# Exploratory Data Analysis for Tabular Playground Series July 2021

The dataset deals with predicting air pollution in a city via various input sensor values. The task is to predict, based on the sensor, values **three** target variables: target_carbon_monoxide,target_benzene and target_nitrogen_oxides. Submissions are evaluated using the [mean column-wise root mean squared logarithmic error](http://www.kaggle.com/c/tabular-playground-series-jul-2021/overview/evaluation).


I chose to use **Plotly** this time as Plotly allows interactive exploration of the data. This is especially nice when examining larger or shorter time intervals.

Looking forward to your feedback and comments!

In [None]:
# import libraries

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib
import matplotlib.pyplot as plt # plotting
%matplotlib inline 
print("matplotlib version: {}". format(matplotlib.__version__))

import plotly
import plotly.express as px
import plotly.graph_objects as go
print("plotly version: {}". format(plotly.__version__))

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
from plotly.offline import plot, iplot, init_notebook_mode
init_notebook_mode(connected=True)
# added to get Plotly working again, see
# https://www.kaggle.com/product-feedback/138599

In [None]:
# read input files
df_train = pd.read_csv("../input/tabular-playground-series-jul-2021/train.csv")
df_test = pd.read_csv("../input/tabular-playground-series-jul-2021/test.csv")
sample_submission = pd.read_csv("../input/tabular-playground-series-jul-2021/sample_submission.csv")
target_cols = ["target_carbon_monoxide","target_benzene","target_nitrogen_oxides"]
df_all = df_train.drop(columns=target_cols).append(df_test) # makes preprocessing easier

## Quick overview

In [None]:
print("Size of training data: ", df_train.shape)
df_train.head()

In [None]:
print("Size of test data: ", df_test.shape)
df_test.head()

In [None]:
#df_train.describe()
#df_train.dtypes
df_train.info()

**Summary**:
* No missing values. 
* Data types as expected. 
* The dataset is rather small. This needs to be taken into account when choosing the validation scheme. 

## Examining date and time

In [None]:
df_train.date_time

**Train set**:
* First timestamp: 2010-03-10 18:00:00 
* Last timestamp: 2011-01-01 00:00:00
* 1 entry every hour
* March till December (9 and a half month full of data -> 295 days plus 6 hours)


**Test set**:
* First timestamp: 2011-01-01 00:00:00 -- identical to last timestamp from training data!
* Last timestamp: 2011-04-04 14:00:00
* 1 entry every hour
* January till April (approx. 3.2 month of data -> 93 days and 15 hours)

The training data is less than one year. The time range for prediction covers month where no training data are available. Not good.

In [None]:
# calculate how much entries are supposed to be there
# month * 24 + rest from 2010-03-10 + 2011-01-01 00:00:00 
no_expected_dates = (31+30+31+30+31+31+30+31+30+21) * 24 + 6 + 1
print("Training data:")
print("Number of expected entries if each hour has one timestamp: ", no_expected_dates)
print("Number of actual entries: ", df_train.shape[0])
print("Are there duplicated dates? ", df_train.duplicated(subset = "date_time").any())

In [None]:
# calculate how much entries are supposed to be there
# month * 24 + rest from 2010-03-10 + 2011-01-01 00:00:00 
no_expected_dates = (31+28+31+3) * 24 + 15 
print("Test data:")
print("Number of expected entries if each hour has one timestamp: ", no_expected_dates)
print("Number of actual entries: ", df_test.shape[0])
print("Are there duplicated dates? ", df_test.duplicated(subset = "date_time").any())

In [None]:
# create new date based columns, will be useful for modelling later

#df_all[["date","time"]] = df_all.date_time.str.split(" ", expand=True)  # altenative to using the datetime functions in pandas
#df_all[["year", "month", "day"]] = df_all.date.str.split("-", expand=True)

df_all['date'] = df_all.date_time.str.split(" ", expand=True)[0] #needed for quick daily grouping
df_all['date_time'] = pd.to_datetime(df_all['date_time'])
df_all['year'] = df_all['date_time'].dt.year
df_all['month'] = df_all['date_time'].dt.month
df_all['day'] = df_all['date_time'].dt.day
df_all['hour'] = df_all['date_time'].dt.hour
df_all['dayofweek'] = df_all['date_time'].dt.dayofweek
df_all['weekend'] = df_all['dayofweek'].apply(lambda x: 1 if (x>4)  else 0) #Sat, Sun are counted as weekend
df_all.head()

In [None]:
# check data types again, look good
df_all.dtypes

## Examining temperature

In [None]:
timestamp_max_temp = df_all.loc[df_all.deg_C == df_all.deg_C.max(),"date"].values[0]
timestamp_min_temp = df_all.loc[df_all.deg_C == df_all.deg_C.min(),"date"].values[0]
print("Minimum temperature: ", df_all.deg_C.min(), " on ", timestamp_min_temp)
print("Maximum temperature: ", df_all.deg_C.max(), " on ", timestamp_max_temp)

In [None]:
# keep in mind that there is one duplicated row at the concatenation of train and test
df_all.iloc[7109:7113]

### These figures are interactive, please explore!
You can zoom in to areas that interest you (form a rectangle with your cursor). To reset the view, double-click or click the house button on the upper right of the plot.

In [None]:
# plot temperature range
temperature_range = df_all.deg_C.value_counts().sort_index()

fig = go.Figure()
fig.add_trace(
      go.Scatter(x=temperature_range.index,
                 y=temperature_range,
                 line = {'width':1}
                )
)
fig.update_layout(
    title="Temperature Range",
    xaxis_title="Temperature",
    yaxis_title="Number of occurances")
fig

The temperature range shows a typcical pattern. Temperatures between 10 and 25 degree C most of the times. Few high extremes with temperature rising up to 46.1 degree C. Hardly any frost, only few days a year with low minus degrees. Minimum temperature is -1.8 degrees C.

In [None]:
fig = go.Figure()
fig.add_trace(
      go.Scatter(x=df_all.date_time,
                 y=df_all.deg_C,
                 line = {'width':1}
                )
)

fig.add_vrect(x0="2011-01-01 00:00:00", x1="2011-04-04 14:00:00", line_width=0, fillcolor="grey", opacity=0.1)

fig.add_annotation(x="2010-12-30 00:00:00", y=45, text="Train", showarrow=False, xanchor='right')
fig.add_annotation(x="2011-01-03 00:00:00", y=45, text="Test", showarrow=False, xanchor='left')
fig.add_annotation(x=timestamp_min_temp, y=df_all.deg_C.min()-2, text="Min", showarrow=False)
fig.add_annotation(x=timestamp_max_temp, y=df_all.deg_C.max()+2, text="Max", showarrow=False)

fig.update_layout(
    title="Temperature over time",
    yaxis_title="Temperature")
fig.show()

Here we see a temperature curve that is higher in summer, lower in winter. When zooming in, a daily temperature pattern can be seen with higher temperature in the afternoon, lower temperature in the night. So far, so good. 

But it's easy to spot several areas that are worth further exploration. From December 2010 to March 2011 there are five areas that look erroneous. Zoom in to examine them more closely. There might be more of those areas hidden in the plot...

In [None]:
# detect temperature anomalies - compute standard deviation of temperature for each day
# flag all above a certain threshold 
#np.std(df_all.deg_C[6:30]) # interesting, np.std() seems to be different from pandas .std()
#df_all.deg_C[6:30].std()
#np.std(df_all.loc[df_all.date_time.str.contains("2011-02-12")]["deg_C"])
daily_temp_std = df_all.groupby(["year","month","day"])["deg_C"].std()
display(daily_temp_std[daily_temp_std<1])
display(daily_temp_std[daily_temp_std>7])
#fig = px.histogram(daily_temp_std, nbins=30)
#fig.show()
# discontinoued, not really suitable to detect potential outliers, visual inspection is better

## Examining Humidity
Click on the legend to show or hide daily averages.

In [None]:
# calculate daily averages
humidity_per_day = df_all.groupby("date").mean()["relative_humidity"].reset_index()

fig = go.Figure()
fig.add_trace(
      go.Scatter(x=df_all.date_time,  # hourly data makes messy graph
                 y=df_all.relative_humidity,
                 line = {'width':1},
                 name="Hourly")
)

fig.add_trace(
      go.Scatter(x=humidity_per_day.date, # use daily averages
                 y=humidity_per_day.relative_humidity,
                 mode = 'lines', 
                 line = {'color':'coral', 
                        'width':1
                         },
                 name="Daily Average")
)

fig.add_vrect(x0="2011-01-01 00:00:00", x1="2011-04-04 14:00:00", line_width=0, fillcolor="grey", opacity=0.1)

fig.add_annotation(x="2010-12-30 00:00:00", y=90, text="Train", showarrow=False, xanchor='right')
fig.add_annotation(x="2011-01-03 00:00:00", y=90, text="Test", showarrow=False, xanchor='left')

fig.update_layout(
    title="Relative humidity over time",
    xaxis_title="Time",
    yaxis_title="Humidity")
fig.show()

In [None]:
# calculate daily averages
ahumidity_per_day = df_all.groupby("date").mean()["absolute_humidity"].reset_index()

fig = go.Figure()
fig.add_trace(
      go.Scatter(x=df_all.date_time,  # hourly data makes messy graph
                 y=df_all.absolute_humidity,
                 line = {'width':1},
                 name="Hourly")
)

fig.add_trace(
      go.Scatter(x=ahumidity_per_day.date, # use daily averages
                 y=ahumidity_per_day.absolute_humidity,
                 mode = 'lines', 
                 line = {'color':'coral', 
                        'width':1
                         },
                 name="Daily Average")
)

fig.add_vrect(x0="2011-01-01 00:00:00", x1="2011-04-04 14:00:00", line_width=0, fillcolor="grey", opacity=0.1)

fig.add_annotation(x="2010-12-30 00:00:00", y=2.2, text="Train", showarrow=False, xanchor='right')
fig.add_annotation(x="2011-01-03 00:00:00", y=2.2, text="Test", showarrow=False, xanchor='left')

fig.update_layout(
    title="Absolute humidity over time",
    #xaxis_title="Time",
    yaxis_title="Humidity")
fig.show()

On the graph displaying absolute humidity over time there are several occasions with strange drops in absolute humidity, e.g. July 31th, Aug 27th. 

## Examining Sensor Data
Zoom in to spot the daily trends and exceptions!

In [None]:
fig = go.Figure()
fig.add_trace(
      go.Scatter(x=df_all.date_time, 
                 y=df_all.sensor_1,
                 line = {'width':1},
                 name="Sensor 1")
)

fig.add_vrect(x0="2011-01-01 00:00:00", x1="2011-04-04 14:00:00", line_width=0, fillcolor="grey", opacity=0.1)

fig.add_annotation(x="2010-12-30 00:00:00", y=2000, text="Train", showarrow=False, xanchor='right')
fig.add_annotation(x="2011-01-03 00:00:00", y=2000, text="Test", showarrow=False, xanchor='left')

fig.update_layout(
    title="Sensor 1 Data"
)
fig.show()

Sensor 1 data shows a pattern with a drop in the early morning hours (3-4 am).

Exceptions are: August 26th+27th, December 15th-17th, January 3rd+4th, January 29th+30th and February 9th-11th.

In [None]:
fig = go.Figure()
fig.add_trace(
      go.Scatter(x=df_all.date_time, 
                 y=df_all.sensor_2,
                 line = {'width':1},
                 name="Sensor 2")
)

fig.add_vrect(x0="2011-01-01 00:00:00", x1="2011-04-04 14:00:00", line_width=0, fillcolor="grey", opacity=0.1)

fig.add_annotation(x="2010-12-30 00:00:00", y=2250, text="Train", showarrow=False, xanchor='right')
fig.add_annotation(x="2011-01-03 00:00:00", y=2250, text="Test", showarrow=False, xanchor='left')

fig.update_layout(
    title="Sensor 2 Data"
)
fig.show()

Sensor 2 data shows a pattern with a drop in the early morning hours (3-4 am), like Sensor 1. 

Exceptions are: April 9th, Mai 26th, June 19th-21st, August 9th, August 26th-28th, Sept 8th, Oct 1st, Dec 15th-17th, Jan 3rd+4th, Feb 9th-11th, March 11th. There, the values drop close to the minimum.



In [None]:
fig = go.Figure()
fig.add_trace(
      go.Scatter(x=df_all.date_time, 
                 y=df_all.sensor_3,
                 line = {'width':1},
                 name="Sensor 3")
)

fig.add_vrect(x0="2011-01-01 00:00:00", x1="2011-04-04 14:00:00", line_width=0, fillcolor="grey", opacity=0.1)

fig.add_annotation(x="2010-12-30 00:00:00", y=2500, text="Train", showarrow=False, xanchor='right')
fig.add_annotation(x="2011-01-03 00:00:00", y=2500, text="Test", showarrow=False, xanchor='left')

fig.update_layout(
    title="Sensor 3 Data"
)
fig.show()

Sensor 3 data follows a cyclical pattern with a high in the early morning (3-4 am) and a low around 8 am.

Exceptions are: April 9th, June 20th+21st, Jul 30th, Aug 9th, Aug 26th+27th, Sept 8th, Oct 1st, Nov 14th, Dec 1st, Dec 10th, Dec 15th-17th, Dec 24th, Jan 3rd+4th, Feb 9th-11th, Mar 1st, Mar 11th.

In [None]:
fig = go.Figure()
fig.add_trace(
      go.Scatter(x=df_all.date_time, 
                 y=df_all.sensor_4,
                 line = {'width':1},
                 name="Sensor 4")
)

fig.add_vrect(x0="2011-01-01 00:00:00", x1="2011-04-04 14:00:00", line_width=0, fillcolor="grey", opacity=0.1)

fig.add_annotation(x="2010-12-30 00:00:00", y=2800, text="Train", showarrow=False, xanchor='right')
fig.add_annotation(x="2011-01-03 00:00:00", y=2800, text="Test", showarrow=False, xanchor='left')

fig.update_layout(
    title="Sensor 4 Data"
)
fig.show()

For Sensor 4, the pattern is not so clear. There seems to be one peak in the morning and one peak in the afternoon. 

In [None]:
fig = go.Figure()
fig.add_trace(
      go.Scatter(x=df_all.date_time, 
                 y=df_all.sensor_5,
                 line = {'width':1},
                 name="Sensor 5")
)

fig.add_vrect(x0="2011-01-01 00:00:00", x1="2011-04-04 14:00:00", line_width=0, fillcolor="grey", opacity=0.1)

fig.add_annotation(x="2010-12-30 00:00:00", y=2500, text="Train", showarrow=False, xanchor='right')
fig.add_annotation(x="2011-01-03 00:00:00", y=2500, text="Test", showarrow=False, xanchor='left')

fig.update_layout(
    title="Sensor 5 Data"
)
fig.show()

Like Sensor 4, not such a clear pattern. There seems to be a peak in the morning and one in the afternoon.

## Overlaying the Sensor Data
Without zoom we see a mess.

When we zoom in, e.g. to the December and January data, we can see that there are times where all sensors show untypical values. Why would that be?

In [None]:
fig = go.Figure()

fig.add_trace(
      go.Scatter(x=df_all.date_time, 
                 y=df_all.sensor_1, 
                 mode = 'lines', 
                 line = {'width':1},
                 name="Sensor 1")
)
fig.add_trace(
      go.Scatter(x=df_all.date_time, 
                 y=df_all.sensor_2,
                 mode = 'lines', 
                 line = {'width':1},
                 name="Sensor 2")
)
fig.add_trace(
      go.Scatter(x=df_all.date_time, 
                 y=df_all.sensor_3,
                 mode = 'lines', 
                 line = {'width':1},
                 name="Sensor 3")
)
fig.add_trace(
      go.Scatter(x=df_all.date_time, 
                 y=df_all.sensor_4,
                 mode = 'lines', 
                 line = {'width':1},
                 name="Sensor 4")
)
fig.add_trace(
      go.Scatter(x=df_all.date_time, 
                 y=df_all.sensor_5,
                 mode = 'lines', 
                 line = {'width':1},
                 name="Sensor 5")
)

fig.add_vrect(x0="2011-01-01 00:00:00", x1="2011-04-04 14:00:00", line_width=0, fillcolor="grey", opacity=0.1)

fig.add_annotation(x="2010-12-30 00:00:00", y=2500, text="Train", showarrow=False, xanchor='right')
fig.add_annotation(x="2011-01-03 00:00:00", y=2500, text="Test", showarrow=False, xanchor='left')

fig.update_layout(
    title="Sensor Data, 1-5"
)
fig.show()

In [None]:
hourly_values = df_all.groupby(["hour"]).mean()

fig = go.Figure()
fig.add_trace(
      go.Scatter(x=hourly_values.index, 
                 y=hourly_values.sensor_1,
                 line = {'width':1},
                 name="Sensor 1")
)

fig.add_trace(
      go.Scatter(x=hourly_values.index, 
                 y=hourly_values.sensor_2,
                 line = {'width':1},
                 name="Sensor 2")
)

fig.add_trace(
      go.Scatter(x=hourly_values.index, 
                 y=hourly_values.sensor_3,
                 line = {'width':1},
                 name="Sensor 3")
)

fig.add_trace(
      go.Scatter(x=hourly_values.index, 
                 y=hourly_values.sensor_4,
                 line = {'width':1},
                 name="Sensor 4")
)

fig.add_trace(
      go.Scatter(x=hourly_values.index, 
                 y=hourly_values.sensor_5,
                 line = {'width':1},
                 name="Sensor 5")
)

fig.update_layout(
    title="Mean Sensor Values per Hour",
    xaxis = dict(
        tickmode = 'linear',
        tick0 = 0,
        dtick = 1
    )
)
fig.show()

We can see that there are common peaks and lows between the sensors. Sensor_3 seems to be inversly related. Based on this analysis I will introduce an additional day_time feature called "rush-hour".

In [None]:
# comparing groupby and resample
#df_all.groupby(["month"]).mean() # groups by month number: 12 entries
#df_all.resample("m", on="date_time").mean() # groups by year and month: 13 entries as for March there is data from 2010 and 2011

In [None]:
daily_values = df_all.groupby(["dayofweek"]).mean()

fig = go.Figure()
fig.add_trace(
      go.Scatter(x=daily_values.index, 
                 y=daily_values.sensor_1,
                 line = {'width':1},
                 name="Sensor 1")
)

fig.add_trace(
      go.Scatter(x=daily_values.index, 
                 y=daily_values.sensor_2,
                 line = {'width':1},
                 name="Sensor 2")
)

fig.add_trace(
      go.Scatter(x=daily_values.index, 
                 y=daily_values.sensor_3,
                 line = {'width':1},
                 name="Sensor 3")
)

fig.add_trace(
      go.Scatter(x=daily_values.index, 
                 y=daily_values.sensor_4,
                 line = {'width':1},
                 name="Sensor 4")
)

fig.add_trace(
      go.Scatter(x=daily_values.index, 
                 y=daily_values.sensor_5,
                 line = {'width':1},
                 name="Sensor 5")
)

fig.update_layout(
    title="Mean Sensor Values per Weekday",
    xaxis = dict(
        tickmode = 'array',
        tickvals = list(range(0,7,1)),
        ticktext = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
    )
)
fig.show()

Four sensors show lower values over the weekend. Experiment with weekend = Sat+Sun or weekend = Sat+Sun+Mon.

## Examining the target variables

Summary: 
* All target variables are air pollutants.
* There is a yearly pattern with a low in August (vacation time?) and higher values in winter (more exhausts?).
* There is a daily pattern with a high in the morning and one in the afternoon (commuting to and from work?).

In [None]:
fig = go.Figure()

fig.add_trace(
      go.Scatter(x=df_train.date_time, 
                 y=df_train.target_carbon_monoxide, 
                 mode = 'lines', 
                 line = {'color':'darkgoldenrod', 'width' : 1},
                 #opacity=0.1,
                 name="Carbon Monoxide")
)

fig.update_layout(
    title="Carbon Monoxide over time"
)
fig.show()

Carbon Monoxide (CO) is a colorless, odorless, tasteless and flammable gas. Among other sources it is caused by industrial activities and liked to climate change. High doses are toxic. (Source: Wikipedia)

There is a daily pattern with one peak in the morning and one in the afternoon. There is a "low season" in August and higher values over the winter.

In [None]:
fig = go.Figure()

fig.add_trace(
      go.Scatter(x=df_train.date_time, 
                 y=df_train.target_benzene,
                 #mode = 'lines', 
                 line = {'color':'darkgoldenrod', 'width' : 1},
                 name="Target Benzene")
)

fig.update_layout(
    title="Benzene over time"
)
fig.show()

Benzene is an organic chemical compound with the formula C6H6. Benzene is classified as a carcinogen and originates for example from tobacco smoke or motor exhausts but also vulcanic erruptions and wildfires. (Source: Wikipedia)

There is a daily pattern with one peak in the morning and one in the afternoon. There is a "low season" in August and higher values over the winter. We can see several occasions where the sensor seems to have reached it's lower end. 

In [None]:
fig = go.Figure()

fig.add_trace(
      go.Scatter(x=df_train.date_time, 
                 y=df_train.target_nitrogen_oxides,
                 mode = 'lines', 
                 line = {'color':'darkgoldenrod', 'width' : 1},
                 name="Nitrogen_oxides")
)

fig.update_layout(
    title="Nitrogen Oxides over time"
)
fig.show()

Nitrogen Oxides are a group of molucules consisting of nitrogen and oxygen. Examples are: NO, NO2, N2O, N2O3... While there are natural sources like lightening, a big part of the nitrogen oxides comes from burning fossile fuels. A high concentration of nitrogen oxides negatively impacts human lung function. (Source: Wikipedia)

There is a daily pattern with one peak in the morning and one in the afternoon. There is a "low season" in August and higher values over the winter. 

In [None]:
# check correlation between the target columns
df_train[target_cols].corr()

We can see, that the correlation of target_carbon_monoxide and target_benzene with target_nitrogen_oxides is the weakest. The first two target columns might not help much in predicting the last one. 