# Table of Contents
<a id="table-of-contents"></a>
- [1 Introduction](#1)
- [2 Preparations](#2)
- [3 Datasets Overview](#3)
    - [3.1 Train dataset](#3.1)
    - [3.2 Test dataset](#3.2)
- [4 Features](#4)
    - [4.1 Distribution](#4.1)
    - [4.2 Time Series](#4.2)
    - [4.3 Matching Distribution](#4.3)
- [5 Target](#5)
    - [5.1 Distribution](#5.1)
    - [5.2 Time Series](#5.2)
- [6 Correlation](#6)
    - [6.1 Target and Features](#6.1)
    - [6.2 Features](#6.2)
    - [6.3 Target](#6.3)
- [7 Winners Solution](#7)

[back to top](#table-of-contents)
<a id="1"></a>
# 1 Introduction

Kaggle competitions are incredibly fun and rewarding, but they can also be intimidating for people who are relatively new in their data science journey. In the past, Kaggle have launched many Playground competitions that are more approachable than Featured competition, and thus more beginner-friendly.

The goal of these competitions is to provide a fun, but less challenging, tabular dataset. These competitions will be great for people looking for something in between the Titanic Getting Started competition and a Featured competition.

The dataset is used for this competition is based on a real dataset, but has synthetic-generated aspects to it. The original dataset deals with predicting air pollution in a city via various input sensor values (e.g., a time series).

Submissions are evaluated using the mean column-wise root mean squared logarithmic error. The RMSLE for a single column calculated as:

$$\sqrt{\frac{1}{n}\sum_{i=1}^n(\text{log}(p_{i}+1)-\text{log}(a_{i}+1))^2}, $$


where $n$ is the total number of observations, $p_{i}$ is your prediction, $a_{i}$ is the actual value, $\text{log}(x)$ is the natural logarithm of $x$.

The final score is the mean of the RMSLE over all columns, in this case, 3.

[back to top](#table-of-contents)
<a id="2"></a>
# 2 Preparations
Preparing packages and data that will be used in the analysis process. Packages that will be loaded are mainly for data manipulation and data visualization. There are 2 datasets that are used in the analysis, they are train and test dataset. The main use of train dataset is to train models and use it to predict test dataset. While sample submission file is used to informed participants on the expected submission for the competition. *(to see the details, please expand)*

In [None]:
# import packages
import os
import joblib
import numpy as np
import pandas as pd
import warnings

import matplotlib
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from matplotlib import ticker
import seaborn as sns

# setting up options
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
warnings.filterwarnings('ignore')

# import datasets
train_df = pd.read_csv('../input/tabular-playground-series-jul-2021/train.csv')
test_df = pd.read_csv('../input/tabular-playground-series-jul-2021/test.csv')
submission = pd.read_csv('../input/tabular-playground-series-jul-2021/sample_submission.csv')

train_df['date_time'] = pd.to_datetime(train_df['date_time'])
test_df['date_time'] = pd.to_datetime(test_df['date_time'])

[back to top](#table-of-contents)
<a id="3"></a>
# 3 Dataset Overview
The intend of the overview is to get a feel of the data and its structure in train, test and submission file. An overview on train and test datasets will include a quick analysis on missing values and basic statistics, while sample submission will be loaded to see the expected submission.

The main objective of the competition is predicting the values of air pollution measurements over time, based on basic weather information (temperature and humidity) and the input values of 5 sensors. Three target values to predict are: `target_carbon_monoxide`, `target_benzene`, and `target_nitrogen_oxides`

As there is no official data definition, a guess of data definition has been performed. The data may contains the following descriptions:
- **date_time** is date and time when the sensors recording occured within interval of 1 hour.
- **deg_C** is air temperature measured in Celsius/Centigrade.
- **relative_humidity** is a term used to describe the amount of water vapor that exists in a gaseous mixture of air and water and is expressed in a percentage.
- **absolute_humidity** is the measure of water vapor (moisture) in the air, regardless of temperature.
- **sensor_1** to **sensor_5** is the sensors recording value.

Three target values:
- **target_carbon_monoxide** is a colorless, odorless, tasteless, flammable gas that is slightly less dense than air.
- **target_benzene** is is a chemical that is a colorless or light yellow liquid at room temperature. It has a sweet odor and is highly flammable.
- **target_nitrogen_oxides** is a family of poisonous, highly reactive gases. These gases form when fuel is burned at high temperatures.

<a id="3.1"></a>
## 3.1 Train dataset
As stated before, train dataset is mainly used to train predictive model as there is an available target variable in this set. This dataset is also used to explore more on the data itself including find a relation between each predictors and the target variable.

**Observations:**
- Date and time of sensors recording is starting from `10-Mar-2010` at `6 p.m.` to `31-Dec-2010` which is a span of  `10 months` of recording.
- The lowest temperature that has been recorded in `deg_C` is `1.3` with the highest temperature of `46.1`. The mean of `20.9`  can be considered a normal temperature in many countries if it measures in Celsius so it may be reasonable to assume the value in the column is in degree Celsius.
- `relative_humidity` is measured in percentage and has a range of `8.9%` to `90.8%` in the `train` dataset. 
- `absolute_humidity` has a range of `0.2` to `2.2`.         
- It seems all the records from `sensor_1` to `sensor_5` were coming from a same type of machine that may be place in several different places as there are no big differences between each sensor which have a range from `242.7` to `2913.8`. 
- `target_carbon_monoxide` has a range of `0.1` to `12.5`, `target_benzene` has a range of `0.1` to `63.7` and `target_nitrogen_oxides` has a range of `1.91` to `1472.3`. Though the minimum values between these targets are quite similar, the maximum and the mean of the targets value are quite different.

### 3.1.1 Quick view
Below is the first 5 rows of train dataset:

In [None]:
train_df.head()

The dimension and number of missing values in the train dataset is as below:

In [None]:
print(f'Number of rows: {train_df.shape[0]};  Number of columns: {train_df.shape[1]}; No of missing values: {sum(train_df.isna().sum())}')

The train dataset has `7,111` of rows with `12` of columns and there is `0` missing values which is good as there will be no additional steps needed to fill the missing value.

### 3.1.2 Data types
Except `date_time` column, all of the other columns are in `float64` type.

In [None]:
train_df.dtypes

### 3.1.3 Basic statistics
Below is the basic statistics for each variables which contain information on `count`, `mean`, `standard deviation`, `minimum`, `1st quartile`, `median`, `3rd quartile` and `maximum`.

In [None]:
train_df.describe()

[back to top](#table-of-contents)
<a id="3.2"></a>
## 3.2 Test dataset
Test dataset is used to make a prediction based on the model that has previously trained. Exploration in this dataset is also needed to see how the data is structured and especially on itâ€™s similiarity with the train dataset.

**Observations:**
- Test observations started from `01-Jan-2011` to `04-Apr-2011` at `14 p.m.` It seems the `train` and `test` dataset are coming from a different cycle. The train dataset only cover less than 1 month of test dataset which is from `10-Mar` to `04-Apr`.
- The lowest temperature that has been recorded in `deg_C` is `-1.8` with the highest temperature of `30` with a mean of `10.8` degree Celsius. This somehow is expected as there is a differences in season cycles between the train and test dataset. 
- `relative_humidity` has a range of `9.8%` to `88.8%` which is quite resemble the train dataset. 
- `absolute_humidity` has a range of `0.18` to `1.39`. Maximum absolute humidity is higher in the train dataset compared to test dataset.         
- `sensor_1` to `sensor_5` have a range from `205.3` to `2593.8` which are quite same with the train dataset, though it's not perfect.

### 3.2.1 Quick view
Below is the first 5 rows of test dataset:

In [None]:
test_df.head()

The dimension and number of missing values in the test dataset is as below:

In [None]:
print(f'Number of rows: {test_df.shape[0]};  Number of columns: {test_df.shape[1]}; No of missing values: {sum(test_df.isna().sum())}')

Similar to the train dataset, test dataset has `2,247` of rows with `9` of columns and there is `0` missing value. There is no target columns in the test dataset. 

### 3.2.2 Data types
Consistent with the train dataset, all of the other columns are in `float64` except for `date_time` column. *(to see the details, please expand)*

In [None]:
test_df.dtypes

### 3.2.3 Basic statistics
Below is the basic statistics for each variables which contain information on `count`, `mean`, `standard deviation`, `minimum`, `1st quartile`, `median`, `3rd quartile` and `maximum`.

In [None]:
test_df.describe()

[back to top](#table-of-contents)
<a id="4"></a>
# 4 Features
Number of features available to be used to create a prediction model are `8` columns. 

<a id="4.1"></a>
## 4.1 Distribution
Comparing features distribution between train and test dataset on each features.

**Observations:**
- `deg_C` is showing a different distribution between train and test dataset, this is expected due to different season cycle between train and test dataset.
- `absolute_humidity` and `sensor_4` are also showing a different distribution between test and train dataset.
- `relative_humidity`, `sensor_1`, `sensor_2`, `sensor_3` and `sensor_5` are showing quite a same distribution between train and test dataset.

In [None]:
features = [feature for feature in train_df.columns if feature not in ["date_time", "target_carbon_monoxide", 
                                                                       "target_benzene", "target_nitrogen_oxides"]]

plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(6, 6), facecolor='#f6f5f5')
gs = fig.add_gridspec(3, 3)
gs.update(wspace=0.2, hspace=0.25)

background_color = "#f6f5f5"

run_no = 0
for row in range(0, 3):
    for col in range(0, 3):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        locals()["ax"+str(run_no)].tick_params(axis='y', which=u'both',length=0)
        locals()["ax"+str(run_no)].set_yticklabels([])
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

run_no = 0
for col in features:
    sns.kdeplot(test_df[col], ax=locals()["ax"+str(run_no)], shade=True, color='#287094', alpha=0.95, linewidth=0, zorder=2)
    sns.kdeplot(train_df[col], ax=locals()["ax"+str(run_no)], shade=True, color='#fcd12a', alpha=0.95, linewidth=0, zorder=2)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=5, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=5, width=0.5, length=1.5)
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.7)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.7)
    run_no += 1
    
ax0.text(-10, 0.086, 'Features Distribution', fontsize=8, fontweight='bold')
ax0.text(-10, 0.081, 'Features comparison between train and test dataset', fontsize=5)
fig.legend(['test', 'train'], ncol=2, facecolor=background_color, edgecolor=background_color, fontsize=4, bbox_to_anchor=(0.24, 0.91))
    
ax8.remove()

<a id="4.2"></a>
## 4.2 Time Series
The dataset is showing a time series data, it started from train dataset than followed with test dataset. Season will be an important factor to consider in the time series:
- **Spring**: start from March to May
- **Summer**: start from June to August
- **Autumn** : start from September to November
- **Winter**: start from December to February

**Observations:**
- `deg_C` is showing a relatively higher temperature in the `summer` and it shows a downtrend in the `autumn`. As expected `winter` is showing a relatively lower temperature before increasing back again in `spring`.
- `relative_humidity` is showing an increasing trend in the `autumn` and more volatile in the `winter`. It also shows a relatively lower value in the summer.
- `absolute_humidity` is showing a lower value in `winter` and slowly increasing when approaches `spring`. It also show a relatively high value in `summer` and `autumn`.
- `sensor_1` is showing lower volatility in `summer` compared to other seasons.
- `sensor_2` and `sensor_5` are showing higher volatility in `autumn` compared to other seasons.
- `sensor_3` is showing lower volatility and value in the `winter` while `sensor_4` is showing more volatility.

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10,8), facecolor='#f6f5f5')
gs = fig.add_gridspec(8, 1)
gs.update(wspace=0, hspace=1.5)

background_color = "#f6f5f5"

run_no = 0
for row in range(0, 8):
    for col in range(0, 1):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

run_no = 0
for col in features:
    sns.lineplot(ax=locals()["ax"+str(run_no)], y=train_df[col], x=train_df['date_time'], color='#fcd12a')
    sns.lineplot(ax=locals()["ax"+str(run_no)], y=test_df[col], x=test_df['date_time'], color='#287094')
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=5, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=5, width=0.5, length=1.5)
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.7)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.7)
    spring = np.arange(np.datetime64("2010-03-10"), np.datetime64("2010-06-02"))
    locals()["ax"+str(run_no)].fill_between(spring, np.max(train_df[col]), color='#ff69b4', alpha=0.2, zorder=2, linewidth=0)
    summer = np.arange(np.datetime64("2010-06-01"), np.datetime64("2010-09-02"))
    locals()["ax"+str(run_no)].fill_between(summer, np.max(train_df[col]), color='#fcd12a', alpha=0.2, zorder=2, linewidth=0)
    autumn = np.arange(np.datetime64("2010-09-01"), np.datetime64("2010-12-02"))
    locals()["ax"+str(run_no)].fill_between(autumn, np.max(train_df[col]), color='#ff9200', alpha=0.2, zorder=2, linewidth=0)
    winter = np.arange(np.datetime64("2010-12-01"), np.datetime64("2011-03-02"))
    locals()["ax"+str(run_no)].fill_between(winter, np.max(train_df[col]), color='#287094', alpha=0.2, zorder=2, linewidth=0)
    spring_2 = np.arange(np.datetime64("2011-03-01"), np.datetime64("2011-04-05"))
    locals()["ax"+str(run_no)].fill_between(spring_2, np.max(train_df[col]), color='#ff69b4', alpha=0.2, zorder=2, linewidth=0)
    run_no += 1
    
ax0.text(14660, 80, 'Time Series', fontsize=8, fontweight='bold')
ax0.text(14660, 65, 'Showing time series data starting from train dataset followed by test dataset', fontsize=5)
fig.legend(['test', 'train'], ncol=2, facecolor=background_color, edgecolor=background_color, fontsize=4, bbox_to_anchor=(0.2, 0.895))

plt.show()

<a id="4.3"></a>
## 4.3 Matching Distribution
This section will try to see if the features monthly distribution between test and train dataset is the same, which mainly focus on `Jan`, `Feb`, `Mar` and `Apr`. This is important as most of the `test` dataset are not available in the `train` dataset and are in different years. `Jan` and `Feb` in `train` will mostly will be compared with the `Dec` in train dataset as the nearest month available.

**Observations:**
- All the features in `Apr` are showing a different distribution between `train` and `test` dataset except for feature `deg_C`. This may due to differences in the number of observations.
- **deg_C** 
    - `Jan` and `Feb` in `test` dataset are resembling `Dec` in train dataset. 
    - `Mar` in the `test` dataset is not fully resembling `Mar` in `train` dataset.
- **relative_humidity** 
    - `Jan` in `test` dataset is quite resembling `Dec` in `train` dataset.
    - It's quite hard to find a match distribution for `Feb` in `test` dataset. 
    - `Mar` in the `test` dataset is not fully resembling `train` but not very far.   
- **absolute_humidity**
    - `Jan` in the `test` dataset is quite resembling `Dec` dataset but still not the same.
    - Comparing `Feb`, `Mar` in the `test` dataset with `train` dataset shows the distribution are very different.
- **sensor_1** and **sensor_2**
    - `Jan` in `test` dataset is showing a resemblance with the `Dec` in `train` dataset but `Feb` in `train` dataset is showing more resemblance to `Dec` in `test` dataset.
    - `Mar` in `test` dataset is quite same with the observations in the `train` dataset.
- **sensor_3**
    - `Jan` in `test` dataset is showing a resemblance with the `Dec` in `train` dataset but `Feb` in `train` dataset is showing more resemblance to `Dec` in `test` dataset.
    - `Mar` in `test` dataset is different with the observations in the `train` dataset.
- **sensor_4**
    - `Jan` in `test` dataset is showing a different distribution compared to `Dec` in `train` dataset. 
    - `Feb` in `train` dataset is showing more resemblance to `Dec` in `test` dataset but is not perfectly same.
    - `Mar` in `test` dataset is different with the observations in the `train` dataset.
- **sensor_4**
    - `Jan` and`Feb` in `test` dataset are showing a different distribution compared to `Dec` dataset.
    - `Mar` in `test` dataset is different with the observations in the `train` dataset.

In [None]:
train_df['train_test'] = 'train'
test_df['train_test'] = 'test'
combine_df = pd.concat([train_df, test_df], axis=0)
combine_df['month'] = combine_df['date_time'].dt.month

train_df = train_df.drop('train_test', axis=1)
test_df = test_df.drop('train_test', axis=1)

plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10,20), facecolor='#f6f5f5')
gs = fig.add_gridspec(8, 1)
gs.update(wspace=0, hspace=0.5)

background_color = "#f6f5f5"
sns.set_palette(['#fcd12a', '#287094'])

run_no = 0
for row in range(0, 8):
    for col in range(0, 1):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

run_no = 0
for col in features:
    sns.violinplot(ax=locals()["ax"+str(run_no)], y=combine_df[col], x=combine_df['month'],hue=combine_df['train_test'], 
                   split = True, saturation=1, linewidth=1, inner="quartile", legend=False, zorder=2)
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.7)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.7)
    locals()["ax"+str(run_no)].set_axisbelow(True) 
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize = 8)
    locals()["ax"+str(run_no)].get_legend().remove()
    locals()["ax"+str(run_no)].tick_params(labelsize=8, width=0.5, length=1.5)
    run_no += 1
    
ax0.text(-0.5, 70, 'Monthly Distribution', fontsize=10, fontweight='bold')
ax0.text(-0.5, 63, 'Showing features monthly distribution on train and test dataset (months are in number)', fontsize=8)
ax0.legend(ncol=2, facecolor=background_color, edgecolor=background_color, fontsize=7, bbox_to_anchor=(0.16, 1.19))

plt.show()

[back to top](#table-of-contents)
<a id="5"></a>
# 5 Target
Analyzing target variable by looking on how this variable are distributed and how each feature are distributed among each classes.

<a id="5.1"></a>
## 5.1 Distribution
Target variables consists of `target_carbon_monoxide`, `target_benzene`, and `target_nitrogen_oxides`.

**Observations:**
- `target_carbon_monoxide`, `target_benzene`, and `target_nitrogen_oxides` are showing a lognormal distribution.

In [None]:
targets = ["target_carbon_monoxide", "target_benzene", "target_nitrogen_oxides"]

plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(5, 1), facecolor='#f6f5f5')
gs = fig.add_gridspec(1, 3)
gs.update(wspace=0.2, hspace=0.5)

background_color = "#f6f5f5"

run_no = 0
for row in range(0, 1):
    for col in range(0, 3):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        locals()["ax"+str(run_no)].tick_params(axis='y', which=u'both',length=0)
        locals()["ax"+str(run_no)].set_yticklabels([])
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

run_no = 0
for col in targets:
    sns.kdeplot(train_df[col], ax=locals()["ax"+str(run_no)], shade=True, color='#fcd12a', alpha=0.95, linewidth=0, zorder=2)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=5, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=5, width=0.5, length=1.5)
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.7)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.7)
    run_no += 1
    
ax0.text(-1.2, 0.44, 'Target Distribution', fontsize=8, fontweight='bold')
ax0.text(-1.2, 0.40, 'Target variables are showing a lognormal distribution', fontsize=5)

plt.show()

<a id="5.2"></a>
## 5.2 Time Series
The challenge in making the prediction in this time series is: there is only 1 month target data on the `winter` that is available and there is no available information on how the target transitioning from `winter` to `spring`.

This section will explore target variable in a time series format with the coresponding seasons as has been done in previous sections:
- **Spring**: start from March to May
- **Summer**: start from June to August
- **Autumn** : start from September to November
- **Winter**: start from December to February

**Observations:**
- It seems since `autumn` 2010 there is an increasing value and volatility in all target values.
- All the highest values in all target variables are achieved in `autumn` 2010. Thing to be thought: "Is this a permanent changes or cyclical changes?"
- `target_carbon_monoxide` volatility is decreasing when it enters `winter` and continue in a decreasing trend. Thing to be considered: "will it be it continue to decreasing and match previous year volatility and value?"
- `target_benzene` volatility is decreasing when it enters `winter`. Thing to be thought: "Will it be any increasing volatility and value that will match `spring` cycle in the previous year?"
- `target_nitrogen_oxides` volatility and value is increasing since `autumn` and continue to have a high value since then. One thing to be considered: "is this a permanent changes or is it due to seasonal condition?", if this is a permanent changes, dropping train dataset from March 2010 to September 2010 can be consider as this will not reflect the future condition.

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10,3), facecolor='#f6f5f5')
gs = fig.add_gridspec(3, 1)
gs.update(wspace=0, hspace=1)

background_color = "#f6f5f5"

run_no = 0
for row in range(0, 3):
    for col in range(0, 1):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

run_no = 0
for col in targets:
    sns.lineplot(ax=locals()["ax"+str(run_no)], y=train_df[col], x=train_df['date_time'], color='#fcd12a')
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=5, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=5, width=0.5, length=1.5)
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.7)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.7)
    spring = np.arange(np.datetime64("2010-03-10"), np.datetime64("2010-06-02"))
    locals()["ax"+str(run_no)].fill_between(spring, np.max(train_df[col]), color='#ff69b4', alpha=0.2, zorder=2, linewidth=0)
    summer = np.arange(np.datetime64("2010-06-01"), np.datetime64("2010-09-02"))
    locals()["ax"+str(run_no)].fill_between(summer, np.max(train_df[col]), color='#fcd12a', alpha=0.2, zorder=2, linewidth=0)
    autumn = np.arange(np.datetime64("2010-09-01"), np.datetime64("2010-12-02"))
    locals()["ax"+str(run_no)].fill_between(autumn, np.max(train_df[col]), color='#ff9200', alpha=0.2, zorder=2, linewidth=0)
    winter = np.arange(np.datetime64("2010-12-01"), np.datetime64("2011-01-03"))
    locals()["ax"+str(run_no)].fill_between(winter, np.max(train_df[col]), color='#287094', alpha=0.2, zorder=2, linewidth=0)
    run_no += 1
    
ax0.text(14663, 18, 'Time Series', fontsize=8, fontweight='bold')
ax0.text(14663, 15, 'Showing target time series data', fontsize=5)

plt.show()

[back to top](#table-of-contents)
<a id="6"></a>
# 6 Correlation
Explore correlation between features and their targets and also between features themselves.

<a id="6.1"></a>
## 6.1 Target and Features
Explore correlation between features variables and their target variables: `target_carbon_monoxide`, `target_benzene`, and `target_nitrogen_oxides`.

**Observations:**
- **General** 
    - `deg_C`, `relative_humidity` and `absolute_humidity` don't show any strong correlation with target variables.
    - `sensor_1`, `sensor_2` and `sensor_5` are the promising variables to be used for making a prediction as they have the highest correlation with the targets.
    - `deg_C` is mostly concentrated between 10 to 30 degree Celsius, `relative_humidity` in a range of 20% - 80% and `absolute_humidity` at range of 0.75 to 1.5.

- **Carbon Monoxide**
    - `sensor_1`, `sensor_2` and `sensor_5` have a very high correlation (above 0.8) with carbon monoxide.
    - `sensor_1` is highly dense at 750 - 1,250, `sensor_2` at range of 1,000 - 2,000 and `sensor_5` has a range of concentration between 500 - 1,500.

- **Benzene**
    - `sensor_1`, `sensor_2`, `sensor_4` and `sensor_5` have a very high correlation with benzene with `sensor_2` has a correlation of 0.96.
    - `sensor_3` has a strong negative correlation of -0.74 with the benzene.
    - `sensor_1` and `sensor_2` is highly dense at 750 - 1,250, `sensor_4` has a range of concentration between 1,250 - 1,750,  `sensor_5` at a range between 500 - 1,500 which is roughly a same density in the carbon monoxide target.
    
- **Nitrogen Oxides**
    - `sensor_1`, `sensor_2`, and `sensor_5` have a medium correlation (0.6 - 0.71) with nitrogen oxides which are not as strong as the correlation with carbon monoxide and benzene.
    - ` sensor_1` has a concentration range between 750 - 1,125, `sensor_2` and `sensor_5` at range of 500 - 1,250.



In [None]:
chart_df = pd.DataFrame(train_df[features].corrwith(train_df['target_carbon_monoxide']))
chart_df.columns = ['corr']

plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(6, 1.5), facecolor='#f6f5f5')
gs = fig.add_gridspec(1, 1)
gs.update(wspace=0.4, hspace=0.1)

background_color = "#f6f5f5"
sns.set_palette(['#00A4CCFF']*6)

ax = fig.add_subplot(gs[0, 0])
for s in ["right", "top"]:
    ax.spines[s].set_visible(False)
ax.set_facecolor(background_color)
ax_sns = sns.barplot(ax=ax, x=chart_df.index, y=chart_df['corr'], color='#fcd12a',
                      zorder=2, linewidth=0, alpha=1, saturation=1)
ax_sns.set_xlabel("Features",fontsize=4, weight='bold')
ax_sns.set_ylabel("Correlation",fontsize=4, weight='bold')
ax_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
ax_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
ax_sns.tick_params(labelsize=4, width=0.5, length=1.5)
ax.text(-0.5, 1.25, 'Carbon Monoxide', fontsize=6, ha='left', va='top', weight='bold')
ax.text(-0.5, 1.13, 'Correlation between carbon monoxide and feature values', fontsize=4, ha='left', va='top')
# data label
for p in ax.patches:
    percentage = f'{p.get_height():.2f}'
    x = p.get_x() + p.get_width() / 2
    y = p.get_height() + 0.05
    ax.text(x, y, percentage, ha='center', va='bottom', fontsize=4,
           bbox=dict(facecolor='none', edgecolor='black', boxstyle='round', linewidth=0.3))

plt.show()

In [None]:
features = [feature for feature in train_df.columns if feature not in ["date_time", "target_carbon_monoxide", 
                                                                       "target_benzene", "target_nitrogen_oxides"]]

plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(8, 8), facecolor='#f6f5f5')
gs = fig.add_gridspec(3, 3)
gs.update(wspace=0.2, hspace=0.25)

background_color = "#f6f5f5"
cmap = sns.light_palette('#fcd12a', as_cmap=True)

run_no = 0
for row in range(0, 3):
    for col in range(0, 3):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

run_no = 0
for col in features:
    locals()["ax"+str(run_no)].hexbin(x=train_df[col], y=train_df['target_carbon_monoxide'], gridsize=15, 
                                      cmap=cmap, zorder=2, facecolor='black', mincnt=1)
    locals()["ax"+str(run_no)].set_ylabel('target_carbon_monoxide', fontsize=5, fontweight='bold')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=5, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=5, width=0.5, length=1.5)
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.7)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.7)
    run_no += 1
    
ax0.text(0, 14.5, 'Carbon Monoxide', fontsize=10, fontweight='bold')
ax0.text(0, 13.6, 'Hexabin plot between features and carbon monoxide', fontsize=7)

ax8.remove()

In [None]:
chart_df = pd.DataFrame(train_df[features].corrwith(train_df['target_benzene']))
chart_df.columns = ['corr']

plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(6, 1.5), facecolor='#f6f5f5')
gs = fig.add_gridspec(1, 1)
gs.update(wspace=0.4, hspace=0.1)

background_color = "#f6f5f5"
sns.set_palette(['#00A4CCFF']*6)

ax = fig.add_subplot(gs[0, 0])
for s in ["right", "top"]:
    ax.spines[s].set_visible(False)
ax.set_facecolor(background_color)
ax_sns = sns.barplot(ax=ax, x=chart_df.index, y=chart_df['corr'], color='#fcd12a',
                      zorder=2, linewidth=0, alpha=1, saturation=1)
ax_sns.set_xlabel("Features",fontsize=4, weight='bold')
ax_sns.set_ylabel("Correlation",fontsize=4, weight='bold')
ax_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
ax_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
ax_sns.tick_params(labelsize=4, width=0.5, length=1.5)
ax.text(-0.5, 1.35, 'Benzene', fontsize=6, ha='left', va='top', weight='bold')
ax.text(-0.5, 1.2, 'Correlation between benzene and feature values', fontsize=4, ha='left', va='top')
# data label
for p in ax.patches:
    percentage = f'{p.get_height():.2f}'
    x = p.get_x() + p.get_width() / 2
    y = p.get_height() + 0.05
    ax.text(x, y, percentage, ha='center', va='bottom', fontsize=4,
           bbox=dict(facecolor='none', edgecolor='black', boxstyle='round', linewidth=0.3))

plt.show()

In [None]:
features = [feature for feature in train_df.columns if feature not in ["date_time", "target_carbon_monoxide", 
                                                                       "target_benzene", "target_nitrogen_oxides"]]

plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(8, 8), facecolor='#f6f5f5')
gs = fig.add_gridspec(3, 3)
gs.update(wspace=0.2, hspace=0.25)

background_color = "#f6f5f5"
cmap = sns.light_palette('#fcd12a', as_cmap=True)

run_no = 0
for row in range(0, 3):
    for col in range(0, 3):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

run_no = 0
for col in features:
    locals()["ax"+str(run_no)].hexbin(x=train_df[col], y=train_df['target_benzene'], gridsize=15, 
                                      cmap=cmap, zorder=2, facecolor='black', mincnt=1)
    locals()["ax"+str(run_no)].set_ylabel('target_benzene', fontsize=5, fontweight='bold')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=5, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=5, width=0.5, length=1.5)
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.7)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.7)
    run_no += 1
    
ax0.text(0, 75, 'Benzene', fontsize=10, fontweight='bold')
ax0.text(0, 70, 'Hexabin plot between features and benzene', fontsize=7)

ax8.remove()

In [None]:
chart_df = pd.DataFrame(train_df[features].corrwith(train_df['target_nitrogen_oxides']))
chart_df.columns = ['corr']

plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(6, 1.5), facecolor='#f6f5f5')
gs = fig.add_gridspec(1, 1)
gs.update(wspace=0.4, hspace=0.1)

background_color = "#f6f5f5"
sns.set_palette(['#00A4CCFF']*6)

ax = fig.add_subplot(gs[0, 0])
for s in ["right", "top"]:
    ax.spines[s].set_visible(False)
ax.set_facecolor(background_color)
ax_sns = sns.barplot(ax=ax, x=chart_df.index, y=chart_df['corr'], color='#fcd12a',
                      zorder=2, linewidth=0, alpha=1, saturation=1)
ax_sns.set_xlabel("Features",fontsize=4, weight='bold')
ax_sns.set_ylabel("Correlation",fontsize=4, weight='bold')
ax_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
ax_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
ax_sns.tick_params(labelsize=4, width=0.5, length=1.5)
ax.text(-0.5, 0.95, 'Nitrogen Oxides', fontsize=6, ha='left', va='top', weight='bold')
ax.text(-0.5, 0.85, 'Correlation between nitrogen oxides and feature values', fontsize=4, ha='left', va='top')
# data label
for p in ax.patches:
    percentage = f'{p.get_height():.2f}'
    x = p.get_x() + p.get_width() / 2
    y = p.get_height() + 0.05
    ax.text(x, y, percentage, ha='center', va='bottom', fontsize=4,
           bbox=dict(facecolor='none', edgecolor='black', boxstyle='round', linewidth=0.3))

plt.show()

In [None]:
features = [feature for feature in train_df.columns if feature not in ["date_time", "target_carbon_monoxide", 
                                                                       "target_benzene", "target_nitrogen_oxides"]]

plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(8, 8), facecolor='#f6f5f5')
gs = fig.add_gridspec(3, 3)
gs.update(wspace=0.2, hspace=0.25)

background_color = "#f6f5f5"
cmap = sns.light_palette('#fcd12a', as_cmap=True)

run_no = 0
for row in range(0, 3):
    for col in range(0, 3):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

run_no = 0
for col in features:
    locals()["ax"+str(run_no)].hexbin(x=train_df[col], y=train_df['target_nitrogen_oxides'], gridsize=15, 
                                      cmap=cmap, zorder=2, facecolor='black', mincnt=1)
    locals()["ax"+str(run_no)].set_ylabel('target_nitrogen_oxides', fontsize=5, fontweight='bold')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=5, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=5, width=0.5, length=1.5)
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.7)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.7)
    run_no += 1
    
ax0.text(0, 1700, 'Nitrogen Oxides', fontsize=10, fontweight='bold')
ax0.text(0, 1600, 'Hexabin plot between features and nitrogen oxides', fontsize=7)

ax8.remove()

[back to top](#table-of-contents)
<a id="6.2"></a>
## 6.2 Features

Explore correlation between features variable in train and test dataset.

**Observations:**
- Highest correlation between features which are relatively the same between train and test dataset can be find in:
    - Negative correlation:`sensor_3 and sensor_2`, `sensor_3 and sensor_4` and `sensor_3 and sensor_5`.
    - Positive correlation: `sensor_1 and sensor_2`, `sensor_1 and sensor_5`, `sensor_2 and sensor_4`, `sensor_2 and sensor_5`. 
- Some of the features correlation between train and test are different due to differences in timeline. The test is starting from `winter` to `spring`, while the train is starting from `spring` to `winter`. Some differences can be seen on below correlation:
    - High correlation of `0.78` between `sensor_4 and sensor_5` in test dataset is not reflected in train dataset.
    - Reverse correlation in `deg_C and sensor_5` which has a negative correlation of `-0.051` in train dataset but has a positive correlation `0.036` in test dataset. This also happen in `relative_humidity and sensor_2` and also `deg_C and sensor_5`.
- Top 5 differences in features correlation between train and test dataset are:
    - `relative_humidity and sensor_4`
    - `relative_humidity and absolute_humidity`
    - `absolute_humidity and sensor_1`
    - `deg_C and sensor_3`
    - `deg_C and sensor_1`

In [None]:
train_corr = train_df[features].corr()
test_corr = test_df[features].corr()
diff_corr = abs(train_corr - test_corr)

plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 5), facecolor='#f6f5f5')
gs = fig.add_gridspec(1, 3)
gs.update(wspace=0.5, hspace=0)

background_color = "#f6f5f5"
cmap_train = sns.dark_palette('#fcd12a', as_cmap=True)
cmap_test = sns.dark_palette('#287094', as_cmap=True)
cmap_diff = sns.dark_palette('#ff69b4', as_cmap=True)

mask = np.triu(np.ones_like(train_corr, dtype=bool))

run_no = 0
for row in range(0, 1):
    for col in range(0, 3):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1
        
sns.heatmap(train_corr, ax=ax0, cmap=cmap_train, square=True, mask=mask, linewidths=.5, linecolor='#f6f5f5', 
            cbar_kws={"shrink": .3}, annot=True, annot_kws={"fontsize":4})
ax0.set_xlabel(col, fontsize=5, fontweight='bold')
ax0.tick_params(labelsize=5, width=0.5, length=1.5)
ax0.set_xlabel('Train Dataset', fontsize=5, fontweight='bold')
cax = plt.gcf().axes[-1]
cax.tick_params(labelsize=5)

sns.heatmap(test_corr, ax=ax1, cmap=cmap_test, square=True, mask=mask, linewidths=.5, linecolor='#f6f5f5', 
            cbar_kws={"shrink": .3}, annot=True, annot_kws={"fontsize":4})
ax1.set_xlabel(col, fontsize=5, fontweight='bold')
ax1.tick_params(labelsize=5, width=0.5, length=1.5)
ax1.set_xlabel('Test Dataset', fontsize=5, fontweight='bold')
cax = plt.gcf().axes[-1]
cax.tick_params(labelsize=5)

sns.heatmap(diff_corr, ax=ax2, cmap=cmap_diff, square=True, mask=mask, linewidths=.5, linecolor='#f6f5f5', 
            cbar_kws={"shrink": .3}, annot=True, annot_kws={"fontsize":4})
ax2.set_xlabel(col, fontsize=5, fontweight='bold')
ax2.tick_params(labelsize=5, width=0.5, length=1.5)
ax2.set_xlabel('Absolute Correlation Differences', fontsize=5, fontweight='bold')
cax = plt.gcf().axes[-1]
cax.tick_params(labelsize=5)

ax0.text(0, -0.8, 'Features Correlation', fontsize=10, fontweight='bold')
ax0.text(0, -0.2, 'Correlation between features in train and test dataset with their absolute correlation differences', fontsize=7)

plt.show()

[back to top](#table-of-contents)
<a id="6.3"></a>
## 6.3 Targets

Explore correlation between targets in the train dataset to see if one/multi targets can be used to predict the other target.

**Observations:**
- `target_carbon_monoxide` has a high correlation with `target_benzene` (0.88), `target_nitrogen_oxides` (0.81). This may suggest `target_carbon_monoxide` can be used to predict `target_benzene` and `target_nitrogen_oxides`. 
- A good prediction on `target_carbon_monoxide` will help to predict the others targets. As a reminder, there are 3 features:`sensor_1`, `sensor_2` and `sensor_5` have a very high correlation (above 0.8) with the `target_carbon_monoxide`.
-  Correlation between `target_benzene` and `target_nitrogen_oxides` are not as strong as with `target_carbon_monoxide` with 0.66.

In [None]:
targets = ['target_carbon_monoxide', 'target_benzene', 'target_nitrogen_oxides']
chart_df = train_df[targets].corr()
chart_df

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(8, 2), facecolor='#f6f5f5')
gs = fig.add_gridspec(1, 3)
gs.update(wspace=0.2, hspace=0.25)

background_color = "#f6f5f5"
cmap = sns.light_palette('#fcd12a', as_cmap=True)

run_no = 0
for row in range(0, 1):
    for col in range(0, 3):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1


ax0.hexbin(x=train_df['target_carbon_monoxide'], y=train_df['target_benzene'], gridsize=15, 
           cmap=cmap, zorder=2, facecolor='black', mincnt=1)
ax0.set_ylabel('target_benzene', fontsize=5, fontweight='bold')
ax0.set_xlabel('target_carbon_monoxide', fontsize=5, fontweight='bold')
ax0.tick_params(labelsize=5, width=0.5, length=1.5)
ax0.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.7)
ax0.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.7)

ax1.hexbin(x=train_df['target_carbon_monoxide'], y=train_df['target_nitrogen_oxides'], gridsize=15, 
           cmap=cmap, zorder=2, facecolor='black', mincnt=1)
ax1.set_ylabel('target_nitrogen_oxides', fontsize=5, fontweight='bold')
ax1.set_xlabel('target_carbon_monoxide', fontsize=5, fontweight='bold')
ax1.tick_params(labelsize=5, width=0.5, length=1.5)
ax1.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.7)
ax1.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.7)

ax2.hexbin(x=train_df['target_benzene'], y=train_df['target_nitrogen_oxides'], gridsize=15, 
           cmap=cmap, zorder=2, facecolor='black', mincnt=1)
ax2.set_ylabel('target_nitrogen_oxides', fontsize=5, fontweight='bold')
ax2.set_xlabel('target_benzene', fontsize=5, fontweight='bold')
ax2.tick_params(labelsize=5, width=0.5, length=1.5)
ax2.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.7)
ax2.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.7)

ax0.text(0, 75, 'Target Correlations', fontsize=10, fontweight='bold')
ax0.text(0, 70, 'target_carbon_monoxide has a high correlation with other targets', fontsize=7)

plt.show()

[back to top](#table-of-contents)
<a id="7"></a>
# 7 Winners Solution
Congratulations for all the winners and thank you for sharing your solution. Below are the winners and their solutions:
- 1st place position: [Laura Desplans](https://www.kaggle.com/desplanl) - [Holy cow, I ranked 1st! Solution/Discussion](https://www.kaggle.com/c/tabular-playground-series-jul-2021/discussion/256486)