<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:skyblue;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">
<h1 style="text-align: center;
           padding: 10px;
              color:white">
Tabular Playground Series - Jul 2021
</h1>
</div>

# Introduction
In the following notebook, We are predicting the values of air pollution measurements over time, based on basic weather information (temperature and humidity) and the input values of 5 sensors.

The three target values to predict are: target_carbon_monoxide, target_benzene, and target_nitrogen_oxides.

## Dataset 
the data is available at [this link](https://www.kaggle.com/c/tabular-playground-series-jul-2021/data) and it contains these files :-
*  train.csv - the training data, including the weather data, sensor data, and values for the 3 targets
*  test.csv - the same format as train.csv, but without the target value; your task is to predict the value for each of these targets.
* sample_submission.csv - a sample submission file in the correct format.


**Importing crucial libraries**

In [None]:
import math

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import datetime
import time

from sklearn.preprocessing import MinMaxScaler

import warnings
warnings.filterwarnings("ignore")

**Loading the Datasets**

In [None]:
dataset = pd.read_csv("../input/tabular-playground-series-jul-2021/train.csv")
test = pd.read_csv("../input/tabular-playground-series-jul-2021/test.csv")
sample_submission = pd.read_csv("../input/tabular-playground-series-jul-2021/sample_submission.csv")

## Exploratory Data Analysis (EDA)

In [None]:
dataset.head(5)

In [None]:
dataset.tail(5)

In [None]:
dataset.info()

In [None]:
dataset.isna().sum()

In [None]:
dataset.describe()

### Summary :
             1) Data Types as expected
             2) No missing values

In [None]:
data = pd.DataFrame({'degree Celsuis': dataset['deg_C'],
                    'Relative Humidity': dataset['relative_humidity'],
                    'Absolute Humidity': dataset['absolute_humidity'],
                    'Sensor 1': dataset['sensor_1'],
                    'Sensor 2': dataset['sensor_2'],
                    'Sensor 3': dataset['sensor_3'],
                    'Sensor 4': dataset['sensor_4'],
                    'Sensor 5': dataset['sensor_5'],  
                       })

In [None]:
target = dataset[['target_carbon_monoxide', 'target_benzene', 'target_nitrogen_oxides']]

### Distribution Analysis :

**for Data :**

In [None]:
def histplot(data):
    print(data.hist(grid=True,orientation='vertical',color='skyblue', edgecolor='red', linewidth=1, bins=30, figsize=(20,12.5)));
    print('\n')
    
histplot(data)

**for target :**

In [None]:
print(target.hist(grid=True,orientation='vertical',color='skyblue', edgecolor='red', linewidth=1, bins=30, figsize=(20,12.5)));
print('\n')

### Detecting the Outliers :

In [None]:
scaler = MinMaxScaler()
scaled_data =pd.DataFrame(scaler.fit_transform(data),columns=data.columns, index=data.index)

In [None]:
scaled_data.boxplot(figsize=(20,7.5),grid=True, rot=0, fontsize=15, patch_artist=True, color='skyblue');

**Deeper Insight into Distribution of the Data :**

In [None]:
g = sns.PairGrid(scaled_data, diag_sharey=False, corner=True)
g.map_lower(sns.scatterplot, color='skyblue')
g.map_diag(sns.kdeplot, color='red')

**Target Distribution over Time :**

In [None]:
plt.figure(figsize=(15,8))
plt.grid(True)
dataset['date_time'] = pd.to_datetime(dataset['date_time'], errors='ignore')
sns.scatterplot(x = dataset['date_time'], y = target['target_carbon_monoxide'], alpha=0.5, color='skyblue')
plt.title("Carbon Monoxide (CO) distribution over time", size=15)

In [None]:
plt.figure(figsize=(15,8))
plt.grid(True)
dataset['date_time'] = pd.to_datetime(dataset['date_time'], errors='ignore')
sns.scatterplot(x = dataset['date_time'], y = target['target_benzene'], alpha=0.5, color='skyblue')
plt.title("Benzene (BZ) distribution over time", size=15)

In [None]:
plt.figure(figsize=(15,8))
plt.grid(True)
dataset['date_time'] = pd.to_datetime(dataset['date_time'], errors='ignore')
sns.scatterplot(x = dataset['date_time'], y = target['target_nitrogen_oxides'], alpha=0.5, color='skyblue')
plt.title("Nitrogen Oxide (NOx) distribution over time", size=15)

**Target Trend over Temperature :**

In [None]:
plt.figure(figsize=(15,8))
plt.grid(True)
sns.lineplot(x=data['degree Celsuis'], y=target['target_carbon_monoxide'], color='skyblue')
plt.title("Trend of Carbon Monoxide(CO) over Temperature", size=15)

In [None]:
plt.figure(figsize=(15,8))
plt.grid(True)
sns.lineplot(x = data['degree Celsuis'], y= target['target_benzene'], color='skyblue')
plt.title("Trend of Benzene(BZ) over Temperature", size=15)

In [None]:
plt.figure(figsize=(15,8))
plt.grid(True)
sns.lineplot(x = data['degree Celsuis'], y= target['target_nitrogen_oxides'], color='skyblue')
plt.title("Trend of Nitrogen Oxide(NOx) over Temperature", size=15)

### Understanding Correlations :

In [None]:
plt.figure(figsize = (12, 8))
corr_train = dataset.corr()
sns.heatmap(corr_train, annot = True, cmap="Blues");

### Summary :
As you could observe, 'sensor_3' provides different values from the other sensors with an insignificant correlation to any
of the target values

In [None]:
dataset.drop(columns = 'sensor_3', inplace = True)

In [None]:
test.drop(columns = 'sensor_3', inplace = True)

### Feature Engineering

In [None]:
dataset['date_time'] = pd.to_datetime(dataset['date_time'], errors='coerce')

dataset['month'] = dataset['date_time'].dt.month
dataset['is_winter'] = dataset['month'].isin([1, 2, 12]).astype('int')
dataset['is_spring'] = dataset['month'].isin([3, 4, 5]).astype('int') 
dataset['is_summer'] = dataset['month'].isin([6, 7, 8]).astype('int')
dataset['is_autumn'] = dataset['month'].isin([9, 10, 11]).astype('int')

dataset['hour'] = dataset['date_time'].dt.hour
dataset['hr'] = dataset.date_time.dt.hour*60+dataset.date_time.dt.minute

dataset['working_hours'] = dataset['hour'].isin(np.arange(8, 21, 1)).astype('int')
dataset['dayofweek'] = dataset['date_time'].dt.dayofweek
dataset['is_weekend'] = (dataset['date_time'].dt.dayofweek >= 5).astype('int')
dataset.drop(columns = 'hour', inplace = True)

dataset['dew_point'] = dataset['deg_C'].apply(lambda x: (17.27 * x) / (237.7 + x)) +  dataset['absolute_humidity'].apply(lambda x: math.log (x) )
dataset['partial_pressure'] = (dataset['deg_C'].apply(lambda x: (237.7 + x) * 286.8) * dataset['absolute_humidity']) / 100000
dataset['saturated_wvd'] = (dataset['absolute_humidity'] * 100) / dataset['relative_humidity']

In [None]:
dataset.drop(columns = 'date_time', inplace = True)

In [None]:
test['date_time'] = pd.to_datetime(test['date_time'], errors='coerce')

test['month'] = test['date_time'].dt.month
test['is_winter'] = test['month'].isin([1, 2, 12]).astype('int')
test['is_spring'] = test['month'].isin([3, 4, 5]).astype('int')
test['is_summer'] = test['month'].isin([6, 7, 8]).astype('int')
test['is_autumn'] = test['month'].isin([9, 10, 11]).astype('int')

test['hour'] = test['date_time'].dt.hour
test['hr'] = test.date_time.dt.hour*60+test.date_time.dt.minute

test['working_hours'] = test['hour'].isin(np.arange(8, 21, 1)).astype('int')
test['dayofweek'] = test['date_time'].dt.dayofweek
test['is_weekend'] = (test['date_time'].dt.dayofweek >= 5).astype('int')
test.drop(columns = 'hour', inplace = True)

test['dew_point'] = test['deg_C'].apply(lambda x: (17.27 * x) / (237.7 + x)) +  test['absolute_humidity'].apply(lambda x: math.log (x) )
test['partial_pressure'] = (test['deg_C'].apply(lambda x: (237.7 + x) * 286.8) * test['absolute_humidity']) / 100000
test['saturated_wvd'] = (test['absolute_humidity'] * 100) / test['relative_humidity']

In [None]:
test.drop(columns = 'date_time', inplace = True)

<center><img src="https://docs.h2o.ai/h2o/latest-stable/h2o-docs/_images/h2o-automl-logo.jpg", width="200", height="200"></center>
<h3><center>H2O AutoML Modelling</center></h3>

In [None]:
import h2o
from h2o.automl import H2OAutoML

In [None]:
h2o.init()

h2o_train = h2o.H2OFrame(dataset)
h2o_test = h2o.H2OFrame(test)

In [None]:
#for Carbon Monoxide (CO) :
features = [x for x in h2o_train.columns if x not in ['date_time', 'target_carbon_monoxide', 'target_benzene', 'target_nitrogen_oxides']]

aml_carbon_monoxide = H2OAutoML(
    max_runtime_secs=180,
    stopping_metric='RMSLE',
    sort_metric='RMSLE'
)

aml_carbon_monoxide.train(x=features, y='target_carbon_monoxide', training_frame=h2o_train)

In [None]:
#Leaderboard for Carbon Monoxide (CO) :
aml_carbon_monoxide.leaderboard

In [None]:
#for Benzene (BZ) :
features = [x for x in h2o_train.columns if x not in ['date_time', 'target_carbon_monoxide', 'target_benzene', 'target_nitrogen_oxides']]

aml_benzene = H2OAutoML(
    max_runtime_secs=180,
    stopping_metric='RMSLE',
    sort_metric='RMSLE'
)

aml_benzene.train(x=features, y='target_benzene', training_frame=h2o_train)

In [None]:
#Leaderboard for Benzene (BZ) :
aml_benzene.leaderboard

In [None]:
#for Nitrogen Oxide (NOx) :
features = [x for x in h2o_train.columns if x not in ['date_time', 'target_carbon_monoxide', 'target_benzene', 'target_nitrogen_oxides']]

aml_nitrogen_oxide = H2OAutoML(
    max_runtime_secs=180,
    stopping_metric='RMSLE',
    sort_metric='RMSLE'
)

aml_nitrogen_oxide.train(x=features, y='target_nitrogen_oxides', training_frame=h2o_train)

In [None]:
#Leaderboard for Nitrogen Oxide (NOx) :
aml_nitrogen_oxide.leaderboard

### Prediction :

In [None]:
prediction_1 = aml_carbon_monoxide.predict(h2o_test)
prediction_2 = aml_benzene.predict(h2o_test)
prediction_3 = aml_nitrogen_oxide.predict(h2o_test)

In [None]:
prediction_1.set_names(['target_carbon_monoxide'])
prediction_2.set_names(['target_benzene'])
prediction_3.set_names(['target_nitrogen_oxides']);

In [None]:
pred_1_data = h2o.as_list(prediction_1)
pred_2_data = h2o.as_list(prediction_2)
pred_3_data = h2o.as_list(prediction_3)

In [None]:
submission = pd.concat([pd.DataFrame(sample_submission['date_time']),pred_1_data, pred_2_data, pred_3_data], axis = 1)
submission.to_csv('sample_submission.csv', index = False)

### References :

H2O AutoML : https://www.h2o.ai/products/h2o-automl/