## Tabular Playground Series July 21

Objective : Predicting the Target Features

Dependent Variables : target_carbon_monoxide, target_benzene, target_nitrogen_oxides

Independent Variables : 
date_time, deg_C, relative_humidity,     absolute_humidity,sensor_1, sensor_2, sensor_3, sensor_4, sensor_5

#### Are these the only independent Variables ?? 

How can we select 'data_time' being non-categorical, dtype string 

Here Comes Time Analysis

#### What is the best suitable regressor to the data 

We will be finding the regressor(model), suitable hyperparameters manually without using autoML

#### Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import skew
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_log_error, mean_squared_error
from datetime import datetime

%matplotlib inline

1. String to Datetime object
2. Custom Parsing(Revolving) Function

In [None]:
custom_date_parser = lambda x: datetime.strptime(x, "%Y-%m-%d %H:%M:%S")
dataset = pd.read_csv("../input/tabular-playground-series-jul-2021/train.csv",parse_dates=['date_time'],date_parser=custom_date_parser,)
dataset_test = pd.read_csv("../input/tabular-playground-series-jul-2021/test.csv",parse_dates=['date_time'],date_parser=custom_date_parser)
dataset_test.head()

###  Time analysis 


1. Extracting the month and hour
2. Encoding To sin,cos functions 
3. Adding New Independent Features to the train and test dataframe

In [None]:
def extract_dt_feats(df):
    # Extract month and hour
    date_enc = pd.to_datetime(df.date_time)
    month = date_enc.dt.month
    hour = date_enc.dt.hour
    year = date_enc.dt.year
    date = date_enc.dt.date
    date = pd.DataFrame(date)
    # Add features, compute and add is_weekend
    sin_cos_encoding(df, month, 'month', 12)
    sin_cos_encoding(df, hour, 'hour', 23)
    df['is_weekend'] = date_enc.dt.day_name().isin(['Saturday', 'Sunday'])*1
    return df,date
def sin_cos_encoding(df, dt, feat_name, max_val):
    # Encode variable using sin and cos
    df['sin_' + feat_name] = np.sin(2 * np.pi * (dt/max_val))
    df['cos_' + feat_name] = np.cos(2 * np.pi * (dt/max_val))
    return None

In [None]:
dataset,date = extract_dt_feats(dataset.copy())
date.rename(columns = {'date_time':'Only_Dates'}, inplace = True)
result = pd.concat([date, dataset], axis=1)
pd.to_datetime(result.Only_Dates)

In [None]:
import plotly.express as px
fig = px.line(result, x='Only_Dates', y="target_carbon_monoxide")
fig.show()


In [None]:
import plotly.express as px
fig = px.line(result, x='Only_Dates', y="target_benzene")
fig.show()

In [None]:
import plotly.express as px
fig = px.line(result, x='Only_Dates', y="target_nitrogen_oxides")
fig.show()

#### Encoding cyclical continuous features - 24-hour time

Just assume how to the time progresses

00:00 to 24:00 constantly

Well Here , There is linearity in the time 

But the problem is the time distance b/w 23:50 and 00:10 is probably 20 mins 

but when the time is linear it must be equal to 23:40.

So we will relate or encode the feature data to sin,cos graph.
Just imagine how cyclic they are in between -1 to 1

Here is the representation of the plots of the graph where all the points of same hour,month are representing same point as they are collided


In [None]:
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
dataset.sin_month.plot(ax = axes[0,0])
dataset.cos_month.plot(ax = axes[0,1])
axes[0,2].scatter(x=dataset['sin_month'],y=dataset['cos_month'])
axes[0,2].set_title("Two-feature transformation in 2D as a 24-hour clock")
dataset.sin_month.plot(ax = axes[1,0])
dataset.cos_month.plot(ax = axes[1,1])
axes[1,2].scatter(x=dataset['sin_hour'],y=dataset['cos_hour'])
axes[1,2].set_title("Two-feature transformation in 2D as a 24-hour clock")
axes[0, 0].set_title("sin_month")
axes[0, 1].set_title("cos_month")
axes[1, 0].set_title("sin_hour")
axes[1, 1].set_title("cos_hour")

In [None]:
TARGET_VARS = ['target_carbon_monoxide',
               'target_benzene',
               'target_nitrogen_oxides']
sns.pairplot(dataset, hue='is_weekend', vars=TARGET_VARS, corner=True,
            plot_kws={'alpha':.1})

### Feature Engineering ++ 

We need to treat all the target features individually(All the datasets are extracted individually for convinence in visualization)

1. Checking for missing values
2. Correlation b/w features and the target variables
3. Feature selection
4. Analysing the Target Variable 
5. Log - Transformations for skewed data
6. Label encoding for categorical data (If needed)

In [None]:
dataset_co = dataset[['date_time']+['deg_C']+['absolute_humidity']+['relative_humidity']+['sensor_1']+['sensor_2']+['sensor_3']+['sensor_4']+['sensor_5']
                     +['target_carbon_monoxide']+['sin_hour']+['cos_hour']+['sin_month']+['cos_month']]
dataset_ben = dataset[['date_time']+['deg_C']+['absolute_humidity']+['relative_humidity']+['sensor_1']+['sensor_2']+['sensor_3']+['sensor_4']+['sensor_5']
                     +['target_benzene']+['sin_hour']+['cos_hour']+['sin_month']+['cos_month']]
dataset_ni = dataset[['date_time']+['deg_C']+['absolute_humidity']+['relative_humidity']+['sensor_1']+['sensor_2']+['sensor_3']+['sensor_4']+['sensor_5']
                     +['target_nitrogen_oxides']+['sin_hour']+['cos_hour']+['sin_month']+['cos_month']]

In [None]:
dataset.isnull().any()

In [None]:
sns.heatmap(dataset_co.corr(), cmap='RdYlGn_r', vmax=1.0, vmin=-1 ,annot = True)

dataset_co[dataset_co.columns[1:]].corr()['target_carbon_monoxide'][:]

In [None]:
sns.heatmap(dataset_ben.corr(), annot = True)
dataset_ben[dataset_ben.columns[1:]].corr()['target_benzene'][:]

In [None]:
sns.heatmap(dataset_ni.corr(), annot = True)
dataset_ni[dataset_ni.columns[1:]].corr()['target_nitrogen_oxides'][:]

The sensors data is much more dependent than the deg_C ,absoule_humidity,relative_humidity

Though the months and time are'nt much different they increase the accuracy of the data upto 8 percent when we tuned the hyperparameters perfectly

In [None]:
fig, axes = plt.subplots(2, 4, figsize=(18, 10))

fig.suptitle('Before Log-Transnformation')

sns.histplot(dataset['deg_C'],ax=axes[0, 0],kde=True)
sns.histplot(dataset['relative_humidity'],ax=axes[0, 1],kde=True)
sns.histplot(dataset['absolute_humidity'],ax=axes[0, 2],kde=True)
sns.histplot(dataset['sensor_1'],ax=axes[0, 3],kde=True)
sns.histplot(dataset['sensor_2'],ax=axes[1, 0],kde=True)
sns.histplot(dataset['sensor_3'],ax=axes[1, 1],kde=True)
sns.histplot(dataset['sensor_4'],ax=axes[1, 2],kde=True)
sns.histplot(dataset['sensor_5'],ax=axes[1, 3],kde=True)


In [None]:
# First we will try to do with most correlated independent features
correlated_features = ['sensor_1','sensor_2','sensor_3','sensor_4','sensor_5']
for features in correlated_features:
    dataset[features] =  np.log(dataset[features])
fig, axes = plt.subplots(2, 4, figsize=(18, 10))

fig.suptitle('After Log-Transnformation')

sns.histplot(dataset['deg_C'],ax=axes[0, 0],kde=True)
sns.histplot(dataset['relative_humidity'],ax=axes[0, 1],kde=True)
sns.histplot(dataset['absolute_humidity'],ax=axes[0, 2],kde=True)
sns.histplot(dataset['sensor_1'],ax=axes[0, 3],kde=True)
sns.histplot(dataset['sensor_2'],ax=axes[1, 0],kde=True)
sns.histplot(dataset['sensor_3'],ax=axes[1, 1],kde=True)
sns.histplot(dataset['sensor_4'],ax=axes[1, 2],kde=True)
sns.histplot(dataset['sensor_5'],ax=axes[1, 3],kde=True)

Skewness of the Target Variables

In [None]:
sns.displot(dataset_co['target_carbon_monoxide'],kde=True)
x0 = pd.DataFrame(dataset['target_carbon_monoxide']).to_numpy()
print(skew(x0))

In [None]:
sns.displot(dataset_ben['target_benzene'],kde=True)
x1 = pd.DataFrame(dataset['target_benzene']).to_numpy()
print(skew(x1))

In [None]:
sns.displot(dataset_ni['target_nitrogen_oxides'],kde=True)
x2 = pd.DataFrame(dataset['target_nitrogen_oxides']).to_numpy()
print(skew(x2))

In [None]:
y0 = np.log1p(x0)
sns.displot(y0,kde = True)

In [None]:
y1 = np.log1p(x1)
sns.displot(y1,kde = True)

In [None]:
y2 = np.log1p(x2)
sns.displot(y2,kde = True)

In [None]:
dataset[TARGET_VARS] = np.log(dataset[TARGET_VARS] + 1)
sns.pairplot(dataset, hue='is_weekend', vars=TARGET_VARS, corner=True,
            plot_kws={'alpha':.1})

In [None]:
del dataset['date_time'] 
del dataset['target_carbon_monoxide']
del dataset['target_benzene']
del dataset['target_nitrogen_oxides']
X = dataset
dataset.head()

Spliting the dataset to check the performance of the model

In [None]:
X0_train, X0_test, y0_train, y0_test = train_test_split(X, y0, test_size=.2)

Checking out different Regression models - Linear regression, Random forest Regressor, Gradient Boosting Regressor,XG Boost regressor

In [None]:
from sklearn import linear_model
model = linear_model.LinearRegression()
model.fit(X0_train, y0_train)
print("Accuracy --> ", model.score(X0_test, y0_test)*100)

In [None]:
#Train the model
from sklearn.ensemble import RandomForestRegressor
model0 = RandomForestRegressor(n_estimators= 800,min_samples_split= 2,min_samples_leaf= 1,max_features='sqrt',max_depth = 20,bootstrap =  False)
#Fit
model0.fit(X0_train, np.ravel(y0_train,order='C'))
#Score/Accuracy
print("Accuracy --> ", model0.score(X0_test, y0_test)*100)

In [None]:
#GradientBoostingRegressor
#Train the model
from sklearn.ensemble import GradientBoostingRegressor
GBR = GradientBoostingRegressor(n_estimators=100, max_depth=4)
#Fit
GBR.fit(X0_train, np.ravel(y0_train))
print("Accuracy --> ", GBR.score(X0_test, y0_test)*100)

In [None]:
#xgboost regressor
import xgboost
classifier = xgboost.XGBRegressor()
classifier.fit(X0_train, np.ravel(y0_train))
print("Accuracy --> ", classifier.score(X0_test, y0_test)*100)

So , I am locking Random Forest Regressor for Model 0 

In [None]:
X1_train, X1_test, y1_train, y1_test = train_test_split(X, y1, test_size=.3)

In [None]:
#Train the model
from sklearn import linear_model
model = linear_model.LinearRegression()
#Fit the model
model.fit(X1_train, y1_train)
#Score/Accuracy
print("Accuracy --> ", model.score(X1_test, y1_test)*100)

In [None]:
#Train the model
from sklearn.ensemble import RandomForestRegressor
model1 = RandomForestRegressor(n_estimators= 1200,min_samples_split = 2,min_samples_leaf =1,max_features= 'sqrt',max_depth=20,bootstrap= False)
#Fit
model1.fit(X1_train, np.ravel(y1_train,order='C'))
#Score/Accuracy
print("Accuracy --> ", model1.score(X1_test, y1_test)*100)

In [None]:
#GradientBoostingRegressor
#Train the model
from sklearn.ensemble import GradientBoostingRegressor
model = GradientBoostingRegressor(n_estimators=100, max_depth=4)
#Fit
model.fit(X1_train, np.ravel(y1_train))
print("Accuracy --> ", model.score(X1_test, y1_test)*100)

In [None]:
#xgboost regressor
import xgboost
classifier = xgboost.XGBRegressor()
classifier.fit(X1_train, np.ravel(y1_train))
print("Accuracy --> ", classifier.score(X1_test, y1_test)*100)

So, I am Locking Up Random Forest Regressor for Model 1

In [None]:
X2_train, X2_test, y2_train, y2_test = train_test_split(X, y2, test_size=.3)

In [None]:
#Train the model
from sklearn import linear_model
model = linear_model.LinearRegression()
#Fit the model
model.fit(X2_train, y2_train)
#Score/Accuracy
print("Accuracy --> ", model.score(X2_test, y2_test)*100)

In [None]:
#Train the model
from sklearn.ensemble import RandomForestRegressor
from pprint import pprint
model2 = RandomForestRegressor(n_estimators= 1200,min_samples_split = 2,min_samples_leaf =1,max_features= 'sqrt',max_depth=20,bootstrap= False)
#Fit
model2.fit(X2_train, np.ravel(y2_train,order='C'))
#Score/Accuracy
print("Accuracy --> ", model2.score(X2_test, y2_test)*100)

In [None]:
#GradientBoostingRegressor
#Train the model
from sklearn.ensemble import GradientBoostingRegressor
GBR = GradientBoostingRegressor(n_estimators=100, max_depth=4)
#Fit
GBR.fit(X2_train, np.ravel(y2_train))
print("Accuracy --> ", GBR.score(X2_test, y2_test)*100)

In [None]:
#xgboost regressor
import xgboost
classifier = xgboost.XGBRegressor()
classifier.fit(X2_train, np.ravel(y2_train))
print("Accuracy --> ", classifier.score(X2_test, y2_test)*100)

Random Forest Regressor for Model 2


Increasing the Performance of the Models

We will be performing HYPERPARAMETER TUNING
(RandomSearchCV)

In [None]:
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
    # Number of features to consider at every split
max_features = ['auto', 'sqrt']
    # Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
    # Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
    # Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
                   'max_features': max_features,
                   'max_depth': max_depth,
                   'min_samples_split': min_samples_split,
                   'min_samples_leaf': min_samples_leaf,
                   'bootstrap': bootstrap}
pprint(random_grid)

In [None]:
from sklearn.model_selection import RandomizedSearchCV

rf = RandomForestRegressor()
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 40, cv = 3, verbose=2, random_state=42, n_jobs = -1)


In [None]:
#For getting the parameters uncomment below lines and get the hyperparameters
#Model0 = rf_random.fit(X0_train, np.ravel(y0_train,order='C'))
#Model1 = rf_random.fit(X1_train, np.ravel(y1_train,order='C'))
#Model2 = rf_random.fit(X2_train, np.ravel(y2_train,order='C'))
#print(Model0.best_params_)
#print(Model1.best_params_)
#print(Model2.best_params_)

Let us bring permutation importance 
How does it works
It just finds the dependency of the a independent variable over the dependent variable 

The top in the columns are more dependent 

You can check my discussion regarding how permuatation importance works


In [None]:
import eli5
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(model0, random_state=1).fit(X0_train, y0_train)
eli5.show_weights(perm, feature_names = X0_train.columns.tolist())

In [None]:
import eli5
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(model1, random_state=1).fit(X1_train, y1_train)
eli5.show_weights(perm, feature_names = X1_train.columns.tolist())

In [None]:
import eli5
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(model2, random_state=1).fit(X2_train, y2_train)
eli5.show_weights(perm, feature_names = X2_train.columns.tolist())

In [None]:
## Creating the function which shows the improvement in Hyperparameter tuning
def evaluate(model, test_features, test_labels):
    predictions = model.predict(test_features)
    errors = abs(predictions - test_labels)
    mape = 100 * np.mean(errors / test_labels)
    accuracy = 100 - mape
    print(model.score(test_features, test_labels))
    print('Model Performance')
    print('Average Error: {:0.4f} degrees.'.format(np.mean(errors)))
    print('Accuracy = {:0.2f}%.'.format(accuracy))
    
    return accuracy

In [None]:
#Bringing the test dataset into the frame
date_time = dataset_test['date_time']
dataset_test,dates = extract_dt_feats(dataset_test.copy())

In [None]:
correlated_features = ['sensor_1','sensor_2','sensor_3','sensor_4','sensor_5']
for features in correlated_features:
    dataset_test[features] =  np.log(dataset_test[features])

So ,the predicted values from the models are log-transformed values

Inversion of Log transformation has to be applied to make the predicted data normal

In [None]:
del dataset_test['date_time']
Xt = dataset_test
dataframe_0 =pd.DataFrame(np.expm1(model0.predict(Xt)), columns=['target_carbon_monoxide']) 
dataframe_1=pd.DataFrame(np.expm1(model1.predict(Xt)), columns=['target_benzene'])
dataframe_2=pd.DataFrame(np.expm1(model2.predict(Xt)), columns=['target_nitrogen_oxides']) 

In [None]:
result = pd.concat([date_time,dataframe_0, dataframe_1,dataframe_2], axis=1)
result.to_csv('submission1.csv',index = False)