# TPS Jul 2021

Kaggle competitions are incredibly fun and rewarding, but they can also be intimidating for people who are relatively new in their data science journey. In the past, we've launched many Playground competitions that are more approachable than our Featured competitions and thus, more beginner-friendly.

In order to have a more consistent offering of these competitions for our community, we're trying a new experiment in 2021. We'll be launching month-long tabular Playground competitions on the 1st of every month and continue the experiment as long as there's sufficient interest and participation.

The goal of these competitions is to provide a fun, and approachable for anyone, tabular dataset. These competitions will be great for people looking for something in between the Titanic Getting Started competition and a Featured competition. If you're an established competitions master or grandmaster, these probably won't be much of a challenge for you. We encourage you to avoid saturating the leaderboard.

For each monthly competition, we'll be offering Kaggle Merchandise for the top three teams. And finally, because we want these competitions to be more about learning, we're limiting team sizes to 3 individuals.

The dataset is used for this competition is based on a real dataset, but has synthetic-generated aspects to it. The original dataset deals with predicting air pollution in a city via various input sensor values (e.g., a time series).

Good luck and have fun!

For ideas on how to improve your score, check out the Intro to Machine Learning and Intermediate Machine Learning courses on Kaggle Learn.

1. Problem definition
2. Data
3. Evaluation
4. Features
5. Modelling
6. Putting it all together

# 1. Problem Definition

In this competition you are predicting the values of air pollution measurements over time, based on basic weather information (temperature and humidity) and the input values of 5 sensors.

The three target values to you to predict are: target_carbon_monoxide, target_benzene, and target_nitrogen_oxides

# 2. Data

##Files

    * train.csv - the training data, including the weather data, sensor data, and values for the 3 targets
    
    * test.csv - the same format as train.csv, but without the target value; your task is to predict the value for each of these targets.
    * sample_submission.csv - a sample submission file in the correct format.


# 3. Evalutation

Submissions are evaluated using the mean column-wise root mean squared logarithmic error.

The final score is the mean of the RMSLE over all columns, in this case, 3.

## Submission File

For each ID in the test set, you must predict a probability for the TARGET variable. The file should contain a header and have the following format:

    date_time,target_carbon_monoxide,target_benzene,target_nitrogen_oxides
    2011-01-01 01:00:00,2.0,10.0,300.0
    2011-01-01 02:00:00,2.0,10.0,300.0
    2011-01-01 03:00:00,2.0,10.0,300.0
    etc.


# 4. Features

## Input / Features

1. date_time
2. deg_C
3. relative_humidity
4. absolute_humidity
5. sensor_1
6. sensor_2
7. sensor_3
8. sensor_4
9. sensor_5

## Output / Label
10. target_carbon_monoxide
11. target_benzene
12. target_nitrogen_oxides

## Standard Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Local
# df = pd.read_csv('Data/train.csv',date_parser=True)
# Kaggle
df = pd.read_csv('/kaggle/input/tabular-playground-series-jul-2021/train.csv',date_parser=True)

In [None]:
df.head()

## Data Exploration

In [None]:
df

In [None]:
df.isnull().sum()

In [None]:
df.info()

In [None]:
df['date_time'] = pd.to_datetime(df['date_time'])

In [None]:
df.info()

In [None]:
df

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of carbon monoxide over time')
sns.lineplot(data=df,x='date_time', y = 'target_carbon_monoxide');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of Benzene over time')
sns.lineplot(data=df,x='date_time', y = 'target_benzene');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of Nitrogen Oxides over time')
sns.lineplot(data=df,x='date_time', y = 'target_nitrogen_oxides');

In [None]:
plt.figure(figsize=(20,20))
sns.heatmap(data=df.corr(), annot=True)

In [None]:
sns.pairplot(data=df)

# 5. Modelling

In [None]:
df.columns

In [None]:
X = df.drop(['target_carbon_monoxide','target_benzene', 'target_nitrogen_oxides','date_time'],axis=1)
y_target_carbon_monoxide = df['target_carbon_monoxide']
y_target_benzene = df['target_benzene']
y_target_nitrogen_oxides = df['target_nitrogen_oxides']

In [None]:
from sklearn.model_selection import train_test_split


In [None]:
from sklearn.preprocessing import StandardScaler


## Import Models

In [None]:
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR, LinearSVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import AdaBoostRegressor
from xgboost import XGBRegressor, XGBRFRegressor


## Baseline Models and Scores

In [None]:
def fit_and_score(models, X_train, X_test, y_train, y_test):
    np.random.seed(42)
    
    model_scores = {}
    
    for name, model in models.items():
        model.fit(X_train,y_train)
        model_scores[name] = model.score(X_test,y_test)

    model_scores = pd.DataFrame(model_scores, index=['Score']).transpose()
    model_scores = model_scores.sort_values('Score')
        
    return model_scores

In [None]:
models = {'Ridge' : Ridge(),
         'Lasso': Lasso(),
         'ElasticNet': ElasticNet(),
         'KNeighborsRegressor': KNeighborsRegressor(),
         'SVR': SVR(),
         'DecisionTreeRegressor': DecisionTreeRegressor(),
         'RandomForestRegressor':RandomForestRegressor(),
         'GradientBoostingRegressor': GradientBoostingRegressor(),
         'AdaBoostRegressor': AdaBoostRegressor(),
        'XGBRegressor': XGBRegressor(objective='reg:squarederror'),
        'XGBRFRegressor': XGBRFRegressor(objective='reg:squarederror')
         }

### target_carbon_monoxide

In [None]:
X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(X, y_target_carbon_monoxide, test_size=0.3, random_state=42)

In [None]:
baseline_model_scores_carbon_monoxide = fit_and_score(models, X_train_1, X_test_1, y_train_1, y_test_1)

In [None]:
baseline_model_scores_carbon_monoxide

In [None]:
plt.figure(figsize=(20,10))
sns.barplot(data=baseline_model_scores_carbon_monoxide.T)
plt.title('Baseline Model Accuracy Score for target_carbon_monoxide')
plt.xticks(rotation=90);

### target_benzene

In [None]:
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X, y_target_benzene, test_size=0.3, random_state=42)

In [None]:
baseline_model_scores_target_benzene = fit_and_score(models, X_train_2, X_test_2, y_train_2, y_test_2)

In [None]:
baseline_model_scores_target_benzene

In [None]:
plt.figure(figsize=(20,10))
sns.barplot(data=baseline_model_scores_target_benzene.T)
plt.title('Baseline Model Accuracy Score for target_carbon_monoxide')
plt.xticks(rotation=90);

### target_nitrogen_oxides

In [None]:
X_train_3, X_test_3, y_train_3, y_test_3 = train_test_split(X, y_target_nitrogen_oxides, test_size=0.3, random_state=42)

In [None]:
baseline_model_scores_target_nitrogen_oxides = fit_and_score(models, X_train_3, X_test_3, y_train_3, y_test_3)

In [None]:
baseline_model_scores_target_nitrogen_oxides

### Summary of Baseline modeling 

In [None]:
baseline_model_scores_carbon_monoxide

In [None]:
baseline_model_scores_target_benzene

In [None]:
baseline_model_scores_target_nitrogen_oxides

Since the target_benzene had better scores we will use will on that first by trying to tune the hyperparams.
1. RandomForestRegressor 	0.975238
2. GradientBoostingRegressor 	0.976005

## target_benzene Hyperparam tuning

In [None]:
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X, y_target_benzene, test_size=0.3, random_state=42)

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from warnings import filterwarnings

In [None]:
filterwarnings('ignore')

In [None]:
def randomesearch_cv_scores(models, params, X_train, X_test, y_train, y_test):
    np.random.seed(42)
    
    model_rs_scores = {}
    model_rs_best_param = {}
    
    for name, model in models.items():
        rs_model = RandomizedSearchCV(model,
                                    param_distributions=params[name],
                                    scoring='neg_mean_squared_log_error',
                                      n_iter=20,
                                    n_jobs=-2,
                                    cv=5,
                                    verbose=2)
        
        rs_model.fit(X_train,y_train)

        model_rs_scores[name] = rs_model.score(X_test,y_test)
        model_rs_best_param[name] = rs_model.best_params_

    model_rs_scores = pd.DataFrame(model_rs_scores, index=['neg_mean_squared_log_error'])
    model_rs_scores = model_rs_scores.transpose().sort_values('neg_mean_squared_log_error')
        
    return model_rs_scores, model_rs_best_param

### Baseline RandomForestRegressor and GradientBoostingRegressor target_benzene

In [None]:
models = {'RandomForestRegressor': RandomForestRegressor(),
         'GradientBoostingRegressor': GradientBoostingRegressor()}
params = {'RandomForestRegressor': {},
          'GradientBoostingRegressor': {}
         }

In [None]:
model_baseline_scores_target_benzene, model_baseline_best_param_target_benzene = randomesearch_cv_scores(models,
                                                                params,
                                                               X_train_2,
                                                               X_test_2, 
                                                               y_train_2, 
                                                               y_test_2)

In [None]:
model_baseline_scores_target_benzene

In [None]:
model_baseline_best_param_target_benzene

### RS target_benzene model 1

In [None]:
models = {'RandomForestRegressor': RandomForestRegressor(),
         'GradientBoostingRegressor': GradientBoostingRegressor()}
params = {'RandomForestRegressor': {'n_estimators': [50,100,200,300,500],
                                'max_depth': [None, 2,4,10,20],
                                'min_samples_split':[2,4,10,20],
                                'min_samples_leaf': [1,2,5,10,20],
                                'max_features': ['auto','sqrt', 'log2'],
                                'max_leaf_nodes': [None, 2,4,10,20],
                                'bootstrap': [True, False],
                                'oob_score': [True, False],
                                'ccp_alpha': [0.0,0.01,0.001]
                                },
          'GradientBoostingRegressor': {'loss': ['ls','lad','huber','quantile'],
                                       'learning_rate': [0.01,0.1,0.2,0.5,1],
                                       'n_estimators': [50,100,200,300,500],
                                       'criterion': ['friedman_mse','mse'],
                                       'min_samples_split':[2,0.2,0.5],
                                        'min_samples_leaf': [1,0.2,0.5],
                                        'max_depth': [None, 2,4,10,20],
                                        'max_features': ['auto','sqrt', 'log2'],
                                        'alpha': [0.1,0.2,0.5,0.9,1],
                                        'max_leaf_nodes': [None, 2,4,10,20],
                                        'ccp_alpha': [0.0,0.01,0.001]
                                       }
         }

In [None]:
model_rs_scores_target_benzene_1, model_rs_best_param_target_benzene_1 = randomesearch_cv_scores(models,
                                                                params,
                                                               X_train_2,
                                                               X_test_2, 
                                                               y_train_2, 
                                                               y_test_2)

In [None]:
model_rs_scores_target_benzene_1

In [None]:
model_rs_best_param_target_benzene_1

### RS target_benzene model 2

In [None]:
models = {'RandomForestRegressor': RandomForestRegressor()}
params = {'RandomForestRegressor': {'n_estimators': [45,50,55],
                                'max_depth': [9,10,11,12,15],
                                'min_samples_split':[3,4,5,6],
                                'min_samples_leaf': [3,4,5,6,7],
                                'max_features': ['log2'],
                                'max_leaf_nodes': [None],
                                'bootstrap': [True],
                                'oob_score': [True],
                                'ccp_alpha': [0.001,0.005,0.009]
                                },
         }

In [None]:
model_rs_scores_target_benzene_2, model_rs_best_param_target_benzene_2 = randomesearch_cv_scores(models,
                                                                params,
                                                               X_train_2,
                                                               X_test_2, 
                                                               y_train_2, 
                                                               y_test_2)

In [None]:
model_rs_scores_target_benzene_2

In [None]:
model_rs_best_param_target_benzene_2

### RS Model 3

In [None]:
params = {'RandomForestRegressor': {'n_estimators': [42,43,44,45,46,47,48],
                                'max_depth': [11],
                                'min_samples_split':[3],
                                'min_samples_leaf': [3],
                                'max_features': ['log2'],
                                'max_leaf_nodes': [None],
                                'bootstrap': [True],
                                'oob_score': [True],
                                'ccp_alpha': [0.0009,0.001,0.002,0.003]
                                },
         }

In [None]:
model_rs_scores_target_benzene_3, model_rs_best_param_target_benzene_3 = randomesearch_cv_scores(models,
                                                                params,
                                                               X_train_2,
                                                               X_test_2, 
                                                               y_train_2, 
                                                               y_test_2)

In [None]:
model_rs_scores_target_benzene_3

In [None]:
model_rs_best_param_target_benzene_3

### RS Model 4

In [None]:
params = {'RandomForestRegressor': {'n_estimators': [47],
                                'max_depth': [11],
                                'min_samples_split':[3],
                                'min_samples_leaf': [3],
                                'max_features': ['log2'],
                                'max_leaf_nodes': [None],
                                'bootstrap': [True],
                                'oob_score': [True],
                                'ccp_alpha': [0,0.0001,0.0005,0.0009]
                                },
         }

In [None]:
model_rs_scores_target_benzene_4, model_rs_best_param_target_benzene_4 = randomesearch_cv_scores(models,
                                                                params,
                                                               X_train_2,
                                                               X_test_2, 
                                                               y_train_2, 
                                                               y_test_2)

In [None]:
model_rs_scores_target_benzene_4

In [None]:
model_rs_best_param_target_benzene_4

## Evalution of target_benzene model

We will use the RS 4 model of the RandomForestRegressor as it is performing the best

In [None]:
from sklearn.metrics import mean_squared_log_error

In [None]:
target_benzene_model = RandomForestRegressor(oob_score=4,
                                             n_estimators=47,
                                             min_samples_split=3,
                                             min_samples_leaf=3,
                                             max_leaf_nodes=None,
                                             max_features='log2',
                                             max_depth=11,
                                             ccp_alpha=0,
                                             bootstrap=True )
target_benzene_model.fit(X_train_2, y_train_2)

In [None]:
y_pred_target_benzene = target_benzene_model.predict(X_test_2)

In [None]:
msle = mean_squared_log_error(y_test_2, y_pred_target_benzene)

In [None]:
msle

In [None]:
print(f'Root mean square long error: {np.sqrt(msle)}')

## Baseline model with target_benzene as an input

In [None]:
X = df.drop(['target_carbon_monoxide','target_nitrogen_oxides','date_time'],axis=1)
y_target_carbon_monoxide = df['target_carbon_monoxide']
y_target_nitrogen_oxides = df['target_nitrogen_oxides']

In [None]:
models = {'Ridge' : Ridge(),
         'Lasso': Lasso(),
         'ElasticNet': ElasticNet(),
         'KNeighborsRegressor': KNeighborsRegressor(),
         'SVR': SVR(),
         'DecisionTreeRegressor': DecisionTreeRegressor(),
         'RandomForestRegressor':RandomForestRegressor(),
         'GradientBoostingRegressor': GradientBoostingRegressor(),
         'AdaBoostRegressor': AdaBoostRegressor(),
        'XGBRegressor': XGBRegressor(objective='reg:squarederror'),
        'XGBRFRegressor': XGBRFRegressor(objective='reg:squarederror')
         }

### target_carbon_monoxide

In [None]:
X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(X, y_target_carbon_monoxide, test_size=0.3, random_state=42)

In [None]:
baseline_model_scores_carbon_monoxide = fit_and_score(models, X_train_1, X_test_1, y_train_1, y_test_1)

In [None]:
baseline_model_scores_carbon_monoxide

In [None]:
plt.figure(figsize=(20,10))
sns.barplot(data=baseline_model_scores_carbon_monoxide.T)
plt.title('Baseline Model Accuracy Score for target_carbon_monoxide')
plt.xticks(rotation=90);

### target_nitrogen_oxides

In [None]:
X_train_3, X_test_3, y_train_3, y_test_3 = train_test_split(X, y_target_nitrogen_oxides, test_size=0.3, random_state=42)

In [None]:
baseline_model_scores_target_nitrogen_oxides = fit_and_score(models, X_train_3, X_test_3, y_train_3, y_test_3)

In [None]:
baseline_model_scores_target_nitrogen_oxides

In [None]:
plt.figure(figsize=(20,10))
sns.barplot(data=baseline_model_scores_carbon_monoxide.T)
plt.title('Baseline Model Accuracy Score for target_nitrogen_oxides')
plt.xticks(rotation=90);

target_carbon_monoxide baseline for:
XGBRegressor 	0.901986

### target_carbon_monoxide Hyperparam Tuning

#### BaseLine Model 1

In [None]:
models = {'XGBRegressor': XGBRegressor()}
params = {'XGBRegressor': {},
         }

In [None]:
model_base_scores_target_carbon_monoxide, model_base_best_param_target_carbon_monoxide_ = randomesearch_cv_scores(models,
                                                                params,
                                                               X_train_1,
                                                               X_test_1, 
                                                               y_train_1, 
                                                               y_test_1)

In [None]:
model_base_scores_target_carbon_monoxide

In [None]:
model_base_best_param_target_carbon_monoxide_

#### RS Model 1

In [None]:
models = {'XGBRegressor': XGBRegressor()}
params = {'XGBRegressor': {'eta': [0.1,0.3,0.5,0.9,1],
                          'gamma': [0,1,5,10,100,500],
                          'max_depth': [2,6,10,20,50],
                          'min_child_weight': [0,1,5,10,20,50],
                          'max_delta_step': [0,1,5,10,20,50],
                          'lamda': [0,1],
                          'alpha':[0,1],
                          },
         }

In [None]:
model_rs_scores_target_carbon_monoxide_1, model_rs_best_param_target_carbon_monoxide_1 = randomesearch_cv_scores(models,
                                                                params,
                                                               X_train_1,
                                                               X_test_1, 
                                                               y_train_1, 
                                                               y_test_1)

In [None]:
model_rs_scores_target_carbon_monoxide_1

In [None]:
model_rs_best_param_target_carbon_monoxide_1

#### RS Model 2

In [None]:
params = {'XGBRegressor': {'eta': [0.1,0.01,0.05],
                          'gamma': [1,2,3],
                          'max_depth': [15,20,25,30],
                          'min_child_weight': [30,40,50,70,100],
                          'max_delta_step': [0],
                          'lamda': [1],
                          'alpha':[1],
                          },
         }

In [None]:
model_rs_scores_target_carbon_monoxide_2, model_rs_best_param_target_carbon_monoxide_2 = randomesearch_cv_scores(models,
                                                                params,
                                                               X_train_1,
                                                               X_test_1, 
                                                               y_train_1, 
                                                               y_test_1)

In [None]:
model_rs_scores_target_carbon_monoxide_2

In [None]:
model_rs_best_param_target_carbon_monoxide_2

#### RS Model 3

In [None]:
params = {'XGBRegressor': {'eta': [0.03,0.04,0.05,0.06,0.07,0.08,0.09],
                          'gamma': [1],
                          'max_depth': [15,20,25],
                          'min_child_weight': [25,30,35],
                          'max_delta_step': [0],
                          'alpha':[1],
                          },
         }

In [None]:
model_rs_scores_target_carbon_monoxide_3, model_rs_best_param_target_carbon_monoxide_3 = randomesearch_cv_scores(models,
                                                                params,
                                                               X_train_1,
                                                               X_test_1, 
                                                               y_train_1, 
                                                               y_test_1)

In [None]:
model_rs_scores_target_carbon_monoxide_3

In [None]:
model_rs_best_param_target_carbon_monoxide_3

#### RS Model 4

In [None]:
params = {'XGBRegressor': {'eta': [0.08,0.09],
                          'gamma': [1],
                          'max_depth': [14,15,16,17,18,19],
                          'min_child_weight': [28,29,30,31,32,33,34],
                          'max_delta_step': [0],
                          'alpha':[1],
                          },
         }

In [None]:
model_rs_scores_target_carbon_monoxide_4, model_rs_best_param_target_carbon_monoxide_4 = randomesearch_cv_scores(models,
                                                                params,
                                                               X_train_1,
                                                               X_test_1, 
                                                               y_train_1, 
                                                               y_test_1)

In [None]:
model_rs_scores_target_carbon_monoxide_4

In [None]:
model_rs_best_param_target_carbon_monoxide_4

## Evalution of target_carbon_monoxide model

we will use the hyperparam of RS 4

In [None]:
target_carbon_monoxide_model = XGBRFRegressor(min_child_weight=30,
                                              max_depth=16,
                                              max_delta_step=0,
                                             gamma=1,
                                                eta=0.09,
                                             alpha=1)
target_carbon_monoxide_model.fit(X_train_1, y_train_1)

In [None]:
y_pred_target_carbon_monoxide = target_carbon_monoxide_model.predict(X_test_1)

In [None]:
msle = mean_squared_log_error(y_test_1, y_pred_target_carbon_monoxide)

In [None]:
msle

In [None]:
print(f'Root mean square long error: {np.sqrt(msle)}')

## Baseline model with target_benzene and target_carbon_monoxide as an input

In [None]:
X = df.drop(['target_nitrogen_oxides','date_time'],axis=1)
y_target_nitrogen_oxides = df['target_nitrogen_oxides']

In [None]:
models = {'Ridge' : Ridge(),
         'Lasso': Lasso(),
         'ElasticNet': ElasticNet(),
         'KNeighborsRegressor': KNeighborsRegressor(),
         'SVR': SVR(),
         'DecisionTreeRegressor': DecisionTreeRegressor(),
         'RandomForestRegressor':RandomForestRegressor(),
         'GradientBoostingRegressor': GradientBoostingRegressor(),
         'AdaBoostRegressor': AdaBoostRegressor(),
        'XGBRegressor': XGBRegressor(objective='reg:squarederror'),
        'XGBRFRegressor': XGBRFRegressor(objective='reg:squarederror')
         }

### target_nitrogen_oxides

In [None]:
X_train_3, X_test_3, y_train_3, y_test_3 = train_test_split(X, y_target_nitrogen_oxides, test_size=0.3, random_state=42)

In [None]:
baseline_model_scores_target_nitrogen_oxides = fit_and_score(models, X_train_3, X_test_3, y_train_3, y_test_3)

In [None]:
baseline_model_scores_target_nitrogen_oxides

In [None]:
plt.figure(figsize=(20,10))
sns.barplot(data=baseline_model_scores_carbon_monoxide.T)
plt.title('Baseline Model Accuracy Score for target_nitrogen_oxides')
plt.xticks(rotation=90);

we will turn the hyper params of:
RandomForestRegressor 0.902264

### target_nitrogen_oxides Hyperparam Tuning

#### BaseLine Model 1

In [None]:
models = {'RandomForestRegressor': RandomForestRegressor()}
params = {'RandomForestRegressor': {}
         }

In [None]:
model_base_scores_target_nitrogen_oxides, model_base_best_param_target_nitrogen_oxides = randomesearch_cv_scores(models,
                                                                params,
                                                               X_train_3,
                                                               X_test_3, 
                                                               y_train_3, 
                                                               y_test_3)

In [None]:
model_base_scores_target_nitrogen_oxides

#### RS Model 1

In [None]:
models = {'RandomForestRegressor': RandomForestRegressor()}
params = {'RandomForestRegressor': {'n_estimators': [50,100,200,300,500],
                                'max_depth': [None, 2,4,10,20],
                                'min_samples_split':[2,4,10,20],
                                'min_samples_leaf': [1,2,5,10,20],
                                'max_features': ['auto','sqrt', 'log2'],
                                'max_leaf_nodes': [None, 2,4,10,20],
                                'bootstrap': [True, False],
                                'oob_score': [True, False],
                                'ccp_alpha': [0.0,0.01,0.001]
                                }
         }

In [None]:
model_rs_scores_target_nitrogen_oxides_1, model_rs_best_param_target_nitrogen_oxides_1 = randomesearch_cv_scores(models,
                                                                params,
                                                               X_train_3,
                                                               X_test_3, 
                                                               y_train_3, 
                                                               y_test_3)

In [None]:
model_rs_scores_target_nitrogen_oxides_1

In [None]:
model_rs_best_param_target_nitrogen_oxides_1

#### RS Model 2

In [None]:
params = {'RandomForestRegressor': {'n_estimators': [90,100,110],
                                'max_depth': [None, 2,3],
                                'min_samples_split':[8,9,10,11],
                                'min_samples_leaf': [1],
                                'max_features': ['sqrt'],
                                'max_leaf_nodes': [None],
                                'bootstrap': [True],
                                'oob_score': [False],
                                'ccp_alpha': [0.01,0.02,0.009]
                                }
         }

In [None]:
model_rs_scores_target_nitrogen_oxides_2, model_rs_best_param_target_nitrogen_oxides_2 = randomesearch_cv_scores(models,
                                                                params,
                                                               X_train_3,
                                                               X_test_3, 
                                                               y_train_3, 
                                                               y_test_3)

In [None]:
model_rs_scores_target_nitrogen_oxides_2

In [None]:
model_rs_best_param_target_nitrogen_oxides_2

#### RS Model 3

In [None]:
params = {'RandomForestRegressor': {'n_estimators': [105,110,115],
                                'max_depth': [None],
                                'min_samples_split':[9],
                                'min_samples_leaf': [1],
                                'max_features': ['sqrt'],
                                'max_leaf_nodes': [None],
                                'bootstrap': [True],
                                'oob_score': [False],
                                'ccp_alpha': [0,0.0001,0.007,0.008,0.009]
                                }
         }

In [None]:
model_rs_scores_target_nitrogen_oxides_3, model_rs_best_param_target_nitrogen_oxides_3 = randomesearch_cv_scores(models,
                                                                params,
                                                               X_train_3,
                                                               X_test_3, 
                                                               y_train_3, 
                                                               y_test_3)

In [None]:
model_rs_scores_target_nitrogen_oxides_3

In [None]:
model_rs_best_param_target_nitrogen_oxides_3

#### RS Model 4

In [None]:
params = {'RandomForestRegressor': {'n_estimators': [114,115,120,125],
                                'max_depth': [None],
                                'min_samples_split':[9],
                                'min_samples_leaf': [1],
                                'max_features': ['sqrt'],
                                'max_leaf_nodes': [None],
                                'bootstrap': [True],
                                'oob_score': [False],
                                'ccp_alpha': [0.008]
                                }
         }

In [None]:
model_rs_scores_target_nitrogen_oxides_4, model_rs_best_param_target_nitrogen_oxides_4 = randomesearch_cv_scores(models,
                                                                params,
                                                               X_train_3,
                                                               X_test_3, 
                                                               y_train_3, 
                                                               y_test_3)

In [None]:
model_rs_scores_target_nitrogen_oxides_4

In [None]:
model_rs_best_param_target_nitrogen_oxides_4

#### RS Model 5

In [None]:
params = {'RandomForestRegressor': {'n_estimators': [123,125,130,135],
                                'max_depth': [None],
                                'min_samples_split':[9],
                                'min_samples_leaf': [1],
                                'max_features': ['sqrt'],
                                'max_leaf_nodes': [None],
                                'bootstrap': [True],
                                'oob_score': [False],
                                'ccp_alpha': [0.008]
                                }
         }

In [None]:
model_rs_scores_target_nitrogen_oxides_5, model_rs_best_param_target_nitrogen_oxides_5 = randomesearch_cv_scores(models,
                                                                params,
                                                               X_train_3,
                                                               X_test_3, 
                                                               y_train_3, 
                                                               y_test_3)

In [None]:
model_rs_scores_target_nitrogen_oxides_5

In [None]:
model_rs_best_param_target_nitrogen_oxides_5

## Evalution of target_nitrogen_oxides model

we will go with RS 5 model as the hyper params for it provide a better score

In [None]:
target_nitrogen_oxides_model = RandomForestRegressor(oob_score=False,n_estimators=125,
                                                     min_samples_split=9,
                                                     min_samples_leaf=1,
                                                     max_leaf_nodes=None,
                                                     max_features='sqrt',
                                                     max_depth=None,
                                                     ccp_alpha=0.008,
                                                     bootstrap=True)
target_nitrogen_oxides_model.fit(X_train_3,y_train_3)

In [None]:
y_pred_target_nitrogen_oxides = target_nitrogen_oxides_model.predict(X_test_3)

In [None]:
msle = mean_squared_log_error(y_test_3, y_pred_target_nitrogen_oxides)

In [None]:
msle

In [None]:
print(f'Root mean square long error: {np.sqrt(msle)}')

# 6. Putting all together

In [None]:
df

In [None]:
def get_preds(df):
    X = df.drop(['date_time'], axis=1)
    
    preds_benzene = target_benzene_model.predict(X)
    df['target_benzene'] = preds_benzene
    
    X = df.drop(['date_time'], axis=1)
    preds_carbon_monoxide = target_carbon_monoxide_model.predict(X)
    df['target_carbon_monoxide'] = preds_carbon_monoxide
    
    df = df[['date_time','deg_C', 'relative_humidity', 'absolute_humidity', 
           'sensor_1', 'sensor_2','sensor_3',
           'sensor_4','sensor_5','target_carbon_monoxide','target_benzene']]
    
    X = df.drop(['date_time'], axis=1)
    preds_nitrogen_oxides = target_nitrogen_oxides_model.predict(X)
    df['target_nitrogen_oxides'] = preds_nitrogen_oxides
    
    df = df[['date_time','deg_C', 'relative_humidity', 'absolute_humidity', 
           'sensor_1', 'sensor_2','sensor_3',
           'sensor_4','sensor_5','target_carbon_monoxide','target_benzene', 'target_nitrogen_oxides']]
    
    return df

In [None]:
testing_df = df.drop(['target_carbon_monoxide','target_benzene', 'target_nitrogen_oxides'], axis=1)

In [None]:
testing_df

In [None]:
testing_df = get_preds(testing_df)

In [None]:
testing_df

In [None]:
df

# Feature importances

### target_benzene_model

In [None]:
feat_importances_target_benzene_model = pd.DataFrame(target_benzene_model.feature_importances_, index=X_train_2.columns)

In [None]:
feat_importances_target_benzene_model

In [None]:
plt.figure(figsize=(20,10))
plt.xticks(rotation=90)
plt.title('Feature Importances')
sns.barplot(data= feat_importances_target_benzene_model.sort_values(0).T);

### target_carbon_monoxide_model

In [None]:
feat_importances_target_carbon_monoxide_model = pd.DataFrame(target_carbon_monoxide_model.feature_importances_, index=X_train_1.columns)

In [None]:
plt.figure(figsize=(20,10))
plt.xticks(rotation=90)
plt.title('Feature Importances')
sns.barplot(data= feat_importances_target_carbon_monoxide_model.sort_values(0).T);

### target_nitrogen_oxides_model

In [None]:
feat_importances_target_nitrogen_oxides_model = pd.DataFrame(target_nitrogen_oxides_model.feature_importances_, index=X_train_3.columns)

In [None]:
plt.figure(figsize=(20,10))
plt.xticks(rotation=90)
plt.title('Feature Importances')
sns.barplot(data= feat_importances_target_nitrogen_oxides_model.sort_values(0).T);

## Submission

In [None]:
test_data = pd.read_csv("/kaggle/input/tabular-playground-series-jul-2021/test.csv")
test_data

In [None]:
test_data_preds = get_preds(test_data)

In [None]:
test_data_preds

In [None]:
test_data_preds.columns

In [None]:
test_data_preds=test_data_preds.drop(['deg_C','relative_humidity','absolute_humidity',
                       'sensor_1','sensor_2','sensor_3','sensor_4','sensor_5'],axis=1)

In [None]:
# test_data_preds.to_csv('submission.csv',index=False)