# Tabular Playground Series - Jul 2021
Continued from [last time](https://www.kaggle.com/astashiro/tps-jul2021-02pycaretblend) .

## Add features
I added features that I thought were valid and predicted them.

In [None]:
!pip install pycaret==2.3.1

In [None]:
!pip install shap

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

from pycaret.regression import setup, compare_models, create_model, tune_model, finalize_model, blend_models, predict_model, interpret_model
import shap

In [None]:
train = pd.read_csv('../input/tabular-playground-series-jul-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-jul-2021/test.csv')
sub = pd.read_csv('../input/tabular-playground-series-jul-2021/sample_submission.csv')

# date_timeを日付型に変換
train['date_time'] = pd.to_datetime(train['date_time'])
test['date_time'] = pd.to_datetime(test['date_time'])

### Causes of outliers
There are no outliers between benzene and each sensor, and there is a strong positive correlation, especially with sensor_2. Outliers are noticeable for carbon monoxide and nitrogen oxides. Therefore, we hypothesized that the benzene value was calculated from the values of each sensor, and that the carbon monoxide and nitrogen oxide values were not calculated from these sensors.

**Hypothesis**  
Benzene and Sensor1-5 : positive correlation  
Carbon monoxide and Sensor1-5 : pseudo-correlation  (latent variable: Benzene)  
Nitrogen oxides and Sensor1-5 : pseudo-correlation  (latent variable: Benzene)  

In [None]:
train_temp = train.loc[:,['target_carbon_monoxide','target_benzene','target_nitrogen_oxides','sensor_1','sensor_2','sensor_3','sensor_4','sensor_5']]
sns.pairplot(train_temp)

Here's a closer look at the actual outlier: the corresponding data was found around 2010/12/16.

In [None]:
sel_train = train[6600:6900].copy()
cols = ['target_carbon_monoxide','target_benzene','target_nitrogen_oxides','deg_C', 'relative_humidity','absolute_humidity', 'sensor_1', 'sensor_2', 'sensor_3', 'sensor_4' ,'sensor_5']
for col in train[cols].columns:
    plt.figure(figsize=(16,4))
    plt.plot(sel_train.date_time, sel_train[col])
    plt.ylabel(col)
    plt.show()

Let's look at the thermometer, hygrometer, and sensors that are not working.  
Benzene is linked to a value close to zero, but carbon monoxide and nitrogen oxides have spikes independent of each sensor.
It seems that each sensor can be used as a feature for carbon monoxide and nitrogen oxide as long as the outlier is considered separately, so I would like to add a feature for whether the sensors are off or not.

### About the value of Nitroge oxides
Consider the feature values needed to predict nitrogen oxides.  
Let's match the scales and display "benzene and carbon monoxide" and "benzene and nitrogen oxides".

In [None]:
def plot_data(df, name1, num1, name2,num2):
    plt.figure(figsize=(16,4))
    plt.plot(df.date_time, df[name1]*num1, label=name1)
    plt.plot(df.date_time, df[name2]*num2, label=name2)
    plt.legend()
    plt.show()

#### Benzene and Carbon monoxide

In [None]:
plot_data(train, 'target_benzene', 1, 'target_carbon_monoxide', 4.6)
plot_data(train[:300], 'target_benzene', 1, 'target_carbon_monoxide', 4.6)
plot_data(train[6800:7100], 'target_benzene', 1, 'target_carbon_monoxide', 4.6)

#### Benzene and Nitrogen oxides

In [None]:
plot_data(train, 'target_benzene', 19, 'target_nitrogen_oxides', 1)
plot_data(train[:300], 'target_benzene', 19, 'target_nitrogen_oxides', 1)
plot_data(train[6800:7100], 'target_benzene', 19, 'target_nitrogen_oxides', 1)

Comparing the first and second halves of the training data, we can see that "benzene and carbon monoxide" overlap with almost the same size spikes for the entire period, while "benzene and nitrogen oxide" is reversed, with benzene being larger in the first half and nitrogen oxide being larger in the second half.  
Therefore, it seems that there is no problem using the sensor as a feature for carbon monoxide, but nitrogen oxide seems to require some kind of feature in addition to the sensor.  
When I looked at the details, I found that nitrogen oxides seem to increase when the temperature is low, but even if the temperature is temporarily high for a few days in winter, nitrogen oxides do not seem to decrease. Assuming that benzene and carbon monoxide are generated by automobile exhaust and nitrogen oxides are generated by exhaust plus winter heating, I thought it would be better to use the average temperature as a characteristic quantity on a monthly basis because air pollution does not disappear immediately.

### Add features

Add the following as features    
week_day,hour : Added with the expectation of predicting carbon monoxide and nitrogen oxide values when the sensor is off  
IsSensorOff : Added in anticipation of determining if sensors are off or not.  
degC_month : Added in anticipation of determining seasonal variations.

In [None]:
# 日付の特徴量を増やす
train['week_day'] = train['date_time'].dt.weekday
train['hour'] = train['date_time'].dt.hour

# センサーがオフのとき1
train['IsSensorOff'] = 0
train.loc[train['absolute_humidity'] < 0.24 , 'IsSensorOff'] = 1

# 月平均温度を追加
train['month'] = train['date_time'].dt.month
train.loc[:, 'degC_month'] = train.groupby(['month'])['deg_C'].transform('mean')

# 日付の特徴量を増やす
test['week_day'] = test['date_time'].dt.weekday
test['hour'] = test['date_time'].dt.hour

# センサーがオフのとき1
test['IsSensorOff'] = 0
test.loc[test['absolute_humidity'] < 0.24 , 'IsSensorOff'] = 1

# 月平均温度を追加
test['month'] = test['date_time'].dt.month
test.loc[:, 'degC_month'] = test.groupby(['month'])['deg_C'].transform('mean')

### Prediction with PyCaret
#### Carbon monoxide
I want to clearly separate carbon monoxide into outliers and other predictions, so I specify IsSensorOff as the category.

In [None]:
train1 = train.loc[:,['IsSensorOff','degC_month','week_day','hour','sensor_1','sensor_2','sensor_3','sensor_4','sensor_5','target_carbon_monoxide']]
train1.head()

In [None]:
reg1 = setup(data=train1, target='target_carbon_monoxide',categorical_features=["IsSensorOff", "week_day", "hour"], session_id=1)

In [None]:
catboost1 = create_model("catboost", fold=4)
et1 = create_model("et", fold=4)
lightgbm1 = create_model("lightgbm", fold=4)
gbr1 = create_model("gbr", fold=4)
rf1 = create_model("rf", fold=4)
blend1 = blend_models(estimator_list= [catboost1, et1, lightgbm1, gbr1, rf1], fold=4)
pred_h1 = predict_model(blend1)
final1 = finalize_model(blend1)
pred1 = predict_model(final1, data=test)

In [None]:
interpret_model(catboost1)

In [None]:
interpret_model(lightgbm1)

In [None]:
interpret_model(rf1)

#### Benzene
Since there is a very strong positive correlation between benzene and sensor_2, we chose sensor_2 as the only feature.

In [None]:
train2 = train.loc[:,['sensor_2','target_benzene']]
train2.head()

In [None]:
reg2 = setup(data=train2, target='target_benzene', session_id=2)

In [None]:
gbr2 = create_model("gbr", fold=4)
et2 = create_model("et", fold=4)
lightgbm2 = create_model("lightgbm", fold=4)
catboost2 = create_model("catboost", fold=4)
rf2 = create_model("rf", fold=4)
blend2 = blend_models(estimator_list= [catboost2, lightgbm2, gbr2, rf2], fold=4)
pred_h2 = predict_model(blend2)
final2 = finalize_model(blend2)
pred2 = predict_model(final2, data=test)

In [None]:
interpret_model(catboost2)

In [None]:
interpret_model(lightgbm2)

In [None]:
interpret_model(rf2)

#### Nitrogen oxides
I specify IsSensorOff as a category for the same reason as carbon monoxide. I  expect to use degC_month to make predictions.

In [None]:
train3 = train1 = train.loc[:,['IsSensorOff','degC_month','week_day','hour','sensor_1','sensor_2','sensor_3','sensor_4','sensor_5','target_nitrogen_oxides']]
train3.head()

In [None]:
reg3 = setup(data=train3, target='target_nitrogen_oxides',categorical_features=["IsSensorOff", "week_day", "hour"], session_id=3)

In [None]:
catboost3 = create_model("catboost", fold=4)
et3 = create_model("et", fold=4)
lightgbm3 = create_model("lightgbm", fold=4)
gbr3 = create_model("gbr", fold=4)
rf3 = create_model("rf", fold=4)
blend3 = blend_models(estimator_list= [catboost3, et3, lightgbm3, gbr3, rf3], fold=4)
pred_h3 = predict_model(blend3)
final3 = finalize_model(blend3)
pred3 = predict_model(final3, data=test)

In [None]:
interpret_model(catboost3)

In [None]:
interpret_model(lightgbm3)

In [None]:
interpret_model(rf3)

I have checked SHAP with CatBoost, LightGBM and Random forest. For nitrogen oxides, degC_month is the most important feature. I also found that the features I have added are used.

### Submission

In [None]:
sub.target_carbon_monoxide = pred1.Label
sub.target_benzene = pred2.Label
sub.target_nitrogen_oxides = pred3.Label
sub.to_csv('addfeatures_submission.csv',index=False)
sub

#### Public Score
**0.26977** 

My score was worse than  [last time](https://www.kaggle.com/astashiro/tps-jul2021-02pycaretblend). 
However, when using only LightGBM, predicting with these features resulted in better scores. 

When using only LightGBM  
0.30025 --> 0.27216  
 
When blended  
0.25441 --> 0.26977  

The newly added features are likely to be effective, but I thought that the way the features are used when training on multiple models might be poor.  

**Issue**
1. Prediction when the sensor is not working
2. Nitrogen Oxides Prediction


I will continue to consider it.