# Tabular Playground Series - Jul 2021
## H20(AutoML)

I have been using **Pycaret** to try predictions with various features. The one with the best score is [this notebook](https://www.kaggle.com/astashiro/tps-jul2021-06rethink-features). </br> 
**Score(Pycaret) : 0.20696**

Next, I also tried predicting with **LightAutoML** under the same conditions as the features that worked well with Pycaret. This is  [the notebook](https://www.kaggle.com/astashiro/tps-jul2021-07lightautoml/output?select=LightAutoML_submission.csv). </br>
**Score(LightAutoML) : 0.20509**

Both of these AutoMLs were very nice and gave similar results for the same features, but I decided to try a third AutoML, **H20**.  

I guess it depends on the tuning, but the results are still the same. It seems that it is better to choose the AutoML you prefer for its execution speed and visualization of the results.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import h2o
from h2o.automl import H2OAutoML

In [None]:
df_train = pd.read_csv('../input/tabular-playground-series-jul-2021/train.csv')
df_test = pd.read_csv('../input/tabular-playground-series-jul-2021/test.csv')

In [None]:
df_train['IsTrain'] = 1
df_test['IsTrain'] = 0
df = pd.concat([df_train, df_test], sort=False,axis=0)

df['date_time'] = pd.to_datetime(df['date_time'])

df['day_of_week'] = df['date_time'].dt.dayofweek
df['hour'] = df['date_time'].dt.hour
df['working_hours'] =  df.hour.isin(np.arange(8, 21, 1)).astype("int")
df.loc[(df['date_time'].dt.month >= 3) & (df['date_time'].dt.month <= 5), 'season'] = 1
df.loc[(df['date_time'].dt.month >= 6) & (df['date_time'].dt.month <= 8), 'season'] = 2
df.loc[(df['date_time'].dt.month >= 9) & (df['date_time'].dt.month <= 11), 'season'] = 3
df.loc[(df['date_time'].dt.month == 12) | (df['date_time'].dt.month <= 2), 'season'] = 4

train = df.query('IsTrain == 1').drop(['IsTrain'], axis=1)
test =  df.query('IsTrain == 0').drop(['IsTrain','target_carbon_monoxide','target_benzene','target_nitrogen_oxides'], axis=1)

### Predict with H20(AutoML)

In [None]:
h2o.init()

In [None]:

def do_h2o(target, train, test):
    features = [x for x in train.columns if x not in [target]]
    h2oaml = H2OAutoML(max_runtime_secs=360, stopping_metric='RMSLE', sort_metric='RMSLE')
    h2oaml.train(x=features, y=target, training_frame=train)
    h2oaml.leaderboard
    pred = h2oaml.leader.predict(test).as_data_frame().predict
    return(pred)

### Prediction when the sensor is on
#### Carbon monoxide

In [None]:
train1 = h2o.H2OFrame(train.query('absolute_humidity >= 0.24').loc[:,['deg_C', 'relative_humidity','absolute_humidity', 'sensor_1', 'sensor_2', 'sensor_5', 'season', 'working_hours', 'target_carbon_monoxide']])
test1 = h2o.H2OFrame(test.loc[:,['deg_C', 'relative_humidity','absolute_humidity', 'sensor_1', 'sensor_2', 'sensor_5', 'season', 'working_hours']])

In [None]:
train1

In [None]:
test1

In [None]:
pred1 = do_h2o('target_carbon_monoxide', train1, test1)
pred1

#### Benzene

In [None]:
train2 = h2o.H2OFrame(train.loc[:,['sensor_2','target_benzene']])
test2 = h2o.H2OFrame(test.loc[:,['sensor_2']])

In [None]:
train2

In [None]:
test2

In [None]:
pred2 = do_h2o('target_benzene', train2, test2)
pred2

#### Nitrogen oxides

In [None]:
train3 = h2o.H2OFrame(train.query('absolute_humidity >= 0.24 & season >= 3').loc[:,['deg_C', 'relative_humidity','absolute_humidity', 'sensor_2', 'sensor_3', 'sensor_4', 'sensor_5', 'working_hours', 'target_nitrogen_oxides']])
test3 = h2o.H2OFrame(test.loc[:,['deg_C', 'relative_humidity','absolute_humidity', 'sensor_2', 'sensor_3', 'sensor_4', 'sensor_5', 'working_hours']])

In [None]:
train3

In [None]:
test3

In [None]:
pred3 = do_h2o('target_nitrogen_oxides', train3, test3)
pred3

In [None]:
sub1 = pd.DataFrame({
    'date_time': test.date_time,
    'target_carbon_monoxide': pred1,
    'target_benzene': pred2,
    'target_nitrogen_oxides': pred3
})

sub1

### Prediction when the sensor is off
#### Carbon monoxide

In [None]:
train4 = h2o.H2OFrame(train.query('season >= 3').loc[:,['day_of_week', 'hour', 'season', 'working_hours', 'target_carbon_monoxide']])
test4 = h2o.H2OFrame(test.loc[:,['day_of_week', 'hour', 'season', 'working_hours']])

In [None]:
train4

In [None]:
test4

In [None]:
pred4 = do_h2o('target_carbon_monoxide', train4, test4)
pred4

#### Nitrogen oxides

In [None]:
train5 = h2o.H2OFrame(train.query('season >= 3').loc[:,['day_of_week', 'hour', 'season', 'working_hours', 'target_nitrogen_oxides']])
test5 = h2o.H2OFrame(test.loc[:,['day_of_week', 'hour', 'season', 'working_hours']])

In [None]:
pred5 = do_h2o('target_nitrogen_oxides', train5, test5)
pred5

In [None]:
sub2 = pd.DataFrame({
    'date_time': test.date_time,
    'target_carbon_monoxide': pred4,
    'target_benzene': pred2,
    'target_nitrogen_oxides': pred5
})

sub2

### Merge predictions

In [None]:
sub_temp1 = sub1.query("date_time < '2011-01-02 21:00:00'")
sub_temp2 = sub2.query("date_time >= '2011-01-02 21:00:00' & date_time <= '2011-01-05 00:00:00'")
sub_temp3 = sub1.query("date_time > '2011-01-05 00:00:00' & date_time < '2011-01-28 17:00:00'")
sub_temp4 = sub1.query("date_time >= '2011-01-28 17:00:00' & date_time <= '2011-01-29 01:00:00'")
sub_temp5 = sub1.query("date_time > '2011-01-29 01:00:00' & date_time < '2011-02-08 17:00:00'")
sub_temp6 = sub2.query("date_time >= '2011-02-08 17:00:00' & date_time <= '2011-02-11 20:00:00'")
sub_temp7 = sub1.query("date_time > '2011-02-11 20:00:00'")

submission = pd.concat([sub_temp1, sub_temp2, sub_temp3, sub_temp4, sub_temp5, sub_temp6, sub_temp7], sort=False,axis=0)

In [None]:
submission

In [None]:
submission.to_csv('autml_h2o_submission.csv',index=False)