<p style='text-align:center;font-family: sans-serif;font-weight:bold;color:#616161;font-size:25px;margin: 30px;'>TPS MAR</p>
<p style='text-align:center;font-family: sans-serif ;font-weight:bold;color:black;font-size:30px;margin: 10px;'>EDA + Modeling with Optuna for <font color='#08B4E4'>Beginners</font></p>
<p style="text-align:center;font-family: sans-serif ;font-weight:bold;color:#616161;font-size:20px;margin: 30px;">Catboost for regression</p>

Hello, this is my first time using Catboost in TPS. It's an interesting model, and I tried tuning with Optuna. Hyperparameter tuning libraries such as Optuna are easy to use and improve performance, but do not always think that the results are the best.

## Import Modules

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from math import sin, cos, pi

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import mean_absolute_error as mae
from sklearn.ensemble import RandomForestRegressor
from catboost import CatBoostRegressor
import optuna
from optuna.samplers import TPESampler
from functools import partial
from sklearn.preprocessing import LabelEncoder, StandardScaler

BACKCOLOR = '#f6f5f5'
sns.set_palette("Paired")

## Read Data

In [None]:
train = pd.read_csv('../input/tabular-playground-series-mar-2022/train.csv', index_col='row_id', parse_dates=['time'])
test = pd.read_csv('../input/tabular-playground-series-mar-2022/test.csv', index_col='row_id', parse_dates=['time'])
submission = pd.read_csv('../input/tabular-playground-series-mar-2022/sample_submission.csv')
train.head()

In [None]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)  
        else:
            df[col] = df[col].astype('category')
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

reduce_mem_usage(train)
reduce_mem_usage(test)

## EDA

### Step1: Understand each features  
The beginning of EDA is understanding variables.  
- row_id - a unique identifier for this instance  
- time - the 20-minute period in which each measurement was taken  
- x - the east-west midpoint coordinate of the roadway  
- y - the north-south midpoint coordinate of the roadway  
- direction - the direction of travel of the roadway. EB indicates "eastbound" travel, for example, while SW indicates a "southwest" direction of travel.  
- congestion - congestion levels for the roadway during each hour; the target. The congestion measurements have been normalized to the range 0 to 100.  

In [None]:
# Check the type and missing value of each variable.
train.info()

In [None]:
from IPython.core.display import HTML
def multi_table(table_list):
    return HTML(
        f"<table><tr> {''.join(['<td>' + table._repr_html_() + '</td>' for table in table_list])} </tr></table>")

In [None]:
# Check the actual value and count
multi_table([pd.DataFrame(train[i].value_counts()) for i in train.columns])

If you look at the table above, you can see that there are 65 objects in each time. This means that there are 65 combinations of position values (x,y) and directions for each time.

The x coordinates are 0, 1, 2 / y coordinates are 0, 1, 2, 3, and there are 12 combinations.
The direction consists of eight directions: EB to SE.

What we can see from this is that all combinations of positions and directions were not recorded at each point in time.

### Step2-1 - Time  
Records are recorded at intervals of 20 minutes from the initial start time, so a total of 13140 times must be recorded.  
However, the actual time recorded is 13059, so 81 missing values exist.  
These missing values must be handled when using time series analysis. There are no applicable items in this analysis.

In [None]:
import datetime
last_datetime = datetime.datetime.strptime('1991-09-30 11:40:00', '%Y-%m-%d %H:%M:%S')
first_datetime = datetime.datetime.strptime('1991-04-01 00:00:00', '%Y-%m-%d %H:%M:%S')
interval = datetime.timedelta(minutes=20)

exp_count = int((last_datetime - first_datetime) / interval + 1)
cur_count = train.time.nunique()

time_info = pd.DataFrame({'first datetime': [first_datetime], 'last datetime': [last_datetime], 'time interval': [interval]}, index=['value']).T
miss_table = pd.DataFrame({'expected count': [exp_count], 'actual row count': [cur_count], 'missing count': [exp_count - cur_count]}, index=['value']).T
multi_table([time_info, miss_table])

### Step2-2 - X and Y  
These values are numeric data, but they are nominal variables.

In [None]:
f, ax = plt.subplots(1, 2, figsize=(15, 5))
for i, p in enumerate(['x', 'y']):
    sns.countplot(train[p], ax=ax[i], edgecolor='black', linewidth=4)
    ax[i].spines[['top', 'right']].set_visible(False)
    ax[i].set_facecolor(BACKCOLOR)
    for patch in ax[i].patches:
        x, height, width = patch.get_x(), patch.get_height(), patch.get_width()
        total_cnt = train[p].count()
        ax[i].set_xlabel(p, size=15)
        ax[i].set_ylabel('count', size=15)
        ax[i].text(x + width / 2, height + 5, f'{height} / {height / total_cnt * 100:2.2f}%', va='center', ha='center', size=8, bbox={'facecolor': 'white', 'boxstyle': 'round'})
f.suptitle('Count by X / Y', size=15)
plt.show()

### Step2-3 - Direction  

In [None]:
f, ax = plt.subplots(1, figsize=(15, 5))
sns.countplot(train['direction'], edgecolor='black', linewidth=4)
ax.spines[['top', 'right']].set_visible(False)
ax.set_facecolor(BACKCOLOR)
for patch in ax.patches:
    x, height, width = patch.get_x(), patch.get_height(), patch.get_width()
    total_cnt = train[p].count()
    ax.set_xlabel('direction', size=15)
    ax.set_ylabel('count', size=15)
    ax.text(x + width / 2, height + 5, f'{height} / {height / total_cnt * 100:2.2f}%', va='center', ha='center', size=10, bbox={'facecolor': 'white', 'boxstyle': 'round'})
    
f.suptitle('Distribution of direction', size=15)
plt.show()

In [None]:
f, ax = plt.subplots(figsize=(10, 7))
dir_dict = {'EB': (1, 0), 'NB': (0, 1), 'SB': (0, -1), 'WB': (-1, 0), 'NE': (1, 1), 'SE': (-1, 1), 'NW': (1, -1), 'SW': (-1, -1)}
for _, x, y, d in train[['x', 'y', 'direction']].drop_duplicates().itertuples():
    dx, dy = dir_dict[d]
    dx, dy = dx/4, dy/4
    plt.plot([x, x+dx], [y, y+dy])
ax.spines[['top', 'right']].set_visible(False)
plt.xlabel('x')
plt.ylabel('y')
plt.title('directions', size=15)
plt.show()

### Step2-4 - Congestion  

In [None]:
f, ax = plt.subplots(1, 4, figsize=(35, 10))
sns.histplot(data=train, x='congestion', element='step', ax=ax[0])
sns.violinplot(train.congestion, edgecolor='black', linewidth=5, ax=ax[1])
sns.boxplot(train.congestion, ax=ax[2])
sns.stripplot(train.congestion, ax=ax[3])
for i in range(4):
    ax[i].spines[['top','right']].set_visible(False)
    ax[i].set_facecolor(BACKCOLOR)
f.suptitle("congestion's distribution", weight='bold', size=25)
plt.show()

## Feature Engineering  
I conducted Feature Engineering by referring to INVERSION's notebook.  
https://www.kaggle.com/inversion/tps-mar-22-cyclical-features

In [None]:
# train['year'] = train['time'].dt.year
# train['month'] = train['time'].dt.month
# train['day'] = train['time'].dt.day
train['hour'] = train['time'].dt.hour
train['minute'] = train['time'].dt.minute
train['weekday'] = train['time'].dt.weekday

# test['year'] = test['time'].dt.year
# test['month'] = test['time'].dt.month
# test['day'] = test['time'].dt.day
test['hour'] = test['time'].dt.hour
test['minute'] = test['time'].dt.minute
test['weekday'] = test['time'].dt.weekday

train = train.drop('time', axis='columns')
test = test.drop('time', axis='columns')

In [None]:
train.dtypes

In [None]:
for col in train.columns[(train.dtypes == 'object') | (train.dtypes == 'category')]:
    le = LabelEncoder()
    train[col] = le.fit_transform(train[col])
    test[col] = le.transform(test[col])

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
numeric_cols = train.columns[(train.dtypes != 'object') & (train.columns != 'congestion') & (train.columns != 'direction')]
train[numeric_cols] = scaler.fit_transform(train[numeric_cols])
test[numeric_cols] = scaler.transform(test[numeric_cols])

In [None]:
X_train, y_train = train.drop(['congestion'], axis=1), train['congestion']

## Option: Model Selection  
Why did I use Catboost?  
Using AutoML libraries, we can quickly and easily discover high-performance machine  learning algorithms for the currently prepared datasets.  
  
By executing the following code, you can easily compare the performance of the model.

In [None]:
allow_pycaret = 0

if allow_pycaret:
    %%capture
    !pip install pycaret[full]
    
    from pycaret.regression import *
    
    model = setup(data = train, 
              target = 'congestion', 
              use_gpu = True,
              n_jobs = -1,
              silent = True,
              fold_shuffle = True,
              fold = 5
             )
    
    # The following code takes a lot of time.
    best_model = compare_models(exclude=['et'])

I used pycaret to get the next table. This table shows that Random Forest has the best performance. I decided to use CatBoost because Random Forest takes a long time.

|Model|MAE|MSE|RMSE|R2|RMSLE|MAPE|TT(Sec)|
|------|---|---|---|---|---|---|---|
|Random Forest Regressor|6.2035|77.1851|8.7855|0.7266|0.2455|0.1676|22.0880|
|CatBoost Regressor|6.4813|83.3279|9.1284|0.7048|0.2570|0.1780|7.3560|
|Extreme Gradient Boosting|6.5138|83.6836|9.1479|0.7036|0.2572|0.1787|1.5180|
|Light Gradient Boosting Machine|7.0354|93.8601|9.6881|0.6675|0.2694|0.1947|3.7400|
|Decision Tree Regressor|8.0557|140.3261|11.8459|0.5029|0.3304|0.2067|3.9780|
|K Neighbors Regressor|9.2048|141.1331|11.8799|0.5001|0.3112|0.2520|1.7700|
|Gradient Boosting Regressor|9.5307|150.3428|12.2612|0.4674|0.3204|0.2623|61.6240|
|Bayesian Ridge|11.7967|214.2355|14.6368|0.2411|0.3666|0.3248|1.5120|
|Least Angle Regression|11.7968|214.2355|14.6368|0.2411|0.3666|0.3248|0.3480|
|Ridge Regression|11.7968|214.2355|14.6368|0.2411|0.3666|0.3248|0.1180|
|Linear Regression|11.7962|214.2554|14.6375|0.2411|0.3666|0.3248|0.1120|
|Huber Regressor|11.7607|214.8679|14.6584|0.2389|0.3659|0.3218|38.9500|
|AdaBoost Regressor|12.6995|238.2440|15.4349|0.1561|0.3998|0.3761|29.6600|
|Orthogonal Matching Pursuit|12.9429|253.2295|15.9132|0.1030|0.3956|0.3635|0.2920|
|Elastic Net|13.4782|268.1863|16.3764|0.0500|0.4112|0.3831|0.1180|
|Lasso Regression|13.6573|274.6520|16.5726|0.0271|0.4153|0.3887|0.1160|
|Lasso Least Angle Regression|13.8705|282.3189|16.8023|-0.0000|0.4207|0.3965|0.2920|
|Dummy Regressor|13.8705|282.3189|16.8023|-0.0000|0.4207|0.3965|0.0960|
|Passive Aggressive Regressor|15.8143|391.0209|19.4711|-0.3857|0.4649|0.4301|1.5940|

## Optimize model  
I tried to optimize easily using Optuna.

In [None]:
def objective(trial):
    params = {
        "random_state":trial.suggest_categorical("random_state", [2022]),
        'learning_rate' : trial.suggest_loguniform('learning_rate', 0.0001, 0.3),
        'bagging_temperature' :trial.suggest_loguniform('bagging_temperature', 0.01, 100.00),
        "n_estimators": 1000,
        "max_depth":trial.suggest_int("max_depth", 4, 16),
        'random_strength' :trial.suggest_int('random_strength', 0, 100),
        "l2_leaf_reg":trial.suggest_float("l2_leaf_reg",1e-8,3e-5),
        "min_child_samples": trial.suggest_int("min_child_samples", 5, 100),
        "max_bin": trial.suggest_int("max_bin", 200, 500),
        'od_type': trial.suggest_categorical('od_type', ['IncToDec', 'Iter']),
        'task_type': trial.suggest_categorical('task_type', ['GPU']),
        'loss_function': trial.suggest_categorical('loss_function', ['MAE']),
        'eval_metric': trial.suggest_categorical('eval_metric', ['MAE'])
    }

    model = CatBoostRegressor(**params)
    X_train_tmp, X_valid_tmp, y_train_tmp, y_valid_tmp = train_test_split(X_train, y_train, test_size=0.3, random_state=42)
    model.fit(
        X_train_tmp, y_train_tmp,
        eval_set=[(X_valid_tmp, y_valid_tmp)],
        early_stopping_rounds=35, verbose=0
    )
        
    y_train_pred = model.predict(X_train_tmp)
    y_valid_pred = model.predict(X_valid_tmp)
    train_mae = mae(y_train_tmp, y_train_pred)
    valid_mae = mae(y_valid_tmp, y_valid_pred)
    
    print(f'MAE of Train: {train_mae}')
    print(f'MAE of Validation: {valid_mae}')
    
    return valid_mae

In [None]:
allow_optimize = 1

In [None]:
TRIALS = 100
TIMEOUT = 3600


if allow_optimize:
    sampler = TPESampler(seed=42)

    study = optuna.create_study(
        study_name = 'cat_parameter_opt',
        direction = 'minimize',
        sampler = sampler,
    )
    study.optimize(objective, n_trials=TRIALS)
    print("Best Score:",study.best_value)
    print("Best trial",study.best_trial.params)
    
    best_params = study.best_params
    
    X_train_tmp, X_valid_tmp, y_train_tmp, y_valid_tmp = train_test_split(X_train, y_train, test_size=0.3, random_state=42)
    model_tmp = CatBoostRegressor(**best_params, n_estimators=30000, verbose=1000).fit(X_train_tmp, y_train_tmp, eval_set=[(X_valid_tmp, y_valid_tmp)], early_stopping_rounds=35)

## Training model

In [None]:
if allow_optimize:
    model = CatBoostRegressor(**best_params, n_estimators=model_tmp.get_best_iteration(), verbose=1000).fit(X_train, y_train)
else:
    model = CatBoostRegressor(
        verbose=1000,
        early_stopping_rounds=10,
        random_seed=2022,
        max_depth=12,
        task_type='GPU',
        learning_rate=0.035,
        iterations=30000,
        loss_function='MAE',
        eval_metric= 'MAE'
    ).fit(X_train, y_train)

## Interpretation  

In [None]:
optuna.visualization.plot_optimization_history(study)

In [None]:
optuna.visualization.plot_slice(study)

## Predict

In [None]:
y_pred = model.predict(test)
submission['congestion'] = y_pred

In [None]:
submission.to_csv('submission.csv', index=False)
submission = pd.read_csv("submission.csv")
submission