# Tabular Playground Series - Aug 21

This month, our data consists of 99 feature variables and our target variable is loss. We will first perform some basic EDA to take a better look at this data following which we will start working on our models. 

## Plan

Moving forward this is the plan we are going to be following. Keep in mind, this is not a concrete plan and I might change it as we move through the notebook. This will show you my process on how I approach these datasets.

- *Memory Reduction*
- *Sampling to Reduce Training Time*
- *EDA*
- *Model Development*
- *Hyperparameter Tuning*
- *Feature Importance from top models*
- *Selecting the best Model*

## Imports 

Let's import some of the libraries we will be using throughout the notebook

In [None]:
# Data Import on Kaggle
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Importing processing libraries
import numpy as np
import pandas as pd

# Importing Visualisation libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Importing libraries for the metrics
from sklearn import metrics
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, GridSearchCV, KFold

# Importing libraries for the model
import xgboost as xgb 
import lightgbm as lgb
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor

In [None]:
data = pd.read_csv('../input/tabular-playground-series-aug-2021/train.csv')
test_data = pd.read_csv('../input/tabular-playground-series-aug-2021/test.csv')

## Memory Reduction

Here, we will take a look at the memory consumption by the current data and each feature following which we will try to reduce it to some extent. 

In [None]:
memory_usage = data.memory_usage(deep=True) / 1024 ** 2
print('memory usage of features: \n', memory_usage.head(7))
print('memory usage sum: ',memory_usage.sum())

In [None]:
def reduce_memory_usage(df, verbose=True):
    numerics = ["int8", "int16", "int32", "int64", "float16", "float32", "float64"]
    start_mem = df.memory_usage().sum() / 1024 ** 2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int":
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if (
                    c_min > np.finfo(np.float16).min
                    and c_max < np.finfo(np.float16).max
                ):
                    df[col] = df[col].astype(np.float16)
                elif (
                    c_min > np.finfo(np.float32).min
                    and c_max < np.finfo(np.float32).max
                ):
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024 ** 2
    if verbose:
        print(
            "Mem. usage decreased to {:.2f} Mb ({:.1f}% reduction)".format(
                end_mem, 100 * (start_mem - end_mem) / start_mem
            )
        )
    return df

reduced_df = reduce_memory_usage(data, verbose=True)

In [None]:
reduced_df.describe()

## Sampling Data

Now that we have reduced the memory usage by over 70%, let's sample the data. We are doing this to reduce the model training time. The sampling would preserve the distributions of each feature while taking only 20% of the entire dataset. We can then perform EDA, modelling, hyperparameter tuning and other steps on this sampled data.

Once we decide on the model we want to use, we can train the final model on the entire dataset again.

In [None]:
sample_df = reduced_df.sample(int(len(reduced_df) * 0.2))
sample_df.shape

sample_df = sample_df.drop(['id'], axis=1)

In [None]:
# Let's confirm if the sampling is retaining the feature distributions

fig, ax = plt.subplots(figsize=(6, 4))

sns.histplot(
    data=reduced_df, x="f6", label="Original data", color="red", alpha=0.3, bins=15
)
sns.histplot(
    data=sample_df, x="f6", label="Sample data", color="green", alpha=0.3, bins=15
)

plt.legend()
plt.show();

## EDA

Let's start looking at any correlations that might exist among the features.
We will also be looking at the densities of every feature.

In [None]:
f, ax = plt.subplots(figsize=(8, 6))
corr = reduced_df.iloc[:,:20].corr()
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True),
            square=True, ax=ax)
plt.show()

In [None]:
fig = plt.figure(figsize = (15, 50))
for i in range(len(sample_df.columns.tolist()[:100])):
    plt.subplot(20,5,i+1)
    sns.set_style("white")
    plt.title(sample_df.columns.tolist()[:100][i], size = 12, fontname = 'monospace')
    a = sns.kdeplot(sample_df[sample_df.columns.tolist()[:100][i]], color = '#1a5d57', shade = True, alpha = 0.9, linewidth = 1.5, edgecolor = 'black')
    plt.ylabel('')
    plt.xlabel('')
    plt.xticks(fontname = 'monospace')
    plt.yticks([])
    for j in ['right', 'left', 'top']:
        a.spines[j].set_visible(False)
        a.spines['bottom'].set_linewidth(1.2)
        
fig.tight_layout(h_pad = 3)

plt.show()

## Model Selection

In this section, we will use some statistical methods and regressions to find siginificant features and possible interactions between them that might be important. For this, we will test out ANOVA, linear regression and GAM and see the results we get.

Following that, we will start training some basic models on the data to make some predictions and see which ones to move forward with.
We will test SVM, XGBoost, LightGBM and Random Forrest.  

In [None]:
# The results from the tests were not useful so I've deleted them.
# I have kept my code for ANOVA below if you want to refer to it.



# import statsmodels.api as sm
# from statsmodels.formula.api import ols

# all_columns = "+".join(sample_df.columns[:-1])
# my_formula = "loss~" + all_columns

# mod = ols(formula=my_formula,
#                 data=sample_df, family=sm.families.Gaussian()).fit()
                
# aov_table = sm.stats.anova_lm(mod, typ=2)
# print(aov_table)

### Train - Test Split

Let's use our sample to split the data into train and test sets

In [None]:
x = sample_df.drop(['loss'], axis=1)
y = sample_df.loss

x_train,x_test,y_train,y_test = train_test_split(x, y, test_size=0.33, random_state=42)

### Scaling

Here we will be scaling the train data to normalise it between 0 and 1. This will not have any effect for most of our models since they are boosting but it is needed for the Support Vector Machine (SVM).

In [None]:
scaler = MinMaxScaler()
x_scaled = scaler.fit_transform(x_train)

In [None]:
x_scaled

### Initial Model Training

In [None]:
model_dict = {
#     'Random Forest Regressor': RandomForestRegressor(random_state=0, verbose=10),
#     'Gradient Boosting Regressor': GradientBoostingRegressor(random_state=0, verbose=10),
#     'Support Vector Machine': SVR(),
#     'Decison Tree': DecisionTreeRegressor(random_state=0),
    'XGB': xgb.XGBRegressor(random_state=0, verbose=10),
    'Light GBM': lgb.LGBMRegressor(random_state=0, verbose=10)
            }
model_list = []
train_acc_list = []
test_acc_list = []
counter_list = []
prediction_list = []
metric_scores_list = []

for model, clf in model_dict.items():
    clf.fit(x_train, y_train)
    test_preds = clf.predict(x_test)
    test_rmse = np.sqrt(mean_squared_error(y_test, test_preds))
    
    train_pred =  clf.predict(x_train)
    train_rmse = np.sqrt(mean_squared_error(y_train, train_pred))
    
    model_list.append(model)
    train_acc_list.append(train_rmse)
    test_acc_list.append(test_rmse)  
    print('{} training'.format(model), 'completed')

results = pd.DataFrame({"model": model_list, "train_rmse": train_acc_list, "test_rmse": test_acc_list})


In [None]:
results

### Initial Model Selection

Now that we've trained our first batch of models on default parameters, we can eliminate a few which don't do well.

The XGBoost and the lightGBM models performed the best so we will keep those and perform hyperparamter tuning on them.



## Hyperparameter Tuning

In this step, we are selecting our XGBoost and LightGBM models to perform Hyperparameter tuning on. We'll start off by using GridSearchCV on both these models with various parameters and selecting the best performing ones based on rmse score.

### XGBoost

The parameters we set for grid search were:

- learning_rate: 0.003, 0.008
- max_depth: 3, 5, 7
- n_estimators: 500, 1000, 2500

and the top performing parameters after gridsearchcv were:
- learning_rate: 0.003
- max_depth: 7
- n_estimators: 2500

with an RMSE of (-7.915772914886475)

In [None]:
params = {
                       "learning_rate":[0.003, 0.008],
                       "subsample":[0.84],
                       'booster': ['gbtree'],
                       'tree_method': ['gpu_hist'],
 'colsample_bytree':[0.70],
    'max_depth': [7],
    'n_estimators': [2500],
}

xgb_estimator = xgb.XGBRegressor(random_state=42)
grid = GridSearchCV(xgb_estimator, param_grid=params, scoring='neg_root_mean_squared_error', cv=5, verbose=100)
xgb_model = grid.fit(x_scaled, y_train)

print(xgb_model.best_params_, xgb_model.best_score_)


In [None]:
xgb_model = xgb.XGBRegressor(random_state=42, booster='gbtree', colsample_bytree= 0.7, learning_rate= 0.003, max_depth=7, n_estimators=2500, subsample= 0.84, tree_method= 'gpu_hist')
xgb_model.fit(x_train, y_train)
oof_pred1 = xgb_model.predict(x_test)
oof_pred1 = np.clip(oof_pred1, y.min(), y.max())

print(f'Mean Error: {np.sqrt(mean_squared_error(y_test, oof_pred1))}')

### LightGBM

The parameters we set for grid search were:

- learning_rate: 0.003, 0.009
- max_depth: -1, 3, 5
- n_estimators: 500, 1000
- num_leaves: 28, 31, 50, 75

and the top performing parameters were
- learning_rate: 0.003
- max_depth: -1
- n_estimators: 1000,
- num_leaves: 50

with an RMSE of (-7.9347)

In [None]:
params = {
    'num_leaves': [50],
    'learning_rate': [0.003],
    'max_depth': [-1],
    'n_estimators': [2500],
}

lgb_estimator = lgb.LGBMRegressor(random_state=42)

grid = GridSearchCV(lgb_estimator, param_grid=params, scoring='neg_root_mean_squared_error', cv=5, verbose=100)
lgb_model = grid.fit(x_scaled, y_train)

print(lgb_model.best_params_, lgb_model.best_score_)


In [None]:
lgb_model = lgb.LGBMRegressor(learning_rate=0.003, max_depth=-1, n_estimators=1000, num_leaves=50, random_state=42)
lgb_model.fit(x_train, y_train)

oof_pred1 = lgb_model.predict(x_test)
oof_pred1 = np.clip(oof_pred1, y.min(), y.max())

from sklearn.metrics import mean_squared_error
print(f'Mean Error: {np.sqrt(mean_squared_error(y_test, oof_pred1))}')

## Feature Importance

Let's take a look at Feature Importance for both our models 

In [None]:
from sklearn.preprocessing import minmax_scale

a1 = lgb_model.feature_importances_
a2 = xgb_model.feature_importances_

axis_x  = x_train.columns.values
axis_y1 = minmax_scale(a1)
axis_y2 = minmax_scale(a2)

plt.style.use('seaborn-whitegrid') 
plt.figure(figsize=(16, 6))
plt.title(f'XGBoost vs Light GBM Feature Importances', fontsize=12)  

plt.scatter(axis_x, axis_y1, s=20, label='Light GBM') 
plt.scatter(axis_x, axis_y2, s=20, label='XGBoost')

plt.legend(fontsize=12, loc=2)
plt.show()

## Best Fit

Now that we have performed hyperparameter tuning for our two top models, XGB and LGBM, we can start taking a deeper look at them and considering the best model to use.

In [None]:
# Let's first take the non-sampled data

reduced_df = reduced_df.drop('id', axis=1)
x_final = reduced_df.drop('loss', axis=1)
y_final = reduced_df.loss

x_train,x_test,y_train,y_test = train_test_split(x_final, y_final, test_size=0.33, random_state=42)

In [None]:
lgb_model = lgb.LGBMRegressor(learning_rate=0.003, max_depth=-1, n_estimators=1000, num_leaves=50, random_state=42)
lgb_model.fit(x_train, y_train)

oof_pred1 = lgb_model.predict(x_test)
oof_pred1 = np.clip(oof_pred1, y.min(), y.max())

from sklearn.metrics import mean_squared_error
print(f'Mean Error: {np.sqrt(mean_squared_error(y_test, oof_pred1))}')

In [None]:
final_preds = lgb_model.predict(test_data.drop('id', axis=1))
new_df = pd.DataFrame({'id': test_data['id'], 'loss': final_preds})

# Submission
new_df.to_csv("submission2.csv",index=False)