# TPS Feb 2021: Handling the multimodal distributed features

# Table of Contents
* [Importing Libraries](#section-one)
* [Reading the data files](#section-two)
* [Exploring the data](#section-three)
* [Exploratory Data Analysis (EDA)](#section-four)
    - [Scaling](#subsection-fourone)
    - [Correlation Check](#subsection-fourtwo)
    - [Outlier Treatment](#subsection-fourthree)
* [Feature Engineering](#section-five)
* [Modeling](#section-six)
    - [LGBM Hyperparameter Tuning with Optuna](#subsection-sixone)

<a id="section-one"></a>
# Importing Libraries

In [None]:
#Importing Required Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.dummy import DummyRegressor
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge, Lasso, LinearRegression
from lightgbm import LGBMRegressor
from xgboost.sklearn import XGBRegressor
from sklearn.svm import SVR
from sklearn.metrics import f1_score, confusion_matrix, classification_report
from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import mean_squared_error
from sklearn.metrics import log_loss
from sklearn.model_selection import StratifiedKFold

from sklearn.mixture import GaussianMixture

import warnings
warnings.filterwarnings("ignore")

pd.set_option('display.max_columns', None)

sns.set_palette("muted")

<a id="section-two"></a>
# Reading the data files

In [None]:
#Reading the data files (Change the paths if running on google colab)

train = pd.read_csv('../input/tabular-playground-series-feb-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-feb-2021/test.csv')
sample = pd.read_csv('../input/tabular-playground-series-feb-2021/sample_submission.csv')

<a id="section-three"></a>
# Exploring the data

In [None]:
print(f'Shape of train data: {train.shape}')
print(f'Missing values count: {train.isna().sum().sum()}')

train.head()

In [None]:
train.info()
print('\n')
train.nunique()

* Training data has 300000 records and 26 features. 
* Column 'id'is the primary key.
* It's a regression problem since we need to predict the 'target' feature which is continous in nature.
* There are 14 numerical features which are already scaled and 10 categorical features in the data.
* There is no missing value in the data.

In [None]:
print(f'Shape of test data: {test.shape}')
print(f'Missing values count: {test.isna().sum().sum()}')

test.head()

In [None]:
test.info()
print('\n')
test.nunique()

* Test data has 200000 records and 25 features. 'Target' feature is absent as expected.
* Column 'id' is the primary key.
* There are 14 numerical features which are already scaled and 10 categorical features in the data.
* There is no missing value in the data.

In [None]:
sample.head()

* We need to submit the predicted target value for each id in the test data.

<a id="section-four"></a>
# Exploratory Data Analysis (EDA)

In [None]:
train = train.set_index('id')
test = test.set_index('id')

In [None]:
#Checking if there is any difference between the behaviour of train and test data
train.describe() - test.describe()

There is not a major difference in the values of all features of test and train data. This is a good sign and will help us in correct validation.

In [None]:
num_columns = train.select_dtypes(exclude=['object']).columns
num_columns = [i for i in num_columns if i != 'target']

cat_columns = train.select_dtypes(include=['object']).columns

In [None]:
#Let's check the distribution of target variable

sns.distplot(train['target'], kde=True, bins=120, label="Skew: %.2f"%(train['target'].skew()))
plt.xlabel('Target', fontsize=12); plt.legend()

The distribution of the target variable is bimodal.

**Continuous Features**

In [None]:
# Checking the distribution of continuous features

i = 1
plt.figure()
fig, ax = plt.subplots(4, 4, figsize=(14, 14))

for feature in num_columns:
    plt.subplot(4, 4, i)
    sns.distplot(train[feature], kde=True, bins=120, label="Skew: %.2f"%(train[feature].skew()))
    plt.xlabel(feature, fontsize=9); plt.legend(loc="best")
    i += 1

fig.tight_layout()

fig.delaxes(ax[3,2])
fig.delaxes(ax[3,3])

plt.show()

* No featre is highly skewed.
* All continuous features are multimodal in nature.

In [None]:
#Scatterplot for continuous features
fig, ax = plt.subplots(5, 3, figsize=(24, 30))
for i, feature in enumerate(num_columns):
    plt.subplot(5, 3, i+1)
    sns.scatterplot(x=feature, 
                    y="target", 
                    data=train, s = 1)
    plt.xlabel(feature, fontsize=12)

fig.delaxes(ax[4,2])
plt.show()

* We can observe some clusters in these scatter plots.
* Cont1 feature has some clearly defined clusters.
* We should try clustering approach in the feature engineering section.

**Categorical Features**

In [None]:
train.head()

In [None]:
# Checking the distribution of categorical features

i = 1
plt.figure()
fig, ax = plt.subplots(3, 4, figsize=(15,12))

for feature in cat_columns:
    plt.subplot(3, 4, i)
    sns.histplot(x=feature, data=train)
    plt.xlabel(feature, fontsize = 9)
    i += 1

fig.suptitle('Distribution of Categorical Features')
plt.tight_layout()

fig.delaxes(ax[2,2])
fig.delaxes(ax[2,3])

plt.show()

* We can observe that some categories are much dominating than others. Such features are not useful for the models.

<a id="subsection-fourone"></a>
### Scaling

All continuous features are already scaled in the dataset.

<a id="subsection-fourtwo"></a>
### Correlation Check

In [None]:
#Let's check how the features are inter-related to each other and with target variable
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 10))
ax.set_title("Correlation Matrix", fontsize=16)

corr = train[num_columns + ['target']].corr().abs()
mask = np.triu(np.ones_like(corr, dtype=np.bool))

sns.heatmap(corr, mask=mask, annot=True, fmt=".2f", cmap='coolwarm',
            cbar_kws={"shrink": .8}, vmin=0, vmax=1)

for tick in ax.xaxis.get_major_ticks():
    tick.label.set_fontsize(12) 
    tick.label.set_rotation(90) 
for tick in ax.yaxis.get_major_ticks():
    tick.label.set_fontsize(12)
    tick.label.set_rotation(0)
    
plt.show()

* None of the features are highly correlated among each other.
* None of the features are directly correlated with target feature.

<a id="subsection-fourthree"></a>
### Outlier Treatment

In [None]:
#Checking for mild outliers
Q1_train = train.quantile(0.25)
Q3_train = train.quantile(0.75)
IQR_train = Q3_train - Q1_train

((train < Q1_train - 1.5*IQR_train) | (train > Q3_train + 1.5*IQR_train)).agg([sum, 'mean', 'count'])

In [None]:
#Checking for extreme outliers
Q1_train = train.quantile(0.25)
Q3_train = train.quantile(0.75)
IQR_train = Q3_train - Q1_train

((train < Q1_train - 3*IQR_train) | (train > Q3_train + 3*IQR_train)).agg([sum, 'mean', 'count'])

The Target Feature has some extreme outliers and 'cont7', 'cont10' has some mild outliers.

Let's remove the records having target feature outlier and replace the outliers in 'cont7' and 'cont10' with median value.

In [None]:
# Removing records with extreme outliers in target variable
train = train.drop(train[(train['target'] < (Q1_train - 3*IQR_train)['target']) | (train['target'] > (Q1_train + 3*IQR_train)['target'])].index)

Removed 3 records.

In [None]:
train_num = train.select_dtypes(exclude=['object'])

In [None]:
#Replacing outliers with median value

def replace_outliers(data):
    for col in data.columns:
        Q1 = data[col].quantile(0.25)
        Q3 = data[col].quantile(0.75)
        IQR = Q3 - Q1
        median_ = data[col].median()
      
        data.loc[((data[col] < Q1 - 3*IQR) | (data[col] > Q3 + 3*IQR)), col] = median_
    return data

train[train_num.drop('target', axis = 1).columns] = replace_outliers(train_num.drop('target', axis = 1))

In [None]:
#Checking the distribution of target variable again
sns.distplot(train['target'], kde=True, bins=120, label='train')
plt.xlabel('Target', fontsize=9); plt.legend()

<a id="section-five"></a>
# Feature Engineering

#### Continuous Features

In [None]:
#Defining number of bins based on above scatterplot and using Gaussian Mixture Model to cluster the data

inits = [4,11,8,6,6,6,4,8,8,9,8,5,8,9]
gmms = []
for feature, init in zip(num_columns, inits):
    X_ = np.array(train[feature].tolist()).reshape(-1, 1)
    gmm_ = GaussianMixture(n_components=init).fit(X_)
    gmms.append(gmm_)
    preds = gmm_.predict(X_)
    train[f'{feature}_gmm'] = preds
    train[f'{feature}_gmm'] = preds[:len(train)]
    test[f'{feature}_gmm'] = preds[:len(test)]

In [None]:
#Plotting scatterplot with clusters

fig, ax = plt.subplots(5, 3, figsize=(24, 30))
for i, feature in enumerate(num_columns):
    plt.subplot(5, 3, i+1)
    sns.scatterplot(x=feature, 
                    y="target", 
                    data=train, 
                    hue=f'{feature}_gmm', s = 1, palette='muted')
    
    plt.xlabel(feature, fontsize=12)
    
fig.delaxes(ax[4,2])
plt.show()

In [None]:
#Let's plot the histograms as well with the clusters
fig, ax = plt.subplots(5, 3, figsize=(24, 30))
for i, feature in enumerate(num_columns):
    plt.subplot(5, 3, i+1)
    sns.histplot(x=feature, 
                 data=train[::100], 
                 hue=f'{feature}_gmm', 
                 kde=True, 
                 bins=100, 
                 palette='muted')
    plt.xlabel(feature, fontsize=9)
    
fig.delaxes(ax[4,2])
plt.show()

* We can see how well the gaussian mixture model has worked in identifying these clusters. This should really help our models to score well on this data.

#### Categorical Features

In [None]:
#Applying label encoding on the categorical features

for feature in cat_columns:
    le = LabelEncoder()
    le.fit(train[feature])
    train[feature] = le.transform(train[feature])
    test[feature] = le.transform(test[feature])

<a id="section-six"></a>
# Modeling

Let's try different ML models and see which performs best.

In [None]:
train = train.reset_index(drop = True)

In [None]:
#Separating the target variable and removing the 'id' column
y = train['target']
X = train.drop(['target'], axis = 1)

In [None]:
# Splitting the train data in 80:20 ratio.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)

In [None]:
model_names = ["Linear",  "Lasso", "Ridge", "Decision Tree", "LGBM", "Random Forest", "XGBoost"]

models = [
    LinearRegression(fit_intercept=True),
    Lasso(fit_intercept=True),
    Ridge(fit_intercept=True),
    DecisionTreeRegressor(),
    LGBMRegressor(),
    RandomForestRegressor(n_estimators = 10, max_depth = 50),
    XGBRegressor()]

for name, model in zip(model_names, models):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    score = mean_squared_error(y_test, y_pred, squared=False)
    print(f'{name}: RMSE: {score}')

Best performing model: LightGBM. It is fitting this much better than other models. Let's try submitting this model on test data.

In [None]:
X_train.columns.symmetric_difference(test.columns)

In [None]:
train.shape, test.shape

In [None]:
test = test.reset_index(drop = True)

In [None]:
model = LGBMRegressor()
model.fit(X_train, y_train)
sample['target'] = model.predict(test.drop('id', axis = 1, errors = 'ignore'))
sample.to_csv('lgbm.csv', index = False)

Great! We have got a leaderboard score of 0.85081.

Since the LGBM model is showing good potential, let's dive deep into the hyperparameter tuning of this best model. 

<a id="subsection-sixone"></a>
## LGBM Hyperparameter Tuning using Optuna

In [None]:
## Install optuna library
# !pip install optuna

In [None]:
#Importing optuna library
import optuna

In [None]:
#Function for hyperparameter tuning using optuna

def objective(trial,data=X,target=y):
    
    train_x, test_x, train_y, test_y = train_test_split(data, target, test_size=0.2,random_state=42)
    param = {
        'metric': 'rmse', 
        'random_state': 48,
        'n_estimators': 2000,
        'reg_alpha': trial.suggest_loguniform('reg_alpha', 1e-3, 10.0),
        'reg_lambda': trial.suggest_loguniform('reg_lambda', 1e-3, 10.0),
        'colsample_bytree': trial.suggest_categorical('colsample_bytree', [0.3,0.4,0.5,0.6,0.7,0.8,0.9, 1.0]),
        'subsample': trial.suggest_categorical('subsample', [0.4,0.5,0.6,0.7,0.8,1.0]),
        'learning_rate': trial.suggest_categorical('learning_rate', [0.006,0.008,0.01,0.014,0.017,0.02]),
        'max_depth': trial.suggest_categorical('max_depth', [10,20,100]),
        'num_leaves' : trial.suggest_int('num_leaves', 1, 1000),
        'min_child_samples': trial.suggest_int('min_child_samples', 1, 300),
        'cat_smooth' : trial.suggest_int('min_data_per_groups', 1, 100)
    }
    model = LGBMRegressor(**param)  
    model.fit(train_x,train_y,eval_set=[(test_x,test_y)],early_stopping_rounds=100,verbose=False)
    preds = model.predict(test_x)
    
    rmse = mean_squared_error(test_y, preds,squared=False)
    
    return rmse

In [None]:
#Hyperparameter tuning to minimize the RMSE for predictions

study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=10)
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)

In [None]:
#Checking the best set of hyperparameters

print(f"\tBest value (rmse): {study.best_value:.5f}")
print(f"\tBest params:")

for key, value in study.best_params.items():
    print(f"\t\t{key}: {value}")

In [None]:
#Adding some additional parameters

params=study.best_params   
params['random_state'] = 48
params['n_estimators'] = 2000
params['metric'] = 'rmse'

In [None]:
#Training LGBM with best set of hyperparameters

model = LGBMRegressor(**params)
model.fit(X, y)
sample['target'] = model.predict(test.drop('id', axis = 1, errors = 'ignore'))
sample.to_csv('submission.csv', index = False)

Awesome! We got a leaderboard score: 0.84583 after tuning the LGBM Regressor.

However, it can be improved further by stacking the models together.

# The End!

Thank you for reading this notebook. I have learnt alot from this exercise, hope you have learnt something too.
Please share feedback if you find any flaw or have a better approach.

Please upvote the notebook if you liked! 

Thank you!