# E-Commerce Shipping - Classification

## Contents:

- Data Description & Cleaning
- Exploratory Data Analysis (EDA)
    * Categorical Features
    * Numerical Features
    * Target Column
- Outliers
    * Log Transformation
    * Square Root Transformation
    * Winsorization
- Heatmap
- One-Hot-Encoding
- Scaling
    * Normalization
    * Standardization
- Building Machine Learning Models
    * Logistic Regression
    * KNN
    * Decision Trees
    * Random Forest
    * AdaBoost
    * Gradient Boosting
    * Extra Trees
    * CatBoost
    * Support Vector Machines
    * XGBoost
    * LightGBM
- Hyperparameter Tuning
    * Logistic Regression
    * KNN
    * Decision Trees
    * Random Forest
    * AdaBoost
    * Gradient Boosting
    * Extra Trees
    * CatBoost
    * Support Vector Machines
    * XGBoost
    * LightGBM
- Best Parameters & Comparison
- Classification with Artificial Neural Networks (ANNs)

## Context
An international e-commerce company based wants to discover key insights from their customer database. They want to use some of the most advanced machine learning techniques to study their customers. The company sells electronic products.

## Columns
The dataset used for model building contained 10999 observations of 12 variables.
The data contains the following information:

- **ID**: ID Number of Customers.
- **Warehouse block**: The Company have big Warehouse which is divided in to block such as A,B,C,D,E.
- **Mode of shipment**: The Company Ships the products in multiple way such as Ship, Flight and Road.
- **Customer care calls**: The number of calls made from enquiry for enquiry of the shipment.
- **Customer rating**: The company has rated from every customer. 1 is the lowest (Worst), 5 is the highest (Best).
- **Cost of the product**: Cost of the Product in US Dollars.
- **Prior purchases**: The Number of Prior Purchase.
- **Product importance**: The company has categorized the product in the various parameter such as low, medium, high.
- **Gender**: Male and Female.
- **Discount offered**: Discount offered on that specific product.
- **Weight in gms**: It is the weight in grams.
- **Reached on time**: It is the target variable, where 1 Indicates that the product has NOT reached on time and 0 indicates it has reached on time.

\
**Data Source:** https://www.kaggle.com/prachi13/customer-analytics

# Data Descrition & Cleaning

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

from scipy.stats.mstats import winsorize

from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from catboost import CatBoostClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

from sklearn.model_selection import GridSearchCV

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

import warnings
warnings.filterwarnings("ignore")
import time

In [None]:
df = pd.read_csv('/kaggle/input/customer-analytics/Train.csv')

df.head()

In [None]:
df.describe(include='all')

In [None]:
df.info()

In [None]:
df.drop('ID', axis=1, inplace=True)
df.rename({'Reached.on.Time_Y.N':'Reached_on_Time'}, axis=1, inplace=True)
df['Reached_on_Time'].replace({1:'No', 0:'Yes'}, inplace=True)

In [None]:
print("Percentage of Null Values:\n")
print(df.isna().sum()*100/df.shape[0])

In [None]:
print("# of Unique Values: \n")
print(df.nunique())

In [None]:
print("Unique Values:\n")
for i in range(len(df.nunique())):
    if df.nunique()[i] < 10:
        print("- ", df.nunique().index[i], ": ", sorted(df.iloc[:, i].unique()), sep='')

In [None]:
print("Columns:")
for column in df.columns:
    print("- {}".format(column))

# Exploratory Data Analysis (EDA)

## Categorical Features

In [None]:
plt.figure(figsize=(18, 18))

plt.subplot(3, 3, 1)
sns.countplot(x='Warehouse_block', data=df)
plt.title('Warehouse Block', fontsize=15)

plt.subplot(3, 3, 2)
sns.countplot(x='Mode_of_Shipment', data=df)
plt.title('Mode of Shipment', fontsize=15)

plt.subplot(3, 3, 3)
sns.countplot(x='Customer_care_calls', data=df)
plt.title('Customer Care Calls', fontsize=15)

plt.subplot(3, 3, 4)
sns.countplot(x='Customer_rating', data=df)
plt.title('Customer Rating', fontsize=15)

plt.subplot(3, 3, 5)
sns.countplot(x='Prior_purchases', data=df)
plt.title('Prior Purchases', fontsize=15)

plt.subplot(3, 3, 6)
sns.countplot(x='Product_importance', data=df)
plt.title('Product Importance', fontsize=15)

plt.subplot(3, 3, 7)
sns.countplot(x='Gender', data=df)
plt.title('Gender', fontsize=15)

plt.show()

In [None]:
plt.figure(figsize=(18, 18))

plt.subplot(3, 3, 1)
sns.countplot(x='Warehouse_block', hue='Reached_on_Time', data=df)
plt.title('Warehouse Block', fontsize=15)

plt.subplot(3, 3, 2)
sns.countplot(x='Mode_of_Shipment', hue='Reached_on_Time', data=df)
plt.title('Mode of Shipment', fontsize=15)

plt.subplot(3, 3, 3)
sns.countplot(x='Customer_care_calls', hue='Reached_on_Time',  data=df)
plt.title('Customer Care Calls', fontsize=15)

plt.subplot(3, 3, 4)
sns.countplot(x='Customer_rating', hue='Reached_on_Time',  data=df)
plt.title('Customer Rating', fontsize=15)

plt.subplot(3, 3, 5)
sns.countplot(x='Prior_purchases', hue='Reached_on_Time',  data=df)
plt.title('Prior Purchases', fontsize=15)

plt.subplot(3, 3, 6)
sns.countplot(x='Product_importance', hue='Reached_on_Time',  data=df)
plt.title('Product Importance', fontsize=15)

plt.subplot(3, 3, 7)
sns.countplot(x='Gender', hue='Reached_on_Time',  data=df)
plt.title('Gender', fontsize=15)

plt.show()

## Numeric Features

In [None]:
plt.figure(figsize=(12, 8))

plt.subplot(2, 3, 1)
plt.hist(df['Cost_of_the_Product'], bins=20)
plt.title('Cost of the Product')

plt.subplot(2, 3, 2)
plt.hist(df['Discount_offered'], bins=20)
plt.title('Discount Offered')

plt.subplot(2, 3, 3)
plt.hist(df['Weight_in_gms'], bins=20)
plt.title('Weight in gms')

plt.subplot(2, 3, 4)
plt.boxplot(df['Cost_of_the_Product'])
plt.title('Cost of the Product')

plt.subplot(2, 3, 5)
plt.boxplot(df['Discount_offered'])
plt.title('Discount Offered')

plt.subplot(2, 3, 6)
plt.boxplot(df['Weight_in_gms'])
plt.title('Weight in gms')

plt.show()

In [None]:
plt.figure(figsize=(18, 12))

plt.subplot(2, 3, 1)
plt.scatter(df['Cost_of_the_Product'], df['Reached_on_Time'], s=5)
plt.title("Cost of the Product vs Reached on Time", fontsize=15)

plt.subplot(2, 3, 2)
plt.scatter(df['Discount_offered'], df['Reached_on_Time'], s=5)
plt.title("Discount offered vs Reached on Time", fontsize=15)

plt.subplot(2, 3, 3)
plt.scatter(df['Weight_in_gms'], df['Reached_on_Time'], s=5)
plt.title("Weight in gms vs Reached On Time", fontsize=15)

plt.subplot(2, 3, 4)
sns.violinplot(x='Reached_on_Time', y='Cost_of_the_Product', data=df)
plt.title("Cost of the Product vs Reached on Time", fontsize=15)

plt.subplot(2, 3, 5)
sns.violinplot(x='Reached_on_Time', y='Discount_offered', data=df)
plt.title("Discount offered vs Reached on Time", fontsize=15)

plt.subplot(2, 3, 6)
sns.violinplot(x='Reached_on_Time', y='Weight_in_gms', data=df)
plt.title("Weight in gms vs Reached On Time", fontsize=15)

plt.show()

## Target Column

In [None]:
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.countplot(x='Reached_on_Time', data=df)
plt.title('Reached on Time', fontsize=15)

plt.subplot(1, 2, 2)
plt.pie(df['Reached_on_Time'].value_counts(), labels=['No', 'Yes'], explode=[0.05, 0.05], autopct='%1.2f%%', shadow=True)
plt.title('Reached on Time', fontsize=15)

plt.show()

# Outliers

According to the boxplots created earlier, the only column with outliers seems to be on "**Discount_offered**" column. But there seems to be a lot of outliers in this feature, because of the way boxplot defines outliers. Therefore, instead of directly removing or winsorizing those outliers, I will first apply **Log transformation** and **Square root transformation** to see which works better. Then, I will **winsorize** the remaining outliers.

In [None]:
def sum_outliers(X):
    """Outliers are calculated according to the matplotlib.pyplot's standards."""
    IQR = np.quantile(X, q=0.75) - np.quantile(X, q=0.25)
    upper_whisker = np.quantile(X, q=0.75) + (IQR * 1.5)
    lower_whisker = np.quantile(X, q=0.25) - (IQR * 1.5)
    return (X > upper_whisker).sum() + (X < lower_whisker).sum()

In [None]:
plt.figure(figsize=(24, 12))

plt.subplot(2, 4, 1)
plt.boxplot(df['Discount_offered'])
plt.title('Discount Offered')

plt.subplot(2, 4, 2)
plt.boxplot(np.log(df['Discount_offered']))
plt.title('Discount Offered (Log Transformation)')

plt.subplot(2, 4, 3)
plt.boxplot(np.sqrt(df['Discount_offered']))
plt.title('Discount Offered (Square Root Transformation)')

plt.subplot(2, 4, 4)
plt.boxplot(winsorize(np.log(df['Discount_offered']),limits=(0.15, 0.15)));
plt.title('Discount Offered (Log Transformation & Winsorized)')

plt.subplot(2, 4, 5)
plt.hist(df['Discount_offered'], bins=20)
plt.title('Discount Offered')

plt.subplot(2, 4, 6)
plt.hist(np.log(df['Discount_offered']), bins=20)
plt.title('Discount Offered (Log Transformation)')

plt.subplot(2, 4, 7)
plt.hist(np.sqrt(df['Discount_offered']), bins=20)
plt.title('Discount Offered (Square Root Transformation)')

plt.subplot(2, 4, 8)
plt.hist(winsorize(np.log(df['Discount_offered']),limits=(0.15, 0.15)), bins=20);
plt.title('Discount Offered (Log Transformation & Winsorized)')


plt.show()

In [None]:
print("Total number of observations: {}".format(len(df['Discount_offered'])))
print("Number of outliers in 'Discount_offered': {}".format(sum_outliers(df['Discount_offered'])))
print("Number of outliers in 'Discount_offered' (Log Transformation): {}".format(sum_outliers(np.log(df['Discount_offered']))))
print("Number of outliers in 'Discount_offered' (Square Root Transformation): {}".format(sum_outliers(np.sqrt(df['Discount_offered']))))
print("Number of outliers in 'Discount_offered' (Log Transformation & Winsorized): {}".format(sum_outliers(winsorize(np.log(df['Discount_offered']),limits=(0.15, 0.15)))))

In [None]:
df['Discount_offered'] = np.array(winsorize(np.log(df['Discount_offered']),limits=(0.15, 0.15)))

# Heatmap

In [None]:
plt.figure(figsize=(12, 12))
sns.heatmap(pd.get_dummies(df, drop_first=True).corr(), annot=True, fmt='.3f')
plt.show()

In [None]:
pd.get_dummies(df, drop_first=False).corr()['Reached_on_Time_Yes'].sort_values(ascending=False)

- **Weight**, and **Cost** are _positively_, amount of **Discount** is _negatively_ correlated with the target variable.
- Overall, there doesn't seem to be the problem of **multicollinearity**.

# One-Hot-Encoding

In [None]:
df = pd.get_dummies(df, drop_first=True)

In [None]:
X = df.drop('Reached_on_Time_Yes', axis=1)
y = df['Reached_on_Time_Yes']

# Scaling

## Normalization

In [None]:
normalizer = Normalizer()
X_normalized = pd.DataFrame(normalizer.fit_transform(df.drop('Reached_on_Time_Yes', axis=1)), columns=df.columns[:-1])

## Standardization

In [None]:
scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(df.drop('Reached_on_Time_Yes', axis=1)), columns=df.columns[:-1])

# Building Machine Learning Models

- Logistic Regression
- KNN
- Decision Trees
- Random Forest
- AdaBoost
- Gradient Boosting
- Extra Trees
- Cat Boost
- Support Vector Machines
- XGBoost
- LightGBM

In [None]:
def fit_predict_score(Model, X_train, y_train, X_test, y_test):
    """Fit the model of your choice, predict for test data, and returns classification metrics."""
    model = Model
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)
    return train_score, test_score, precision_score(y_test, y_pred), recall_score(y_test, y_pred), f1_score(y_test, y_pred)

def model_comparison(X, y):
    """Creates a DataFrame comparing Logistic Regression, K-Nearest Neighbors, Decision Tree,
    Random Forest, AdaBoost, Gradient Boosting, Extra Trees, CatBoost, Support Vector Machines,
    XGBoost, and LightGBM."""
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
    
    lr_train_score, lr_test_score, lr_pr, lr_re, lr_f1 = fit_predict_score(LogisticRegression(), X_train, y_train, X_test, y_test)
    knn_train_score, knn_test_score, knn_pr, knn_re, knn_f1 = fit_predict_score(KNeighborsClassifier(), X_train, y_train, X_test, y_test)
    dtc_train_score, dtc_test_score, dtc_pr, dtc_re, dtc_f1 = fit_predict_score(DecisionTreeClassifier(), X_train, y_train, X_test, y_test)
    rfc_train_score, rfc_test_score, rfc_pr, rfc_re, rfc_f1 = fit_predict_score(RandomForestClassifier(), X_train, y_train, X_test, y_test)
    ada_train_score, ada_test_score, ada_pr, ada_re, ada_f1 = fit_predict_score(AdaBoostClassifier(), X_train, y_train, X_test, y_test)
    gbc_train_score, gbc_test_score, gbc_pr, gbc_re, gbc_f1 = fit_predict_score(GradientBoostingClassifier(), X_train, y_train, X_test, y_test)
    xtc_train_score, xtc_test_score, xtc_pr, xtc_re, xtc_f1 = fit_predict_score(ExtraTreesClassifier(), X_train, y_train, X_test, y_test)
    cbc_train_score, cbc_test_score, cbc_pr, cbc_re, cbc_f1 = fit_predict_score(CatBoostClassifier(verbose=0), X_train, y_train, X_test, y_test)
    svc_train_score, svc_test_score, svc_pr, svc_re, svc_f1 = fit_predict_score(SVC(), X_train, y_train, X_test, y_test)
    xgbc_train_score, xgbc_test_score, xgbc_pr, xgbc_re, xgbc_f1 = fit_predict_score(XGBClassifier(verbosity=0), X_train, y_train, X_test, y_test)
    lgbc_train_score, lgbc_test_score, lgbc_pr, lgbc_re, lgbc_f1 = fit_predict_score(LGBMClassifier(), X_train, y_train, X_test, y_test)
    
    models = ['Logistic Regression', 'K-Nearest Neighbors', 'Decision Tree', 'Random Forest', 'AdaBoost',
              'Gradient Boosting', 'Extra Trees', 'CatBoost', 'Support Vector Machines', 'XGBoost', 'LightGBM']
    train_score = [lr_train_score, knn_train_score, dtc_train_score, rfc_train_score, ada_train_score,
                   gbc_train_score, xtc_train_score, cbc_train_score, svc_train_score, xgbc_train_score, lgbc_train_score]
    test_score = [lr_test_score, knn_test_score, dtc_test_score, rfc_test_score, ada_test_score,
                  gbc_test_score, xtc_test_score, cbc_test_score, svc_test_score, xgbc_test_score, lgbc_test_score]
    precision = [lr_pr, knn_pr, dtc_pr, rfc_pr, ada_pr, gbc_pr, xtc_pr, cbc_pr, svc_pr, xgbc_pr, lgbc_pr]
    recall = [lr_re, knn_re, dtc_re, rfc_re, ada_re, gbc_re, xtc_re, cbc_re, svc_re, xgbc_re, lgbc_re]
    f1 = [lr_f1, knn_f1, dtc_f1, rfc_f1, ada_f1, gbc_f1, xtc_f1, cbc_f1, svc_f1, xgbc_f1, lgbc_f1]
    
    model_comparison = pd.DataFrame(data=[models, train_score, test_score, precision, recall, f1]).T.rename({0: 'Model',
                                                                                                             1:'Training Score',
                                                                                                             2: 'Test Score (Accuracy)',
                                                                                                             3: 'Precision',
                                                                                                             4: 'Recall',
                                                                                                             5: 'F1 Score'
                                                                                                            }, axis=1)
    
    return model_comparison

In [None]:
print("Default DataFrame:")
display(model_comparison(X, y))
print('-'*40)
print("\nNormalized DataFrame:")
display(model_comparison(X_normalized, y))
print('-'*40)
print("\nStandardized DataFrame:")
display(model_comparison(X_scaled, y))

As expected, normalizing or standardization the data did not improve the performance of classification models significantly. I will proceed with the unscaled data.

# Hyperparameter Tuning

## Logistic Regression

In [None]:
start = time.time()

params = {"C": [10 ** x for x in range (-5, 5, 1)],
          "penalty": ['l1', 'l2']}

lr_grid = GridSearchCV(estimator=LogisticRegression(),
                       param_grid = params,
                       cv = 5,
                       verbose = 0)

lr_grid.fit(X, y)

end = time.time()

In [None]:
print("GridSearchCV Runtime: {} minutes".format(round((end - start) / 60, 2)))
print("Best Parameters : ", lr_grid.best_params_)

## K-Nearest Neighbors

In [None]:
start = time.time()

params = {
    "n_neighbors": [1, 3, 5, 10, 15, 30, 50],
    "weights": ['uniform', 'distance'],
    "metric": ['minkowski', 'euclidian', 'manhattan']
}

knn_grid = GridSearchCV(estimator=KNeighborsClassifier(),
                        param_grid = params,
                        cv = 5,
                        verbose = 0)

knn_grid.fit(X, y)

end = time.time()

In [None]:
print("GridSearchCV Runtime: {} minutes".format(round((end - start) / 60, 2)))
print("Best Parameters : ", knn_grid.best_params_)

## Decision Tree

In [None]:
start = time.time()

params = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [i for i in range(1, 10)],
    'min_samples_split': [i for i in range(1, 10)],
    'min_samples_leaf': [i for i in range(1, 5)]
}

dtc_grid = GridSearchCV(estimator = DecisionTreeClassifier(),
                        param_grid = params,
                        cv = 5,
                        verbose = 0)

dtc_grid.fit(X, y)

end = time.time()

In [None]:
print("GridSearchCV Runtime: {} minutes".format(round((end - start) / 60, 2)))
print("Best Parameters : ", dtc_grid.best_params_)

## Random Forest

In [None]:
start = time.time()

params = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth' : [4, 5, 6, 7, 8, 10, 15],
    'criterion' :['gini', 'entropy']
}

rfc_grid = GridSearchCV(estimator = RandomForestClassifier(),
                        param_grid = params,
                        cv = 5,
                        verbose = 0)

rfc_grid.fit(X, y)

end = time.time()

In [None]:
print("GridSearchCV Runtime: {} minutes".format(round((end - start) / 60, 2)))
print("Best Parameters : ", rfc_grid.best_params_)

## AdaBoost

In [None]:
start = time.time()

params = {
    'n_estimators': [10, 50, 100, 200, 500],
    'learning_rate': [0.1, 0.3, 0.5, 0.7]
}

ada_grid = GridSearchCV(estimator = AdaBoostClassifier(),
                        param_grid = params,
                        cv = 5,
                        verbose = 0)

ada_grid.fit(X, y)

end = time.time()

In [None]:
print("GridSearchCV Runtime: {} minutes".format(round((end - start) / 60, 2)))
print("Best Parameters : ", ada_grid.best_params_)

## Gradient Boosting

In [None]:
start = time.time()

params = {
    'learning_rate': [0.1, 0.3, 0.5, 0.8, 1],
    'max_depth': [1, 3, 5, 7, 10, 15, 25],
    'subsample': [0.1, 0.3, 0.5, 0.8, 1],
    'n_estimators' : [50, 100, 250, 500]
}

gbc_grid = GridSearchCV(estimator = GradientBoostingClassifier(),
                        param_grid = params,
                        cv = 3,
                        verbose = 0)

gbc_grid.fit(X, y)

end = time.time()

In [None]:
print("GridSearchCV Runtime: {} minutes".format(round((end - start) / 60, 2)))
print("Best Parameters : ", gbc_grid.best_params_)

## Extra Trees

In [None]:
start = time.time()

params = {
    'n_estimators' : [50, 75, 100, 125, 150],
    'max_depth': [i for i in range(1, 10, 2)],
    'min_samples_leaf': [i for i in range(1, 10, 2)],
    'min_samples_split': [i for i in range(1, 10, 2)]
}

xtc_grid = GridSearchCV(estimator = ExtraTreesClassifier(),
                        param_grid = params,
                        cv = 3,
                        verbose = 0)

xtc_grid.fit(X, y)

end = time.time()

In [None]:
print("GridSearchCV Runtime: {} minutes".format(round((end - start) / 60, 2)))
print("Best Parameters : ", xtc_grid.best_params_)

## Cat Boost

In [None]:
start = time.time()

params = {
    'learning_rate': [0.03, 0.1, 0.5],
    'depth': [4, 6, 10],
    'l2_leaf_reg': [1, 3, 5, 7, 9]
}

cbc_grid = GridSearchCV(estimator = CatBoostClassifier(verbose = 0),
                        param_grid = params,
                        cv = 3,
                        verbose = 0)

cbc_grid.fit(X, y)

end = time.time()

In [None]:
print("GridSearchCV Runtime: {} minutes".format(round((end - start) / 60, 2)))
print("Best Parameters : ", cbc_grid.best_params_)

## Support Vector Machines

In [None]:
start = time.time()

params = {'C': [10**i for i in range(1, 2)] + [round(0.1**i,5) for i in range(5)]}

svc_grid = GridSearchCV(estimator = SVC(),
                        param_grid = params,                        
                        cv = 5,
                        verbose = 0)

svc_grid.fit(X, y)

end = time.time()

In [None]:
print("GridSearchCV Runtime: {} minutes".format(round((end - start) / 60, 2)))
print("Best Parameters : ", svc_grid.best_params_)

## XGBoost

In [None]:
X_sample = X.sample(n=3000, random_state=42)
y_sample = y[X_sample.index]

start = time.time()

params = {
    'learning_rate': [0.1, 0.3, 0.5],
    'max_depth': [1, 3, 5],
    'min_child_weight': [1, 3, 5, 7, 9],
    'subsample': [0.1, 0.3, 0.5, 0.8, 1],
    'colsample_bytree': [0.1, 0.3, 0.5],
    'n_estimators' : [100, 200, 300, 400, 500],
    'objective': ['reg:squarederror']
}

xgbc_grid = GridSearchCV(estimator = XGBClassifier(),
                         param_grid = params,
                         cv = 3,
                         verbose = 0)

xgbc_grid.fit(X_sample, y_sample)

end = time.time()

In [None]:
print("GridSearchCV Runtime: {} minutes".format(round((end - start) / 60, 2)))
print('Best Parameters: ', xgbc_grid.best_params_)

## LightGBM

In [None]:
start = time.time()

params = {
    'learning_rate': [10 ** x for x in range (-5, 5, 1)],
    'n_estimators': [x * 100 for x in range(1, 11)]
}

lgbc_grid = GridSearchCV(estimator = LGBMClassifier(),
                        param_grid = params,                        
                        cv = 3,
                        verbose = 0)

lgbc_grid.fit(X_normalized, y)

end = time.time()

In [None]:
print("GridSearchCV Runtime: {} minutes".format(round((end - start) / 60, 2)))
print('Best Parameters: ', lgbc_grid.best_params_)

# Best Parameters & Comparison

In [None]:
print("Best Parameters (Logistic Regression): ", lr_grid.best_params_)
print("Best Parameters (K-Nearest Neighbors): ", knn_grid.best_params_)
print("Best Parameters (Decision Tree): ", dtc_grid.best_params_)
print("Best Parameters (Random Forest): ", rfc_grid.best_params_)
print("Best Parameters (AdaBoost): ", ada_grid.best_params_)
print("Best Parameters (Gradient Boosting): ", gbc_grid.best_params_)
print("Best Parameters (Extra Trees): ", xtc_grid.best_params_)
print("Best Parameters (CatBoost): ", cbc_grid.best_params_)
print("Best Parameters (SVC): ", svc_grid.best_params_)
print('Best Parameters (XGBoost):', xgbc_grid.best_params_)
print('Best Parameters (LightGBM): ', lgbc_grid.best_params_)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

lr_train_score, lr_test_score, lr_pr, lr_re, lr_f1 = fit_predict_score(LogisticRegression(C=1, penalty='l2'), X_train, y_train, X_test, y_test)
knn_train_score, knn_test_score, knn_pr, knn_re, knn_f1 = fit_predict_score(KNeighborsClassifier(metric='minkowski', n_neighbors=3, weights='uniform'), X_train, y_train, X_test, y_test)
dtc_train_score, dtc_test_score, dtc_pr, dtc_re, dtc_f1 = fit_predict_score(DecisionTreeClassifier(criterion='gini', max_depth=2, min_samples_leaf=1, min_samples_split=2), X_train, y_train, X_test, y_test)
rfc_train_score, rfc_test_score, rfc_pr, rfc_re, rfc_f1 = fit_predict_score(RandomForestClassifier(criterion='gini', max_depth=15, n_estimators=100), X_train, y_train, X_test, y_test)
ada_train_score, ada_test_score, ada_pr, ada_re, ada_f1 = fit_predict_score(AdaBoostClassifier(learning_rate=0.1, n_estimators=10), X_train, y_train, X_test, y_test)
gbc_train_score, gbc_test_score, gbc_pr, gbc_re, gbc_f1 = fit_predict_score(GradientBoostingClassifier(learning_rate=0.8, max_depth=5, n_estimators=500, subsample=0.1), X_train, y_train, X_test, y_test)
xtc_train_score, xtc_test_score, xtc_pr, xtc_re, xtc_f1 = fit_predict_score(ExtraTreesClassifier(max_depth=1, min_samples_leaf=7, min_samples_split=3, n_estimators=75), X_train, y_train, X_test, y_test)
cbc_train_score, cbc_test_score, cbc_pr, cbc_re, cbc_f1 = fit_predict_score(CatBoostClassifier(verbose = 0, depth=6, l2_leaf_reg=3, learning_rate=0.5), X_train, y_train, X_test, y_test)
svc_train_score, svc_test_score, svc_pr, svc_re, svc_f1 = fit_predict_score(SVC(C=10), X_train, y_train, X_test, y_test)
xgbc_train_score, xgbc_test_score, xgbc_pr, xgbc_re, xgbc_f1 = fit_predict_score(XGBClassifier(colsample_bytree=0.5, learning_rate=0.1, max_depth=1, min_child_weight=7, n_estimators=100, objective='reg:squarederror', subsample=0.5), X_train, y_train, X_test, y_test)
lgbc_train_score, lgbc_test_score, lgbc_pr, lgbc_re, lgbc_f1 = fit_predict_score(LGBMClassifier(learning_rate=10000, n_estimators=100), X_train, y_train, X_test, y_test)

models = ['Logistic Regression', 'K-Nearest Neighbors', 'Decision Tree', 'Random Forest', 'AdaBoost',
          'Gradient Boosting', 'Extra Trees', 'CatBoost', 'Support Vector Machines', 'XGBoost', 'LightGBM']
train_score = [lr_train_score, knn_train_score, dtc_train_score, rfc_train_score, ada_train_score,
               gbc_train_score, xtc_train_score, cbc_train_score, svc_train_score, xgbc_train_score, lgbc_train_score]
test_score = [lr_test_score, knn_test_score, dtc_test_score, rfc_test_score, ada_test_score,
              gbc_test_score, xtc_test_score, cbc_test_score, svc_test_score, xgbc_test_score, lgbc_test_score]
precision = [lr_pr, knn_pr, dtc_pr, rfc_pr, ada_pr, gbc_pr, xtc_pr, cbc_pr, svc_pr, xgbc_pr, lgbc_pr]
recall = [lr_re, knn_re, dtc_re, rfc_re, ada_re, gbc_re, xtc_re, cbc_re, svc_re, xgbc_re, lgbc_re]
f1 = [lr_f1, knn_f1, dtc_f1, rfc_f1, ada_f1, gbc_f1, xtc_f1, cbc_f1, svc_f1, xgbc_f1, lgbc_f1]

tuned_models = pd.DataFrame(data=[models, train_score, test_score, precision, recall, f1]).T.rename({0: 'Model',
                                                                                                     1:'Training Score',
                                                                                                     2: 'Test Score (Accuracy)',
                                                                                                     3: 'Precision',
                                                                                                     4: 'Recall',
                                                                                                     5: 'F1 Score'
                                                                                                        }, axis=1)

In [None]:
print("Default Parameters:")
display(model_comparison(X, y))
print('-'*40)
print("\nTuned Parameters:")
display(tuned_models)

# Classification with Artificial Neural Networks (ANN)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.33, random_state=42)

print("Shape of train set (X) :", X_train.shape)
print("Shape of train set (y) :", y_train.shape)
print("Shape of test set  (X) :", X_test.shape)
print("Shape of test set  (y) :", y_test.shape)

input_shape = X_train.shape[1]

In [None]:
model = Sequential()
model.add(Dense(16, activation='relu', input_shape = (input_shape,), name = "Hidden_Layer_1"))
model.add(Dense(8, activation='relu', name = "Hidden_Layer_2"))
model.add(Dense(4, activation='relu', name = "Hidden_Layer_3"))
model.add(Dense(2, activation='relu', name = "Hidden_Layer_4"))
model.add(Dense(1, activation='sigmoid', name = "Output"))

model.summary()

In [None]:
model.compile(optimizer ='adam',
              loss='binary_crossentropy', 
              metrics =['accuracy'])

In [None]:
model.fit(X_train, y_train, epochs=100)

In [None]:
y_pred = model.predict(X_test)
y_pred = (y_pred>0.5)

In [None]:
confusion_matrix(y_test, y_pred)

In [None]:
train_score = model.evaluate(X_train, y_train, verbose = 0)[1]
test_score = model.evaluate(X_test, y_test, verbose = 0)[1]

print("Training Score: {:.3f}".format(train_score))
print("Test Score (Accuracy): {:.3f}".format(test_score))
print("Precision: {:.3f}".format(precision_score(y_test, y_pred)))
print("Recall: {:.3f}".format(recall_score(y_test, y_pred)))
print("F1 Score: {:.3f}".format(f1_score(y_test, y_pred)))