In [None]:
import numpy as np                  # Mathetimatical Operations
import pandas as pd                 # Data manipulation

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt     
%matplotlib inline

# Sklearn
from sklearn.model_selection import train_test_split, RandomizedSearchCV, StratifiedKFold, GridSearchCV
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, auc, roc_curve, roc_auc_score, classification_report, mean_squared_error, confusion_matrix, f1_score, precision_recall_curve, r2_score 
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.ensemble import AdaBoostRegressor, BaggingRegressor, GradientBoostingRegressor, RandomForestRegressor

# Scipy
from scipy.stats import stats
from scipy.stats import ttest_ind, ttest_ind_from_stats

# XGBoost
from xgboost import XGBClassifier
from xgboost import XGBRegressor
import xgboost as xgb

# LightGBM
import lightgbm as lgb

# Datetime
import datetime 
import time
from datetime import datetime

# Folium
import folium 
from folium import plugins
from folium.plugins import HeatMap

# Image
from IPython.display import Image

# Bayesian Optimizer
from skopt import BayesSearchCV

# Itertools
import itertools

# Remove warnings
import warnings
warnings.filterwarnings('ignore')

## Introduction

Imagine a user visiting a website, and performing a job search. From the set of displayed results, the user clicks on certain ones that he/she is interested in, and after checking job descriptions, she further clicks on apply button therein to land into an application page. The apply rate is defined as the fraction of applies (after visiting job description pages), and the goal is to predict this metric using the dataset described in the following section.

This notebook will provide a complete guide to anyone who is new in Machine Learning. My goal is to provide you with an end-to-end blueprint for applied machine learning while keeping it as actionable and succinct as possible.



### We will be following the process to solve this problem

1. Data Collection
2. Exploratory Data Analysis
3. Data Cleaning 
4. Feature Engineering 
5. Model Training (including cross validation and hyperparameter tuning)
6. Insights

### 1. Data collection

In [None]:
df = pd.read_csv('../input/Apply_Rate_2019.csv')

### 2. Exploratory Data Analysis

The purpose of the exploratory analysis is to “get to know” the dataset. Doing so up front will make the rest of the project much smoother, in 3 main ways:

- You’ll gain valuable hints for Data Cleaning (which can make or break your models).
- You’ll think of ideas for Feature Engineering (which can take your models from good to great).
- You’ll get a “feel” for the dataset, which will help you communicate results and deliver greater impact.

However, exploratory analysis for machine learning should be quick, efficient, and decisive… not long and drawn out!

Don’t skip this step, but don’t get stuck on it either.

You see, there are infinite possible plots, charts, and tables, but you only need a handful to “get to know” the data well enough to work with it.

In this step, we’ll show you the visualizations that provide the biggest bang for your buck.

#### Start with basics

First, you’ll want to answer a set of basic questions about the dataset:

- How many observations do I have?
- How many features?
- What are the data types of my features? Are they numeric? Categorical?
- Do I have a target variable?


In [None]:
df.head()

In [None]:
# Check the total number of observations in the dataset

print('Total number of observations in the dataset are:',df.shape[0])

In [None]:
df.info()

#### Observation: 

1. title_proximity_tfidf, description_proximity_tfidf and city_match contains null values
2. There are 7 float type, 2 integer type and 1 object type features

In [None]:
df.drop(['apply'],axis=1).describe()

#### Observation:

1. There is notably a large difference between 75th %tile and max values of mostly all the predictors.
2. Median value of 'title_proximity_tfidf', 'description_proximity_tfidf', 'main_query_tfidf', 'query_jl_score', 'query_title_score', 'job_age_days' is lower than mean
3. Thus observation 1 and 2 suggest there are lot of outliers in the data

In [None]:
# Lets check the distribution for classes who applied and did not apply

count_classes = pd.value_counts(df['apply'], sort = True)
count_classes.plot(kind = 'bar')

plt.title("Apply Rate")
plt.xticks(range(2))
plt.xlabel("Class")
plt.ylabel("Frequency");

print('Number of customers who didnt apply:',df['apply'].value_counts()[0])
print('Number of customers who applied:',df['apply'].value_counts()[1])
print('Percentage of apply to non apply',df['apply'].value_counts()[0]/df['apply'].value_counts()[1],'%')

#### Observation:

The data is imbalanced and so we might have to use techniques like resmapling (undersampling or oversampling) or use metrics like AUC-ROC curve or AUPRC or SMOTE to handle imbalanced data. Lets explore further which will help us decide what technique should we use. Note: It is already given in the dataset that I have to use AUC as the metric.

### Correlation

Correlations allow you to look at the relationships between numeric features and other numeric features.

Correlation is a value between -1 and 1 that represents how closely two features move in unison. You don’t need to remember the math to calculate them. Just know the following intuition:

- Positive correlation means that as one feature increases, the other increases. E.g. a child’s age and her height.
- Negative correlation means that as one feature increases, the other decreases. E.g. hours spent studying and a number of parties attended.
- Correlations near -1 or 1 indicate a strong relationship.
- Those closer to 0 indicate a weak relationship.
- 0 indicates no relationship


In [None]:
# Lets check the correlation between the features

sns.heatmap(df.corr())

#### Observation: 

1. title_proximity_tfidf and main_query_tfidf are correlated with value of arounf 0.7
2. Other features are not highly correlated

### Outliers

One of the most important steps in Exploratory Data Analysis is outlier detection and treatment. Machine learning algorithms are very sensitive to the range and distribution of data points. Data outliers can deceive the training process resulting in longer training times and less accurate models. Outliers are defined as samples that are significantly different from the remaining data. Those are points that lie outside the overall pattern of the distribution. Statistical measures such as mean, variance and correlation are very susceptible to outliers.

#### Nature of outliers:

Outliers can occur in the dataset due to one of the following reasons,

- Genuine extreme high and low values in the dataset
- Introduced due to human or mechanical error
- Introduced by replacing missing values

#### Outlier Detection

- Extreme Value Analysis
- Z-score method
- K Means clustering-based approach
- Visualizing the data
- Boxplot

#### Outlier Treatment

- Mean/Median or random Imputation
- Trimming
- Top, Bottom and Zero Coding
- Discretization

**However, in this article, I will be detecting outliers using boxplot method. If you want to learn in-depth on how to detect and handle outliers, please refer to this article. Box plot diagram also termed as Whisker’s plot is a graphical method typically depicted by quartiles and inter quartiles that helps in defining the upper limit and lower limit beyond which any data lying will be considered as outliers.

In brief, quantiles are points in a distribution that relates to the rank order of values in that distribution. For a given sample, you can find any quantile by sorting the sample. The middle value of the sorted sample is the middle quantile or the 50th percentile (also known as the median of the sample).



In [None]:
l = ['title_proximity_tfidf', 'description_proximity_tfidf',
       'main_query_tfidf', 'query_jl_score', 'query_title_score',
       'city_match', 'job_age_days']
number_of_columns=7
number_of_rows = len(l)-1/number_of_columns
plt.figure(figsize=(number_of_columns,5*number_of_rows))
for i in range(0,len(l)):
    plt.subplot(number_of_rows + 1,number_of_columns,i+1)
    sns.set_style('whitegrid')
    sns.boxplot(df[l[i]],color='green',orient='v')
    plt.tight_layout()

#### Observation:

As we can see there are lot of outliers in the data

In [None]:
# Check the distribution

# Now to check the linearity of the variables it is a good practice to plot distribution graph and look for skewness 
# of features. Kernel density estimate (kde) is a quite useful tool for plotting the shape of a distribution.

for feature in df.columns[:-3]:
    ax = plt.subplot()
    sns.distplot(df[df['apply'] == 1][feature], bins=50, label='Anormal')
    sns.distplot(df[df['apply'] == 0][feature], bins=50, label='Normal')
    ax.set_xlabel('')
    ax.set_title('histogram of feature: ' + str(feature))
    plt.legend(loc='best')
    plt.show()

#### Observation:

For all the features, both apply and non apply rates have almost similar distributions

## 3. Data Cleaning

### Drop Duplicates

In [None]:
print(df.shape)
df = df.drop_duplicates(keep = 'first')
df.shape

### Missing Values 

Missing data is a deceptively tricky issue in applied machine learning.

First, just to be clear, you cannot simply ignore missing values in your dataset. You must handle them in some way for the very practical reason that most algorithms do not accept missing values.

“Common sense” is not sensible here
The following are the most commonly recommended ways of dealing with missing data:

- Dropping observations that have missing values
- Imputing the missing values based on other observations
- Interpolation and Extrapolation
- Using KNN
- Mean/ Median Imputation
- Regression Imputation
- Stochastic regression imputation
- Hot-deck imputation

In [None]:
df.isnull().sum()

In [None]:
# Lets check the value counts for the three columns
df['title_proximity_tfidf'].value_counts().head()

In [None]:
df['description_proximity_tfidf'].value_counts().head()

In [None]:
df['city_match'].value_counts().head()

#### Observation:

The first 2 columns contains mostly value zero so it would be a safe option to impute a value of '0' to the first two columns. For the 'city-match' column, lets check the percentage of apply and non apply before and after we remove the NaN values. If the percentage is same, we can conclude that it is safe to remove rows that have NaN values in City_match column.

In [None]:
df['title_proximity_tfidf'].fillna(0,inplace=True)
df['description_proximity_tfidf'].fillna(0,inplace=True)
df.dropna(subset=['city_match'],inplace=True)

#### Note: I will not be removing outliers since there is possibility of them carrying important information which can help us detect the apply and non apply cases

### 4. Feature Engineering

In general, you can think of data cleaning as a process of subtraction and feature engineering as a process of addition.

This is often one of the most valuable tasks a data scientist can do to improve model performance, for 3 big reasons:

- You can isolate and highlight key information, which helps your algorithms “focus” on what’s important.
- You can bring in your own domain expertise.
- Most importantly, once you understand the “vocabulary” of feature engineering, you can bring in other people’s domain expertise!

Below are some of the ways we can perform feature engineering but please note that this is not an exhaustive compendium of all feature engineering because there are limitless possibilities for this step. The good news is that this skill will naturally improve as you gain more experience.

- Infuse domain knowledge
- Create interactive features
- Combine sparse classes
- Add dummy variables
- Remove unused features

In our case, since there is not much domain knowledge about the dataset, we are restricted in our application of feature engineering. The only feature engineering that I have applied is multiplying the two features which were correlated (title_proximity_tfid and main_query_tfidf) to create a new column named main title tfidf.



In [None]:
# From the correlation graph, we observed that title_proximity_tfidf and main_query_tfidf are quite correlated, 
# lets merge them and get a single feature by multiplying both of them

df['main_title_tfidf'] = df['title_proximity_tfidf']*df['main_query_tfidf']

In [None]:
df = df.drop(['title_proximity_tfidf','main_query_tfidf'], axis=1)

### 5. Modeling

#### Some of the factors affecting the choice of a model are:

Whether the model meets the business goals
- How much pre-processing the model needs
- How accurate the model is
- How explainable the model is
- How fast the model is: How long does it take to build a model, and how long does the model take to make predictions.
- How scalable the model is

An important criterion affecting the choice of algorithm is model complexity. Generally speaking, a model is more complex is:

- It relies on more features to learn and predict (e.g. using two features vs ten features to predict a target)
- It relies on more complex feature engineering (e.g. using polynomial terms, interactions, or principal components)
- It has more computational overhead (e.g. a single decision tree vs. a random forest of 100 trees).

Besides this, the same machine learning algorithm can be made more complex based on the number of parameters or the choice of some hyperparameters. For example,

- A regression model can have more features, or polynomial terms and interaction terms.
- A decision tree can have more or less depth.
- Making the same algorithm more complex increases the chance of overfitting.

### Commonly used Machine Learning algorithms for classification

#### Logistic Regression

Logistic Regression models fit a “straight line”. In practice, they rarely perform well. We actually recommend skipping them for most machine learning problems.

Their main advantage is that they are easy to interpret and understand. However, our goal is not to study the data and write a research report. Our goal is to build a model that can make accurate predictions.

In this regard, logistic regression suffers from two major flaws:

* It’s prone to overfit with many input features.
* It cannot easily express non-linear relationships.

#### Regularization

As mentioned above, logistic regression suffers from overfitting and difficulty in handling non-linear relationships. Regularization is a technique used to prevent overfitting by artificially penalizing model coefficients.

* It can discourage large coefficients (by dampening them).
* It can also remove features entirely (by setting their coefficients to 0).
* The “strength” of the penalty is tunable. (More on this tomorrow…)

Types of regularization are Lasso (L1), ridge (L2) and elastic net (compromise between ridge and lasso)

#### Decision Trees

Decision trees model data as a “tree” of hierarchical branches. They make branches until they reach “leaves” that represent predictions. Due to their branching structure, decision trees can easily model nonlinear relationships.

Unfortunately, decision trees suffer from a major flaw as well. If you allow them to grow limitlessly, they can completely “memorize” the training data, just from creating more and more and more branches. As a result, individual unconstrained decision trees are very prone to overfitting.​

So, how can we take advantage of the flexibility of decision trees while preventing them from overfitting the training data?

#### Tree Ensembles

Ensembles are machine learning methods for combining predictions from multiple separate models. There are a few different methods for ensembling, but the three most common are:

Bagging: attempts to reduce the chance of overfitting complex models.

* It trains a large number of “strong” learners in parallel.
* A strong learner is a model that’s relatively unconstrained.
* Bagging then combines all the strong learners together in order to “smooth out” their predictions.
* Commonly used technique is Random Forest

Boosting: attempts to improve the predictive flexibility of simple models.

* It trains a large number of “weak” learners in sequence.
* A weak learner is a constrained model (i.e. you could limit the max depth of each decision tree).
* Each one in the sequence focuses on learning from the mistakes of the one before it.
* Boosting then combines all the weak learners into a single strong learner.
* Commonly used technique is XGBoost and LightGBM

LightGBM: Light GBM is a gradient boosting framework that uses a tree-based learning algorithm. Light GBM grows tree vertically while other algorithm grows trees horizontally meaning that Light GBM grows tree leaf-wise while other algorithm grows level-wise. It will choose the leaf with max delta loss to grow. When growing the same leaf, Leaf-wise algorithm can reduce more loss than a level-wise algorithm.

There are many other algorithms as well like Support Vector machine, Neural Networks, etc. but we won't be taking it here.

For our case, I will be using XGBoost, Random Forest and LightGBM.




#### Splitting the dataset into training and testing

In [None]:
# Splitting the dataset by date
train = df.loc[df['search_date_pacific']<'2018-01-27']
test = df.loc[df['search_date_pacific'] == '2018-01-27']

In [None]:
# Drop the unnecessary columns
train.drop(['search_date_pacific','class_id'],axis=1,inplace = True)
test.drop(['search_date_pacific','class_id'],axis=1,inplace = True)

In [None]:
# Drop irrelevant features
X = df.drop(['search_date_pacific','class_id','apply'],axis=1)
y = df['apply']

In [None]:
# Reset the index
X = X.reset_index(drop='index')
y = y.reset_index(drop='index')

In [None]:
X_train = train.drop(['apply'],axis=1)
y_train = train['apply']
X_test = test.drop(['apply'],axis=1)
y_test = test['apply']

In [None]:
# Define a function to plot confusion matrix

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`
    """
    plt.figure()
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.show()

In [None]:
# Define a function which will be used to get the important parameters like AUC, Classification report

def report(test_set, predictions,labels,title):
    print('F1 score is:', f1_score(test_set,predictions))
    print("AUC-ROC is: %3.2f" % (roc_auc_score(test_set, predictions)))
    plot_confusion_matrix(confusion_matrix(test_set, predictions),labels,title)
    
    #plot the curve
    fpr, tpr, threshold = roc_curve(test_set,predictions)
    auc = roc_auc_score(test_set,predictions)
    fig, ax = plt.subplots(figsize=(6,6))
    ax.set_title('Receiver Operating Characteristic')
    plt.plot(fpr, tpr, 'b',label='Model - AUC = %0.3f'% auc)
    ax.legend(loc='lower right')
    plt.plot([0,1],[0,1],'r--', label='Chance')
    ax.legend()
    ax.set_xlim([-0.1,1.0])
    ax.set_ylim([-0.1,1.01])
    ax.set_ylabel('True Positive Rate')
    ax.set_xlabel('False Positive Rate')
    plt.show()

## Hyperparameter tuning and CV using Bayesian Optimizer

Search for parameters of machine learning models that result in best cross-validation performance is necessary in almost all practical cases to get a model with best generalization estimate. A standard approach in scikit-learn is using GridSearchCV class, which takes a set of values for every parameter to try, and simply enumerates all combinations of parameter values. The complexity of such search grows exponentially with the addition of new parameters. A more scalable approach is using RandomizedSearchCV, which however does not take advantage of the structure of a search space.

Scikit-optimize provides a drop-in replacement for GridSearchCV, which utilizes Bayesian Optimization where a predictive model referred to as "surrogate" is used to model the search space and utilized to arrive at good parameter values combination as soon as possible.

In [None]:
# Define a function to print the status during bayesian hyperparameter search

def status_print(optim_result):
    """Status callback durring bayesian hyperparameter search"""
    
    # Get all the models tested so far in DataFrame format
    all_models = pd.DataFrame(bayes_cv_tuner.cv_results_)    
    
    # Get current parameters and the best parameters    
    best_params = pd.Series(bayes_cv_tuner.best_params_)
    print('Model #{}\nBest ROC-AUC: {}\nBest params: {}\n'.format(
        len(all_models),
        np.round(bayes_cv_tuner.best_score_, 4),
        bayes_cv_tuner.best_params_
    ))



### A.  XGBoost 

In [None]:
# SETTINGS - CHANGE THESE TO GET SOMETHING MEANINGFUL
ITERATIONS = 10
TRAINING_SIZE = 100000 
TEST_SIZE = 25000


# Classifier
bayes_cv_tuner = BayesSearchCV(
    estimator = xgb.XGBClassifier(
        n_jobs = 1,
        objective = 'binary:logistic',
        eval_metric = 'auc',
        silent=1,
        tree_method='approx'
    ),
    search_spaces = {
        'learning_rate': (0.01, 1.0, 'log-uniform'),
        'min_child_weight': (0, 10),
        'max_depth': (0, 50),
        'max_delta_step': (0, 20),
        'subsample': (0.01, 1.0, 'uniform'),
        'colsample_bytree': (0.01, 1.0, 'uniform'),
        'colsample_bylevel': (0.01, 1.0, 'uniform'),
        'reg_lambda': (1e-9, 1000, 'log-uniform'),
        'reg_alpha': (1e-9, 1.0, 'log-uniform'),
        'gamma': (1e-9, 0.5, 'log-uniform'),
        'min_child_weight': (0, 5),
        'n_estimators': (50, 100),
        'scale_pos_weight': (1e-6, 500, 'log-uniform')
    },    
    scoring = 'roc_auc',
    cv = StratifiedKFold(
        n_splits=3,
        shuffle=True,
        random_state=42
    ),
    n_jobs = 3,
    n_iter = ITERATIONS,   
    verbose = 0,
    refit = True,
    random_state = 42
)


In [None]:
result = bayes_cv_tuner.fit(X, y, callback=status_print)

#### Use the tuned parameters to make the predictions

#### Note: For the predictions, because we are measuring ROC AUC and not accuracy, we have the model predict probabilities and not hard binary values.

In [None]:
xgb = XGBClassifier(colsample_bylevel= 0.8390144719977516, colsample_bytree= 0.8844821246070537, 
                    gamma= 4.358684608480795e-07, learning_rate= 0.7988179462781242, max_delta_step= 17, 
                    max_depth= 3, min_child_weight= 1, n_estimators= 68, reg_alpha= 0.0005266983003701547, 
                    reg_lambda= 276.5424475574225, scale_pos_weight= 0.3016410771843142, subsample= 0.9923710598637134)
xgb.fit(X_train, y_train)
preds_xgb = xgb.predict_proba(X_test)[:, 1]
labels = ['No Apply', 'Apply']
#report(y_test, preds_xgb,labels, 'Confusion Matrix')
auc = roc_auc_score(y_test, preds_xgb)

print('The baseline score on the test set is {:.4f}.'.format(auc))

### B. Random Forest

In [None]:
# ITERATIONS = 10 # 1000
# TRAINING_SIZE = 100000 # 20000000
# TEST_SIZE = 25000
# # Classifier
# bayes_cv_tuner = BayesSearchCV(
#     estimator = RandomForestClassifier(
#         n_jobs = -1
#     ),
#     search_spaces = {
#     'min_samples_split': [3, 5, 8, 10, 20], 
#     'n_estimators' : [100, 500],
#     'max_depth': [3, 5, 8, 10, 15],
#     'max_features': [3, 5, 6]
# },    
#     scoring = 'roc_auc',
#     cv = StratifiedKFold(
#         n_splits=3,
#         shuffle=True,
#         random_state=42
#     ),
#     n_jobs = 3,
#     n_iter = ITERATIONS,   
#     verbose = 0,
#     refit = True,
#     random_state = 42
# )

In [None]:
# result = bayes_cv_tuner.fit(X, y, callback=status_print)

In [None]:
# rf = RandomForestClassifier(
#     n_estimators=421, 
#     max_depth=15,
#     max_features=3,
#     min_samples_split=8, 
#     class_weight="balanced",
#     bootstrap=True,
#     criterion='entropy',
#     random_state=100
#     )

# rf.fit(X_train, y_train)
# preds_rf = rf.predict_proba(X_test)[:,1]
# #labels = ['No Apply', 'Apply']
# #report(y_test, preds_rf,labels, 'Confusion Matrix')
# auc = roc_auc_score(y_test, preds_rf)

# print('The baseline score on the test set is {:.4f}.'.format(auc))

### C. LightGBM

In [None]:
# SETTINGS - CHANGE THESE TO GET SOMETHING MEANINGFUL
ITERATIONS = 10
TRAINING_SIZE = 100000 
TEST_SIZE = 25000


# Classifier
bayes_cv_tuner = BayesSearchCV(
    estimator = lgb.LGBMClassifier(
        n_jobs = 1,
        objective = 'binary',
        eval_metric = 'auc',
        silent=1,
        tree_method='approx'
    ),
    search_spaces = {
        'learning_rate': (0.01, 1.0, 'log-uniform'),
        'min_child_weight': (0, 10),
        'max_depth': (0, 50),
        'subsample': (0.01, 1.0, 'uniform'),
        'colsample_bytree': (0.01, 1.0, 'uniform'),
        'reg_lambda': (1e-9, 1000, 'log-uniform'),
        'reg_alpha': (1e-9, 1.0, 'log-uniform'),
        'min_child_weight': (0, 5),
        'n_estimators': (50, 100)
    },    
    scoring = 'roc_auc',
    cv = StratifiedKFold(
        n_splits=3,
        shuffle=True,
        random_state=42
    ),
    n_jobs = 3,
    n_iter = ITERATIONS,   
    verbose = 0,
    refit = True,
    random_state = 42
)


In [None]:
result = bayes_cv_tuner.fit(X, y, callback=status_print)

In [None]:
model = lgb.LGBMClassifier(colsample_bytree=0.8015579071911014, learning_rate=0.07517239253342656, 
                           max_depth=26, min_child_weight=4, n_estimators=95, reg_alpha=0.002839751649223172, 
                           reg_lambda=0.0001230656555713626, subsample=0.653781260730285)

model.fit(X_train, y_train)


preds_lgb = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, preds_lgb)

print('The baseline score on the test set is {:.4f}.'.format(auc))

### Insights
- Although the improvement is not quite significant, the Bayesian optimizer was able to perform the tuning operation with greater speed.

- Such low AUC score of 0.5849 may be attributed to the fact that we don't have many features in the dataset, which makes it difficult for the algorithm to classify the target variable correctly.

- We did not have much domain knowledge because of which we were not able to perform much feature engineering.

### Things TODO:
- We can use stacking of the above three algorithms which can further improve the AUC

- Include last column (class_id) to improve the results