**Auction Kick Prediction - Laksh Advani**

In this notebook we will explore the dataset to predict if the car purchased at the Auction is a good or bad buy.

## Import Libraries

In [None]:
import pandas as pd
import numpy as np

#Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer

#For Missing Value and Feature Engineering
from sklearn.feature_selection import SelectKBest, chi2, f_classif, VarianceThreshold
from sklearn.impute import SimpleImputer, KNNImputer, MissingIndicator
from sklearn.preprocessing import KBinsDiscretizer, OneHotEncoder, MinMaxScaler
from sklearn.decomposition import PCA

from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

import time

import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from lightgbm import LGBMClassifier
from sklearn.model_selection import train_test_split
from collections import Counter
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.metrics import f1_score
import seaborn as sns
from sklearn.metrics import confusion_matrix


## Import Data

In [None]:
train = pd.read_csv("../input/DontGetKicked/training.csv")
test = pd.read_csv("../input/DontGetKicked/test.csv")

In [None]:
train.head()

In [None]:
train.describe().T

In [None]:
test.describe().T

In [None]:

categorical_cols = list(train.select_dtypes(include = ['object', 'category']).columns)
numerical_cols = list(train.select_dtypes(include = ['int64']).columns)

In [None]:
numerical_cols

In [None]:
categorical_cols

From the above cells we can see that we have a variety of numerical and cateorical features. Out train set has 72983 examples and the test set has 48707. In the following cells we will try to flesh out some more information about these features to detect covariance.

In [None]:
sns.distplot(train['MMRAcquisitionAuctionAveragePrice'])
sns.distplot(train['MMRCurrentAuctionAveragePrice'])


In [None]:
sns.distplot(train['VehOdo'])



We can see that most of the cars have around 70000 miles on them. 

In [None]:

sns.pairplot(train, vars=['IsBadBuy','VehicleAge','VehOdo','MMRCurrentAuctionAveragePrice', 'MMRAcquisitionAuctionAveragePrice', 'VehOdo','WarrantyCost'], palette="husl")
plt.title('pairplot of the auction features')

From the pairplot above we can see the relationships in the dataset.

In [None]:
plt.figure(figsize=(15,15))
cor = train.corr()
sns.heatmap(cor, annot=True)
plt.show()

The Pearson correlation heatmap above gives us an idea about covariance in the dataset, from what we can see the cost variables ar ehighly correlated, we will make a note of this and investigate whether we need to drop these later. 

In [None]:

c = Counter(train['IsBadBuy'])

As we can see from the data above the classes are imbalanced so we need to ensure that we select a different metric instead of 'accuracy'. We also need to set class weights as 'balanced' with our models.

In [None]:
train.describe().T

## Divide Dataset into X and Y

In [None]:
#create X and y datasets for splitting 
X = train.drop(['IsBadBuy','RefId'], axis=1)

#X = train.drop(['IsBadBuy'], axis=1)
y = train['IsBadBuy']

In [None]:
X.describe().T

In [None]:
#import train_test_split library

# create train test split
X_train, X_test, y_train, y_test = train_test_split( X,  y, test_size=0.3, random_state=0)  

In [None]:
X_train.describe().T

numerical_features = [c for c, dtype in zip(X.columns, X.dtypes)
                     if dtype.kind in ['i','f'] and c !='PassengerId']
categorical_features = [c for c, dtype in zip(X.columns, X.dtypes)
                     if dtype.kind not in ['i','f']]

## Setup Pipeline 

In [None]:
preprocessor = make_column_transformer(
    
    (make_pipeline(
    KNNImputer(n_neighbors=2, weights="uniform"),
    MinMaxScaler()), numerical_features),
    
    (make_pipeline(
    SimpleImputer(strategy = 'constant', fill_value = 'missing'),
    OneHotEncoder(categories = 'auto', handle_unknown = 'ignore')), categorical_features),
    
)

In [None]:
preprocessor_best = make_pipeline(preprocessor, 
                                  VarianceThreshold(), 
                                  SelectKBest(f_classif, k = 50)
                                 )

In [None]:

RF_Model = make_pipeline(preprocessor_best, LGBMClassifier(class_weight="balanced"))

## Grid Search

In [None]:
# Number of trees in random forest



n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1000, num = 10)]
# Number of features to consider at every split
#max_features = ['auto', 'sqrt']
#Maximum number of levels in tree
max_depth = [10,15,20,25,50]

param_grid = {'lgbmclassifier__n_estimators': n_estimators,
               'lgbmclassifier__max_depth': max_depth
               #'randomforestclassifier__min_samples_split': min_samples_split,
               #'randomforestclassifier__min_samples_leaf': min_samples_leaf,
               #'randomforestclassifier__bootstrap': bootstrap
             }

RF_Model = make_pipeline(preprocessor_best, LGBMClassifier(class_weight="balanced"))
rf_RandomGrid1 = GridSearchCV(estimator = RF_Model, param_grid = param_grid, cv = 3, verbose=1, n_jobs = -1,scoring = 'f1')
rf_RandomGrid1.fit(X_train, y_train)

In [None]:
rf_RandomGrid1.best_estimator_

In [None]:
yhat = rf_RandomGrid1.best_estimator_.predict(X_test)

print(yhat)



f1_score(yhat, y_test, average='macro')

In [None]:

cm = confusion_matrix(yhat, y_test)
print(cm)
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(cm)
plt.title('Confusion matrix of the classifier')
fig.colorbar(cax)

plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

In [None]:

warnings.simplefilter(action='ignore', category=FutureWarning)

# sorted(zip(clf.feature_importances_, X.columns), reverse=True)
feature_imp = pd.DataFrame(sorted(zip(rf_RandomGrid1.best_estimator_._final_estimator.feature_importances_,X.columns)), columns=['Value','Feature'])

plt.figure(figsize=(20, 10))
sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", ascending=False))
plt.title('LightGBM Features (avg over folds)')
plt.tight_layout()
plt.show()
plt.savefig('lgbm_importances-01.png')

From the plot above we can see that features like 'odometer', 'age', 'size', 'nationality' and 'color' are some of the most iportant features which influence the LightGBM model. 

Given that there is significant class imbalance I decided to use the F-1 score as a metric for measuring the success of the model. 

Apart from the LightGBM model I tried the Logistic Regression classifier which gave us a F-1 score of ~0.6.

Even tho the aution price features were correlated, removing them did negatively affected the score of the model.

Finally to summarizie, the F-1 score for the LightGBM model is 0.65. To improve upon this model I would look into feature transformation and try out more advanced models like Recurrent Neural Networks. 



