# Introduction

This notebook is a solution to the kaggle challenge of [Don't Get Kicked!](https://www.kaggle.com/c/DontGetKicked)

One of the biggest challenges of an auto dealership purchasing a used car at an auto auction is the risk of that the vehicle might have serious issues that prevent it from being sold to customers. The auto community calls these unfortunate purchases "kicks".

Kicked cars often result when there are tampered odometers, mechanical issues the dealer is not able to address, issues with getting the vehicle title from the seller, or some other unforeseen problem. Kick cars can be very costly to dealers after transportation cost, throw-away repair work, and market losses in reselling the vehicle.

Modelers who can figure out which cars have a higher risk of being kick can provide real value to dealerships trying to provide the best inventory selection possible to their customers.

The challenge of this competition is to predict if the car purchased at the Auction is a Kick (bad buy).

# Import Packages

In [None]:
# data process
import numpy as np 
import pandas as pd

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
import graphviz 
%matplotlib inline

# modeling
from sklearn.preprocessing import LabelEncoder
from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score

# tensorflow to build neural networks
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras import initializers

# system
import os
import sys

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Python version
print("Python version: {}". format(sys.version))

# Acquire Data and Initial Check

In [None]:
# Input data files are available in the "../input/" directory.
print(os.listdir("../input/DontGetKicked"))

train_df = pd.read_csv('../input/DontGetKicked/training.csv')
test_df = pd.read_csv('../input/DontGetKicked/test.csv')
submissions_df = test_df[['RefId']] # save RefId for submission file
print(train_df.shape, test_df.shape)

In [None]:
train_df.sample(10)

In [None]:
train_df.info()

In [None]:
print("Description about the columns of Dataset:-\n")
with open('../input/DontGetKicked/Carvana_Data_Dictionary.txt') as f:
    print (f.read())

# Explore Data

In [None]:
numerical_features = train_df.select_dtypes(include = ['float64', 'int64']).columns.drop('RefId')
train_df[numerical_features].hist(figsize=(20, 15), color = "#3498db", bins=30, xlabelsize=8, ylabelsize=8)

In [None]:
plt.figure(figsize=(25,10))
sns.heatmap(train_df[numerical_features].corr(),cmap = sns.diverging_palette(220, 10, as_cmap = True),annot=True, linewidths=.5, fmt= '.3f');
plt.show();

In [None]:
categorical_features = train_df.select_dtypes(include = 'object').columns.tolist()
train_df[categorical_features].describe()

# Select Features

In [None]:
# The dataset has many features, remove the features which are less likely to contribute

unique_id = ['RefId','BYRNO']
with_many_categories = ['VNZIP1','PurchDate', 'Make', 'Model', 'SubModel', 'Trim', 'VNST', 'Color'] 
redundant = ['WheelTypeID']
high_correlation = [ 'MMRCurrentAuctionCleanPrice',    # 99% corr with MMRCurrentAuctionAveragePrice
                    'MMRCurrentRetailCleanPrice',      # 99% corr with MMRCurrentRetailAveragePrice
                    'MMRAcquisitionAuctionCleanPrice', # 99% corr with MMRAcquisitionAuctionAveragePrice
                    'MMRAcquisitonRetailCleanPrice',    # 99% corr with MMRQcquisitionRetailAverageprice
                    'VehYear'                          # 96% corr with 'VehicleAge'
                   ]
columns_to_drop = unique_id + with_many_categories + redundant + high_correlation
train_df.drop(columns_to_drop,axis=1,inplace=True)
test_df.drop(columns_to_drop,axis=1,inplace=True)

In [None]:
train_df.info()

In [None]:
targets=train_df['IsBadBuy']
train_df.drop('IsBadBuy',axis=1,inplace=True)

# Handle Missing Value

In [None]:
# check missing value
print('Train columns with null values:\n', train_df.isnull().sum())
print("-"*10)
print('Test/Validation columns with null values:\n', test_df.isnull().sum())

In [None]:
# separate numerical and categorical features
train_df.info()
numerical_features = train_df.select_dtypes(include = ['float64', 'int64']).columns.tolist()
categorical_features = train_df.select_dtypes(include = 'object').columns.tolist()

In [None]:
# Replace missing numerical values with mean value
from sklearn.impute import SimpleImputer
imputer=SimpleImputer(strategy='mean');
imputer.fit(train_df[numerical_features]);
train_df[numerical_features]=imputer.transform(train_df[numerical_features]);
test_df[numerical_features]=imputer.transform(test_df[numerical_features]);

In [None]:
# Add Unknown type for missing category values
for c in categorical_features:
    train_df[c].fillna('Unknown',inplace=True)
    test_df[c].fillna('Unknown',inplace=True)

# Encode Categorical Data & Normalize Numerical Data

In [None]:
from sklearn.preprocessing import OneHotEncoder
encoder=OneHotEncoder(sparse=False,handle_unknown='ignore')
encoder.fit(train_df[categorical_features])
encoded_cols=list(encoder.get_feature_names(categorical_features))
train_df[encoded_cols]=encoder.transform(train_df[categorical_features]);
test_df[encoded_cols]=encoder.transform(test_df[categorical_features]);

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()
scaler.fit(train_df[numerical_features]);
train_df[numerical_features]=scaler.transform(train_df[numerical_features])
test_df[numerical_features]=scaler.transform(test_df[numerical_features])

In [None]:
train_df=train_df[numerical_features+encoded_cols]
test_df=test_df[numerical_features+encoded_cols]

In [None]:
train_df.columns

# Testing Models

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train_df, targets, test_size=0.15, random_state=0)

In [None]:
# Logistic Regression

logreg = LogisticRegression()
clf = logreg.fit(X_train, y_train)
acc_train_log = round(logreg.score(X_train, y_train) * 100, 2)
acc_test_log = round(logreg.score(X_test, y_test) * 100, 2)
roc_test_log = round(roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1]),3)
print('logistic regression train accurary: ',acc_train_log)
print('logistic regression test accurary: ',acc_test_log)
print('logistic regression test ROC: ',roc_test_log)


# K-Nearest Neighbors

knn = KNeighborsClassifier(n_neighbors = 3)
clf = knn.fit(X_train, y_train)
acc_train_knn = round(knn.score(X_train, y_train) * 100, 2)
acc_test_knn = round(knn.score(X_test, y_test) * 100, 2)
roc_test_knn = round(roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1]),3)
print('K-Nearest Neighbors train accurary: ',acc_train_knn)
print('K-Nearest Neighbors test accurary: ',acc_test_knn)
print('K-Nearest Neighbors test ROC: ',roc_test_knn)

# Decision Tree

decision_tree = DecisionTreeClassifier()
clf = decision_tree.fit(X_train, y_train)
acc_train_decision_tree = round(decision_tree.score(X_train, y_train) * 100, 2)
acc_test_decision_tree = round(decision_tree.score(X_test, y_test) * 100, 2)
roc_test_decision_tree = round(roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1]),3)
print('Decision Tree train accurary: ',acc_train_decision_tree)
print('Decision Tree test accurary: ',acc_test_decision_tree)
print('Decision Tree test ROC: ',roc_test_decision_tree)

# Random Forest

random_forest = RandomForestClassifier(n_estimators=100)
clf = random_forest.fit(X_train, y_train)
acc_train_random_forest = round(random_forest.score(X_train, y_train) * 100, 2)
acc_test_random_forest = round(random_forest.score(X_test, y_test) * 100, 2)
roc_test_random_forest = round(roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1]),3)
print('Random Forest train accurary: ',acc_train_random_forest)
print('Random Forest test accurary: ',acc_test_random_forest)
print('Random Fores test ROC: ',roc_test_random_forest)

# XGBoost
xgb = XGBClassifier()
clr = xgb.fit(X_train, y_train)
acc_train_xgb = round(xgb.score(X_train, y_train) * 100, 2)
acc_test_xgb = round(logreg.score(X_test, y_test) * 100, 2)
roc_test_xgb = round(roc_auc_score(y_test, clr.predict_proba(X_test)[:, 1]),3)
print('xgb train accurary: ',acc_train_xgb)
print('xgb test accurary: ',acc_test_xgb)
print('xgb test ROC: ',roc_test_xgb)

# LightGBM

lgbm = LGBMClassifier()
clr = lgbm.fit(X_train, y_train)
acc_train_lgbm = round(lgbm.score(X_train, y_train) * 100, 2)
acc_test_lgbm = round(lgbm.score(X_test, y_test) * 100, 2)
roc_test_lgbm = round(roc_auc_score(y_test, clr.predict_proba(X_test)[:, 1]),3)
print('lgbm train accurary: ',acc_train_lgbm)
print('lgbm test accurary: ',acc_test_lgbm)
print('lgbm test ROC: ',roc_test_lgbm)

# Choose the Best Model and Generate Prediction

LGBM Classifier has both the best test accurary and ROC score. The train accurary and test accurary is very close so there is no overfiting. Retrain the model with all the training data

In [None]:
clr = lgbm.fit(train_df, targets)
predictions = clr.predict_proba(test_df)[:, 1]
submissions_df['IsBadBuy'] = predictions
submissions_df.to_csv('Submissions_lgbm.csv',index=False)

# Alternative Approach - Neural Network

In [None]:
# Get input dimensions
train_df.shape

In [None]:
initializer = initializers.he_normal()
model = Sequential([
        Dense(32, input_shape=(49,), activation='relu', kernel_initializer=initializer),
        Dense(32, activation='relu', kernel_initializer=initializer),
        Dense(16, activation='relu', kernel_initializer=initializer),
        Dense(1, activation='sigmoid')
    ])
model.summary()

In [None]:
model.compile(optimizer='adam',
                 loss='binary_crossentropy',
                 metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=10, batch_size=512, validation_data=(X_test, y_test), verbose=2)

The validation accurary of the neural network is less than that of LGBM, so neural network will not be applied in this solution.

# Next Steps

1. Try to add back some delted categorical features. (Adding "marker" did not improve the prediction.)
2. Optimize the parameters for LGBM or other Models