# ****Don't Get Kicked!

# Analysis Roadmap 

This analysis follows these five main steps:

1. EDA.
2. Data Prep with 4 C's --- Correcting, Completing, Creating, and Converting.
3. MLA ( Machine Learning Alrogithm) Comparison and selection.
4. Obtain the predictions from the selected algorthims. 
5. Optimization of the selected algorithms.

In this notebook, I only cover the first four steps, the optimization of the selected algorithms willl be the planned step to further improve the model.


Disclaimer: This notebook is built upon an exising notebook in Kaggle with an excellent EDA work.

# Results

1. Even though the host didn't reveal the evaluation score for the competition, but through the comparison from the list of candidates including Precision, Recall, F1-score and AUC, the evaluation score of this competition is shown to be **Recall**.  Recall which defined as the percentage of real "kicks" being picked up by the prediction makes most sense for this data science task. Hence for the step 3 of the analysis, the Recall score of the validation datasets is the primary score for algorithm comparison and selection. Ligntgbm and Xgboost are the top  two algorithms selected. The final prediction was based on Lightgbm which obtained a private score at 0.25335 and public score at 0.25117, ranked at 23/top 4% on the leaderboard.

2. This is an imbalanced classification probelm, with 12% as positive in the training data set. For most classic algorithms such as logistcs, svm, KNN, balancing data is necessary. Balancing data can be achieved through oversampling, downsampling,SMOT or through specifying weight option. **However**, the selected Lightgbm and Xgboost algorithm appear to be overfitting if specifying higher weight on the positive observation or using oversampled data. By comparing setting weight from 0-10, the highest score was achieved as weight setting to 0. SMOT approach should be experimented in the follow-up work.

3. Adding couple of features has given a good lift to the model perfomance. 
The important features engineered are:
* miles_per_yr: average miles driven per year --- abnormal value may indicate odometer fixing
* depr_per_yr: depreciation per year --- abnormal value may indicate abnormal depreciaiton
* diff_current_auction_retail: difference of the current auction price from the current retail price
* Month: purchase month
* Weekday: purchase day of week
   

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import math

# Task describtion

In [2]:
train = pd.read_csv('../input/DontGetKicked/training.csv')
test = pd.read_csv('../input/DontGetKicked/test.csv')

# Intitial EDA

## Common Info

In [3]:
print(train.shape)
print(test.shape)

In [4]:
train.dtypes

In [5]:
categorical_features = { "Auction", "Make", "Model", "PurchDate", "Size", "Color",
                        "Trim", "SubModel", "Transmission", "WheelType", "WheelTypeID",
                        "Nationality", "TopThreeAmericanName", "PRIMEUNIT",
                        "AUCGUART", "VNZIP1", "VNST", "IsOnlineSale"
                        }
train["IsBadBuy"] = train["IsBadBuy"].astype("category")
for feature in categorical_features:
    train[feature] = train[feature].astype("category")
    test[feature] = test[feature].astype("category")

## Missing Data

In [6]:
print(train.isnull().sum())

In [7]:
print(test.isnull().sum())

# Duplicates

In [8]:
train[train.duplicated()]

In [9]:
test[test.duplicated()]

# Numerical Features

In [10]:
numerical_features = train.select_dtypes(include = ['float64', 'int64']).columns
train[numerical_features].hist(figsize=(20, 20), color = 'b', bins=30, xlabelsize=8, ylabelsize=8)

In [11]:
#scatterplot to look out for outlier

#initiate fig, ax
fig, ax = plt.subplots(2,2, figsize=(16,8))

#scatterplot of the MMRA... Columns
sns.scatterplot(x='MMRAcquisitionAuctionAveragePrice', 
                y='MMRAcquisitionRetailAveragePrice', data=train,  hue='IsBadBuy', ax=ax[0,0]);
sns.scatterplot(x='MMRAcquisitionAuctionCleanPrice',
                y='MMRAcquisitonRetailCleanPrice', data=train,  hue='IsBadBuy', ax=ax[0,1]);
sns.scatterplot(x='MMRCurrentAuctionCleanPrice', 
                y='MMRCurrentRetailCleanPrice', data=train,  hue='IsBadBuy', ax=ax[1,0]);
sns.scatterplot(x='MMRCurrentAuctionAveragePrice', 
                y='MMRCurrentRetailAveragePrice', data=train,  hue='IsBadBuy', ax=ax[1,1]);

In [12]:
sns.scatterplot(x='VehBCost', y='VehOdo', data=train,  hue='IsBadBuy');

# Categorical Features

In [13]:
train[categorical_features].describe()

In [14]:
test[categorical_features].describe()

# Correlation Matrix

In [15]:
corrMatt = train.corr()
mask = np.array(corrMatt)
mask[np.tril_indices_from(mask)] = False
fig,ax= plt.subplots()
fig.set_size_inches(15,10)
sns.heatmap(corrMatt, cmap="Blues", mask=mask,vmax=.8, square=True,annot=True)

# Data prep

<a id="ch5"></a>
## The 4 C's of Data Preparation: Correcting, Completing, Creating, and Converting
In this stage, we will clean our data by 1) correcting aberrant values and outliers, 2) completing missing information, 3) creating new features for analysis, and 4) converting fields to the correct format for calculations and presentation.

1. **Correcting:** Reviewing the data, there does not appear to be any aberrant or non-acceptable data inputs. However, we do see some having much higher auction and retail prices  for the bad buys. 
2. **Completing:** There are null values or missing data. Missing values can be bad, because some algorithms don't know how-to handle null values and will fail. While others, like xgboost or lightgbm can handle null values.  There are two common methods, either delete the record or populate the missing value using a reasonable input. It is not recommended to delete the record, especially a large percentage of records, unless it truly represents an incomplete record. Instead, it's best to impute missing values. A basic methodology for qualitative data is impute using mode. A basic methodology for quantitative data is impute using mean, median, or mean + randomized standard deviation. 
3. **Creating:**  Feature engineering is when we use existing features to create new features to determine if they provide new signals to predict our outcome. 
4. **Converting:**  For this dataset, we will convert some categorical variables into dummy variables.

## Delete columns

In [16]:
#### drop `WheelTypeID` because `WheelType`has the same information, drop 'PurchDate' because 'VehicleAge' has the same information

train.drop("WheelTypeID", inplace=True, axis=1)
test.drop("WheelTypeID", inplace=True, axis=1)

#train.drop("PurchDate", inplace=True, axis=1)
#test.drop("PurchDate", inplace=True, axis=1)

categorical_features.remove("PurchDate")
categorical_features.remove("WheelTypeID")

In [17]:
#### Drop features like `Model`, `Trim`, SubModel`, `VINZIP1` ,`VNST`, 'Make`, 'Color' have many values

to_much_cat_delete_candidate = { "Model", "Trim", "SubModel", "VNZIP1", "VNST",  "Make", "Color"}
for d in to_much_cat_delete_candidate:
    train.drop(d, inplace=True, axis=1)
    test.drop(d, inplace=True, axis=1)
    categorical_features.remove(d)

In [18]:
#####drop all clean price and keep the average price

corr_delete_candidate = {
                         "MMRCurrentAuctionCleanPrice",
                         "MMRCurrentRetailCleanPrice",
                         "MMRAcquisitionAuctionCleanPrice",
                         "MMRAcquisitonRetailCleanPrice",
                         "VehYear"
                        }

for d in corr_delete_candidate:
    train.drop(d, inplace=True, axis=1)
    test.drop(d, inplace=True, axis=1)
    

# Completing: NaN handling

In [19]:
print(train.isnull().sum())

In [20]:
print(test.isnull().sum())

In [21]:
train.fillna((train.median()), inplace=True)
test.fillna((train.median()), inplace=True)

In [22]:
train.fillna((train.mode()), inplace=True)
test.fillna((train.mode()), inplace=True)

In [23]:
nan_new_cat_candidate = { "PRIMEUNIT", "AUCGUART", "WheelType" }

for c in nan_new_cat_candidate:
    train[c] = train[c].cat.add_categories('Unkown')
    train[c] = train[c].fillna(value="Unkown", inplace=False)
    test[c] = test[c].cat.add_categories('Unkown')
    test[c] = test[c].fillna(value="Unkown", inplace=False)


In [24]:
train["Transmission"].value_counts(dropna=False)

In [25]:
train["Transmission"].replace("Manual", "MANUAL", inplace=True)

# Target Variable

In [26]:
train["IsBadBuy"].value_counts(dropna=False)

In [27]:
train["IsBadBuy"].value_counts(normalize=True).plot(kind='bar')

# Categorical Features

In [28]:
df = train.copy()

varlist = ['WheelType','Transmission', 'TopThreeAmericanName','PRIMEUNIT',  'AUCGUART', 'Size',  'Nationality', 'Auction',  'IsOnlineSale']
for var in varlist:
    var_isbadbuy = pd.crosstab(index=df.loc[:, var], 
                columns=df.loc[:,'IsBadBuy'],  normalize="index").sort_values(
                by=[1], ascending=False)

     #rename Columns
    var_isbadbuy.columns = ["no", "yes"]

     # initiate fig and ax
    fig, ax = plt.subplots(figsize=(10,6))

    #plot
    var_isbadbuy.iloc[:,0:2].plot(kind='bar',
                                      ax=ax, 
                                      #stacked=True,
                                      linewidth=1, 
                                      edgecolor='#000000'
                                     );

# Converting: label encoding/ one-hot encoding

In [29]:
label_enc_candidate = { "AUCGUART", "WheelType" }

for c in label_enc_candidate:
    train[c] = train[c].cat.codes
    test[c] = test[c].cat.codes
    categorical_features.remove(c)

In [30]:
one_hot_enc_candidates = { "Nationality", "TopThreeAmericanName", 'Size', 'Auction', 
                          'IsOnlineSale', 'Transmission', 'PRIMEUNIT'
                         }

train = pd.get_dummies(train, columns=one_hot_enc_candidates, prefix = one_hot_enc_candidates)
test = pd.get_dummies(test, columns=one_hot_enc_candidates, prefix = one_hot_enc_candidates)

# Creating: Important Feature Engineering

1. miles_per_yr: abnormal value may indicate odometer fixing
2. depr_per_yr: depreciation per year --- abnormal value may indicate abnormal depreciaiton 
3. diff_current_auction_retail: difference of the current auction price from the current retail price
4. Month: purchase month
5. Weekday: purchase day of week

In [31]:
train['miles_per_yr'] = train['VehOdo']/(train['VehicleAge']+0.05)
test['miles_per_yr'] = test['VehOdo']/(test['VehicleAge']+ 0.05)

In [32]:

train['depr_per_yr'] = (train['MMRCurrentAuctionAveragePrice'] - train['VehBCost'])/(train['VehicleAge']+0.05)
test['depr_per_yr'] = (test['MMRCurrentAuctionAveragePrice'] - test['VehBCost'])/(test['VehicleAge']+0.05)

In [33]:

train['diff_current_auction_retail'] = train['MMRCurrentRetailAveragePrice']  - train['MMRCurrentAuctionAveragePrice']
test['diff_current_auction_retail'] = test['MMRCurrentRetailAveragePrice']  - test['MMRCurrentAuctionAveragePrice']

In [34]:
train["PurchDate"] = pd.to_datetime(train["PurchDate"])
test["PurchDate"] = pd.to_datetime(test["PurchDate"])

In [35]:
train['Month'] = train["PurchDate"].dt.month
test['Month'] = test["PurchDate"].dt.month


In [36]:
train['Weekday'] = train["PurchDate"].dt.weekday
test['Weekday'] = test["PurchDate"].dt.weekday


In [37]:
del train["PurchDate"], test["PurchDate"]

## Balance data Preparation



In [38]:
#Oversampling is one of approches to balance data. 
count_class_0, count_class_1 = train.IsBadBuy.value_counts()

df_class_0 = train[train['IsBadBuy'] == 0]
df_class_1 = train[train['IsBadBuy'] == 1]

df_class_1_over = df_class_1.sample(count_class_0, replace=True)
df_test_over = pd.concat([df_class_0, df_class_1_over], axis=0)

print(df_test_over.IsBadBuy.value_counts())

df_test_over.IsBadBuy.value_counts().plot(kind='bar', title='Count (target)');
train_over = df_test_over

In [39]:
# obtain the weight for specifying weight option 

scale_pos_weight_num = int(count_class_0/count_class_1)
scale_pos_weight_num

# Apply MLA Selection Macro

**When it comes to data modeling,  there is no super algorithm, that works best in all situations, for all datasets. So the best approach is to try multiple MLAs, tune them, and compare them for your specific scenario. This MLA selection macro is created to do the initiial selection by comparing cross-validation scores. In this specific case, we use mean recall score of validations. You can also identify the overfitting problems by comparing traing vs validation scores.**



In [40]:
import xgboost 
xgboost.set_config(verbosity=0)
#Common Model Algorithms
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

#Common Model Helpers
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics


In [41]:
train_df=train.drop(["IsBadBuy",'RefId'], axis=1)

In [42]:
test_df=test.drop(['RefId'], axis=1)
test_df

In [43]:
#Machine Learning Algorithm (MLA) Selection and Initialization


MLA = [
    #ensemble.AdaBoostClassifier(),
    #ensemble.GradientBoostingClassifier(),
    #ensemble.RandomForestClassifier(),
 
    #linear_model.LogisticRegressionCV(),
    #linear_model.RidgeClassifierCV(),
    
    #Trees    
    #tree.DecisionTreeClassifier(),
    #tree.ExtraTreeClassifier(),
         
    #xgboost: http://xgboost.readthedocs.io/en/latest/model.html
    XGBClassifier(use_label_encoder=False), 
    
    LGBMClassifier(),
    ]



cv_split = model_selection.ShuffleSplit(n_splits = 5, test_size = .3, train_size = .7, random_state = 0 ) # run model 10x with 60/30 split intentionally leaving out 10%

#create table to compare MLA metrics
MLA_columns = ['MLA Name', 'MLA Parameters','MLA Train Recall Mean', 'MLA Val Recall Mean',  'MLA Time']
MLA_compare = pd.DataFrame(columns = MLA_columns)

#create table to compare MLA predictions
MLA_predict = train["IsBadBuy"].values

#index through MLA and save performance to table
row_index = 0
for alg in MLA:

    #set name and parameters
    MLA_name = alg.__class__.__name__
    MLA_compare.loc[row_index, 'MLA Name'] = MLA_name
    MLA_compare.loc[row_index, 'MLA Parameters'] = str(alg.get_params())
    
    
    cv_results = model_selection.cross_validate(alg, 
                                                train_df, 
                                                train["IsBadBuy"].values,
                                                cv  = cv_split, 
                                                scoring = ('precision', 'recall','f1', 'accuracy'),
                                                return_train_score =True)

    MLA_compare.loc[row_index, 'MLA Time'] = cv_results['fit_time'].mean()
    MLA_compare.loc[row_index, 'MLA Train Precision Mean'] = cv_results['train_precision'].mean() 
    MLA_compare.loc[row_index, 'MLA Val Precision Mean'] = cv_results['test_precision'].mean() 
    
    MLA_compare.loc[row_index, 'MLA Train Recall Mean'] = cv_results['train_recall'].mean() 
    MLA_compare.loc[row_index, 'MLA Val Recall Mean'] = cv_results['test_recall'].mean() 
    
    MLA_compare.loc[row_index, 'MLA Train F1 Mean'] = cv_results['train_f1'].mean() 
    MLA_compare.loc[row_index, 'MLA Val F1 Mean'] = cv_results['test_f1'].mean() 
    MLA_compare.loc[row_index, 'MLA Train Accuracy Mean'] = cv_results['train_accuracy'].mean() 
    MLA_compare.loc[row_index, 'MLA Val Accuracy Mean'] = cv_results['test_accuracy'].mean() 
    
    
    
    #if this is a non-bias random sample, then +/-3 standard deviations (std) from the mean, should statistically capture 99.7% of the subsets
    MLA_compare.loc[row_index, 'MLA Val Recall 3*STD'] = cv_results['test_recall'].std()*3   #let's know the worst that can happen!
    

    row_index+=1

    

MLA_compare.sort_values(by = ['MLA Val Recall Mean'], ascending = False, inplace = True)
MLA_compare
#MLA_predict




# XGBOOST

In [44]:
clf = XGBClassifier()

clf.fit(train_df, train["IsBadBuy"])


xgb_predictions = clf.predict_proba(test_df)[:, 1]
submit = test[['RefId']]
submit['IsBadBuy'] = xgb_predictions
submit.to_csv('xgboost_baseline.csv', index = False)

In [45]:
import xgboost as xgb
fig, ax = plt.subplots(figsize=(10,10))
xgb.plot_importance(clf, max_num_features=50, height=0.5, ax=ax,importance_type='gain')
plt.show()

# LGBMClassifier

In [46]:
from lightgbm import LGBMClassifier

In [47]:
from sklearn.metrics import precision_recall_curve, plot_precision_recall_curve, roc_curve, plot_roc_curve
train_labels = train["IsBadBuy"]
pd.options.mode.chained_assignment = None  # default='warn'
lgb = LGBMClassifier()
lgb.fit(train_df, train_labels)
lgbm_predictions = lgb.predict_proba(test_df)[:, 1]
submit = test[['RefId']]
submit['IsBadBuy'] = lgbm_predictions
submit.to_csv('lightgbm_baseline.csv', index = False)

In [48]:
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# sorted(zip(clf.feature_importances_, X.columns), reverse=True)
feature_imp = pd.DataFrame(sorted(zip(clf.feature_importances_,train_df.columns)), columns=['Value','Feature'])

plt.figure(figsize=(20, 10))
sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", ascending=False))
plt.title('LightGBM Features (avg over folds)')
plt.tight_layout()
plt.show()
plt.savefig('lgbm_importances-01.png')

# LGBM Optimization To be Continued

# Xgboost Optimization To be Continued