<a href="https://colab.research.google.com/github/vanaja-penumatsa-dev/kaggle-dont-get-kicked/blob/main/Modelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')
DPATH = '/content/drive/MyDrive/apporchid/data/'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd

train = pd.read_csv(DPATH+'train_preprocess.csv')
test = pd.read_csv(DPATH+'test_preprocess.csv')

In [None]:
train['IsBadBuy'].value_counts()

0    64005
1     8976
Name: IsBadBuy, dtype: int64

**Observation**

*   Highy imbalanced dataset
*   Any model must perform better than a random model with accuracy 87%.

<h1>Choosing performance metric</h2>



*   Since we have an imbalanced dataset, accuracy is not the right metric to model
*   We can go for F1-score/aucroc
*   For this particular problem we want avoid classifying a BadBuy(1) as GoodBuy(0) ie., FN - recall
*   We can use weighted F1 score variation, ie., F-Beta score where we can adjust weights of precisiona and recall.





<h1>1. Data Postprocessing</h1>

In [None]:
train.head()

Unnamed: 0,IsBadBuy,Auction,VehYear,VehicleAge,Make,Model,Trim,SubModel,Color,Transmission,WheelTypeID,WheelType,VehOdo,Nationality,Size,TopThreeAmericanName,MMRAcquisitionAuctionAveragePrice,MMRAcquisitionAuctionCleanPrice,MMRAcquisitionRetailAveragePrice,MMRAcquisitonRetailCleanPrice,MMRCurrentAuctionAveragePrice,MMRCurrentAuctionCleanPrice,MMRCurrentRetailAveragePrice,MMRCurrentRetailCleanPrice,PRIMEUNIT,AUCGUART,BYRNO,VNZIP1,VNST,VehBCost,IsOnlineSale,WarrantyCost,quality,month,WC_Perc
0,0,ADESA,2006,3,MAZDA,MAZDA3,i,4D SEDAN I,RED,AUTO,1.0,Alloy,29682.0,OTHER ASIAN,MEDIUM,OTHER,8155.0,9829.0,11636.0,13600.0,7451.0,8552.0,11597.0,12409.0,NO,Other,21973,33619,FL,7100.0,0,1113,Below Avg,12,19.137466
1,0,ADESA,2004,5,DODGE,1500 RAM PICKUP 2WD,ST,QUAD CAB 4.7L SLT,WHITE,AUTO,1.0,Alloy,18718.6,AMERICAN,LARGE TRUCK,CHRYSLER,6854.0,8383.0,10897.0,12572.0,7456.0,9222.0,11374.0,12791.0,NO,Other,19638,33619,FL,7600.0,0,1053,Avg,12,21.652422
2,0,ADESA,2005,4,DODGE,STRATUS V6,SXT,4D SEDAN SXT FFV,MAROON,AUTO,2.0,Covers,18451.75,AMERICAN,MEDIUM,CHRYSLER,3202.0,4760.0,6943.0,8457.0,4035.0,5557.0,7146.0,8702.0,NO,Other,19638,33619,FL,4900.0,0,1389,Above Avg,12,10.583153
3,0,ADESA,2004,5,DODGE,NEON,SXT,4D SEDAN,SILVER,AUTO,1.0,Alloy,13123.4,AMERICAN,COMPACT,CHRYSLER,1893.0,2675.0,4658.0,5690.0,1844.0,2646.0,4375.0,5518.0,NO,Other,19638,33619,FL,4100.0,0,630,Above Avg,12,19.52381
4,0,ADESA,2005,4,FORD,FOCUS,ZX3,2D COUPE ZX3,SILVER,MANUAL,2.0,Covers,17341.75,AMERICAN,COMPACT,FORD,3913.0,5054.0,7723.0,8707.0,3247.0,4384.0,6739.0,7911.0,NO,Other,19638,33619,FL,4000.0,0,1020,Avg,12,11.764706


In [None]:
cat_cols = ['Auction', 'VehYear', 'Make', 'Model', 'Trim', 'SubModel', 'Color', 'Transmission', 'WheelTypeID', 'WheelType', \
              'Nationality', 'Size', 'TopThreeAmericanName', 'PRIMEUNIT', 'AUCGUART', 'BYRNO', 'VNZIP1', 'VNST', 'IsOnlineSale', 'month', 'quality']
num_cols = ['VehicleAge', 'VehOdo', 'MMRAcquisitionAuctionAveragePrice', 'MMRAcquisitionAuctionCleanPrice', 'MMRAcquisitionRetailAveragePrice', \
                'MMRAcquisitonRetailCleanPrice', 'MMRCurrentAuctionAveragePrice', 'MMRCurrentAuctionCleanPrice', \
                'MMRCurrentRetailAveragePrice', 'MMRCurrentRetailCleanPrice', 'VehBCost', 'WarrantyCost', 'WC_Perc']

In [None]:
assert len(cat_cols) + len(num_cols) == train.shape[1]-1

In [None]:
y = train['IsBadBuy'].values
X = train.drop(['IsBadBuy'],axis=1)
X_test = test

from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [None]:
X_train.shape, y_train.shape, X_valid.shape, y_valid.shape, X_test.shape

((58384, 34), (58384,), (14597, 34), (14597,), (48707, 35))

<h2>1a) Standardising numerical features</h2>

In [None]:
from sklearn.preprocessing import StandardScaler
import numpy as np

def numeric_postprocess(X_train_post, X_valid_post, X_test_post):
    
    for col in num_cols:
        scaler = StandardScaler()
        scaler.fit(X_train[col].values.reshape(-1,1))
        X_train_post = np.append(X_train_post, scaler.transform(X_train[col].values.reshape(-1,1)), axis=1)
        X_valid_post = np.append(X_valid_post, scaler.transform(X_valid[col].values.reshape(-1,1)), axis=1)
        X_test_post = np.append(X_test_post, scaler.transform(X_test[col].values.reshape(-1,1)), axis=1)
    return X_train_post, X_valid_post, X_test_post

<h2>1b) Encoding categorical features</h2>



*   Iam using **target encoding** to encode categorical features as few features has two many values so, we cant use one-hot encoding and they dont have any ordinal relationaship to use Label encoding
*   **target encoding** gives a numerical posterior  probability value based on taret variable.



In [None]:
from target_encoder import TargetEncoder

def categoric_postprocess():

    encoder = TargetEncoder(cat_cols)
    encoder.fit(X_train[cat_cols], y_train)
    X_train_post = encoder.transform(X_train[cat_cols])
    X_valid_post = encoder.transform(X_valid[cat_cols])
    X_test_post = encoder.transform(X_test[cat_cols])
    return X_train_post, X_valid_post, X_test_post

In [None]:
X_train_post, X_valid_post, X_test_post = categoric_postprocess()
X_train_post, X_valid_post, X_test_post = numeric_postprocess(X_train_post, X_valid_post, X_test_post)

In [None]:
X_train_post.shape, X_valid_post.shape, X_test_post.shape

((58384, 34), (14597, 34), (48707, 34))

<h1>Model1. Kernel-SVM</h1>

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import fbeta_score
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
import matplotlib.pyplot as plt

alpha_values = [10 ** x for x in range(-5, 2)]
'''
max_fbeta = float('-inf')
best_alpha = None
fbeta_array=[]
for alpha in alpha_values:
    clf = SVC(kernel='rbf', random_state=1, C=alpha)
    #clf = MultinomialNB(alpha=alpha)
    clf.fit(X_train_post, y_train)
    y_pred = clf.predict(X_valid_post)
    current_fbeta = fbeta_score(y_pred, y_valid, beta=2)
    acc = accuracy_score(y_valid, y_pred)
    fbeta_array.append(current_fbeta)
    print('For values of alpha = ', alpha, "The fbeta score is:", current_fbeta, "accuracy is:", acc)
    if current_fbeta > max_fbeta:
        max_fbeta = current_fbeta
        best_alpha = alpha

fig, ax = plt.subplots()
ax.plot(alpha_values, fbeta_array,c='g')
for i, txt in enumerate(np.round(fbeta_array,3)):
    ax.annotate((alpha_values[i],np.round(txt,3)), (alpha_values[i],fbeta_array[i]))
plt.grid()
plt.title("F-beta score for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("F-beta score")
plt.show()
'''
best_alpha = 10
best_clf = SVC(kernel='rbf', random_state=1, C=best_alpha)
best_clf.fit(X_train_post, y_train)
y_pred = best_clf.predict(X_train_post)
valid_fbeta = fbeta_score(y_pred, y_train, beta=2)
acc = accuracy_score(y_train, y_pred)
print("Train fbeta for best c", alpha,valid_fbeta, "accuracy is:", acc)
print(confusion_matrix(y_train, y_pred))
y_pred = best_clf.predict(X_valid_post)
valid_fbeta = fbeta_score(y_pred, y_valid, beta=2)
acc = accuracy_score(y_valid, y_pred)
print("Validation fbeta for best c", alpha,valid_fbeta, "accuracy is:", acc)
print(classification_report(y_valid, y_pred))
print(confusion_matrix(y_valid, y_pred))

Train fbeta for best c 10 0.0493530745631586 accuracy is: 0.8781858043299534
[[51198     5]
 [ 7107    74]]
Validation fbeta for best c 10 0.01878690284487386 accuracy is: 0.8768240049325203
              precision    recall  f1-score   support

           0       0.88      1.00      0.93     12802
           1       0.41      0.00      0.01      1795

    accuracy                           0.88     14597
   macro avg       0.64      0.50      0.47     14597
weighted avg       0.82      0.88      0.82     14597

[[12792    10]
 [ 1788     7]]


<h1>Model2. Gaussian Naive Bayes</h1>

In [None]:
from sklearn.naive_bayes import GaussianNB


clf = GaussianNB()

#clf = MultinomialNB(alpha=alpha)
clf.fit(X_train_post, y_train)
y_pred = clf.predict(X_train_post)
train_fbeta = fbeta_score(y_pred, y_train, beta=2)
acc = accuracy_score(y_train, y_pred)
print("Train fbeta ", train_fbeta, "accuracy is:", acc)
print(confusion_matrix(y_train, y_pred))
y_pred = clf.predict(X_valid_post)
valid_fbeta = fbeta_score(y_pred, y_valid, beta=2)
acc = accuracy_score(y_valid, y_pred)
print("Validation fbeta ", valid_fbeta, "accuracy is:", acc)
print(classification_report(y_valid, y_pred))
print(confusion_matrix(y_valid, y_pred))


y_pred = clf.predict(X_test_post)
test_gnb = pd.DataFrame()
test_gnb['IsBadBuy'] = y_pred
test_gnb['RefId'] = test['RefId']

test_rf.to_csv(DPATH+"test_gnb.csv", index=False)

Train fbeta  0.2634464025657389 accuracy is: 0.7563716086599068
[[40940 10263]
 [ 3961  3220]]
Validation fbeta  0.2553361260722123 accuracy is: 0.7554291977803659
              precision    recall  f1-score   support

           0       0.91      0.80      0.85     12802
           1       0.23      0.43      0.30      1795

    accuracy                           0.76     14597
   macro avg       0.57      0.61      0.58     14597
weighted avg       0.83      0.76      0.78     14597

[[10259  2543]
 [ 1027   768]]


<h1>Model3. RandomForest Classifier</h1>

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(max_depth=4, random_state=0)

clf.fit(X_train_post, y_train)
y_pred = clf.predict(X_train_post)
train_fbeta = fbeta_score(y_pred, y_train, beta=2)
acc = accuracy_score(y_train, y_pred)
print("Train fbeta ", train_fbeta, "accuracy is:", acc)
print(confusion_matrix(y_train, y_pred))
y_pred = clf.predict(X_valid_post)
valid_fbeta = fbeta_score(y_pred, y_valid, beta=2)
acc = accuracy_score(y_valid, y_pred)
print("Validation fbeta ", valid_fbeta, "accuracy is:", acc)
print(classification_report(y_valid, y_pred))
print(confusion_matrix(y_valid, y_pred))

y_pred = clf.predict(X_test_post)
test_rf = pd.DataFrame()
test_rf['IsBadBuy'] = y_pred
test_rf['RefId'] = test['RefId']

test_rf.to_csv(DPATH+"test_ranfor.csv", index=False)

Train fbeta  0.008299903167796375 accuracy is: 0.8772095094546452
[[51203     0]
 [ 7169    12]]
Validation fbeta  0.005546311702717692 accuracy is: 0.8771665410700829
              precision    recall  f1-score   support

           0       0.88      1.00      0.93     12802
           1       1.00      0.00      0.00      1795

    accuracy                           0.88     14597
   macro avg       0.94      0.50      0.47     14597
weighted avg       0.89      0.88      0.82     14597

[[12802     0]
 [ 1793     2]]


<h1>Model Selection criteria</h1>

*   I chose Gaussian Naive Bayes because features found to be almost independent of each other and also since I did target encoding for categorical variables, it takes posterior probabilities wrt target variable and encdes featues which is nothing but the NB definition. We can also see that GNB performed well comapred other two. If we observe the metrics it is the only classifier which ave accuracy in 70s and better values for precision, recall that means its learning.

*   I chose RBF SVM because SVM works well with non-linear, high dimensional data. It didn't perform well

*   I chose RandomForest, as we know ensemble models perform well on average, here we dont see this.







<h1>How can we improve?</h1>



*   Further data analysis
*   Hyperparameter tuning by GridSearch
*   Build complex models
*   **Data augmentation** with techniques like **SMOTE**, because ultimately good balanced data beats all complex models.


<h1>Difficulties I faced</h1>

<p align='justify'>Since I chose target encoding, from sklearn.preprocessing import CategoricalEncoder is only available in the development version 0.20.dev0. I used an online github source code https://brendanhasz.github.io/2019/03/04/target-encoding#target-encoding and I worked on tweaking it to suit this testcase.</p>