# **Introduction**

### This notebook is a solution to the kaggle challenge of Don't Get Kicked!

### One of the biggest challenges of an auto dealership purchasing a used car at an auto auction is the risk of that the vehicle might have serious issues that prevent it from being sold to customers. The auto community calls these unfortunate purchases "kicks".

### Kicked cars often result when there are tampered odometers, mechanical issues the dealer is not able to address, issues with getting the vehicle title from the seller, or some other unforeseen problem. Kick cars can be very costly to dealers after transportation cost, throw-away repair work, and market losses in reselling the vehicle.

### Modelers who can figure out which cars have a higher risk of being kick can provide real value to dealerships trying to provide the best inventory selection possible to their customers.

### The challenge of this competition is to predict if the car purchased at the Auction is a Kick (bad buy).

# Breakdown of the script
### 1. Load Data
### 2. Preprocess data - process missing values and categorical values / feature normalization / Drop features 
### 3. Principal component analysis
### 4. Undersample the majority class(0) for training
### 5. Random forest feature importance
### 6. Train Logistic regression, Random forest, bagging model and a feed forward neural network
### 7. Check correlation of the predictions of the above models 
### 8. Create an ensemble of the above classifiers
### 9. Function to calcuate the Gini score of a set of predictions
### 10. Create a submission file for kaggle

# Import required libraries

In [149]:
import pandas as pd
import numpy as np

from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
from sklearn.neural_network import BernoulliRBM
from sklearn.pipeline import Pipeline
from sklearn import svm
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
from collections import Counter
from sklearn.metrics import f1_score

#Importing all libraries
import sklearn
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

import copy
from __future__ import print_function
import numpy as np

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten, LSTM, Bidirectional
from keras.layers.convolutional import Conv1D, Conv2D
from keras.layers.pooling import MaxPooling1D, MaxPooling2D
from keras.callbacks import ModelCheckpoint
from keras.optimizers import Adam
from keras.constraints import maxnorm
from keras.layers.wrappers import TimeDistributed
from keras.utils.np_utils import to_categorical
import keras

import pickle
np.random.seed(1789)

# Preprocessing functions

In [2]:
'''
1. Fill up missing values (primitive solution)
- Nominal Columns: Filled with mode 
- Numeric Columns: Filled with median
2. Convert nominal columns to one-hot
'''

    
nominal_cols = ['Auction', 'Make', 'Trim', 'TopThreeAmericanName', 'Model', 'SubModel', 'Color', 'Transmission', 'WheelType', 
                'PRIMEUNIT', 'AUCGUART', 'Nationality', 'Size', 'VNST']

num_cols = ['VehicleAge', 'WheelTypeID', 'VehOdo', 'BYRNO', 'VNZIP1', 'IsOnlineSale', 'WarrantyCost'] + ['MMRAcquisitionAuctionAveragePrice', 'MMRAcquisitionAuctionCleanPrice',
                        'MMRAcquisitionRetailAveragePrice', 'MMRAcquisitonRetailCleanPrice',
                        'MMRCurrentAuctionAveragePrice', 'MMRCurrentAuctionCleanPrice',
                        'MMRCurrentRetailAveragePrice', 'MMRCurrentRetailCleanPrice']

global df

def preprocess(dataframe):
    global df
    
    df = dataframe
    df = df.drop(['RefId'], axis=1)
    return df 


def fill_missing_values(df):
    '''
    This function fills in the missing values
    Currently it's a simple solution
    - Mode for nominal columns
    - Median for numerical columns
    '''

    for col in num_cols:
        df[col] = df[col].fillna(df[col].median())
    
    for col in nominal_cols:
        mode = df[col].mode()[0]
        df[col] = df[col].fillna(mode)

    return df

def show_nominal_values(df):
    
    for col in nominal_cols:
        print col, len(Counter(df[col])) 

def feature_engineering(df):
    '''
    - Drop PurchDate & VehYear since PurchDate = VehYear + VehicleAge -> Features correlated
    - All the Average Prices are related => so just take the average 
    '''
    df = df.drop(['PurchDate'], axis=1)
    df = df.drop(['VehYear'], axis=1)
#     df = merge_auction_ave_price(df)
    
    return df 

def convert_nominal_cols(df):
    '''
    This function converts nominal cols to one-hot vectors
    '''
    global nominal_cols
    df_with_dummies = pd.get_dummies(df, columns = nominal_cols)
    
    return df_with_dummies

    
def merge_auction_ave_price(dataframe):
    '''
    This function takes the average of the 8 variables of the auction average prices
    '''
    auction_averages = ['MMRAcquisitionAuctionAveragePrice', 'MMRAcquisitionAuctionCleanPrice',
                        'MMRAcquisitionRetailAveragePrice', 'MMRAcquisitonRetailCleanPrice',
                        'MMRCurrentAuctionAveragePrice', 'MMRCurrentAuctionCleanPrice',
                        'MMRCurrentRetailAveragePrice', 'MMRCurrentRetailCleanPrice']
                        
    
    dataframe['AuctionAve'] = sum(dataframe[ave] for ave in auction_averages) /len(auction_averages)
    dataframe = dataframe.drop(auction_averages, axis=1)
    
    return dataframe

def apply_pca(train_dataframe, test_dataframe, n_features):
        
    '''
    Function applies PCA on feature vectors and returns the trasformations and the transformation matrix
    '''
    pca = PCA(n_components=n_features)
    pca.fit(np.array(train_dataframe))

    train_dataframe_pca = pca.transform(train_dataframe)

    test_dataframe_pca = pca.transform(test_dataframe)
    
    return [train_dataframe, test_dataframe, train_dataframe_pca, test_dataframe_pca, pca]

# Preprocessing Pipeline

In [107]:
class pipeline():

    def __init__(self, train_file, test_file):
    
        self.train_dataframe = pd.read_csv(train_file, header=0) 
        self.test_dataframe = pd.read_csv(test_file, header=0)
        self.test_dataframe_refId = self.test_dataframe['RefId']
        
        self.y_train = np.array(self.train_dataframe["IsBadBuy"])
        self.train_dataframe.drop("IsBadBuy", axis=1, inplace=True)
        
#         print set(list(self.train_dataframe)).difference(list(self.test_dataframe))
        
        self.preprocess_data()
    
    def preprocess_data(self):
        
        categories = []
        
        self.train_dataframe = feature_engineering(self.train_dataframe)
        self.test_dataframe = feature_engineering(self.test_dataframe)
        
        self.train_dataframe = fill_missing_values(self.train_dataframe)
        self.test_dataframe = fill_missing_values(self.test_dataframe)
        
        self.train_dataframe["dataset"] = "train"
        self.test_dataframe["dataset"] = "test"
        
        self.data = pd.concat([self.train_dataframe, self.test_dataframe])
        self.data = convert_nominal_cols(self.data)
        
        self.train_dataframe = self.data[self.data["dataset"] == "train"]
        self.test_dataframe = self.data[self.data["dataset"] == "test"]

        self.train_dataframe.drop("dataset", axis=1, inplace=True)
        self.test_dataframe.drop("dataset", axis=1, inplace=True)
        
        print("Preprocessing Data")
        self.train_dataframe = preprocess(self.train_dataframe)
        
        print("Preprocessing Test")
        self.test_dataframe = preprocess(self.test_dataframe)

        # Add dummy column to test dataframe to match dimensions
        # Quick hack: should take away
        # 		self.test_dataframe['IsBadBuy'] = 0

In [108]:
pipe = pipeline('training.csv', 'test.csv')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Preprocessing Data
Preprocessing Test


In [5]:
#len(list(pipe.train_dataframe)), len(list(pipe.test_dataframe))

In [6]:
#Counter(pipe.y_train)

Counter({0: 64007, 1: 8976})

## Apply PCA to reduce from 2000+ features to a 500

In [6]:
""" Apply PCA
    train_dataframe_int -> before PCA
    train_dataframe -> After PCA -> top 500 featurees"""
[train_dataframe_init, test_dataframe_init, train_dataframe, test_dataframe, pca] = apply_pca(pipe.train_dataframe, pipe.test_dataframe, n_features=500)

In [8]:
#train_dataframe.shape, test_dataframe.shape, train_dataframe_init.shape, test_dataframe_init.shape

((72983, 500), (48707, 500), (72983, 2337), (48707, 2337))

In [7]:
# Function to shuffle data
def shuffle_data(x_train, y_train_zero):
    idx = np.random.randint(len(y_train_zero), size=int(len(y_train_zero)))
    y_train_zero = y_train_zero[idx]
    x_train = x_train[idx, :]
    return x_train, y_train_zero

# Undersampling Class 0 and create train, val splits


In [8]:
def filter_data(train_dataframe, y_train, ratio=0.2):
    
    y_train_zero = y_train[y_train == 0]
    train_dataframe_zero = train_dataframe[y_train == 0]

    y_train_one = y_train[y_train == 1]
    train_dataframe_one = train_dataframe[y_train == 1]

    idx = np.random.randint(len(y_train_zero), size=int(ratio*len(y_train_zero)))

    y_train_zero = y_train_zero[idx]
    train_dataframe_zero = train_dataframe_zero[idx,:]

    train_dataframe_new = np.concatenate([train_dataframe_zero, train_dataframe_one])
    y_train_new = np.concatenate([y_train_zero, y_train_one])

    train_X, val_X, train_y, val_y = train_test_split(train_dataframe_new, y_train_new,
                                                        test_size=0.25, random_state=4531)
    
    return [train_X, val_X, train_y, val_y]

In [9]:
# Undersample class 0 and shuffle data - Select 15% of data from class 0
[train_X, val_X, train_y, val_y] = filter_data(train_dataframe, pipe.y_train, 0.15)
[train_X_init, val_X_init, train_y_init, val_y_init] = filter_data(np.array(train_dataframe_init), pipe.y_train, 0.15)

train_X, train_y = shuffle_data(train_X, train_y)
train_X_init, train_y_init = shuffle_data(train_X_init, train_y_init)

In [12]:
#train_X.shape, train_y.shape, train_X_init.shape, train_y_init.shape

((13932, 500), (13932,), (13932, 2337), (13932,))

In [13]:
#len(train_y), Counter(train_y), Counter(val_y)

(13932, Counter({0: 7076, 1: 6856}), Counter({0: 2410, 1: 2235}))

In [14]:
#train_X_init.shape

(13932, 2337)

## Feature normalize the data

In [11]:
#PCA on test data
test_X_init = np.array(pipe.test_dataframe)
test_X = pca.transform(test_X_init)

# Feature normalize after PCA
test_X_n = (test_X - np.mean(train_X, axis = 0)) / (np.std(train_X, axis=0) + 0.01)
test_X_init_n = (test_X_init - np.mean(train_X_init, axis = 0)) / (np.std(train_X_init, axis=0) + 0.01)

In [12]:
# Feature normalize after PCA
train_X_n = (train_X - np.mean(train_X, axis = 0)) / (np.std(train_X, axis=0) + 0.01)
val_X_n = (val_X - np.mean(train_X, axis = 0)) / (np.std(train_X, axis=0) + 0.01)

In [13]:
# Feature normalize without PCA
train_X_init_n = (train_X_init - np.mean(train_X_init, axis = 0)) / (np.std(train_X_init, axis=0) + 0.01)
val_X_init_n = (val_X_init - np.mean(train_X_init, axis = 0)) / (np.std(train_X_init, axis=0) + 0.01)

# Random forest, feature importances

In [None]:
forest = RandomForestClassifier(n_estimators=100,
                              random_state=0,
                               )

forest.fit(train_X, train_y)

# Calculate feature importances
importances = forest.feature_importances_

std = np.std([tree.feature_importances_ for tree in forest.estimators_],
             axis=0)

# Sort in descending order of importance
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

In [14]:
#Compute accuaracy
def accuracy(matrix):
    return (np.trace(matrix)) * 1.0 / np.sum(matrix)

In [83]:
'''
For K fold cross validation
'''
k_fold = 10
kf_total = KFold(n_splits=k_fold)

# Logistic Regression

In [95]:
# Logistic regression - choose parameters using k fold cross validation and grid search
def logistic(train_X, train_y, val_X, val_y, weight="balanced"):
    
    logistic = LogisticRegression(C=0.5, penalty="l2", class_weight=weight, max_iter=100, verbose=1)
    logistic_clf = GridSearchCV(estimator=logistic, param_grid=dict(C=[0.5, 0.7, 1.0], class_weight=['balanced', None]), cv=10, n_jobs=-1)

    cms = [confusion_matrix(train_y[test], logistic_clf.fit(train_X[train],train_y[train]).predict(train_X[test])) for train, test in kf_total.split(train_X)]
    print(cms)

    logistic = LogisticRegression(C=logistic_clf.best_estimator_.C, penalty="l2", class_weight=logistic_clf.best_estimator_.class_weight, max_iter=100, verbose=1)
    
    logistic.fit(train_X, train_y)
    pred = logistic.predict(val_X)
    print (confusion_matrix(val_y, pred))
    print (accuracy(confusion_matrix(val_y, pred)))
    print (f1_score(val_y, pred))
    return logistic

In [96]:
logistic_clf = logistic(train_X, train_y, val_X, val_y)

[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][array([[441, 241],
       [256, 456]]), array([[475, 233],
       [228, 458]]), array([[419, 267],
       [230, 477]]), array([[417, 309],
       [205, 462]]), array([[451, 259],
       [271, 412]]), array([[425, 293],
       [211, 464]]), array([[453, 261],
       [252, 427]]), array([[414, 309],
       [191, 479]]), array([[434, 280],
       [238, 441]]), array([[421, 274],
       [228, 470]])]
[LibLinear][[1491  919]
 [ 775 1460]]
0.635306781485
0.632856523624


In [38]:
#logistic_clf = logistic(train_X_n, train_y, val_X_n, val_y)

[LibLinear]
[[1553  857]
 [ 854 1381]]
0.631646932185
0.617482673821


In [25]:
#logistic_clf = logistic(train_X_init, train_y_init, val_X_init, val_y_init)

[LibLinear]
[[1467  943]
 [ 797 1438]]
0.625403659849
0.623050259965


In [None]:
#logistic_clf = logistic(train_X, train_y, val_X, val_y, None)

In [None]:
#logistic_clf = logistic(train_X_init, train_y_init, val_X_init, val_y_init, None)

# Random Forest

In [97]:
# Random Forest - choose parameters using k fold cross validation and grid search
def run_forest(train_X, train_y, val_X, val_y):
    forest = RandomForestClassifier(n_estimators=250,
                                  random_state=0,
                                   )
    clf_forest = GridSearchCV(estimator=forest, param_grid=dict(n_estimators=[200, 350, 500], warm_start=[True, False]), cv=k_fold, n_jobs=-1)
    cms = [confusion_matrix(train_y[test], clf_forest.fit(train_X[train],train_y[train]).predict(train_X[test])) for train, test in kf_total.split(train_X)]
#     print(cms)
    
#     forest.fit(train_X, train_y)
#     forest = 

    forest = RandomForestClassifier(n_estimators=clf_forest.best_estimator_.n_estimators,
                                  warm_start=clf_forest.best_estimator_.warm_start,
                                   )

    forest.fit(train_X, train_y)
    pred = forest.predict(val_X)
    print(confusion_matrix(val_y, pred))
    print(accuracy(confusion_matrix(val_y, pred)))
    print(f1_score(val_y, pred))
    return forest

In [168]:
#rf_clf = run_forest(train_X, train_y, val_X, val_y)

[[1577  833]
 [ 871 1364]]
0.633153928956
0.615523465704


In [169]:
#rf_clf = run_forest(train_X_n, train_y, val_X_n, val_y)

[[1575  835]
 [ 876 1359]]
0.631646932185
0.61368254685


In [98]:
rf_clf = run_forest(train_X_init, train_y_init, val_X_init, val_y_init)

[[1563  847]
 [ 817 1418]]
0.641765339074
0.630222222222


In [171]:
#rf_clf = run_forest(train_X_init_n, train_y_init, val_X_init_n, val_y_init)

[[1567  843]
 [ 812 1423]]
0.643702906351
0.632303932459


# Support Vector Machines

In [None]:
# SVM - choose parameters using k fold cross validation and grid search
from sklearn import svm
def run_svm(train_X, train_y, val_X, val_y):

    svc = svm.SVC(C=1.0, kernel='linear')
    svc.fit(train_X, train_y)
    pred = svc.predict(val_X)
    print (confusion_matrix(val_y, pred))
    print (accuracy(confusion_matrix(val_y, pred)))
    print (f1_score(val_y, pred))
    return svc

svc = run_svm(train_X, train_y, val_X, val_y)
svc_2 = run_svm(train_X_init, train_y_init, val_X_init, val_y_init)

In [152]:
[svm] = pickle.load(open("svm.pickle", "r"))
pred = svm.predict(val_X)
print (confusion_matrix(val_y, pred))
print (accuracy(confusion_matrix(val_y, pred)))
print (f1_score(val_y, pred))

[[1558  852]
 [ 890 1345]]
0.624973089343
0.606949458484


# Bagging method

In [22]:
# Bagging - choose parameters using k fold cross validation and grid search
from sklearn.ensemble import BaggingClassifier
bagging = BaggingClassifier(n_estimators=250,
                              random_state=0)
# estimators_list = [150, 250, 500]

# k_fold = 5
# kf_total = KFold(n_splits=k_fold)

# clf_bagging = GridSearchCV(estimator=bagging, param_grid=dict(n_estimators=estimators_list, warm_start=[True, False]), cv=k_fold, n_jobs=-1)
# cms = [confusion_matrix(train_y[test], clf_bagging.fit(train_X[train], train_y[train]).predict(train_X[test])) for train, test in kf_total.split(train_X)]
# accuracies = []
# for cm in cms:
#     accuracies.append(accuracy(cm))
# print(accuracies)
# print(np.mean(accuracies))
bagging.fit(train_X, train_y)
pred = bagging.predict(val_X)
print (confusion_matrix(val_y, pred))
print (accuracy(confusion_matrix(val_y, pred)))
print (f1_score(val_y, pred))

[[1639  771]
 [ 862 1373]]
0.648439181916
0.627083809089


# Feed forward neural network

In [24]:
# Feedforward neural network
batch_size = 64

X_train_lstm = np.reshape(train_X_n, (len(train_X), len(train_X[0])))#, 1))
X_test_lstm = np.reshape(val_X_n, (len(val_X), len(val_X[0])))#, 1))

y_train_lstm = to_categorical(train_y)
y_test_lstm = to_categorical(val_y)

model = Sequential()

model.add(Dense(50, activation='relu', kernel_initializer=keras.initializers.RandomNormal(mean=0.0, stddev=0.1, seed=None)
, input_shape=X_train_lstm.shape[1:]))      
model.add(Dropout(0.5))
model.add(Dense(2, activation='sigmoid'))

# try using different optimizers and different optimizer configs
adam = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08)
model.compile(loss='binary_crossentropy', optimizer=adam, metrics=['accuracy'])

filepath="weights.best.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
callbacks_list = [checkpoint]

print('Train...')
model.fit(X_train_lstm, y_train_lstm,
          batch_size=batch_size,
          epochs=20,
          validation_split=0.2,
          shuffle=True,
          #callbacks = callbacks_list,
         )
scores = model.evaluate(X_train_lstm, y_train_lstm, verbose=0)
print(scores)
scores = model.evaluate(X_test_lstm, y_test_lstm, verbose=0)
print(scores)

Train...
Train on 11145 samples, validate on 2787 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
[0.44614967418887508, 0.80017226528854435]
[0.72123019305189151, 0.63175457483728936]


In [25]:
#Predictions from the model
pred = np.argmax(model.predict(X_test_lstm), axis = 1)

In [26]:
print (confusion_matrix(val_y, pred))
print (accuracy(confusion_matrix(val_y, pred)))
print (f1_score(val_y, pred))

[[1573  837]
 [ 874 1361]]
0.631646932185
0.61403113016


## Ensemble model - Max polling and predictions between ANN, LR, RF, SVM and Bagging

In [159]:
def ensemble_kaggle_submission(classifiers, data):

    '''
    Predicted is the probability of a bad buy[0-1]
    '''
#     classifiers = [model, logistic_clf, rf_clf, bagging]
#     X_val_nn = np.reshape(val_X_n, (len(val_X_n), len(val_X_n[0])))#, 1))
#     data = [val_X_n, val_X, val_X_init, val_X]
    pred = []

    for i in range(len(classifiers)):

#         if i == 4 or i == 1:
#             continue
            
        if i == 0:
            pred_i = classifiers[i].predict(data[i])[:, 1]
        else:
            pred_i = classifiers[i].predict_proba(data[i])[:, 1]
        
        pred.append(pred_i)

    
    pred = np.array(pred)
    
    #Measure correlation between predictions of different models
    print(np.corrcoef(pred))
    
    pred = np.mean(pred, axis = 0)
#     pred[pred <= 0.5] = 0
#     pred[pred > 0.5] = 1
    return pred

def ensemble(classifiers, data):
    
    '''
    Prediction of whether it is a good or bad buy(0/1)
    '''
    
#     classifiers = [model, logistic_clf, rf_clf, bagging]
#     X_val_nn = np.reshape(val_X_n, (len(val_X_n), len(val_X_n[0])))#, 1))
#     data = [val_X_n, val_X, val_X_init, val_X]
    pred = []

    for i in range(len(classifiers)):

#         if i == 4 or i == 1:
#             continue

        if i == 0:
            pred_i = np.argmax(classifiers[i].predict(data[i]), axis=1)
        else:
            pred_i = classifiers[i].predict(data[i])

        pred.append(pred_i)

    
    pred = np.array(pred)
    
    #Measure correlation between predictions of different models
    print(np.corrcoef(pred))
    
    #Average predictions and threshold them
    pred = np.mean(pred, axis = 0)
    pred[pred < 0.5] = 0
    pred[pred >= 0.5] = 1
    return pred

In [147]:
pred_train = ensemble_kaggle_submission([model, logistic_clf, rf_clf, bagging, svm], [train_X_n, train_X, train_X_init, train_X, train_X])

[[  1.00000000e+00   4.77264104e-01  -1.29529764e-02   9.05710383e-01]
 [  4.77264104e-01   1.00000000e+00   8.79814280e-04   3.82273311e-01]
 [ -1.29529764e-02   8.79814280e-04   1.00000000e+00  -1.59856326e-02]
 [  9.05710383e-01   3.82273311e-01  -1.59856326e-02   1.00000000e+00]]


In [156]:
# Binary prediction on val set
pred = ensemble([model, logistic_clf, rf_clf, bagging, svm], [val_X_n, val_X, val_X_init, val_X, val_X])
# Continuous prediction on val set
pred_continous = ensemble_kaggle_submission([model, logistic_clf, rf_clf, bagging, svm], [val_X_n, val_X, val_X_init, val_X, val_X])

[[ 1.          0.28876431  0.52686806]
 [ 0.28876431  1.          0.28039007]
 [ 0.52686806  0.28039007  1.        ]]


In [143]:
#Validation accuracy - Ensemble of all 5 models
print (confusion_matrix(val_y, pred))
print (accuracy(confusion_matrix(val_y, pred)))
print (f1_score(val_y, pred))

[[1473  937]
 [ 580 1655]]
0.673412271259
0.685726123886


In [157]:
#Validation accuracy - Ensemble of RF + NN + Bagging 
print (confusion_matrix(val_y, pred))
print (accuracy(confusion_matrix(val_y, pred)))
print (f1_score(val_y, pred))

[[1750  660]
 [ 756 1479]]
0.695156081808
0.676268861454


# Generating predictions on kaggle test set - 0.1882(LR+RF+NN+Bagging) Gini for initial submission
## Train on train + val sets before submitting on Kaggle's test

## Training models on train+val to predict on Kaggle's test

In [140]:
logistic_clf.fit(np.concatenate([train_X_n, val_X_n]), np.concatenate([train_y, val_y]))

[LibLinear]

LogisticRegression(C=0.5, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=1, warm_start=False)

In [129]:
rf_clf.fit(np.concatenate([train_X, val_X]), np.concatenate([train_y, val_y]))

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=1,
            oob_score=False, random_state=None, verbose=0, warm_start=True)

In [130]:
model.fit(np.concatenate([X_train_lstm, X_test_lstm]), np.concatenate([y_train_lstm, y_test_lstm]),
          batch_size=batch_size,
          epochs=20,
          validation_split=0.2,
          shuffle=True,
          #callbacks = callbacks_list,
         )

Train on 11145 samples, validate on 2787 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7fb2997bd990>

In [131]:
bagging.fit(np.concatenate([train_X, val_X]), np.concatenate([train_y, val_y]))

BaggingClassifier(base_estimator=None, bootstrap=True,
         bootstrap_features=False, max_features=1.0, max_samples=1.0,
         n_estimators=250, n_jobs=1, oob_score=False, random_state=0,
         verbose=0, warm_start=False)

## Test submission

In [160]:
ref_id = np.array(pipe.test_dataframe_refId)
pred_test = ensemble_kaggle_submission([model, logistic_clf, rf_clf, bagging, svm], [test_X_n, test_X, test_X_init, test_X, test_X])

assert len(ref_id) == len(pred_test)

f = open("submission5.csv", "w")
f.write("RefId,IsBadBuy\n")

for i in range(len(ref_id)):
    if i != len(ref_id) - 1:
        f.write(str(ref_id[i]) + "," + str(pred_test[i]) + "\n")
    else:
        f.write(str(ref_id[i]) + "," + str(pred_test[i]))
f.close()

[[ 1.          0.57170606  0.63946228]
 [ 0.57170606  1.          0.64364128]
 [ 0.63946228  0.64364128  1.        ]]


# Kaggle evaluation metric - Gini coefficient

In [60]:
def gini(actual, pred):

    actual_len = len(actual)
    assert( actual_len == len(pred) )
    all = np.asarray(np.c_[ actual, pred, np.arange(actual_len) ], dtype=np.float)
    all = all[ np.lexsort((all[:,2], -1*all[:,1])) ] 
    giniSum = all[:,0].cumsum().sum() / all[:,0].sum()
    giniSum -= (actual_len + 1) / 2.
    return giniSum / actual_len

def normalized_gini(solution, submission):
    normalized_gini = gini(solution, submission)/gini(solution, solution)
    return normalized_gini


normalized_gini(val_y, pred_continuous)

0.89211147613541375