**PREPROCESS AND TRAINING**

In this kernel I'm gonna make a classifier comparison, in order to understand which is the most suitable classifier for this calssification problem.

Here I'm using these classification model: RandomForest, Gaussian Naive Bayes and MLP. Basically, my idea is to use different classification algorithms that belong to different classification categories.
RandomForest belongs to tree-based algorithms, GNB to the probabilistic algorithms and MLP is a kind of Neural Network (more or less).

In [None]:
#importing all the necessary libraries

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import roc_curve,auc,precision_recall_fscore_support
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score
import csv

In [None]:
df=pd.read_csv("../input/train.csv")
#dropping the ID code, which is useless for our classification task
df=df.drop(['ID_code'],axis=1)

#Y is a list in which all the target values are stored
Y=[]

for i in range(len(df)):
    Y.append(df.iloc[i]['target'])

In [None]:
df['target'].value_counts().plot.bar()
#dropping the target, because it is not an explainatory feature
df=df.drop(['target'],axis=1)

#printing the correlation matrix among the features
print(df.corr())

We can see that there is no strong correlation among the explainatory variable, so we can assume that they are indipendent from each other. All the features have almost the same importance, so they all could be used to train our classification models.
Maybe, I will upload a new kernel where feauture selection will be implemented, and just the features that are most correlated with the target variable will be used during the training phase. For now, let's use all the features (although it may be a little bit time consuming and computational demanding).
Besides, from the plot above, we can figure out that this dataset is highly unbalanced. Truth to be told, it doesn't really matter, because we are interested in ROC Curve.

**ROC Curve depicts the performance of a classifier, regardless to the target class distribution**

There is no need, in my opinion, to perform oversampling (with SMOTE technique or whatever). Besides, with oversampling, we could end up with a (little) biased dataset, because some samples would be "artificially" added.
On the other hand, downsampling could be performed, but it is not necessary either.


In [None]:
#getting all the column headers of the dataset
index_list=df.columns.values

X=[]
aux_list=[]

for i in range(0,len(df)):
    for j in range(len(index_list)):
        aux_list.append(df.iloc[i][index_list[j]])
    X.append(aux_list)
    aux_list=[]
    
#X is a list of lists where all the values of explainatory variables are stored. It contains all the values
#for each row of our training set

In [None]:
#splitting into train and validation set
X_train, X_val, y_train, y_val = train_test_split(X,Y,test_size=0.33,shuffle=False)

#scaling the features using the standard scaler (so that variables have mean=0 and std=1)
scaler=StandardScaler()
X_train_std=scaler.fit_transform(X_train)
X_val_std=scaler.transform(X_val)

We splitted the dataset into train and validation sets and then we used a StandardScaler to standardize our data (with mean=0 and std=1), so that the training phase can be a little faster.


In [None]:
clf_nb=GaussianNB()
clf_nb.fit(X_train_std,y_train)

In [None]:
clf_mlp=MLPClassifier()
clf_mlp.fit(X_train_std,y_train)

In [None]:
#clf_svm=SVC(kernel='rbf',probability=True)
#clf_svm.fit(X_train_std,y_train)

In [None]:
clf_rf=RandomForestClassifier()
clf_rf.fit(X_train_std,y_train)

We trained our different classifiers, so we are ready to test how the  perform on our validation set.
Remember that we are interested in ROC Curve, so we need to predict **the probabilities**, so in the next code cell we are using the *predict_proba* function.

In [None]:
y_scores_nb=clf_nb.predict_proba(X_val_std)
y_scores_mlp=clf_mlp.predict_proba(X_val_std)
y_scores_rf=clf_rf.predict_proba(X_val_std)

#y_scores_svm=clf_svm.predict_proba(X_val_std)

Now that we calculated the probabilities for each classifier, we can use them to generate the ROC Curves.
We are interested in people who made a transaction, so we need to specify that the positive label is 1.

In [None]:
false_positive_rate_nb, true_positive_rate_nb, thresholds_nb = roc_curve(y_val,y_scores_nb[:,1],pos_label=1)
false_positive_rate_mlp, true_positive_rate_mlp, thresholds_mlp = roc_curve(y_val,y_scores_mlp[:,1],pos_label=1) 
false_positive_rate_rf, true_positive_rate_rf, thresholds_rf = roc_curve(y_val,y_scores_rf[:,1],pos_label=1)

#false_positive_rate_svm, true_positive_rate_svm, thresholds_svm = roc_curve(y_val,y_scores_svm[:,1],pos_label=1)

Here, we calculate the values of AUCs for each classifier and then we print them out

In [None]:
roc_auc_nb=auc(false_positive_rate_nb,true_positive_rate_nb)
roc_auc_mlp=auc(false_positive_rate_mlp,true_positive_rate_mlp)
roc_auc_rf=auc(false_positive_rate_rf,true_positive_rate_rf)

#roc_auc_svm=auc(false_positive_rate_svm,true_positive_rate_svm)

In [None]:
print("Area under curve of Naive Bayes:",roc_auc_nb)
print(" ")
print("Area under curve of MultiLayer Perceptron:",roc_auc_mlp)
print(" ")
print("Area under curve of Random Forest:",roc_auc_rf)

#print("Area under curve of Support Vector Machine:",roc_auc_svm)
#print(" ")

In the code cell below, we just plot the ROC Curves and the AUCs related to the different calssifiers.

In [None]:
plt.title('Receiver Operating Characteristic Comparison')
plt.plot(false_positive_rate_nb,true_positive_rate_nb, 'b', label = 'AUC GNB = %0.2f' % roc_auc_nb)
plt.plot(false_positive_rate_mlp,true_positive_rate_mlp, 'r', label = 'AUC MLP = %0.2f' % roc_auc_mlp)
plt.plot(false_positive_rate_rf,true_positive_rate_rf, 'g', label = 'AUC RF = %0.2f' % roc_auc_rf)

#plt.plot(false_positive_rate_svm,true_positive_rate_svm, 'y', label = 'AUC SVM = %0.2f' % roc_auc_svm)

plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()


From the ROC Curve plot above, it is clear that the ROC Curve related to the GNB is the best one, without any doubt, in fact its AUC is 0.89.
It seems that Gaussian Naive Bayes performs well on this dataset. This is probably because the variables are indipendent from each other, as shown above.

NB: The AUC value of GNB is almost the same as the one reached by LGBM (the most optimized LGBM algorithms reached an AUC value on the validation set which is more or less around 0.90 or maybe 0.91), but training a GNB is less time consuming and far less computational demanding.
So, at the end of the day, I think that Gaussian Naive Bayes, for the reasons mentioned above, is the most suitable classifier for this problem. 

**TESTING**

Now we are ready to use the GNB classifier to predict the samples of our test set.
After reading the test file, we save the id codes in a list and then we drop them out, because we don't need them for ur prediction.

In [None]:
df_test=pd.read_csv("../input/test.csv")

id_codes=df_test['ID_code'].tolist()

df_test=df_test.drop(['ID_code'],axis=1)

In the code below, we do the same thing as before, storing the test values in a list called X_test, and after that, we standardize our test data.

In [None]:
index_list_new=df_test.columns.values

X_test=[]
aux_list_new=[]

for i in range(len(df_test)):
    for j in range(len(index_list_new)):
        aux_list_new.append(df_test.iloc[i][index_list_new[j]])
    X_test.append(aux_list_new)
    aux_list_new=[]

X_test_std=scaler.transform(X_test)

After doing the preprocessing, we are finally ready to predict the probability of each sample in our test set to be a person who made a transaction. Besides, here we predict also the class which each sample belongs to. Then, we store the id code, the probability and the class of each test sample in a file called *sample_submission.csv*. 

In [None]:
y_scores_test_nb=clf_nb.predict_proba(X_test_std)[:,1]
y_test_pred=clf_nb.predict(X_test_std)

In [1]:
submission = pd.DataFrame({"ID_code": id_codes})
submission["target"] = y_scores_test_nb
submission.to_csv("Sample_Submission.csv",index=False)

NameError: name 'pd' is not defined

NB: I commented the part related to SVC, because it is too computationally and time expensive. Besides, the GNB is still better than SVC.

This is my first Kaggle competition, I hope you like it :)