## Content
#### 1)Libraries
#### 2)Data extraction
#### 3)Data exploration
#### 4)Unbalanced Data and Resampling
#### 5)Feature selection
#### 6) Classification models
#### 7)Detection of the most influential variables

### Loading the libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.utils import resample
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix,recall_score,roc_auc_score,roc_curve,precision_score,f1_score,auc,accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB


### Data Extraction

In [None]:
customer_data = pd.read_csv('../input/santander-customer-transaction-prediction/train.csv')
test_data = pd.read_csv('../input/santander-customer-transaction-prediction/test.csv')

In [None]:
print(customer_data.shape)
print(test_data.shape)

In [None]:
customer_data.head()

In [None]:
test_data.head()

### Data Exploration

In [None]:
test_data.describe()

This is an anonymised dataset with 199 discrete numeric variables, with a dependent variable labeled as a binary variable and a column in string format with an identifier label. Two training datasets are provided, a training dataset and evaluation dataset, but no target variable so that for our purpose we won't use it to train the models. The task that is requested in this challenge is to predict the value of the target column in the test set.

In [None]:
customer_data['target'].value_counts().plot.bar()
plt.show()

### We have imbalance data.

### Missing Value Analysis

In [None]:
print(customer_data.isnull().sum().any())
print(test_data.isnull().sum().any())

There is no missing value present

## Checking the distribution
Get an idea of this data distribution, we review in the training dataset that we will work with, we review the histogram of the mean values of each record based on the binary target variable

In [None]:

columns = list(customer_data.columns)
columns.remove('target')
columns.remove('ID_code')

target0_data = customer_data[customer_data['target']==0]
target1_data = customer_data[customer_data['target']==1]
plt.figure(figsize=(14,8))
plt.title("Distribution of mean row data based on target ")
sns.distplot(target0_data[columns].mean(axis=1),color='blue',kde=True,bins=100,label='target_0')
sns.distplot(target1_data[columns].mean(axis=1),color='red',kde=True,bins=100,label='target_1')
plt.legend()
plt.show()

As we can see that there is a small variation in the mean of all feature that could explain the target variable.

####  We will look for correlation variables to decrease high dimensionality. We have tried to numerically show it as visually the plot would be too large. 

In [None]:
corr_matrix = customer_data[columns].corr()
sns.heatmap(corr_matrix)

We try to detect potential correlated variables to decrease high dimensionality. As the correlation matrix is too large visually as seen above, we tried to numerically detect the existence of correlations above 0.5 and below -0.5.

In [None]:
corr = customer_data.corr()
high_corr = np.where(corr>0.5)
high_corr = [(corr(x),corr(y)) for x,y in zip(*high_corr) if x!=y and x<y]
if len(high_corr)==0:
    print("There are no correlated variables")

## PCA

In [None]:
from sklearn.decomposition import PCA
ratio={}
for i in range(190,200):
    pca=PCA(n_components=i).fit(customer_data[columns])
    ratio[i]=sum(pca.explained_variance_ratio_)
    
pd.Series(ratio).plot()
plt.show()

#### It is observed that we require whole 200 features to explain the variance. So we will not be applying PCA here.

## Outlier

In [None]:
def boxplot_func(data_frame,col):
    sns.set(style="whitegrid")
    plt.title("Outliers")
    fig, ax = plt.subplots(10,10,figsize=(18,24))
    counter=0
    for c in col:
        counter+=1
        plt.subplot(10,10,counter)
        sns.boxplot(data_frame[c])
        plt.xlabel(c)
        plt.tick_params(axis='x', labelsize=7, pad= -7)
    plt.show()

In [None]:
col = customer_data.columns.values[2:102]
boxplot_func(customer_data,col)

In [None]:
col=customer_data.columns.values[102:]
boxplot_func(customer_data,col)

In [None]:
copy_train_data = customer_data.copy()
copy_test_data = test_data.copy()

In [None]:
def outlier_removal(df):
    for i in columns:
        q75, q25 =np.percentile(df.loc[:,i],[75,25])
        iqr  = q75-q25
        min  = q25 - (iqr*1.5)
        max  = q75 + (iqr*1.5)
        df = df.drop(df[df.loc[:,i]<min].index) 
        df = df.drop(df[df.loc[:,i]>max].index)
        return df

In [None]:
customer_data = outlier_removal(customer_data)
#test_data = outlier_removal(test_data) ## We don't need to remove outliers from test data
print("Total number of observations dropped in train set:",copy_train_data.shape[0]-customer_data.shape[0])
customer_data.shape

In [None]:
customer_data['target'].value_counts()

## Resampling
Note we are dealing with a data set very unbalanced, where there is only 10% of records categorized with target 1, so those customers who have made a financial transaction. So we will try sampling the data

## 1. Under Sampling

In [None]:
class_0,class_1 = customer_data.target.value_counts()

df_class_0 = customer_data[customer_data['target']==0]
df_class_1 = customer_data[customer_data['target']==1]

under_df_0 = df_class_0.sample(class_1)
df_train_under = pd.concat([under_df_0,df_class_1],axis=0)

print(df_train_under.target.value_counts())
df_train_under.describe()

## 2. Oversampling

In [None]:
over_df = resample(df_class_1, replace=True, n_samples=179813,random_state=123)

df_train_over = pd.concat([over_df,df_class_0],axis=0)

len(df_train_over)
print(df_train_over.target.value_counts())
df_train_over.describe()

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X_train_under = df_train_under[columns]
y_train_under = df_train_under['target']

X_train_over = df_train_over[columns]
y_train_over = df_train_over['target']
print(X_train_under.shape)
train_x,test_x,train_y,test_y = train_test_split(X_train_under,y_train_under,train_size=0.8,random_state=42,stratify=y_train_under)
train_y.shape

In [None]:
print(train_y.value_counts())

## Classification Models

In [None]:
def metrics(y_true,y_pred):
    print("Confusion Matrix")
    print(confusion_matrix(y_true,y_pred))
    
    print("Accuracy:", accuracy_score(y_true,y_pred))
    print("Precision:", precision_score(y_true,y_pred))
    print("F1 Score:", f1_score(y_true,y_pred))
    print("Recall:", recall_score(y_true,y_pred))
    
    false_positive_rate,recall,thresholds = roc_curve(y_true,y_pred)
    roc_auc = auc(false_positive_rate,recall)
    
    print("ROC:",roc_auc)
    
    plt.plot(false_positive_rate,recall,'b')
    plt.plot([0,1],[0,1],'r--')
    plt.title("AUC=%0.2f"%roc_auc)
    plt.show()

## Testing with Undersampling Data

### 1) Logistic Regression

In [None]:
logistic_model = LogisticRegression().fit(train_x,train_y)
logistic_predict = logistic_model.predict(train_x)

print("Metrics:")
metrics(train_y,logistic_predict)

In [None]:
logistic_predict_test = logistic_model.predict(test_x)
print("Metrics for test:")
metrics(test_y,logistic_predict_test)

### 2) Random Forest

In [None]:
tree = RandomForestClassifier(n_estimators=10,max_depth=7,random_state=1).fit(train_x,train_y)
tree_train_predict = tree.predict(train_x)
    
print("Metrics:")
metrics(train_y,tree_train_predict)

In [None]:
tree_test_predict = tree.predict(test_x)
print("Metrics:")
metrics(test_y,tree_test_predict)

### 3) Naive Bayes

In [None]:
naive = GaussianNB().fit(train_x,train_y)
naive_train_predict = naive.predict(train_x)
print("Metrics:")
metrics(train_y,naive_train_predict)

In [None]:
naive_test_predict = naive.predict(test_x)
print("Metrics:")
metrics(test_y,naive_test_predict)

In [None]:
models = []
models.append(("LogisticRegression",LogisticRegression()))
models.append(("Random Forest",RandomForestClassifier(n_estimators=10,max_depth=7,random_state=1)))
models.append(("NaiveBayes",GaussianNB()))



In [None]:
def model_test(x_data,y_data):
    for name,model in models:
        train_x,test_x,train_y,test_y = train_test_split(x_data,y_data,train_size=0.75,random_state=42,stratify=y_data)
        print("#"*10,"Validation for %s "%name,"#"*10)
        model.fit(train_x,train_y)
        metrics(train_y,model.predict(train_x))
        pred = model.predict(test_x)
        print("Testing Metrics of %s"%name)
        metrics(test_y,pred)


## Oversampling Data

In [None]:
model_test(X_train_over,y_train_over)

## Feature Importance

In [None]:
from sklearn.inspection import permutation_importance

imps = permutation_importance(naive, test_x, test_y)
importances = imps.importances_mean
std = imps.importances_std
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")
for f in range(test_x.shape[1]):
    print("%d. (%f)" % (f + 1, importances[indices[f]]))

In [None]:
feature = pd.DataFrame({"imp":importances,"col":columns})
feature = feature.sort_values(['imp','col'],ascending=[True,False]).iloc[-30:]
feature.plot(kind='barh',x='col',y='imp',figsize=(10,7),legend=None)
plt.title("Feature Importances")
plt.ylabel("Features")
plt.xlabel("Importances")
plt.show()

In [None]:
test_data.drop(['ID_code'],axis=1,inplace=True)

predict = naive.predict(test_data)


In [None]:
pd.Series(predict).value_counts().plot(kind='bar')

In [None]:
sample_submission = pd.read_csv('../input/santander-customer-transaction-prediction/sample_submission.csv')
sample_submission['target'] = predict


In [None]:
print(sample_submission.head())
sample_submission.to_csv('submission_1.csv',index=False)