# Introduction

In this project we have to identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted.

Evaluation :
Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.



In [None]:
# Importing all required libraries

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import seaborn as sns

# Stats
from scipy import stats
from scipy.stats import skew, norm
from scipy.special import boxcox1p
from scipy.stats import boxcox_normmax

# For ignoring the warnings
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

import gc
gc.enable()

In [None]:
# Fetching the data to pandas dataframes
train = pd.read_csv('/kaggle/input/santander-customer-transaction-prediction/train.csv')
test = pd.read_csv('/kaggle/input/santander-customer-transaction-prediction/test.csv')
sub = pd.read_csv('/kaggle/input/santander-customer-transaction-prediction/sample_submission.csv')

## Let's explore the training data first

In [None]:
# Checking training data
train.shape

The training data has 202 columns and 200000 rows that will provide us a good amount of data for training our machine learning models

In [None]:
train.head()

As we can see the data is anonymized by the company and only target and Id are provided without anonymizing, we would need to explore the data and formulate certain hypothesis to de-anonymize the data and infer some valuable insights from the data that will help us predict the target value for test data

In [None]:
# Checking the datatypes of columns
train.dtypes

In [None]:
train.dtypes.value_counts()

All the anonymized columns i.e var_0 to var_199 have float datatype, we will have to check the values of all the columns if they are actually a numeric type values or are encoded to values from a categorical type variable.<br>
We will then decode the values back to categorical features and that might improve the accuracy of our predictions

# Exploratory data analysis

## Checking for missing values in the data

In [None]:
train.isna().sum().sort_values(ascending = False).head()

As this is a banking related data, as expected there are no null values present in the data, that means the data is properly recorded by the bank without any failure

## Let's explore the columns now

In [None]:
# Checking if every row has unique id
len(train['ID_code'].unique())

In [None]:
# Target column
train['target'].value_counts()

In [None]:
ax = sns.countplot(x="target",data=train)

The target variable is highly imbalanced as value 0 has a much higher count then the value 1.<br>
we will need to consider this while doing cross validation of a model that balanced sample is taken for prediction and for cross validation

## Checking for duplicate columns

Firstly we will encode each and every column using tqdm function which works as a pipeline operator for handing big amount of data.

Fun Fact : **tqdm means "progress" in Arabic (taqadum, تقدّم) and is an abbreviation for "I love you so much" in Spanish (te quiero demasiado).**

In [None]:
import tqdm
train_enc = pd.DataFrame(index = train.index)
for col in tqdm.tqdm_notebook(train.columns):
    train_enc[col] = train[col].factorize()[0]

There are no duplicate columns present in the data

# Feature Skewness

In [None]:
feature_names = train.columns[2:]
feature_names


In [None]:
# Find skewed numerical features
skew_features = train[feature_names].apply(lambda x: skew(x)).sort_values(ascending=False)

high_skew = skew_features[skew_features > 0.5]
skew_index = high_skew.index

print("There are {} numerical features with Skew > 0.5 :".format(high_skew.shape[0]))
skewness = pd.DataFrame({'Skew' :high_skew})
skew_features.head(10)

As there are not highly skewed features, we do not need to apply any transformation on the data

# Outliers

As the data is huge, we will first calculate the z-score of the training data and then we will further explore if the value found is to be treated as an outlier or not

In [None]:
# Dropping ID and target columns
z_score_calc = train.drop(columns=['ID_code', 'target'])
# Calculating z score
z = np.abs(stats.zscore(z_score_calc))
# print(z)
threshold = 3
print(np.where(z > 4))

The first array gives the row numbers and the 2nd array gives the respective columns of the outliers

In [None]:
treated_data = train[(z < 4).all(axis=1)]

In [None]:
print("before treating outliers : {}".format(train.shape))
print("after treating outliers : {}".format(treated_data.shape))

### We have removed **27** rows having outliers

We will try fitting our model on treated and non treated data to see which one performs better and then we can try changing the threshold value for finding outliers

# Feature Selection

In [None]:
treated_data.columns

In [None]:
# Creating the variables for model fitting
X = treated_data.drop(columns=['ID_code', 'target'])
y = treated_data['target']

# Test variable
test = test.drop(columns=['ID_code'])

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size = .3, random_state=0)

In [None]:
## Import the random forest model.
from sklearn.ensemble import RandomForestClassifier
## This line instantiates the model. 
rf = RandomForestClassifier() 
## Fit the model on your training data.
rf.fit(X_train, y_train) 
## And score it on your testing data.
rf.score(X_test, y_test)

In [None]:
prediction_rf = rf.predict(test)

In [None]:
train.columns

In [None]:
submission=pd.DataFrame({"ID_code":sub['ID_code'],
                         "target":prediction_rf})
submission.to_csv('submission_rf.csv',index=False)

In [None]:
feature_importances = pd.DataFrame(rf.feature_importances_,
                                   index = X_train.columns,
                                    columns=['importance']).sort_values('importance',ascending=False)

In [None]:
feature_importances

In [None]:
feature_importances.median()

In [None]:
feature_importances.tail(15)

Removing the features that have importance less than 0.0038


In [None]:
X_train.drop( ['var_38','var_158','var_73','var_14','var_10','var_84','var_61','var_103','var_185'],axis = 1,inplace = True )
X_test.drop( ['var_38','var_158','var_73','var_14','var_10','var_84','var_61','var_103','var_185'],axis = 1 , inplace = True)
X_train.head()


In [None]:
rf.fit(X_train, y_train) 
## And score it on your testing data.
rf.score(X_test, y_test)

In [None]:
feature_selected_test = test.drop( ['var_38','var_158','var_73','var_14','var_10','var_84','var_61','var_103','var_185'],axis = 1)
feature_selected_test.head()

In [None]:
prediction_rf = rf.predict(feature_selected_test)

In [None]:
submission=pd.DataFrame({"ID_code":sub['ID_code'],
                         "target":prediction_rf})
submission.to_csv('submission_rf2.csv',index=False)