# Intro
Welcome to the [Santander Customer Transaction Prediction](https://www.kaggle.com/c/santander-customer-transaction-prediction/overview)
![](https://storage.googleapis.com/kaggle-competitions/kaggle/10385/logos/header.png)

<span style="color: royalblue;">Please vote the notebook up if it helps you. Thank you. </span>

# Libraries
We load some standard libraries and packages of sklearn.

In [None]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score

In [None]:
import warnings
warnings.filterwarnings("ignore")

# Path
Define the input path and show the content of the input folder:

In [None]:
path = '/kaggle/input/santander-customer-transaction-prediction/'
os.listdir(path)

# Load Data

In [None]:
train_data = pd.read_csv(path+'train.csv')
test_data = pd.read_csv(path+'test.csv')
samp_subm = pd.read_csv(path+'sample_submission.csv')

# Overview

This is a big dataset with 200,000 samples and 200 features:

In [None]:
print('number of train samples:', len(train_data))
print('number of test samples:', len(test_data))
print('number of features:', len(train_data.columns)-2)

The target distribution is very imbalanced:

In [None]:
train_data['target'].value_counts()

There are no missing values on the train and test data:

In [None]:
train_data.isnull().sum().sum(), test_data.isnull().sum().sum()

# PCA
We want to analyse if we can reduce the dimension of the features:

In [None]:
pca = PCA().fit(train_data[train_data.columns[2:]])
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('No of components')
plt.ylabel('Cumulative explained variance')
plt.grid()
plt.show()

From the cumulative variance, overall 99% is being captured by about 150 components. Hence, we can decide that the number of principal components for our dataset is 150. This is a reduction about 25%.

# Feature Engineering
For every sample (row) we add the statistical features sum, mean, std, min and max:

In [None]:
train_data['sum'] = train_data[train_data.columns[2:202]].sum(axis=1)
test_data['sum'] = test_data[test_data.columns[1:201]].sum(axis=1)
train_data['mean'] = train_data[train_data.columns[2:202]].mean(axis=1)
test_data['mean'] = test_data[test_data.columns[1:201]].mean(axis=1)
train_data['std'] = train_data[train_data.columns[2:202]].std(axis=1)
test_data['std'] = test_data[test_data.columns[1:201]].std(axis=1)
train_data['min'] = train_data[train_data.columns[2:202]].min(axis=1)
test_data['min'] = test_data[test_data.columns[1:201]].min(axis=1)
train_data['max'] = train_data[train_data.columns[2:202]].max(axis=1)
test_data['max'] = test_data[test_data.columns[1:201]].max(axis=1)

Plot the distribution of the new features for train (upper row) and test (lower row) data:

In [None]:
def plot_distrubution():
    fig, axs = plt.subplots(2, 5, figsize=(20, 5))
    fig.subplots_adjust(hspace = 0.5, wspace=0.2)
    axs = axs.ravel()
    features = ['sum', 'mean', 'std', 'min', 'max']
    bins = 50
    for col in range(5):
        axs[col].hist(train_data[features[col]], bins=bins, color='blue', alpha=0.7)
        axs[col+5].hist(test_data[features[col]], bins=bins, color='red', alpha=0.7)
        axs[col].set_title(features[col]+' - train')
        axs[col+5].set_title(features[col]+' - test')
        axs[col].set_ylabel('Frequence')
        axs[col+5].set_ylabel('Frequence')
        axs[col].grid()
        axs[col+5].grid()
        axs[col].set_yticks([])
        axs[col+5].set_yticks([])

In [None]:
plot_distrubution()

# Define Train and Test Data

In [None]:
X_train = train_data[train_data.columns[2:]]
y_train = train_data['target']
X_test = test_data[test_data.columns[1:]]

In [None]:
assert(len(X_train.columns) == len(X_test.columns))

In [None]:
print('number of train samples:', len(X_train))
print('number of test samples:', len(X_test))

# Scale Data

In [None]:
min_max = MinMaxScaler()
X_train = min_max.fit_transform(X_train)
X_test = min_max.transform(X_test)

# Split Train And Val

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.33, random_state=2020)

In [None]:
print('number of train samples:', len(X_train))
print('number of val samples:', len(X_val))

# Model

In [None]:
model = xgb.XGBRegressor()

In [None]:
params = {'objective': ['binary:logistic'],
          'random_sate': [2020],
          'n_estimators': [300, 500, 700],
          'max_depth': [4, 6, 8],
          'learning_rate': [0.1, 0.01]}

In [None]:
grid = GridSearchCV(model, params, n_jobs=4, scoring='roc_auc', verbose=3)
grid.fit(X_train, y_train)
print('best score:', grid.best_score_)
print('best param:', grid.best_params_)

In [None]:
model.set_param(**grid.best_params_)
model.fit(X_train, y_train)

Predict validation data:

In [None]:
y_val_pred = model.predict(X_val)
roc_auc_score(y_val, y_val_pred)

Predict test data:

In [None]:
y_test = model.predict(X_test)

# Write Output

In [None]:
output = pd.DataFrame({'ID_code': samp_subm['ID_code'],
                       'target': y_test})

In [None]:
output['target'].describe()

In [None]:
output.to_csv('submission.csv', index=False)