<h1><center>Santander Customer Satisfaction - Baseline model</center></h1>

In this notebook we will be exploring the problem statement of predicting the dissatisfied customers of Santander. This is a competition that was held at Kaggle and the [data](https://www.kaggle.com/c/santander-customer-satisfaction/) can be downloaded after accepting the rules. 

The purpose of this notebook is to demonstrate the ML techniques to solve such structured problems.

In [1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

In [2]:
import os
import operator
import numpy as np
import pandas as pd
import xgboost as xgb
import matplotlib.pyplot as plt
from collections import defaultdict
from dotenv import find_dotenv, load_dotenv

from sklearn.utils import shuffle
from sklearn.decomposition import PCA
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import Normalizer
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import chi2, f_classif, mutual_info_classif
from sklearn.feature_selection import SelectFromModel, GenericUnivariateSelect

from src.models.train_model import train_xgb_cv
from src.data.clean_dataset import feats_n_unique
from src.data.clean_dataset import remove_new_values
from src.features.preprocess.transform import log_transform
from src.features.pruning.interaction import model_select_features
from src.features.pruning.interaction import get_correlated_features

PROJECT_PATH = '../../'

## Loading the raw data files

In [3]:
train = pd.read_csv(os.path.join(PROJECT_PATH, "data", "raw", "train.csv"))

test = pd.read_csv(os.path.join(PROJECT_PATH, "data", "raw", "test.csv"))

In [4]:
train_x = train.drop(['TARGET'], axis=1)
train_y = train['TARGET']

## 1. Data Cleaning

### 1.1. Limiting the test values to max-min values obtained from respective train features

In [5]:
test = remove_new_values(train_x, test)

## 2. Feature Transform

1. Log of certain features
2. Converting categorical to one-hot
3. Missing value imputation

### 2.1 [FAILED] Log of highly varied features

Attempted log transformation of highly varied features
- didn't see an improvement in auc
- infact there was a drop in auc by 0.005 when trained using xgb
- hence, ignore this step

In [None]:
# train_x, feats_log_transform = log_transform(train_x)
# test, _ = log_transform(test, features=feats_log_transform)

## 3. Feature Pruning

1. Delete features with no variance 
2. Delete highly correlated features
3. Get feature selection (importances or scikit) and delete one-by-one
4. Transform and delete

### 3.1 Removing columns with only 1 unique value

If a feature has only 1 unique value then there contains no information in predicting our target variable.

In [6]:
# removing features with only one unique value
train_unique_feats = feats_n_unique(train, 1)
train.drop(columns=train_unique_feats, inplace=True)

test.drop(columns=train_unique_feats, inplace=True)

### 3.2 Removing correlated features

In [None]:
feats_corr_del = get_correlated_features(train_x, 0.98)

### 3.3 Removing features based on model importances

In [None]:
model = xgb.XGBClassifier(learning_rate=0.02, subsample=0.7, colsample_bytree=0.6)

feats_selected_m, feat_importances = model_select_features(model, train_x, train_y, run_all_thresh=True)

### 3.4 Removing features based on statistical tests

In [None]:
feature_stat = GenericUnivariateSelect(mutual_info_classif, mode='k_best', param=60)
feature_stat.fit(train_x.to_numpy(), train_y)
feature_selected_stat = train_x.columns[feature_stat.get_support()]

In [None]:
len(set(feature_selected_stat) & set(feature_selected_model))/len(feature_selected_model)

## 4 Feature generation

1. Add PCA features
2. Add missing feature to indicate missing elements
3. Add number of zeros in each row
4. Perform EDA

### 4.1 Perform PCA with normalizer

In [7]:
normalizer = Normalizer()
pca = PCA(n_components=10)

In [8]:
pca_train = pca.fit_transform(normalizer.fit_transform(train_x))[:, :3]
train["pca_1"] = pca_train[:, 0]
train["pca_2"] = pca_train[:, 1]
train["pca_3"] = pca_train[:, 2]

pca_test = pca.transform(normalizer.transform(test))[:, :3]
test["pca_1"] = pca_test[:, 0]
test["pca_2"] = pca_test[:, 1]
test["pca_3"] = pca_test[:, 2]

### 4.2 Adding n_zeros per row

In [9]:
# adding zeros per row
train_x["n_zeros"] = (train_x==0).sum(axis=1)
test["n_zeros"] = (test==0).sum(axis=1)

## 5. Validation setup

In [10]:
skf = StratifiedKFold(n_splits=15, random_state=22)

## 6. Modeling

Here, an XGBoost model is tried from using the XGBoost library (not the scikit wrapper)

In [11]:
# xgb params
param = {'max_depth': 3, 'eta': 0.1, 'silent': 1, 'objective': 'binary:logistic'}
param['nthread'] = 4
param['eval_metric'] = 'auc'
param['colsample_bytree'] = 0.7

In [None]:
test.drop(['TARGET'], axis=1, inplace=True)

In [12]:
train_preds, test_preds = train_xgb_cv(train_x, train_y, test, skf, param)

  if getattr(data, 'base', None) is not None and \


[0]	eval-auc:0.801537	train-auc:0.797202
[1]	eval-auc:0.821005	train-auc:0.807739
[2]	eval-auc:0.800604	train-auc:0.810656
[3]	eval-auc:0.813579	train-auc:0.818085
[4]	eval-auc:0.800654	train-auc:0.810389
[5]	eval-auc:0.806186	train-auc:0.817402
[6]	eval-auc:0.67199	train-auc:0.824313
[7]	eval-auc:0.683968	train-auc:0.823543
[8]	eval-auc:0.508168	train-auc:0.820993
[9]	eval-auc:0.464519	train-auc:0.823296
[10]	eval-auc:0.508615	train-auc:0.826711
[11]	eval-auc:0.593887	train-auc:0.826458
[12]	eval-auc:0.648456	train-auc:0.827159
[13]	eval-auc:0.659726	train-auc:0.826903
[14]	eval-auc:0.665418	train-auc:0.828296


KeyboardInterrupt: 

In [None]:
roc_auc_score(train_y, train_preds)

In [None]:
test["TARGET"] = test_preds
test[["ID", "TARGET"]].to_csv("test_preds_cv15_n130_xgb.csv", index=False)

In [None]:
skf.__str__()

## Scores

In [None]:
from IPython.display import display
from IPython.display import Image
img = Image("/Users/santoshgsk/Desktop/kaggle_score.png")
display(img)

## Checkpointing

In [13]:
FOLDER_NAME = os.path.basename(os.getcwd())

In [14]:
FOLDER_PATH = os.path.join(PROJECT_PATH, "data", "processed", FOLDER_NAME)

In [15]:
if not os.path.exists(FOLDER_PATH):
    os.mkdir(FOLDER_PATH)

In [16]:
import re
import datetime

In [17]:
current_time_str = re.sub(r'[- :\.]', '', datetime.datetime.now().__str__())

In [18]:
train_ckpt = os.path.join(PROJECT_PATH, "data", "processed", FOLDER_NAME, current_time_str + "_baseline_01_train.csv")
test_ckpt = os.path.join(PROJECT_PATH, "data", "processed", FOLDER_NAME, current_time_str + "_baseline_01_test.csv")

In [27]:
pd.concat([train_x, train_y.to_frame()], axis=1)

Unnamed: 0,ID,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,...,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38,n_zeros,TARGET
0,1,2,23,0.0,0.00,0.00,0.00,0.00,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,39205.170000,355,0
1,3,2,34,0.0,0.00,0.00,0.00,0.00,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,49278.030000,329,0
2,4,2,23,0.0,0.00,0.00,0.00,0.00,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,67333.770000,340,0
3,8,2,37,0.0,195.00,195.00,0.00,0.00,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,64007.970000,309,0
4,10,2,39,0.0,0.00,0.00,0.00,0.00,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117310.979016,319,0
5,13,2,23,0.0,0.00,0.00,0.00,0.00,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,87975.750000,355,0
6,14,2,27,0.0,0.00,0.00,0.00,0.00,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,94956.660000,340,0
7,18,2,26,0.0,0.00,0.00,0.00,0.00,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,251638.950000,341,0
8,20,2,45,0.0,0.00,0.00,0.00,0.00,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,101962.020000,355,0
9,23,2,25,0.0,0.00,0.00,0.00,0.00,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,356463.060000,346,0


In [33]:
test.to_csv(test_ckpt, index=False)

In [35]:
train.to_csv(train_ckpt, index=False)