# Customer satisfaction: models & explainability

The [Santander Customer Satisfaction competition](https://www.kaggle.com/c/santander-customer-satisfaction) was launched in 2016, with the aim of finding the best models to predict which customers are happy and which have complains. Several models got an AUC-ROC score of 0.84 back in those days, but few submissions included a whole analysis from scratch including a post-analysis. 

The idea of this notebook is to provide an **end-to-end approach to this challenge**, starting from understanding the many columns of the dataset (370), comparing which type of models are best suited for the problem, developing a competitive algorithm through feature engineering and finally analysing which features are more important for the predictions and how do they affect them.


**TABLE OF CONTENTS**

1. [Exploratory data analysis (EDA)](#section1)

    1.1. [Class balance](#section11)
    
    1.2. [Var columns](#section12)
    
    1.3. [Correlation with TARGET](#section13)
    
    1.4. [Missing data](#section14)
    
    
2. [Comparison of classification models](#section2)


3. [Model development](#section3)

    3.1. [Model 1: Baseline LGB](#section31)

    3.2. [Model 2: Remove duplicated and constant columns](#section32)

    3.3. [Model 3: Transform skewed data](#section33)


4. [Model explainability](#section4)

    4.1. [Feature importance](#section41)
    
    4.2. [SHAP values](#section42)
    

5. [Submission](#section5)

In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
from scipy.stats import randint as sp_randint
from scipy.stats import uniform as sp_uniform
import time
import warnings
warnings.filterwarnings('ignore')

# ML models
import lightgbm as lgb
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

# Additional libraries related to ML tasks
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, normalize
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV, cross_validate, cross_val_score
from sklearn.feature_selection import VarianceThreshold
from sklearn.metrics import confusion_matrix, roc_auc_score, accuracy_score, precision_score, recall_score, f1_score, precision_recall_fscore_support
from sklearn.metrics import mean_squared_error as mse
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from imblearn.under_sampling import RandomUnderSampler
from imblearn.under_sampling import OneSidedSelection
import eli5
from eli5.sklearn import PermutationImportance
import shap

# 1. Exploratory data analysis (EDA) <a id="section1"></a>

Ideally one would perform a thorough analysys of each of the columns, but this requires an amount of time and resources that may not be worth for the seak of this project. I will focus only on the most relevant features of the dataset, but be aware that real world projects frequently rely on really detailed and thorough EDAs.

First go first; let's load data, display its structure and get a brief summary:


In [None]:
train = pd.read_csv("../input/santander-customer-satisfaction/train.csv")
test = pd.read_csv("../input/santander-customer-satisfaction/test.csv")
submission_example = pd.read_csv("../input/santander-customer-satisfaction/sample_submission.csv")

print("Train dataset:", len(train), "rows and", len(train.columns), "columns")
print("Test dataset:", len(test), "rows and", len(test.columns), "columns")
display(train.head(5))
display(train.describe())

**OBSERVATIONS**:
* The dataset contains a large number of columns: 371
* Binary target class: 0 and 1
* Some columns present very large values. For example, var38 contains values of order 107
* Ad-hoc values are present in some cases. For example, var3 with -999999 values. This could point in the direction of unkwown encoded categorical variables

## 1.1. Class balance <a id="section11"></a>

Distribution of the number of happy and non-happy customers:


In [None]:
len_train = len(train)
target_0 = len(train.loc[train['TARGET']==0])/len_train
target_1 = 1-target_0

fig, axes = plt.subplots(1, 2, figsize=(12,5))

# TARGET distribution count
sns.countplot(x='TARGET', ax=axes[0], data=train, palette='Set2')
axes[0].set_title('Target count')

# TARGET distribution pie chart
axes[1].pie([target_0, target_1], colors=['mediumaquamarine', 'coral'], autopct='%1.2f%%', shadow=True, startangle=90, wedgeprops={'alpha':.5})
axes[1].set_title('Target distribution')
plt.savefig('target_counts.png')

**OBSERVATIONS**:
* The target column is highly unbalanced
* Happy customers (0): 96.04 %
* Unhappy customers (1): 3.96 %

## 1.2. Var columns <a id="section12"></a>

Columns named **varXX are candidates to be relevant features**, since their unique name (var) suggests that they might be related to the customer's personal information. In some cases, it seems that they have been transformed through an encoding process (i.e. var3), and hence their original values would be categorical.

In [None]:
train[['var3','var15','var21','var36','var38']].hist(bins=100, figsize=(10, 8), alpha=0.5)
plt.savefig('var_columns_all.png')
plt.show()

* **Var3, var15 & var38**

In [None]:
fig, ax = plt.subplots(1, 3, sharex=False, sharey=False, figsize=(18,4))

train.loc[train.var3.between(5, 300), 'var3'].hist(bins=50, range=(5, 300), ax=ax[0], alpha=0.7)
ax[0].set_title("Var3 distribution")
ax[0].set_ylabel("Count")
ax[0].set_xlabel("var3")
ax[0].set_ylim(0,600)

train.loc[train.var15.between(5, 105), 'var15'].hist(bins=50, range=(5, 105), ax=ax[1], alpha=0.7)
ax[1].set_title("Var15 distribution")
ax[1].set_xlabel("var15")
ax[1].set_ylim(0,27000)

train.var38.hist(bins=100, range=(0, 500000), ax=ax[2], alpha=0.7)
ax[2].set_title("Var38 distribution")
ax[2].set_xlabel("var38")
ax[2].set_ylim(0,18000)

plt.savefig('var3_15_38.png')
plt.show()

In [None]:
print("Var3")
print("Max: ", train['var3'].max())
print("Min: ", train['var3'].min())
print("Unique values: ", train['var3'].nunique())

print("\nVar15")
print("Max: ", train['var15'].max())
print("Min: ", train['var15'].min())
print("Unique values: ", train['var15'].nunique())

print("\nVar38")
print("Max: ", train['var38'].max())
print("Min: ", train['var38'].min())
print("Unique values: ", train['var38'].nunique())

**Observations**: 
* Var3 has 208 unique values, from which certain values are more frequent while the rest of them appear more uniformly

* Var3 shows ad-hoc values flagged as -999999, suggesting that this variable may be related to a categorical data field. To prevent potential problems due to this high negative value, we have replaced it by -0.5

* **Var3**. From the (not very large) number of distinct values and the fact that there's a manual flag, my guess is that this column is **related to country information**, such as nationality or citizenship

* **Var15** is distributed between 5 and 105 following a decreasing behavior. Since values are comprised of "small" numbers, they might be **related to certain categorization** or count, and not to to capital data. My guess is that it's related to the age of the customer, which is in consonance with values expected (quite surprising to find such young customers tho!)

* **Var38**. The exponential decayment shown above resembles to curves from wealth distributions. Moreover, there's a huge peak at 117310.979016 with more than 16000 cases, which is difficult to explain without further details. This varialbe is probably **related to wealth** or capital in some way

* **Var21 & var36**

In [None]:
fig, ax = plt.subplots(1, 1, sharex=False, sharey=False, figsize=(7,4))
train.var21.hist(bins=100, range=(0, 30000), ax=ax, alpha=0.7)
ax.set_title("Var21 distribution")
ax.set_xlabel("var21")
ax.set_ylim(0,250)
ax.set_xlim(300,30000)
plt.savefig('var21_36.png')
plt.show()

In [None]:
print("Var21")
print("Max: ", train['var21'].max())
print("Min: ", train['var21'].min())
print("Unique values: ", train['var21'].nunique())
print("List of unique values: ", train['var21'].unique())

print("\nVar36")
print("Max: ", train['var36'].max())
print("Min: ", train['var36'].min())
print("Unique values: ", train['var36'].nunique())
print("List of unique values: ", train['var36'].unique())

**Observations**:
* Both var21 and var36 are comprised of a small number of distinct values
* For **var21**, all values are multiples of 300, which might provide some useful information about the nature of this variable. Despite being quite noisy, the general tendency of the values' frequencies is decreasing
* For **var36** the values range mainly from 0 to 3, while 99 looks like a manual flag to mark specific cases (unknown cases?)

## 1.3. Correlation with TARGET <a id="section13"></a>

Brief analysis of the TOP20 most correlated data columns with target:

In [None]:
# Correlation matrix of relevant features
corr = train.corr()
top20_corr = corr.nlargest(20, 'TARGET')['TARGET']

# Plot top20 correlations
fig, ax = plt.subplots(1, 1, sharex=False, sharey=False, figsize=(7,4))
plt.bar(top20_corr[1:].index.values, top20_corr[1:].values, alpha=0.7)
plt.title("Top 20 most correlated columns with TARGET")
plt.ylabel("Correlation")
plt.xlabel("Features")
plt.xticks(rotation=90)
plt.savefig('correlation_target.png')
plt.show()

**Observations**:
* **Var36** and **var15** are the most correlated features with the target variable, with values around 0.1
* **Ind_var8_0** and **num_var8_0** are also considerably correlated to the target, with values above 0.45
* There's a drop for the rest of the columns, all of them with correlations below 0.3

## 1.4. Missing data <a id="section14"></a>

Since we are dealing with such rich dataset, let's look for missing values:

In [None]:
print("Number of mssing data in the dataset: ", train.isna().any().sum())

**Observations**:
* No missings. Quite surprising given the number of columns, but good news in the end

# 2. Comparison of classification models <a id="section2"></a>

The brief EDA performed in the previous section has helped us to identify some insights that could be useful in a future feature engineering step. But before feature engineering, we need a baseline model in order to decide which transformations are indeed useful for the prediction purposes.

Let's start by comparing some classification models:

In [None]:
# Train/valid split
def split_dataset(data, split_size):
    y = data['TARGET']
    X = data.drop(['TARGET'], axis=1)
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=split_size, random_state=21)
    return X_train, X_valid, y_train, y_valid

X_train, X_valid, y_train, y_valid = split_dataset(train, 0.2) 

* **Logistic Regression**

In [None]:
ts = time.time()

LR = LogisticRegression()

scoring = ['accuracy', 'precision_macro', 'recall_macro' , 'f1_weighted', 'roc_auc']
scores = cross_validate(LR, X_train, y_train, scoring=scoring, cv=5)

sorted(scores.keys())
LR_fit_time = scores['fit_time'].mean()
LR_score_time = scores['score_time'].mean()
LR_accuracy = scores['test_accuracy'].mean()
LR_precision = scores['test_precision_macro'].mean()
LR_recall = scores['test_recall_macro'].mean()
LR_f1 = scores['test_f1_weighted'].mean()
LR_roc = scores['test_roc_auc'].mean()

print("Time spent: ", time.time()-ts)

* **Support Vector Machine**

In [None]:
ts = time.time()

SVM = SVC(probability = True)

scoring = ['accuracy','precision_macro', 'recall_macro' , 'f1_weighted', 'roc_auc']
scores = cross_validate(SVM, X_train, y_train, scoring=scoring, cv=5)

sorted(scores.keys())
SVM_fit_time = scores['fit_time'].mean()
SVM_score_time = scores['score_time'].mean()
SVM_accuracy = scores['test_accuracy'].mean()
SVM_precision = scores['test_precision_macro'].mean()
SVM_recall = scores['test_recall_macro'].mean()
SVM_f1 = scores['test_f1_weighted'].mean()
SVM_roc = scores['test_roc_auc'].mean()

print("Time spent: ", time.time()-ts)

* **Linear Discriminant Analysis**

In [None]:
ts = time.time()

LDA = LinearDiscriminantAnalysis()

scoring = ['accuracy', 'precision_macro', 'recall_macro' , 'f1_weighted', 'roc_auc']
scores = cross_validate(LDA, X_train, y_train, scoring=scoring, cv=5)

sorted(scores.keys())
LDA_fit_time = scores['fit_time'].mean()
LDA_score_time = scores['score_time'].mean()
LDA_accuracy = scores['test_accuracy'].mean()
LDA_precision = scores['test_precision_macro'].mean()
LDA_recall = scores['test_recall_macro'].mean()
LDA_f1 = scores['test_f1_weighted'].mean()
LDA_roc = scores['test_roc_auc'].mean()

print("Time spent: ", time.time()-ts)

* **Quadratic Discriminant Analysis**

In [None]:
ts = time.time()

QDA = QuadraticDiscriminantAnalysis()

scoring = ['accuracy', 'precision_macro', 'recall_macro' , 'f1_weighted', 'roc_auc']
scores = cross_validate(QDA, X_train, y_train, scoring=scoring, cv=5)

sorted(scores.keys())
QDA_fit_time = scores['fit_time'].mean()
QDA_score_time = scores['score_time'].mean()
QDA_accuracy = scores['test_accuracy'].mean()
QDA_precision = scores['test_precision_macro'].mean()
QDA_recall = scores['test_recall_macro'].mean()
QDA_f1 = scores['test_f1_weighted'].mean()
QDA_roc = scores['test_roc_auc'].mean()

print("Time spent: ", time.time()-ts)

* **Random Forest**

In [None]:
ts = time.time()

random_forest = RandomForestClassifier()

scoring = ['accuracy', 'precision_macro', 'recall_macro' , 'f1_weighted', 'roc_auc']
scores = cross_validate(random_forest, X_train, y_train, scoring=scoring, cv=5)

sorted(scores.keys())
forest_fit_time = scores['fit_time'].mean()
forest_score_time = scores['score_time'].mean()
forest_accuracy = scores['test_accuracy'].mean()
forest_precision = scores['test_precision_macro'].mean()
forest_recall = scores['test_recall_macro'].mean()
forest_f1 = scores['test_f1_weighted'].mean()
forest_roc = scores['test_roc_auc'].mean()

print("Time spent: ", time.time()-ts)

* **K Nearest Neighbors**

In [None]:
ts = time.time()
KNN = KNeighborsClassifier()

scoring = ['accuracy', 'precision_macro', 'recall_macro' , 'f1_weighted', 'roc_auc']
scores = cross_validate(KNN, X_train, y_train, scoring=scoring, cv=5)

sorted(scores.keys())
KNN_fit_time = scores['fit_time'].mean()
KNN_score_time = scores['score_time'].mean()
KNN_accuracy = scores['test_accuracy'].mean()
KNN_precision = scores['test_precision_macro'].mean()
KNN_recall = scores['test_recall_macro'].mean()
KNN_f1 = scores['test_f1_weighted'].mean()
KNN_roc = scores['test_roc_auc'].mean()

print("Time spent: ", time.time()-ts)

* **Naive Bayes (Gaussian)**

In [None]:
ts = time.time()

bayes = GaussianNB()

scoring = ['accuracy', 'precision_macro', 'recall_macro' , 'f1_weighted', 'roc_auc']
scores = cross_validate(bayes, X_train, y_train, scoring=scoring, cv=5)

sorted(scores.keys())
bayes_fit_time = scores['fit_time'].mean()
bayes_score_time = scores['score_time'].mean()
bayes_accuracy = scores['test_accuracy'].mean()
bayes_precision = scores['test_precision_macro'].mean()
bayes_recall = scores['test_recall_macro'].mean()
bayes_f1 = scores['test_f1_weighted'].mean()
bayes_roc = scores['test_roc_auc'].mean()

print("Time spent: ", time.time()-ts)

* **Comparison table**

In [None]:
models_initial = pd.DataFrame({
    'Model'       : ['Logistic Regression', 'Support Vector Machine', 'Linear Discriminant Analysis', 'Quadratic Discriminant Analysis', 'Random Forest', 'K-Nearest Neighbors', 'Bayes'],
    'Fitting time': [LR_fit_time, SVM_fit_time, LDA_fit_time, QDA_fit_time, forest_fit_time, KNN_fit_time, bayes_fit_time],
    'Scoring time': [LR_score_time, SVM_score_time, LDA_score_time, QDA_score_time, forest_score_time, KNN_score_time, bayes_score_time],
    'Accuracy'    : [LR_accuracy, SVM_accuracy, LDA_accuracy, QDA_accuracy, forest_accuracy, KNN_accuracy, bayes_accuracy],
    'Precision'   : [LR_precision, SVM_precision, LDA_precision, QDA_precision, forest_precision, KNN_precision, bayes_precision],
    'Recall'      : [LR_recall, SVM_recall, LDA_recall, QDA_recall, forest_recall, KNN_recall, bayes_recall],
    'F1_score'    : [LR_f1,SVM_f1, LDA_f1, QDA_f1, forest_f1, KNN_f1, bayes_f1],
    'AUC_ROC'     : [LR_roc, SVM_roc, LDA_roc, QDA_roc, forest_roc, KNN_roc, bayes_roc],
    }, columns = ['Model', 'Fitting time', 'Scoring time', 'Accuracy', 'Precision', 'Recall', 'F1_score', 'AUC_ROC'])

models_initial.sort_values(by='AUC_ROC', ascending=False)

* **Observations**:
* The metric for this competition is AUC-ROC. The best models in this scoring are Linear Discriminant and RF
* All accuracies are quite high and above 0.92, except for Bayes and Quadratic Discriminant. The same applies to F1 score
* Given the performance and the capability to fine tune its corresponding parameters, we will proceed with RF based models

# 3. Model development  <a id="section3"></a>

Random Forests have emerged as the winners of the comparison, and we will use a fancy and powerful version of them: Light Gradient Boosting (LGB). In order to analyze which data transformations are useful to achieve a better classification, we will create one "model-experiment" for each technique, so that the final score can be properly compared.

## 3.1. Model 1: Baseline LGB <a id="section31"></a>

The first case consists on a** simple LGB model with raw data**, that will work as a baseline score for the following results.

Workflow
* **Split**. Split data into train/validation sets
* **Optimize LGB**. Perform hyperparametrization tuning to find optimal LGB parameters
* **Results**. Compute cross validation AUC scores

In [None]:
# Hyperparametrization of LGB model
def optimize_lgb(X_train, y_train, X_valid, y_valid): 
    
    n_HP_points_to_test = 100

    fit_params={"early_stopping_rounds":30, 
            "eval_metric" : 'auc', 
            "eval_set" : [(X_valid, y_valid)],
            'eval_names': ['valid'],
            'verbose': 0,
            'categorical_feature': 'auto'}
    
    param_test ={'max_depth': [4,5,6,7],
             'num_leaves': sp_randint(6, 50), 
             'min_child_samples': sp_randint(100, 500), 
             'min_child_weight': [1e-5, 1e-3, 1e-2, 1e-1, 1, 1e1, 1e2, 1e3, 1e4],
             'subsample': sp_uniform(loc=0.2, scale=0.8), 
             'colsample_bytree': sp_uniform(loc=0.4, scale=0.6),
             'reg_alpha': [0, 1e-1, 1, 2, 5, 7, 10, 50, 100],
             'reg_lambda': [0, 1e-1, 1, 5, 10, 20, 50, 100]}
    
    clf = lgb.LGBMClassifier(max_depth=-1, random_state=21, silent=True, metric='auc', n_jobs=4, n_estimators=1000)
    
    gs = RandomizedSearchCV(
            estimator=clf, param_distributions=param_test, 
            n_iter=n_HP_points_to_test,
            scoring='roc_auc',
            cv=5,
            refit=True,
            random_state=21,
            verbose=0)
    
    gs.fit(X_train, y_train, **fit_params)
    print('Best score reached: {} with params: {} '.format(gs.best_score_, gs.best_params_))
    
    return gs, gs.best_score_, gs.best_params_, fit_params


# Raw data
X_train, X_valid, y_train, y_valid = split_dataset(train, 0.2)
clf_1, best_score_1, optimal_params_1, fit_params = optimize_lgb(X_train, y_train, X_valid, y_valid)

**Results**:
* Best ROC-AUC score reached: 0.8399974507944336
* Optimal parameters: {'colsample_bytree': 0.6641632439141401, 'max_depth': 5, 'min_child_samples': 224, 'min_child_weight': 0.1, 'num_leaves': 28, 'reg_alpha': 2, 'reg_lambda': 10, 'subsample': 0.9244622777209361} 
* Features at this point: 371

# 3.2. Model 2: Remove duplicated and constant columns <a id="section32"></a>

The previous prediction has run over 370 features. A lot of these data columns are constant or quasi-constant features, which provide no real information in order to classify customers into happy/no happy. Let's get rid of them and compare the results.

**Workflow**
* **Small fixes**. Replace -999999 by -0.5 in var3 and drop ID column
* **Clean duplicates**. Remove duplicated data columns
* **Clean constant**. Remove features with more than a threshdol percentage of constant values
* Split. Split data into train/validation sets
* Optimize LGB. Perform hyperparametrization tuning to find optimal LGB parameters
* Results. Compute cross validation AUC scores

In [None]:
def clean_duplicates(df1, df2):
    remove = []
    cols = df1.columns
    for i in range(len(cols)-1):
        v = df1[cols[i]].values
        for j in range(i+1,len(cols)):
            if np.array_equal(v,df1[cols[j]].values):
                remove.append(cols[j])
    df1.drop(remove, axis=1, inplace=True)
    df2.drop(remove, axis=1, inplace=True)
    return df1, df2


def clean_constant_columns(df1, df2, threshold):
    constant_cols = []
    for i in df1.columns:
        counts = df1[i].value_counts()
        zeros = counts.iloc[0]
        if zeros / len(df1) * 100 > threshold:
            constant_cols.append(i)
    df1 = df1.drop(constant_cols, axis=1)
    df2 = df2.drop(constant_cols, axis=1)
    return df1, df2


# Duplicates of train/test for modification purposes
train_df = train.copy()
test_df = test.copy()

# Replace -999999 by -0.5 in var3
train_df.var3.replace(-999999, -0.5, inplace=True)
test_df.var3.replace(-999999, -0.5, inplace=True)

# Drop ID column
train_id = train_df['ID']
test_id = test['ID']
train_df.drop('ID', axis=1, inplace=True)
test_df.drop('ID', axis=1, inplace=True)

# Irrelevant columns cleaning 
train_df, test_df = clean_duplicates(train_df, test_df)
train_df, test_df = clean_constant_columns(train_df, test_df, 99.9)

# Split train dataset and find the best parameters for LGB
X_train, X_valid, y_train, y_valid = split_dataset(train_df, 0.20)
clf_2, best_score_2, optimal_params_2, fit_params = optimize_lgb(X_train, y_train, X_valid, y_valid)

**Results**:
* Classification is now slightly better, confirming that removing quasi-constant features has reduced noise and helped the model's predictions
* Best ROC-AUC score reached: 0.8404171483835411
* Optimal parameters: {'colsample_bytree': 0.652955902411274, 'max_depth': 6, 'min_child_samples': 268, 'min_child_weight': 0.1, 'num_leaves': 37, 'reg_alpha': 2, 'reg_lambda': 10, 'subsample': 0.9251646597209053} 
* Features at this point: 215

# 3.3. Model 3: Transform skewed data <a id="section33"></a>

So far so good, we have achieved a considerable AUC-ROC. In this experiment we will go one step further, putting our attention on data distributions and particularly on features with a high skewness.

In order to normalize a skewed variable $x$, it is frequent to apply a transformation such as $x^2$ or $log(x)$. We will opt for a slightly modified version of the second transformation, $log(x+1)$, since it's able to deal both with positive values and with the additional range $(âˆ’1,0]$.

**Workflow**
* Small fixes. Replace -999999 by -0.5 in var3 and drop ID column
* Clean duplicates. Remove duplicated data columns
* Clean constant. Remove features with more than a threshdol percentage of constant values
* Split. Split data into train/validation sets
* **Normality test**. Check for highly skewed columns, and apply a normalization transformation (log1p) when required
* Optimize LGB. Perform hyperparametrization tuning to find optimal LGB parameters
* Results. Compute cross validation AUC scores

**Note**: tree-based models are scale invariant, so that *a priori* the log transformation to normalize data should not have any impact on the predictions. However, since the objective of this notebook is to provide useful feature engineering techniques and given that we could end up using additional models not based on trees, the transformation of skewed features may prove useful. 

In [None]:
def fix_skewness(df1, df2, nunique, max_skew):
    numeric_cols = [cname for cname in df1.columns if df1[cname].dtype in ['int64', 'float64']]
    skewed_feats = df1[numeric_cols].apply(lambda x: stats.skew(x.dropna())).sort_values(ascending=False)

    # Apply log1p to all columns with >nunique values, |skewness|>max_skew and x>-0.99
    log_col = []
    for col in skewed_feats.index:
        if(df1[col].nunique()>nunique):
            if(abs(skewed_feats[col])>max_skew): 
                if(df1[col].min()>=-0.99):
                    log_col.append(col)
                    df1[col]=df1[col].apply(lambda x: np.log1p(x))
                    df2[col]=df2[col].apply(lambda x: np.log1p(x))
    return df1, df2, log_col


def var38_flag(df1, df2):
    df1['var38_flag'], df2['var38_flag'] = 0, 0
    var38_mode = df1.var38.mode()
    df1.loc[df1['var38']==var38_mode[0], ['var38', 'var38_flag']] = 0, 1
    df2.loc[df2['var38']==var38_mode[0], ['var38', 'var38_flag']] = 0, 1
    return df1, df2
    
    
ts = time.time()

train_df_skw, test_df_skw = train_df.copy(), test_df.copy()

# Transform skewed features (log1p), create flag for var38 mode
train_df_skw, test_df_skw = var38_flag(train_df_skw, test_df_skw)
train_df_skw, test_df_skw, cols_skw = fix_skewness(train_df_skw, test_df_skw, 50, 0.7)

# Split train dataset and find the best parameters for LGB
X_train_skw, X_valid_skw, y_train_skw, y_valid_skw = split_dataset(train_df_skw, 0.20)
clf_3, best_score_3, optimal_params_3, fit_params = optimize_lgb(X_train_skw, y_train_skw, X_valid_skw, y_valid_skw)

print("Time spent: ", time.time()-ts)

**Results**:
* A very slight increase in the score, which can't be interpreted as an improvement (probability component of the model). However, normalizing data distributions does not worsen the results
* Best ROC-AUC score reached: 0.8405401877854473
* Optimal parameters: {'colsample_bytree': 0.5041066856718041, 'max_depth': 6, 'min_child_samples': 215, 'min_child_weight': 0.01, 'num_leaves': 47, 'reg_alpha': 2, 'reg_lambda': 5, 'subsample': 0.7631144227290101} 
* Features at this point: 214

# 4. Model explainability <a id="section4"></a>

Obtaining a model with the best score under a certain metric is important to some extent, since it's the main objective of a data prediction project. However, from a business perspective, **a black-box algorithm that is unable to provide any additional value nor explain how the different features affect the predictions might have little valuable** depending on the goals of the project. 

In this case, Banco Santander could be interested in additional information, like which features have more impact in the algorithm results or how exactly do they affect them. Understanding the features usually helps to identify how to improve the business objectives, eventually leading to recommendations that might end up in new projects.

## 4.1. Feature importance <a id="section41"></a>

The first post-processing task we will tackle is feature importance. Based on our best performing model, we will extract two feature importance measures:

1. **Decision tree feature importance**: importance scores based on the reduction in the criterion used to select split points in a Decision tree algorithm, like Gini or entropy
2. **Permutation importance**: by exchanging values randomly on each feature, a score is computed that tracks the impact on the predictions

In [None]:
clf_lgb_1 = lgb.LGBMClassifier(   colsample_bytree= 0.5041066856718041, 
                                max_depth= 6, 
                                min_child_samples= 215, 
                                min_child_weight= 0.01, 
                                num_leaves= 47, 
                                reg_alpha= 2, 
                                reg_lambda= 5, 
                                subsample= 0.7631144227290101, 
                                random_state=21, 
                                silent=True, 
                                metric='auc', 
                                n_jobs=4, 
                                n_estimators=1000)
clf_lgb_1.fit(X_train, y_train)

# Feature importance
feature_imp = pd.DataFrame(sorted(zip(clf_lgb_1.feature_importances_,X_train.columns)), columns=['Value','Feature']).sort_values(by=['Value'], ascending=False)[:50]
plt.figure(figsize=(20, 10))
sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", ascending=False))
plt.title('LightGB feature importance')
plt.tight_layout()
plt.savefig('lgb_feature_importance.png')
plt.show()

In [None]:
perm = PermutationImportance(clf_lgb_1, random_state=21).fit(X_valid, y_valid)
eli5.show_weights(perm, feature_names = X_valid.columns.tolist())

**Observations**:

* **Var columns**. Most of the var columns identified as relevant in the EDA section (var38, var15 & var36) are present in the LGB feature importance top. In fact, the top 2 most important features are var38 and var15
* **Saldo_medio_var5**. Features related to *saldo_medio_var5_* have a high importance value
* **Saldo_var**. The group including saldo_var30, saldo_var42, saldo_var5 & saldo_var37 has a high importance value
* **Num_var45**. Features related to *num_var45_* have a high importance value
* **Num_var22**. Features related to *num_var22_* have a medium-high importance value

## 4.2. SHAP values <a id="section42"></a>

SHAP values (SHapley Additive exPlanations) interpret the impact of having a certain value for a given feature in comparison to the prediction we'd make if that feature took some baseline value. SHAP values add explainability to the model to some extent, and provide useful tools to analyze the dependence contribution of each feature.

In [None]:
X_train_red = X_train
X_valid_red = X_valid

clf_lgb_1.fit(X_train_red, y_train)

explainer = shap.TreeExplainer(clf_lgb_1)
shap_values = explainer.shap_values(X_valid_red)
shap.summary_plot(shap_values[1], X_valid_red)
plt.savefig('shap_summary.png')

**Observations**:
* From the top 20 features with more SHAP value impact, 3 of them are var features (var15, var38 & var36)
* **Polarized effects features**. Some features are clearly polarized, so that high feature values produce a positive impact on SHAP, while low values have a negative impact. These features are candidates to be relevant features given their SHAP behavior, since they may contain interesting dependencies with the target. Examples: num_meses_var5_ult3, saldo_var5, var36
* **Indistinguishable effects features**. There are several cases in which high feature values impact both positively or negatively to SHAP values. Apparently, changing the value of these features produce an outlier-like behavior. Examples: num_var22_ult3, saldo_var42, num_var45_hace3.
* **Mixed effects features**. Other cases present a rich scenario, in which high/low feature values have no clear regime of impact in SHAP values. Some of them are polarized to some extent with a certain transition regime, while others are more mixed. Examples: var15, saldo_medio_var5_hace3, var38, num_var22_hace3

### 4.2.1. Var columns <a id="section421"></a>

In [None]:
shap.dependence_plot('var38', shap_values[1], X_valid_red)
plt.savefig('shap_var38.png')
shap.dependence_plot('var15', shap_values[1], X_valid_red)
plt.savefig('shap_var15.png')
shap.dependence_plot('var36', shap_values[1], X_valid_red)
plt.savefig('shap_var36.png')
shap.dependence_plot('var3', shap_values[1], X_valid_red)
plt.savefig('shap_var3.png')

**Observations**:

* **Var38 (wealth)**. Low var38 values have indistinguishable effects on SHAP, while high values don't seem to have any effect. No clear dependance.
* **Var15 (age)**. The younger the customer (low var15), the more negative the effect on SHAP. Since we are focusing on the probability to be unhappy (target=1), this means that young customers tend to be happier. Customers between 40 and 80 years have more complains, and a more diverse opinion.
* **Var36**. When var36=0, SHAP values tend to be negative and hence customers are happier. For higher values, and in particular for var36=99, the customer satisfaction drops slightly down (remind that high SHAP values are related to a high probability of target=1).
* **Var3 (country)**. Looks like modifying the country variable has negative impact on SHAP. We can adventure to guess that the lowest value of var3 is the most common country of the bank's customers (Spain), and that foreign customers are less prone to complain. This could be a reasonable assumption, but business expertise would be required to confirm this behavior. Additionally, it looks like customers from uncommon countries are not only happier, but also generally older.

### 4.2.2. Saldo_var5 <a id="section422"></a>

In [None]:
shap.dependence_plot('saldo_var5', shap_values[1], X_valid_red)
plt.savefig('shap_saldo_var5.png')
shap.dependence_plot('saldo_medio_var5_hace3', shap_values[1], X_valid_red)
plt.savefig('shap_saldo_medio_var5_hace3.png')
shap.dependence_plot('saldo_medio_var5_ult3', shap_values[1], X_valid_red)
plt.savefig('shap_saldo_medio_var5_ult3.png')

**Observations**:

* **Saldo_var5**. There's an increasing tendency in saldo_var5 and its respective SHAP value. Looks like higher values might be related to less satisfied customers, but the effect is small. Moreover, there's an obvious dependency between saldo_var5 and its historically related variable saldo_medio_var5_ult3.
* **Saldo_medio_var5**. These features don't seem to have a strong impact on SHAP values, but a slight negative effect for high values. 

### 4.2.3. Num_var30 & saldo_var30 <a id="section423"></a>

In [None]:
shap.dependence_plot('num_var30', shap_values[1], X_valid_red)
plt.savefig('shap_num_var30.png')
shap.dependence_plot('saldo_var30', shap_values[1], X_valid_red)
plt.savefig('shap_saldo_var30.png')

**Observations**: 

* No big effects in var30, but a general tendency to have higher satisfaction probabilities for large values of num_var30 and saldo_var30.

### 4.2.4. Num_var45 <a id="section424"></a>

In [None]:
shap.dependence_plot('num_var45_hace2', shap_values[1], X_valid_red)
plt.savefig('shap_num_var45_hace2.png')
shap.dependence_plot('num_var45_hace3', shap_values[1], X_valid_red)
plt.savefig('shap_num_var45_hace3.png')

**Observations**:

* **Num_var45_hace2**. Lower values tend to be related to higher satisfaction probabilities, but the effect is very small.
* **Num_var45_hace3**. Higher values of the feature tend to provide negative SHAP values, and hence more probability for a happy customer. 

# 5. Submission

In [None]:
clf_lgb = lgb.LGBMClassifier(   colsample_bytree= 0.6641632439141401, 
                                max_depth= 5, 
                                min_child_samples= 224, 
                                min_child_weight= 0.1, 
                                num_leaves= 28, 
                                reg_alpha= 2, 
                                reg_lambda= 10, 
                                subsample= 0.9244622777209361,  
                                random_state=21,
                                silent=True, 
                                metric='auc', 
                                n_jobs=4, 
                                n_estimators=1000)


clf_lgb.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], early_stopping_rounds=10)
test_df = test_df[X_train.columns]
probs = clf_lgb.predict_proba(test_df)

submission = pd.DataFrame({"ID":test_id, "TARGET": probs[:,1]})
submission.to_csv("submission.csv", index=False)