## Santander Customer Satisfaction
#### Which customers are happy customers?

From frontline support teams to C-suites, customer satisfaction is a key measure of success. Unhappy customers don't stick around. What's more, unhappy customers rarely voice their dissatisfaction before leaving.

Santander Bank is asking Kagglers to help them identify dissatisfied customers early in their relationship. Doing so would allow Santander to take proactive steps to improve a customer's happiness before it's too late.

In this competition, you'll work with hundreds of anonymized features to predict if a customer is satisfied or dissatisfied with their banking experience.

In [None]:
# We can start by importing the packages to Reading and plotting Graphs
import re
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


# Setting the display to see the maximum number of columns adopted
pd.set_option('display.max_columns', None)

In [None]:
santander_treino = pd.read_csv(r'../input/santander-customer-satisfaction/train.csv')
santander_teste = pd.read_csv(r'../input/santander-customer-satisfaction/test.csv')
# Remove ID as it will not be used in the analysis, it is only a reference variable.
santander_treino = santander_treino.drop('ID', axis = 1)
santander_teste =  santander_teste.drop('ID', axis = 1)

In [None]:
# We can observe that we have unbalanced data, so that the model does not capture the standards only for the category of satisfied customers = 0,
# it is necessary to carry out balancing techniques (Oversampling and Undersampling)

ax = sns.countplot(x="TARGET", data=santander_treino)
plt.title("Visualização da variável Alvo")
plt.xlabel("Variável Target")
plt.ylabel("Contagem")
plt.show()

In [None]:
santander_treino.head(10)

In [None]:
# Checking empty values in my workout set
santander_treino.isnull().values.any()
# We have no missing values, since the result is False.

## Feature Selection and Data Processing

In [None]:
# Packages Feature Selection
from sklearn.feature_selection import VarianceThreshold


def removeUnvariable(X, cols):
    """
    Removes columns that have a variability equal to 0.
    """
    
    cols.remove('TARGET')
    # Columns that have variation ... it is necessary to have minimal variation to be able to generalize.
    colsVariance = []
    
    for i in cols:
        if X[i].var() == 0:
            pass
        else:
            colsVariance.append(i)
    colsVariance.append('TARGET')
    return colsVariance

In [None]:
# We remove the target and separate it into another object.
santander_treino_limpo = santander_treino.copy()

# Placing the training base columns without target
cols = list(santander_treino_limpo.columns)

# Removing variables that have no variation
cols = removeUnvariable(santander_treino_limpo, cols)
santander_treino_limpo = santander_treino_limpo[cols]

# Removing columns that don't have much variation

# --> all ind columns (Categorical and there is not much variation between categories)

#for i in cols_with_ind:
#    print(X_train[cols_with_ind][i].value_counts(normalize = True).max())

cols_without_ind = [i for i in cols if bool(re.match(r'ind', i)) != True]
cols_with_ind = [i for i in cols if bool(re.match(r'ind', i))]

santander_treino_limpo = santander_treino_limpo[cols_without_ind]

# Seletor variance
variance_seletor = VarianceThreshold(threshold=0.02)
variance_seletor.fit(santander_treino_limpo)
colunas_const = variance_seletor.get_support()

list_columns = list(santander_treino_limpo.columns)
list_columns.remove('TARGET')

columns_drop = [col for col in list_columns if col not in santander_treino_limpo.columns[colunas_const]]
santander_treino_limpo.drop(columns=columns_drop, axis=1, inplace=True)

In [None]:
# Correlation between the variables and the target variable - Apparently there are few strong positive and negative correlations
correlations_list = []
cols = list(santander_treino_limpo.columns)

for i in cols:
    correlations_list.append(santander_treino_limpo['TARGET'].corr(santander_treino_limpo[i]))

df_correlations = pd.DataFrame(correlations_list, columns = ['correlations'])
df_correlations['correlations'].hist(bins = 10)
plt.show()

In [None]:
# -->> Data set of Test
cols_test = cols
cols_test.remove('TARGET')
santander_teste = santander_teste[cols_test]

## Modeling using Undersampling and Oversampling

In [None]:
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import classification_report
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import roc_auc_score

In [None]:
X = santander_treino_limpo.drop('TARGET', axis = 1)
y = santander_treino_limpo.TARGET.values

In [None]:
under = RandomUnderSampler(sampling_strategy=0.2)
X_under, y_under = under.fit_resample(X,y)

# create train test split
X_train, X_test, y_train, y_test = train_test_split(X_under, y_under, test_size=0.3, random_state=0)  

# In order not to overlap one variable over the other, it will be important to standardize the variables on a scale of 0 to 1.

scaler = MinMaxScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
santander_teste_scaler = scaler.transform(santander_teste)


oversample = SMOTE(random_state = 2020)
X_sm_train, y_sm_train = oversample.fit_resample(X_train_scaled, y_train)

In [None]:
# Seleção de atributos
pca = PCA(n_components = 10)
pca.fit(X_sm_train)

X_treino_pca = pca.transform(X_sm_train)
X_teste_pca = pca.transform(X_test_scaled)
santander_teste_pca = pca.transform(santander_teste_scaler)

In [None]:
sum(pca.explained_variance_)

In [None]:
model = XGBClassifier()

In [None]:
iterations = 7

param_grid = {
 "xgbclassifier__learning_rate"    : [0.05, 0.08, 0.1] ,
 "xgbclassifier__max_depth"        : [5, 7, 10, 13],
 "xgbclassifier__gamma"            : [ 0.0, 0.05, 0.08, 0.1]  
}

XGBoostRandomSCV = RandomizedSearchCV(estimator = model, 
                                        param_distributions = param_grid, 
                                        cv = 5, verbose=1, 
                                        n_jobs = -1, 
                                        scoring = 'roc_auc', n_iter = iterations)

XGBoostRandomSCV.fit(X_treino_pca, y_sm_train)

In [None]:
y_pred = XGBoostRandomSCV.best_estimator_.predict(X_teste_pca)
print("AUC Score is", roc_auc_score(y_test, y_pred))

In [None]:
print("Accuracy Score is", XGBoostRandomSCV.best_estimator_.score(X_teste_pca, y_test))

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
sample_table = pd.read_csv('../input/santander-customer-satisfaction/sample_submission.csv')
prediction = XGBoostRandomSCV.best_estimator_.predict(santander_teste_pca)

In [None]:
sample_table['TARGET'] = prediction

In [None]:
sample_table.to_csv('submission.csv', index = False)