# Kaggle Competition: Santander Customer Satisfaction

Contents
1. Abstract
2. Introduction
3. Data Exploration
4. Features
5. Modeling
6. Concluding Thoughts

## 1. Abstract
In this Kaggle.com Competition, given the train and test datasets provided by Santander Bank, predict which of its customers, those that are represented in the test data, are unhappy with their experience based on what can be gleaned from the train data.

## 2. Introduction
From the competition page:
> "From frontline support teams to C-suites, customer satisfaction is a key measure of success. Unhappy customers don't stick around. What's more, unhappy customers rarely voice their dissatisfaction before leaving.

> "Santander Bank is asking Kagglers to help them identify dissatisfied customers early in their relationship. Doing so would allow Santander to take proactive steps to improve a customer's happiness before it's too late.

> "In this competition, you'll work with hundreds of anonymized features to predict if a customer is satisfied or dissatisfied with their banking experience."

Santander provides the data for competiting participants in two .csv files, plus an additional file to show how they expect the data to be submitted. Not having to dig for the data ourselves seems lovely, but there is a caveat here: Santander provides no context, let alone a data dictionary or code book, to accompany the data. Participants have only the feature names and the data itself--and to some degree one another despite this being a competition--to work with.

## 3. Data Exploration
The files provided are '**train.csv**' and '**test.csv**' (with a third file, sample_submission.csv, to demonstrate in what format submitted predictions are required to be). The file 'train.csv' is for training the participant's model while 'test.csv' is what the participant has to test their models and predict responses accordingly.

In [7]:
# import all the things
%matplotlib inline
import pandas as pd
import numpy as np

In [11]:
santrainder = pd.read_csv('train.csv')
santestder = pd.read_csv('test.csv')
print santrainder.shape
print santestder.shape

(76020, 371)
(75818, 370)


Both **train.csv** and **test.csv** are about the same size and shape, with **train.csv** having 76,020 observations and **test.csv** having 75,818. There are 370 features to sift through, and the additional 371st column in train.csv being the response vector.

For this project we focus mostly on **train.csv** as the file is what we use to train our model, but understandably the two datasets are fairly similar at a glance, with the key difference of **test.csv** not having a **response vector** (test.csv's response vector is what participants have to predict and submit for the competition).

Now, looking at the features:

In [12]:
print santrainder.columns

Index([u''ID', u'var3', u'var15', u'imp_ent_var16_ult1',
       u'imp_op_var39_comer_ult1', u'imp_op_var39_comer_ult3',
       u'imp_op_var40_comer_ult1', u'imp_op_var40_comer_ult3',
       u'imp_op_var40_efect_ult1', u'imp_op_var40_efect_ult3',
       ...
       u'saldo_medio_var33_hace2', u'saldo_medio_var33_hace3',
       u'saldo_medio_var33_ult1', u'saldo_medio_var33_ult3',
       u'saldo_medio_var44_hace2', u'saldo_medio_var44_hace3',
       u'saldo_medio_var44_ult1', u'saldo_medio_var44_ult3', u'var38',
       u'TARGET'],
      dtype='object', length=371)


... We can see that it's not in English. Santander Bank is a Spanish banking entity, one the largest banks in the eurozone, and so these features are in Spanish (*abbreviated* Spanish, no less). After some competition forum scouring and translation work, we have the following translation paradigm:

- imp_ent_varX => importe entidad => amount for the bank office
- imp_op_varX_comer => importe opcion comercial => amount for commercial option
- imp_sal_varX => importe salario => amount for wage
- ind_varX_corto => indicador corto => short (time lapse?) indicator/dummy
- ind_varX_medio => indicador medio => medium-sized (time lapse?) indicator/dummy
- ind_varX_largo => indicador largo => long-sized (time lapse?) indicator/dummy
- sald_varX => saldo => balance
- delta_imp_amort_varX_1y3 => importe amortización 1 y 3 => amount/price for redemption (?) 1 and 3
- delta_imp_aport_varX_1y3 => importe aportación 1 y 3 => amount/price for contribution (?) 1 and 3
- delta_imp_reemb_varX_1y3 => importe reembolso 1 y 3 => amount/price for refund 1 and 3
- delta_imp_trasp_varX_out_1y3 => importe traspaso 1 y 3 => amount/price for transfer 1 and 3
- imp_venta_varX => importe venta => sale fee.
- ind_varX_emit_ult1 => indicador emitido => indicator of emission
- ind_varX_recib_ult1 => indicador recibido => indicator of reception
- num_varX_hace2 => número hace 2 => number [of variable X ] done two units in the past
- num_med_varX => número medio => mean number [of variable X]
- num_meses_varX => número de meses => number of months [for variable X]
- saldo_medio_varX => saldo medio => average balance
- delta_imp_venta_varX_1y3 = > importe de venta 1 y 3 => fee on sales [for variable X] 1 and 3

In [22]:
santrainder_en = pd.read_csv('train_english.csv')

In [24]:
print santrainder_en.columns

Index([u''ID', u'var3', u'var15', u'var16 entity amount last',
       u'var39 commercial option amount ultima1',
       u'var39 commercial option amount ultima3',
       u'var40 commercial option amount ultima1',
       u'var40 commercial option amount ultima3',
       u'var40 amount effective option ultima1',
       u'var40 amount effective option ultima3',
       ...
       u'var33 average balance hace2', u'var33 average balance hace3',
       u'var33 average balance ultima1', u'var33 average balance ultima3',
       u'var44 average balance hace2', u'var44 average balance hace3',
       u'var44 average balance ultima1', u'var44 average balance ultima3',
       u'var38', u'TARGET'],
      dtype='object', length=371)


Which does help a little but not for the majority of features. There aren't really any hints as to what different '**varX**' mean (X being any positive integer between 3 to 46) right off the bat, but when just looking at the data itself it's easy enough to tell from a glance which features are categorical and numerical (for the most part, anyway), and it is possible to guess the nature of certain features/variables by looking at its histogram.

(For visual simplicity's sake, we'll be sticking mostly to the original abbreviated Spanish data, making a note whenever we need to make a translation.)

For example, there are 66 total features that are 'indicator' features with dummy values of either 0 or 1.

In [20]:
print santrainder.columns[20:87]

Index([u'ind_var1_0', u'ind_var1', u'ind_var2_0', u'ind_var2', u'ind_var5_0',
       u'ind_var5', u'ind_var6_0', u'ind_var6', u'ind_var8_0', u'ind_var8',
       u'ind_var12_0', u'ind_var12', u'ind_var13_0', u'ind_var13_corto_0',
       u'ind_var13_corto', u'ind_var13_largo_0', u'ind_var13_largo',
       u'ind_var13_medio_0', u'ind_var13_medio', u'ind_var13', u'ind_var14_0',
       u'ind_var14', u'ind_var17_0', u'ind_var17', u'ind_var18_0',
       u'ind_var18', u'ind_var19', u'ind_var20_0', u'ind_var20',
       u'ind_var24_0', u'ind_var24', u'ind_var25_cte', u'ind_var26_0',
       u'ind_var26_cte', u'ind_var26', u'ind_var25_0', u'ind_var25',
       u'ind_var27_0', u'ind_var28_0', u'ind_var28', u'ind_var27',
       u'ind_var29_0', u'ind_var29', u'ind_var30_0', u'ind_var30',
       u'ind_var31_0', u'ind_var31', u'ind_var32_cte', u'ind_var32_0',
       u'ind_var32', u'ind_var33_0', u'ind_var33', u'ind_var34_0',
       u'ind_var34', u'ind_var37_cte', u'ind_var37_0', u'ind_var37',
       u'i

And from these indicator features we can see that they often come in pairs of **ind_varX** and **ind_varX_0**. It's speculated that each of these indicator varXs are some sort of banking feature, service, or product, and the suffix **0** suggests an explicit opt-out of the product (vs. simply not using the product ever)

In [6]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="white", color_codes=True)

## 4. Features

## 5. Modeling

## 6. Concluding Thoughts