## Santander Value Prediction Challenge
#### Predict the value of transactions for potential customers.

According to Epsilon research, 80% of customers are more likely to do business with you if you provide personalized service. Banking is no exception.

The digitalization of everyday lives means that customers expect services to be delivered in a personalized and timely manner… and often before they´ve even realized they need the service. In their 3rd Kaggle competition, Santander Group aims to go a step beyond recognizing that there is a need to provide a customer a financial service and intends to determine the amount or value of the customer's transaction. This means anticipating customer needs in a more concrete, but also simple and personal way. With so many choices for financial services, this need is greater now than ever before.

In this competition, Santander Group is asking Kagglers to help them identify the value of transactions for each potential customer. This is a first step that Santander needs to nail in order to personalize their services at scale.


The evaluation metric for this competition is Root Mean Squared Logarithmic Error.

In [81]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

import gc

In [2]:
for p in [np, pd, sns]:
    print(p.__version__)

1.14.3
0.23.0
0.8.1


### Feature Engineering

1. Concatenate the train and test data together to ensure range consistency
2. Remove columns with zero standard deviation in the train dataset from both dataset
3. Normalize the features to 0 - 1 range using minmaxscaler
4. For each row, add mean, std dev, median, maximum.
5. 

### Read the data

In [4]:
train = pd.read_csv("../data/train.csv.zip")
train.head()

Unnamed: 0,ID,target,48df886f9,0deb4b6a8,34b15f335,a8cb14b00,2f0771a37,30347e683,d08d1fbe3,6ee66e115,...,3ecc09859,9281abeea,8675bec0b,3a13ed79a,f677d4d13,71b203550,137efaa80,fb36b89d9,7e293fbaf,9fc776466
0,000d6aaf2,38000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
1,000fbd867,600000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
2,0027d6b71,10000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
3,0028cbf45,2000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
4,002a68644,14400000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0


In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4459 entries, 0 to 4458
Columns: 4993 entries, ID to 9fc776466
dtypes: float64(1845), int64(3147), object(1)
memory usage: 169.9+ MB


In [6]:
# Test Dataset
test = pd.read_csv("../data/test.csv.zip")
test.head()

Unnamed: 0,ID,48df886f9,0deb4b6a8,34b15f335,a8cb14b00,2f0771a37,30347e683,d08d1fbe3,6ee66e115,20aa07010,...,3ecc09859,9281abeea,8675bec0b,3a13ed79a,f677d4d13,71b203550,137efaa80,fb36b89d9,7e293fbaf,9fc776466
0,000137c73,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,00021489f,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0004d7953,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,00056a333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,00056d8eb,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
# Concatenate the data
data = pd.concat([train, test], axis=0, sort=False).reset_index().drop('index', axis=1)
data.head()

Unnamed: 0,ID,target,48df886f9,0deb4b6a8,34b15f335,a8cb14b00,2f0771a37,30347e683,d08d1fbe3,6ee66e115,...,3ecc09859,9281abeea,8675bec0b,3a13ed79a,f677d4d13,71b203550,137efaa80,fb36b89d9,7e293fbaf,9fc776466
0,000d6aaf2,38000000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,000fbd867,600000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0027d6b71,10000000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0028cbf45,2000000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,002a68644,14400000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
# Verify the concatenation is correct
assert data.target.isnull().sum() == test.shape[0]
assert (~data.target.isnull()).sum() == train.shape[0]

In [9]:
data.target.isnull().sum(), (~data.target.isnull()).sum()

(49342, 4459)

##### Rename the columns to `x1..x4993` for easy reference

In [10]:
old_feature_names = [n for n in data.columns if n not in ('ID','target')]
new_feature_names = ['x'+str(i) for i in range(1,len(data.columns)-1)]
assert len(old_feature_names) == len(new_feature_names)
feature_map = {k:v for (k,v) in zip(new_feature_names, old_feature_names)}

In [11]:
data.rename(columns=dict(zip(train.columns, ['ID','target']+new_feature_names)), inplace=True) 

In [12]:
data.head()

Unnamed: 0,ID,target,x1,x2,x3,x4,x5,x6,x7,x8,...,x4982,x4983,x4984,x4985,x4986,x4987,x4988,x4989,x4990,x4991
0,000d6aaf2,38000000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,000fbd867,600000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0027d6b71,10000000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0028cbf45,2000000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,002a68644,14400000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Remove columns with zero variation

In [13]:
train_stats = data.loc[~data.target.isnull(),:].describe()

In [14]:
cols_zero_std = train_stats.loc['std'].loc[train_stats.loc['std'].values <= 0].index.tolist()
print("There are {0} columns that have no variation".format(len(cols_zero_std)))

There are 256 columns that have no variation


In [15]:
data.drop(cols_zero_std, axis=1, inplace=True)

### How many rows are simulated 
A row is simulated if its values have 4 or more decimal places
inspired by: 

https://www.kaggle.com/c/santander-value-prediction-challenge/discussion/61288

In [67]:
decimal_threshold = 3
feature_decimal = (data.loc[:,data.columns[2:].tolist()].values*10**decimal_threshold) % 1
num_decimal = np.sum((feature_decimal > 10e-6) & (feature_decimal < 1 - 10e-6), axis=1)    # floating-point arithmetic quirks
print("Rows with at least 1 column with more than " + str(decimal_threshold) + " decimal places = {0}".format(np.sum(num_decimal>0)))

Rows with at least 1 column with more than 3 decimal places = 31628


In [82]:
# Create new train and test data by removing the simulated data
simulated, real = data.loc[num_decimal > 0,:].copy(), data.loc[~(num_decimal > 0),:].copy()
train_new, test_new = real.loc[~real.target.isnull(),:].copy(), real.loc[real.target.isnull(),:].copy()
print("Simulated data size = {0}".format(simulated.shape[0]))
print("Real data size      = {0}".format(real.shape[0]))
print("Train size          = {0}".format(train_new.shape[0]))
print("Test data size      = {0}".format(test_new.shape[0]))

del data
gc.collect()

Simulated data size = 31628
Real data size      = 22173
Train size          = 4459
Test data size      = 17714


20

### Add the new features
1. Number of zero features
2. Mean, median, standard deviation, kurtosis excluding zero features. 

In [83]:
# How many features are zeros
real['num_zeros'] = real.loc[:,real.columns[2:].tolist()].apply(lambda x: np.sum(x==0), axis='columns')

### Cross validation
Also confirm that ignoring the simulated data does not impact the score