## Santander Value Prediction Challenge
#### Predict the value of transactions for potential customers.

According to Epsilon research, 80% of customers are more likely to do business with you if you provide personalized service. Banking is no exception.

The digitalization of everyday lives means that customers expect services to be delivered in a personalized and timely manner… and often before they´ve even realized they need the service. In their 3rd Kaggle competition, Santander Group aims to go a step beyond recognizing that there is a need to provide a customer a financial service and intends to determine the amount or value of the customer's transaction. This means anticipating customer needs in a more concrete, but also simple and personal way. With so many choices for financial services, this need is greater now than ever before.

In this competition, Santander Group is asking Kagglers to help them identify the value of transactions for each potential customer. This is a first step that Santander needs to nail in order to personalize their services at scale.


The evaluation metric for this competition is Root Mean Squared Logarithmic Error.

In [12]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import kurtosis
from math import sqrt
from sklearn.metrics import mean_squared_error 

%matplotlib inline

import gc

In [2]:
for p in [np, pd, sns]:
    print(p.__version__)

1.14.3
0.23.0
0.8.1


### Feature Engineering

1. Concatenate the train and test data together to ensure range consistency
2. Remove columns with zero standard deviation in the train dataset from both dataset
3. Normalize the features to 0 - 1 range using minmaxscaler
4. For each row, add mean, std dev, median, maximum.
5. 

### Read the data

In [3]:
train = pd.read_csv("../data/train.csv.zip")
train.head()

Unnamed: 0,ID,target,48df886f9,0deb4b6a8,34b15f335,a8cb14b00,2f0771a37,30347e683,d08d1fbe3,6ee66e115,...,3ecc09859,9281abeea,8675bec0b,3a13ed79a,f677d4d13,71b203550,137efaa80,fb36b89d9,7e293fbaf,9fc776466
0,000d6aaf2,38000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
1,000fbd867,600000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
2,0027d6b71,10000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
3,0028cbf45,2000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
4,002a68644,14400000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0


In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4459 entries, 0 to 4458
Columns: 4993 entries, ID to 9fc776466
dtypes: float64(1845), int64(3147), object(1)
memory usage: 169.9+ MB


In [5]:
# Test Dataset
test = pd.read_csv("../data/test.csv.zip")
test.head()

Unnamed: 0,ID,48df886f9,0deb4b6a8,34b15f335,a8cb14b00,2f0771a37,30347e683,d08d1fbe3,6ee66e115,20aa07010,...,3ecc09859,9281abeea,8675bec0b,3a13ed79a,f677d4d13,71b203550,137efaa80,fb36b89d9,7e293fbaf,9fc776466
0,000137c73,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,00021489f,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0004d7953,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,00056a333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,00056d8eb,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
FEATURES = ['f190486d6', '58e2e02e6', 'eeb9cd3aa', '9fd594eec', '6eef030c1', 
            '15ace8c9f', 'fb0f5dbfe', '58e056e12', '20aa07010', '024c577b9', 
            'd6bb78916', 'b43a7cfd5', '58232a6fb', '1702b5bf0', '324921c7b', 
            '62e59a501', '2ec5b290f', '241f0f867', 'fb49e4212', '66ace2992', 
            'f74e8f13d', '5c6487af1', '963a49cdc', '26fc93eb7', '1931ccfdd', 
            '703885424', '70feb1494', '491b9ee45', '23310aa6f', 'e176a204a', 
            '6619d81fc', '1db387535', 'fc99f9426', '91f701ba2', '0572565c2', 
            '190db8488', 'adb64ff71', 'c47340d97', 'c5a231d81', '0ff32eb98']
    
def get_pred(data, lag=2):
    d1 = data[FEATURES[:-lag]].apply(tuple, axis=1).to_frame().rename(columns={0: 'key'})
    d2 = data[FEATURES[lag:]].apply(tuple, axis=1).to_frame().rename(columns={0: 'key'})
    d2['pred'] = data[FEATURES[lag - 2]]
    d2 = d2[d2.pred != 0]
    d3 = d2[~d2.duplicated(['key'], keep=False)]
    return d1.merge(d3, how='left', on='key').pred.fillna(0)

In [8]:
def get_all_pred(data, max_lag):
    target = pd.Series(index=data.index, data=np.zeros(data.shape[0]))
    for lag in range(2, max_lag + 1):
        pred = get_pred(data, lag)
        mask = (target == 0) & (pred != 0)
        target[mask] = pred[mask]
    return target

In [13]:
for max_lag in range(2, 5):
    pred_train = get_all_pred(train, max_lag)
    have_data = pred_train != 0
    print(f'Max lag {max_lag}: Score = {sqrt(mean_squared_error(np.log1p(train.target[have_data]), np.log1p(pred_train[have_data])))} on {have_data.sum()} out of {train.shape[0]} training samples')

Max lag 2: Score = 0.14151594656341263 on 1358 out of 4459 training samples
Max lag 3: Score = 0.14241434561746488 on 1958 out of 4459 training samples
Max lag 4: Score = 0.18436867496270792 on 2352 out of 4459 training samples


In [32]:
subset = train.loc[:1000, FEATURES]
subset

Unnamed: 0,f190486d6,58e2e02e6,eeb9cd3aa,9fd594eec,6eef030c1,15ace8c9f,fb0f5dbfe,58e056e12,20aa07010,024c577b9,...,6619d81fc,1db387535,fc99f9426,91f701ba2,0572565c2,190db8488,adb64ff71,c47340d97,c5a231d81,0ff32eb98
0,1866666.66,12066666.66,7.000000e+05,600000.00,900000.0,4100000.0,0.00,0.00,0.0,0.00,...,400000.0,0.0,0.00,5000000.00,400000.0,0.00,0.00,0.00,0.00,0.00
1,0.00,2850000.00,2.225000e+06,1800000.00,800000.0,0.0,0.00,3300000.00,2200000.0,0.00,...,0.0,0.0,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00
2,0.00,0.00,0.000000e+00,0.00,0.0,0.0,0.00,6000000.00,0.0,0.00,...,0.0,0.0,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00
3,2000000.00,0.00,0.000000e+00,0.00,0.0,0.0,0.00,0.00,0.0,0.00,...,0.0,0.0,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00
4,0.00,0.00,0.000000e+00,0.00,37662000.0,0.0,4000000.00,6700000.00,2000000.0,5400000.00,...,0.0,0.0,0.00,0.00,0.0,0.00,0.00,8000000.00,0.00,0.00
5,0.00,0.00,2.800000e+06,17000000.00,0.0,556000.0,0.00,0.00,17020000.0,17020000.00,...,17020000.0,17020000.0,17020000.00,17020000.00,17020000.0,17020000.00,17020000.00,17020000.00,17020000.00,17020000.00
6,10000.00,4000.00,0.000000e+00,30000.00,0.0,0.0,0.00,4000.00,0.0,2000.00,...,0.0,6000.0,0.00,14000.00,0.0,22000.00,6000.00,15000.00,360000.00,18000.00
7,0.00,0.00,0.000000e+00,65000000.00,0.0,0.0,0.00,0.00,0.0,0.00,...,0.0,0.0,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00
8,3333333.34,3925333.34,4.000000e+06,0.00,0.0,0.0,58000.00,58000.00,58000.0,58000.00,...,58000.0,58000.0,58000.00,58000.00,58000.0,58000.00,58000.00,58000.00,58000.00,58000.00
9,7500000.00,0.00,0.000000e+00,0.00,0.0,0.0,0.00,0.00,0.0,0.00,...,1100000.0,0.0,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00


In [33]:
d1 = subset[FEATURES[:-2]].apply(tuple, axis=1).to_frame().rename(columns={0: 'key'})
d1

Unnamed: 0,key
0,"(1866666.66, 12066666.66, 700000.0, 600000.0, ..."
1,"(0.0, 2850000.0, 2225000.0, 1800000.0, 800000...."
2,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6000000.0,..."
3,"(2000000.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,..."
4,"(0.0, 0.0, 0.0, 0.0, 37662000.0, 0.0, 4000000...."
5,"(0.0, 0.0, 2800000.0, 17000000.0, 0.0, 556000...."
6,"(10000.0, 4000.0, 0.0, 30000.0, 0.0, 0.0, 0.0,..."
7,"(0.0, 0.0, 0.0, 65000000.0, 0.0, 0.0, 0.0, 0.0..."
8,"(3333333.34, 3925333.34, 4000000.0, 0.0, 0.0, ..."
9,"(7500000.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,..."


In [35]:
d2 = subset[FEATURES[2:]].apply(tuple, axis=1).to_frame().rename(columns={0: 'key'})
d2

Unnamed: 0,key
0,"(700000.0, 600000.0, 900000.0, 4100000.0, 0.0,..."
1,"(2225000.0, 1800000.0, 800000.0, 0.0, 0.0, 330..."
2,"(0.0, 0.0, 0.0, 0.0, 0.0, 6000000.0, 0.0, 0.0,..."
3,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,"(0.0, 0.0, 37662000.0, 0.0, 4000000.0, 6700000..."
5,"(2800000.0, 17000000.0, 0.0, 556000.0, 0.0, 0...."
6,"(0.0, 30000.0, 0.0, 0.0, 0.0, 4000.0, 0.0, 200..."
7,"(0.0, 65000000.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0..."
8,"(4000000.0, 0.0, 0.0, 0.0, 58000.0, 58000.0, 5..."
9,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


In [36]:
d2['pred'] = subset[FEATURES[2 - 2]]
d2

Unnamed: 0,key,pred
0,"(700000.0, 600000.0, 900000.0, 4100000.0, 0.0,...",1866666.66
1,"(2225000.0, 1800000.0, 800000.0, 0.0, 0.0, 330...",0.00
2,"(0.0, 0.0, 0.0, 0.0, 0.0, 6000000.0, 0.0, 0.0,...",0.00
3,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",2000000.00
4,"(0.0, 0.0, 37662000.0, 0.0, 4000000.0, 6700000...",0.00
5,"(2800000.0, 17000000.0, 0.0, 556000.0, 0.0, 0....",0.00
6,"(0.0, 30000.0, 0.0, 0.0, 0.0, 4000.0, 0.0, 200...",10000.00
7,"(0.0, 65000000.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0...",0.00
8,"(4000000.0, 0.0, 0.0, 0.0, 58000.0, 58000.0, 5...",3333333.34
9,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",7500000.00


In [37]:
d2 = d2[d2.pred != 0]
d2

Unnamed: 0,key,pred
0,"(700000.0, 600000.0, 900000.0, 4100000.0, 0.0,...",1.866667e+06
3,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",2.000000e+06
6,"(0.0, 30000.0, 0.0, 0.0, 0.0, 4000.0, 0.0, 200...",1.000000e+04
8,"(4000000.0, 0.0, 0.0, 0.0, 58000.0, 58000.0, 5...",3.333333e+06
9,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",7.500000e+06
11,"(0.0, 0.0, 6800000.0, 2800000.0, 0.0, 0.0, 0.0...",1.120000e+07
13,"(8800000.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2460...",3.300000e+06
16,"(400000.0, 0.0, 0.0, 0.0, 0.0, 0.0, 7000.0, 0....",2.000000e+04
17,"(0.0, 788000.0, 0.0, 0.0, 0.0, 1600000.0, 0.0,...",7.000000e+05
19,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",2.000000e+06


In [38]:
d3 = d2[~d2.duplicated(['key'], keep=False)]
d3

Unnamed: 0,key,pred
0,"(700000.0, 600000.0, 900000.0, 4100000.0, 0.0,...",1.866667e+06
6,"(0.0, 30000.0, 0.0, 0.0, 0.0, 4000.0, 0.0, 200...",1.000000e+04
8,"(4000000.0, 0.0, 0.0, 0.0, 58000.0, 58000.0, 5...",3.333333e+06
9,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",7.500000e+06
11,"(0.0, 0.0, 6800000.0, 2800000.0, 0.0, 0.0, 0.0...",1.120000e+07
13,"(8800000.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2460...",3.300000e+06
16,"(400000.0, 0.0, 0.0, 0.0, 0.0, 0.0, 7000.0, 0....",2.000000e+04
17,"(0.0, 788000.0, 0.0, 0.0, 0.0, 1600000.0, 0.0,...",7.000000e+05
19,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",2.000000e+06
20,"(0.0, 0.0, 0.0, 700000.0, 0.0, 0.0, 0.0, 0.0, ...",7.000000e+05


In [40]:
d1.merge(d3, how='left', on='key')

Unnamed: 0,key,pred
0,"(1866666.66, 12066666.66, 700000.0, 600000.0, ...",
1,"(0.0, 2850000.0, 2225000.0, 1800000.0, 800000....",
2,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6000000.0,...",
3,"(2000000.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...",
4,"(0.0, 0.0, 0.0, 0.0, 37662000.0, 0.0, 4000000....",
5,"(0.0, 0.0, 2800000.0, 17000000.0, 0.0, 556000....",
6,"(10000.0, 4000.0, 0.0, 30000.0, 0.0, 0.0, 0.0,...",
7,"(0.0, 0.0, 0.0, 65000000.0, 0.0, 0.0, 0.0, 0.0...",600000.00
8,"(3333333.34, 3925333.34, 4000000.0, 0.0, 0.0, ...",
9,"(7500000.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...",


In [16]:
pred = get_pred(train.loc[:10, :])

In [17]:
pred

0     0.0
1     0.0
2     0.0
3     0.0
4     0.0
5     0.0
6     0.0
7     0.0
8     0.0
9     0.0
10    0.0
Name: pred, dtype: float64

In [15]:
train.loc[:10, :]

Unnamed: 0,ID,target,48df886f9,0deb4b6a8,34b15f335,a8cb14b00,2f0771a37,30347e683,d08d1fbe3,6ee66e115,...,3ecc09859,9281abeea,8675bec0b,3a13ed79a,f677d4d13,71b203550,137efaa80,fb36b89d9,7e293fbaf,9fc776466
0,000d6aaf2,38000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
1,000fbd867,600000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
2,0027d6b71,10000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
3,0028cbf45,2000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
4,002a68644,14400000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
5,002dbeb22,2800000.0,0.0,0,0.0,0,0,0,0,0,...,12000.0,5600000.0,20000000.0,0,0,0,0,0,0,11000
6,003925ac6,164000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,40000,0,0,0
7,003eb0261,600000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
8,004b92275,979000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,4000000.0,0,0,0,0,0,0,0
9,0067b4fef,460000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,400000
