## Santander Value Prediction Challenge
#### Predict the value of transactions for potential customers.

According to Epsilon research, 80% of customers are more likely to do business with you if you provide personalized service. Banking is no exception.

The digitalization of everyday lives means that customers expect services to be delivered in a personalized and timely manner… and often before they´ve even realized they need the service. In their 3rd Kaggle competition, Santander Group aims to go a step beyond recognizing that there is a need to provide a customer a financial service and intends to determine the amount or value of the customer's transaction. This means anticipating customer needs in a more concrete, but also simple and personal way. With so many choices for financial services, this need is greater now than ever before.

In this competition, Santander Group is asking Kagglers to help them identify the value of transactions for each potential customer. This is a first step that Santander needs to nail in order to personalize their services at scale.


The evaluation metric for this competition is Root Mean Squared Logarithmic Error.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import kurtosis

%matplotlib inline

import gc

In [2]:
for p in [np, pd, sns]:
    print(p.__version__)

1.14.3
0.23.0
0.8.1


### Feature Engineering

1. Concatenate the train and test data together to ensure range consistency
2. Remove columns with zero standard deviation in the train dataset from both dataset
3. Normalize the features to 0 - 1 range using minmaxscaler
4. For each row, add mean, std dev, median, maximum.
5. 

### Read the data

In [3]:
train = pd.read_csv("../data/train.csv.zip")
train.head()

Unnamed: 0,ID,target,48df886f9,0deb4b6a8,34b15f335,a8cb14b00,2f0771a37,30347e683,d08d1fbe3,6ee66e115,...,3ecc09859,9281abeea,8675bec0b,3a13ed79a,f677d4d13,71b203550,137efaa80,fb36b89d9,7e293fbaf,9fc776466
0,000d6aaf2,38000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
1,000fbd867,600000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
2,0027d6b71,10000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
3,0028cbf45,2000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
4,002a68644,14400000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0


In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4459 entries, 0 to 4458
Columns: 4993 entries, ID to 9fc776466
dtypes: float64(1845), int64(3147), object(1)
memory usage: 169.9+ MB


In [5]:
# Test Dataset
test = pd.read_csv("../data/test.csv.zip")
test.head()

Unnamed: 0,ID,48df886f9,0deb4b6a8,34b15f335,a8cb14b00,2f0771a37,30347e683,d08d1fbe3,6ee66e115,20aa07010,...,3ecc09859,9281abeea,8675bec0b,3a13ed79a,f677d4d13,71b203550,137efaa80,fb36b89d9,7e293fbaf,9fc776466
0,000137c73,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,00021489f,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0004d7953,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,00056a333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,00056d8eb,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
# Concatenate the data
data = pd.concat([train, test], axis=0, sort=False).reset_index().drop('index', axis=1)
data.head()

Unnamed: 0,ID,target,48df886f9,0deb4b6a8,34b15f335,a8cb14b00,2f0771a37,30347e683,d08d1fbe3,6ee66e115,...,3ecc09859,9281abeea,8675bec0b,3a13ed79a,f677d4d13,71b203550,137efaa80,fb36b89d9,7e293fbaf,9fc776466
0,000d6aaf2,38000000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,000fbd867,600000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0027d6b71,10000000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0028cbf45,2000000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,002a68644,14400000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
# Verify the concatenation is correct
assert data.target.isnull().sum() == test.shape[0]
assert (~data.target.isnull()).sum() == train.shape[0]

In [8]:
data.target.isnull().sum(), (~data.target.isnull()).sum()

(49342, 4459)

##### Rename the columns to `x1..x4993` for easy reference

In [9]:
old_feature_names = [n for n in data.columns if n not in ('ID','target')]
new_feature_names = ['x'+str(i) for i in range(1,len(data.columns)-1)]
assert len(old_feature_names) == len(new_feature_names)
feature_map = {k:v for (k,v) in zip(new_feature_names, old_feature_names)}

In [10]:
data.rename(columns=dict(zip(train.columns, ['ID','target']+new_feature_names)), inplace=True) 

In [11]:
data.head()

Unnamed: 0,ID,target,x1,x2,x3,x4,x5,x6,x7,x8,...,x4982,x4983,x4984,x4985,x4986,x4987,x4988,x4989,x4990,x4991
0,000d6aaf2,38000000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,000fbd867,600000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0027d6b71,10000000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0028cbf45,2000000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,002a68644,14400000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Remove columns with zero variation

In [12]:
train_stats = data.loc[~data.target.isnull(),:].describe()

In [13]:
cols_zero_std = train_stats.loc['std'].loc[train_stats.loc['std'].values <= 0].index.tolist()
print("There are {0} columns that have no variation".format(len(cols_zero_std)))

There are 256 columns that have no variation


In [14]:
data.drop(cols_zero_std, axis=1, inplace=True)

### How many rows are simulated 
A row is simulated if its values have 4 or more decimal places
inspired by: 

https://www.kaggle.com/c/santander-value-prediction-challenge/discussion/61288

In [15]:
decimal_threshold = 3
feature_decimal = (data.loc[:,data.columns[2:].tolist()].values*10**decimal_threshold) % 1
num_decimal = np.sum((feature_decimal > 10e-6) & (feature_decimal < 1 - 10e-6), axis=1)    # floating-point arithmetic quirks
print("Rows with at least 1 column with more than " + str(decimal_threshold) + " decimal places = {0}".format(np.sum(num_decimal>0)))

Rows with at least 1 column with more than 3 decimal places = 31628


In [16]:
# Create new train and test data by removing the simulated data
simulated, real = data.loc[num_decimal > 0,:].copy(), data.loc[~(num_decimal > 0),:].copy()
print("Simulated data size = {0}".format(simulated.shape[0]))
print("Real data size      = {0}".format(real.shape[0]))

del data
gc.collect()

Simulated data size = 31628
Real data size      = 22173


21

### Add the new features
1. Number of zero features
2. Mean, standard deviation, kurtosis. 

In [17]:
# How many features are zeros
real['num_zeros'] = real.loc[:,real.columns[2:].tolist()].apply(lambda x: np.sum(x==0), axis='columns')

# Numerical features
real['feat_mean'] = real.loc[:,real.columns[2:].tolist()].apply(lambda x: np.mean(x), axis='columns')
real['feat_std'] = real.loc[:,real.columns[2:].tolist()].apply(lambda x: np.std(x), axis='columns')
real['feat_kurt'] = real.loc[:,real.columns[2:].tolist()].apply(lambda x: kurtosis(x), axis='columns')

In [18]:
simulated['num_zeros'] = simulated.loc[:,simulated.columns[2:].tolist()].apply(lambda x: np.sum(x==0), axis='columns')
simulated['feat_mean'] = simulated.loc[:,simulated.columns[2:].tolist()].apply(lambda x: np.mean(x), axis='columns')
simulated['feat_std']  = simulated.loc[:,simulated.columns[2:].tolist()].apply(lambda x: np.std(x), axis='columns')
simulated['feat_kurt'] = simulated.loc[:,simulated.columns[2:].tolist()].apply(lambda x: kurtosis(x), axis='columns')

In [19]:
train_new, test_new = real.loc[~real.target.isnull(),:].copy(), real.loc[real.target.isnull(),:].copy()
print("Train data size     = {0}".format(train_new.shape[0]))
print("Test data size      = {0}".format(test_new.shape[0]))

Train data size     = 4459
Test data size      = 17714


In [20]:
# Standardization
from sklearn.base import TransformerMixin
from sklearn.preprocessing import StandardScaler, MinMaxScaler

class DFMinMaxScaler(TransformerMixin):
    # MinMaxScaler for pandas DataFrames

    def __init__(self):
        self.ss = None
        self.min_ = None
        self.max_ = None

    def fit(self, X, y=None):
        self.ss = MinMaxScaler()
        self.ss.fit(X)
        self.min_ = pd.Series(self.ss.data_min_, index=X.columns)
        self.max_ = pd.Series(self.ss.data_max_, index=X.columns)
        return self

    def transform(self, X):
        # assumes X is a DataFrame
        Xss = self.ss.transform(X)
        Xscaled = pd.DataFrame(Xss, index=X.index, columns=X.columns)
        return Xscaled


In [21]:
# Fit the transformer using only train data
# minmaxscaler = DFMinMaxScaler()
# minmaxscaler.fit(train_new.iloc[:,2:])
# real.iloc[:,2:] = minmaxscaler.transform(real.iloc[:,2:])
# simulated.iloc[:,2:] = minmaxscaler.transform(simulated.iloc[:,2:])

### Modeling

2. Random forest
3. Gradient boosting
4. Stacking?

#### Prepare the data

In [22]:
from sklearn.linear_model import Lasso, Ridge
from sklearn.model_selection import KFold, GridSearchCV, cross_val_score
import pickle

In [23]:
kf = KFold(5, random_state=42)

In [35]:
X_train, y_train = train_new.iloc[:,2:], train_new.target
X_test_real, X_test_simulated = test_new.iloc[:,2:], simulated.iloc[:,2:]

#### Linear regression

In [25]:
# lasso, hyperparameter is alpha
# param = [{'alpha': np.logspace(-2,2,num=5)}]
# lr = Ridge(random_state=42, max_iter=2000)
# gcv = GridSearchCV(lr, param, cv=kf, scoring='neg_mean_squared_log_error', n_jobs=2, verbose=1)
# gcv.fit(X_train, y_train)
# lr = gcv.best_estimator_

# print("Best Parameter = {0}".format(gcv.best_params_))
# print("Best Score = {0}".format(-gcv.best_score))


#### Random Forest

In [26]:
from sklearn.ensemble import RandomForestRegressor
param = [

        {
              "n_estimators": np.array([200, 500, 1000]),
              "max_depth": np.array([6,7,8,9]),
              "max_features": ['sqrt','auto'],
              "min_samples_leaf": np.array([2,5,10])
        }

]
rf = RandomForestRegressor(random_state=42)
gcv = GridSearchCV(rf, param, cv=kf, scoring='neg_mean_squared_log_error', n_jobs=1, verbose=10)
gcv.fit(X_train, y_train)
rf = gcv.best_estimator_
s = pickle.dumps(rf)
print("Best Parameter = {0}".format(gcv.best_params_))
print("Best Score = {0}".format(-gcv.best_score_))

Fitting 5 folds for each of 72 candidates, totalling 360 fits
[CV] max_depth=6, max_features=sqrt, min_samples_leaf=2, n_estimators=200 
[CV]  max_depth=6, max_features=sqrt, min_samples_leaf=2, n_estimators=200, score=-3.55569988258224, total=   2.0s
[CV] max_depth=6, max_features=sqrt, min_samples_leaf=2, n_estimators=200 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.2s remaining:    0.0s


[CV]  max_depth=6, max_features=sqrt, min_samples_leaf=2, n_estimators=200, score=-4.13946002418476, total=   2.0s
[CV] max_depth=6, max_features=sqrt, min_samples_leaf=2, n_estimators=200 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    4.5s remaining:    0.0s


[CV]  max_depth=6, max_features=sqrt, min_samples_leaf=2, n_estimators=200, score=-3.80397347884679, total=   2.0s
[CV] max_depth=6, max_features=sqrt, min_samples_leaf=2, n_estimators=200 


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    6.8s remaining:    0.0s


[CV]  max_depth=6, max_features=sqrt, min_samples_leaf=2, n_estimators=200, score=-3.9391922871589706, total=   2.0s
[CV] max_depth=6, max_features=sqrt, min_samples_leaf=2, n_estimators=200 


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    9.1s remaining:    0.0s


[CV]  max_depth=6, max_features=sqrt, min_samples_leaf=2, n_estimators=200, score=-4.341737426426927, total=   1.9s
[CV] max_depth=6, max_features=sqrt, min_samples_leaf=2, n_estimators=500 


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   11.3s remaining:    0.0s


[CV]  max_depth=6, max_features=sqrt, min_samples_leaf=2, n_estimators=500, score=-3.5666746234636104, total=   4.7s
[CV] max_depth=6, max_features=sqrt, min_samples_leaf=2, n_estimators=500 


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:   16.3s remaining:    0.0s


[CV]  max_depth=6, max_features=sqrt, min_samples_leaf=2, n_estimators=500, score=-4.145051922579719, total=   4.6s
[CV] max_depth=6, max_features=sqrt, min_samples_leaf=2, n_estimators=500 


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:   21.3s remaining:    0.0s


[CV]  max_depth=6, max_features=sqrt, min_samples_leaf=2, n_estimators=500, score=-3.812382425512328, total=   4.7s
[CV] max_depth=6, max_features=sqrt, min_samples_leaf=2, n_estimators=500 


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:   26.3s remaining:    0.0s


[CV]  max_depth=6, max_features=sqrt, min_samples_leaf=2, n_estimators=500, score=-3.944007698787234, total=   4.8s
[CV] max_depth=6, max_features=sqrt, min_samples_leaf=2, n_estimators=500 


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:   31.5s remaining:    0.0s


[CV]  max_depth=6, max_features=sqrt, min_samples_leaf=2, n_estimators=500, score=-4.338407709672552, total=   4.6s
[CV] max_depth=6, max_features=sqrt, min_samples_leaf=2, n_estimators=1000 
[CV]  max_depth=6, max_features=sqrt, min_samples_leaf=2, n_estimators=1000, score=-3.5684509540954115, total=   9.5s
[CV] max_depth=6, max_features=sqrt, min_samples_leaf=2, n_estimators=1000 
[CV]  max_depth=6, max_features=sqrt, min_samples_leaf=2, n_estimators=1000, score=-4.145851353660196, total=   9.7s
[CV] max_depth=6, max_features=sqrt, min_samples_leaf=2, n_estimators=1000 
[CV]  max_depth=6, max_features=sqrt, min_samples_leaf=2, n_estimators=1000, score=-3.8076945887695453, total=   9.2s
[CV] max_depth=6, max_features=sqrt, min_samples_leaf=2, n_estimators=1000 
[CV]  max_depth=6, max_features=sqrt, min_samples_leaf=2, n_estimators=1000, score=-3.941750631240779, total=   9.3s
[CV] max_depth=6, max_features=sqrt, min_samples_leaf=2, n_estimators=1000 
[CV]  max_depth=6, max_features=sq

[CV]  max_depth=6, max_features=auto, min_samples_leaf=2, n_estimators=500, score=-3.201816613750017, total= 4.5min
[CV] max_depth=6, max_features=auto, min_samples_leaf=2, n_estimators=500 
[CV]  max_depth=6, max_features=auto, min_samples_leaf=2, n_estimators=500, score=-3.290635904561032, total= 4.5min
[CV] max_depth=6, max_features=auto, min_samples_leaf=2, n_estimators=500 
[CV]  max_depth=6, max_features=auto, min_samples_leaf=2, n_estimators=500, score=-3.6068119075262635, total= 4.5min
[CV] max_depth=6, max_features=auto, min_samples_leaf=2, n_estimators=1000 
[CV]  max_depth=6, max_features=auto, min_samples_leaf=2, n_estimators=1000, score=-2.9681056191608675, total= 9.0min
[CV] max_depth=6, max_features=auto, min_samples_leaf=2, n_estimators=1000 


KeyboardInterrupt: 

In [27]:
rf_baseline = RandomForestRegressor()
scores = cross_val_score(rf_baseline, X_train, y_train, cv=kf, scoring='neg_mean_squared_log_error', n_jobs=1, verbose=10)
print("Baseline RF score = {0}".format(-np.mean(scores)))

[CV]  ................................................................
[CV] ....................... , score=-2.381260406766779, total=  39.7s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   39.7s remaining:    0.0s


[CV] ...................... , score=-2.7583640104349603, total=  38.3s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  1.3min remaining:    0.0s


[CV] ....................... , score=-2.472066644481669, total=  40.6s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  2.0min remaining:    0.0s


[CV] ...................... , score=-2.5024418172188154, total=  41.1s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  2.7min remaining:    0.0s


[CV] ....................... , score=-2.769244082715981, total=  36.9s
Baseline RF score = 2.576675392323641


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  3.3min remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  3.3min finished


In [29]:
rf_baseline.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

#### Submission Test
See if the simulated data matter

In [37]:
test_pred = rf_baseline.predict(X_test_real)
simulated_pred_rf = rf_baseline.predict(X_test_simulated)
simulated_pred_sample_mean = np.mean(y_train)

In [40]:
df_test_submission = pd.DataFrame({'ID':test_new.ID, 'Target':np.clip(test_pred, np.min(y_train), np.max(y_train))})
df_simulated_rf_submission = pd.DataFrame({'ID':simulated.ID, 'Target':np.clip(simulated_pred_rf, np.min(y_train), np.max(y_train))})
df_simulated_sm_submission = pd.DataFrame({'ID':simulated.ID, 'Target':simulated_pred_sample_mean})


submission_1 = pd.concat([df_test_submission, df_simulated_rf_submission], axis=0)
print(submission_1.shape)
submission_2 = pd.concat([df_test_submission, df_simulated_rf_submission], axis=0)
print(submission_2.shape)
submission_1.to_csv('Submission_all_rf.csv', index=False)
submission_2.to_csv('Submission_partial_rf.csv', index=False)

(49342, 2)
(49342, 2)
