# Santander Value Prediction

# Context

According to Epsilon research, 80% of customers are more likely to do business with you if you provide personalized service. Banking is no exception.

The digitalization of everyday lives means that customers expect services to be delivered in a personalized and timely manner… and often before they´ve even realized they need the service.

# Problem

Santander Group aims to go a step beyond recognizing that there is a need to provide a customer a financial service and intends to determine the amount or value of the customer's transaction. This means anticipating customer needs in a more concrete, but also simple and personal way. With so many choices for financial services, this need is greater now than ever before.

# Objective

Identify the value of transactions for each potential customer. This is a first step that Santander needs to nail in order to personalize their services at scale.

# Technical Overview

## Dataset

We are provided with an anonymized dataset containing numeric feature variables, the numeric target column, and a string ID column. This dataset is composed of

- The training set in the file train.csv
- The testing set in the file test.csv.

## Evaluation Metric

The evaluation metric for this competition is Root Mean Squared Logarithmic Error. The RMSLE is calculated as

![RMSE](rmse.svg)

where

- $\epsilon$ is the RMSLE value (score)
- $n$ is the total number of observations in the (public/private) data set,
- $p_i$ is your prediction of target, and
- $a_i$ is the actual target for $i$.
- $\log(x)$ is the natural logarithm of $x$.

# Objective

We have to predict the value of the target column in the test set where $\epsilon$ has to be as closed as $0$. For every row in the test.csv, submission files has to contain two columns: ID and target. The ID corresponds to the column of that ID in the test.csv.

# Data Acquisition

The objective is to get the train and test data sets and extract basic information before starting the data exploratory phase.

## Data Source

The data come from Kaggle and are downloaded here.https://www.kaggle.com/c/santander-value-prediction-challenge

# Dataset Basic Information

We need to know how much data do we have in our data sets to help us determining a list of algorithms that will suit better to solve the problem.

- Number of rows
- Number of columns
- Percentage: number of rows of a data set / total number of rows of test + train sets * 100

In [11]:
import pandas
import numpy as np
import matplotlib.pyplot as plt


# dataset reading
train = pandas.read_csv("train.csv")
test= pandas.read_csv("test.csv")


As we can see, the number of features is slightly greater than the number of observations. In such case, we can use:

- Dimension reduction algorithms like PCA (Principal Component Analysis) in order to group features and then reduce their number.

- Regularized models like the SVM (Support Vector Machine), Ridge, Lasso or Elastic net because they use the regularization parameter which make the algorithm resistant against the over-fitting. However, the choice of that parameter has to be judicious.

In [12]:
print(train.info())
print(test.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4459 entries, 0 to 4458
Columns: 4993 entries, ID to 9fc776466
dtypes: float64(1845), int64(3147), object(1)
memory usage: 169.9+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49342 entries, 0 to 49341
Columns: 4992 entries, ID to 9fc776466
dtypes: float64(4991), object(1)
memory usage: 1.8+ GB
None


In [13]:
train.head()

Unnamed: 0,ID,target,48df886f9,0deb4b6a8,34b15f335,a8cb14b00,2f0771a37,30347e683,d08d1fbe3,6ee66e115,...,3ecc09859,9281abeea,8675bec0b,3a13ed79a,f677d4d13,71b203550,137efaa80,fb36b89d9,7e293fbaf,9fc776466
0,000d6aaf2,38000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
1,000fbd867,600000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
2,0027d6b71,10000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
3,0028cbf45,2000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
4,002a68644,14400000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0


In [14]:
X = train.drop('target', axis=1)
X = X.drop('ID', axis=1)
y = train['target']

X_test = test
X_test = X_test.drop('ID', axis=1)

In [15]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

In [16]:
X = X.apply(lambda x: np.log(x+1))
y = y.apply(lambda x: np.log(x+1))
X_test = X_test.apply(lambda x: np.log(x+1))

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=1)

In [17]:
model = RandomForestRegressor(n_estimators=1000, random_state=42, verbose=1)
model.fit(X_train,y_train)

[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed: 59.6min finished


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=1,
           oob_score=False, random_state=42, verbose=1, warm_start=False)

In [18]:
predictions = model.predict(X_valid)

from sklearn.metrics import mean_squared_error, mean_absolute_error
from keras import backend as K
import tensorflow as tf

def rmsle(real, predicted):
    sum=0.0
    for x in range(len(predicted)):
        if predicted[x]<0 or real[x]<0: #check for negative values
            continue
        p = np.log(predicted[x]+1)
        r = np.log(real[x]+1)
        sum = sum + (p - r)**2
    return (sum/len(predicted))**0.5

predictions = np.exp(predictions)-1
real = np.exp(y_valid.values)-1

print(rmsle(real, predictions))

[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:    0.5s finished
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


1.387853008115715


In [21]:

prediction  = pandas.DataFrame()
prediction['ID'] = test['ID']
prediction.head()
prediction['target'] = model.predict(X_test)
prediction['target'] = prediction['target'].apply(lambda x: np.exp(x) - 1)

prediction.to_csv('result.csv', index=False)

[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:   51.3s finished
