# Background
According to Epsilon research, 80% of customers are more likely to do business with you if you provide personalized service. Banking is no exception.

The digitalization of everyday lives means that customers expect services to be delivered in a personalized and timely manner… and often before they´ve even realized they need the service. In their 3rd Kaggle competition, Santander Group aims to go a step beyond recognizing that there is a need to provide a customer a financial service and intends to determine the amount or value of the customer's transaction. This means anticipating customer needs in a more concrete, but also simple and personal way. With so many choices for financial services, this need is greater now than ever before.

In this competition, Santander Group is asking Kagglers to help them identify the value of transactions for each potential customer. This is a first step that Santander needs to nail in order to personalize their services at scale.

# Load Libraries & Load Data
## Load Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb
# import catboost as cbt
import lightgbm as lgb
%matplotlib inline

## Load Data

In [None]:
train = pd.read_csv('../input/train.csv')
test =  pd.read_csv('../input/test.csv')

# Data preliminary analysis
## Analysis of Train
1. There are 1845 float features, 3147 integal  features, 1 object feature
2. There is no null element in train data set
3. There are 256 columns have only one number, which is useless for many models since it brings no information.
4. There are 9 columns provide repeated information

In [None]:
train.head()

- There are 1845 float features, 3147 integal features, 1 object feature

In [None]:
train.info()

- There is no null element in train data set

In [None]:
null_num =  train.isnull().sum().sum()
print('There are {} null elements in train data'.format(null_num))

- There are 256 columns have only one number, which is useless for many models since it brings no information.(Delete this columns)

We see many people choose to merge train and test data and then delete columns with only one value, but I think only do this for train is ok since we only need to train our model use training data.

In [None]:
cols_with_onlyone_val = train.columns[train.nunique() == 1]
cols_with_onlyone_val

In [None]:
len(cols_with_onlyone_val)

In [None]:
cols = [x for x in train.columns if x not in cols_with_onlyone_val]
train_clean = train[cols].copy()

- Drop those duplicate features and keeps only one feature.

Here, I provide accelarate this process in two ways, the first is by heuristic method, we only calculate the statistical infomation of all columns, and we think if two columns are with the same statistical infomation, they are the same.  The second way is use numpy operations to accelarate .



In [None]:
train_clean_info = train_clean.describe()
train_clean_info

In [None]:
columns = train_clean_info.columns
cols_del = []  # del those duplicated columns 
dup_dict = {} 

for i in range(len(columns)-1):
    if columns[i] in cols_del:
        continue 
    if i % 10 ==0:
        print(i / len(columns))
    first = train_clean_info[columns[i]].values  
    res = train_clean_info.iloc[:,i+1:] - np.tile([first],[train_clean_info.shape[1]-i -1,1]).T  
    cols_del.extend(res.columns[np.sum(res) == 0]) 
    if np.sum(np.sum(res) == 0) > 0: 
        dup_dict[columns[i]] = res.columns[np.sum(res) == 0] 

- There are 9 columns provide repeated information, we can drop them.

In [None]:
cols_del

In [None]:
dup_dict

Let's take a look! we take three pairs and we see yes, that's what we want to find.

In [None]:
train_clean[['168b3e5bc','f8d75792f','34ceb0081','d60ddde1b','70f3a87ec','66f57f2e5']]

In [None]:
cols = [x for x in train_clean.columns if x not in cols_del]
train_clean = train_clean[cols].copy()

## Analysis of Test
1. There are 4991 float features
2. There is no null element in test data set
3. No columns with only one value
4. no columns are duplicated

In [None]:
train.head()

- There are 4991 float features, which is interesting, since we have many integals in train dataset.

In [None]:
test.info()

There is no need to do extra work for those columns that we have done analysis.

In [None]:
cols = [x for x in train_clean.columns if x!='target']
test_clean = test[cols].copy()

- Nice work, no columns with only one value here!

In [None]:
cols_with_onlyone_val = test_clean.columns[test_clean.nunique() == 1]
cols_with_onlyone_val

- Nice work, no columns are duplicated!

In [None]:
test_clean_info = test_clean.describe()
test_clean_info

In [None]:
columns = test_clean_info.columns
cols_del = []  # del those duplicated columns 
dup_dict = {} 

for i in range(len(columns)-1):
    if columns[i] in cols_del:
        continue 
    if i % 10 ==0:
        print(i / len(columns))
    first = test_clean_info[columns[i]].values  
    res = test_clean_info.iloc[:,i+1:] - np.tile([first],[test_clean_info.shape[1]-i -1,1]).T  
    cols_del.extend(res.columns[np.sum(res) == 0]) 
    if np.sum(np.sum(res) == 0) > 0: 
        dup_dict[columns[i]] = res.columns[np.sum(res) == 0] 

In [None]:
cols_del

## Label
For regression problem, there are many tricks. But before that, we need to take a look at our label.
- There are unusual values
- All labels are bigger than 0
- Our target is skewed

In [None]:
train['target']

In [None]:
plt.scatter(x = range(train.shape[0]), y = train['target'].values)

In [None]:
train['target'].plot()

- Our target is skewed!

In [None]:
plt.figure(figsize=(12,8))
sns.distplot(train_clean["target"].values, bins=50, kde=False)
plt.xlabel('Target', fontsize=12)
plt.title("Target Histogram", fontsize=14) 

- There are unusual values, which I mean (value - mean) / std > 2

In [None]:
train['target'].describe()

In [None]:
(train['target'].describe().loc['max'] - train['target'].describe().loc['mean']) / train['target'].describe().loc['std']

# Baseline Model
Of course, there are still many things we need to do, but since this is only an anonymous game, so, after simple preprocessing, now, we need a baseline to see how far to go.

## Label transformation
If we want to get a nice score, the best way is to find a loss function that can be easily optimized, like mse, so here, we need to do some transformations.

In [None]:
train_clean["target"] = train_clean["target"].apply(np.log1p)

## Validation

Here, to reduce randomness, we take three random seed, and see if the validation results are similar. 

In [None]:
from sklearn import model_selection
from sklearn.model_selection import train_test_split

In [None]:
def run_lgb_val(train_X, train_y, val_X, val_y):
    params = {
        "objective" : "regression",
        "metric" : "rmse",
        "num_leaves" : 64,
        "learning_rate" : 0.005,
        "bagging_fraction" : 0.85,
        "feature_fraction" : 0.85,
        "bagging_frequency" : 5,
        "bagging_seed" : 100,
        "verbosity" : -1,
        "seed": 921212
    }
    
    lgtrain = lgb.Dataset(train_X, label=train_y)
    lgval = lgb.Dataset(val_X, label=val_y)
    evals_result = {}
    model = lgb.train(params, lgtrain, 5000, 
                      valid_sets=[lgval], 
                      early_stopping_rounds=100, 
                      verbose_eval=50, 
                      evals_result=evals_result)  
    

In [None]:
feature_columns = [x for x in train_clean.columns if x not in ['ID','target']]
for rnd in range(3):
    print('*' * 50)
    print(rnd)
    print('*' * 50)
    train_X, val_X, train_y, val_y = train_test_split(train_clean[feature_columns], train_clean['target'], test_size = 0.3, random_state = rnd)
    run_lgb_val(train_X, train_y, val_X, val_y) 

## Submit(The Final Result is 1.47 online, which is not a huge gap between underline & online)

For robustness, here, we choose to use simple ensemble methods to submit our predition to see if it is similar to our underline result.


In [None]:
def run_lgb_test(train_X, train_y, val_X, val_y):
    params = {
        "objective" : "regression",
        "metric" : "rmse",
        "num_leaves" : 64,
        "learning_rate" : 0.005,
        "bagging_fraction" : 0.85,
        "feature_fraction" : 0.85,
        "bagging_frequency" : 5,
        "bagging_seed" : 100,
        "verbosity" : -1,
        "seed": 921212
    }
    
    lgtrain = lgb.Dataset(train_X, label=train_y)
    lgval = lgb.Dataset(val_X, label=val_y)
    evals_result = {}
    model = lgb.train(params, lgtrain, 5000, 
                      valid_sets=[lgval], 
                      early_stopping_rounds=100, 
                      verbose_eval=50, 
                      evals_result=evals_result)  
    return model

In [None]:
feature_columns = [x for x in train_clean.columns if x not in ['ID','target']]
res = []
for rnd in range(3):
    print('*' * 50)
    print(rnd)
    print('*' * 50)
    train_X, val_X, train_y, val_y = train_test_split(train_clean[feature_columns], train_clean['target'], test_size = 0.3, random_state = rnd)
    model = run_lgb_test(train_X, train_y, val_X, val_y) 
    pred = model.predict(test_clean[feature_columns])
    res.append(pred) 

In [None]:
test['target'] = np.expm1(np.mean(res,axis=0))

In [None]:
test[['ID','target']].head(100)

In [None]:
test[['ID','target']].to_csv('Baseline.csv',index = False)