# <a id="1">Introduction</a>

Santander Group aims to go a step beyond recognizing that there is a need to provide a customer a financial service and intends to **determine the amount or value of the customer's transaction**. This means anticipating customer needs in a more concrete, but also simple and personal way. With so many choices for financial services, this need is greater now than ever before.

In this project, we help Santander Group is asking to identify the value of transactions for each potential customer. This is a first step that Santander needs to nail in order to personalize their services at scale.

# <a id="2">Load packages</a>

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import warnings
warnings.filterwarnings('ignore')

import matplotlib
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline


import os
print(os.listdir("../input"))
from math import log1p
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold, train_test_split
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso

# <a id="3">Load Data</a>

In [None]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

# <a>Glimpse Of Data</a>

In [None]:
train.head()

In [None]:
test.head()

In [None]:
train.info()

In [None]:
test.info()

# <a>Checking Missing Data</a>

In [None]:
train.isnull().sum().sum()

In [None]:
test.isnull().sum().sum()

So, the data has no missing value which is very good, because even when missing values are adjusted there is some logical over-fitting and fitting missing values here would been very difficult because most of the indexes are 0, so for even for more accurate results replacing NaN values with 0 would been most prominent.

# <a>Supervised Regression</a>

The project demands an accurate supervised regressor. We'll try a couple of Regressors including Ensemble Regressors and suggest a good regressor.

In [None]:
columnsList = [ x for x in train.columns if x not in ['ID','target']]
features=train[columnsList]
targets=train['target']

# <a> Removing Column with Constant Entries </a>
*Deleting all columns with no Entries, though it won't matter in the regression, the regressor won't take any input from those columns, it makes data more clear*

In [None]:
to_remove_cols=[ x for x in train.columns if train[x].sum()==0]
columnsList = [ x for x in columnsList if x not in to_remove_cols]
print(len(columnsList))

# <a>Removing Duplicate Columns</a>
*In regression we don't need duplicate columns, duplicate columns will create problem in train_test_split*

In [None]:
train=train.T.drop_duplicates().T
columnsList = [ x for x in columnsList if x in train.columns]
print(len(columnsList))

# <a> Evaluation Metric </a>
The evaluation metric for this competition is Root Mean Squared Logarithmic Error.

We have to reduce RMSLE Value

We will be passing log1p values. So this metric will calculate RMS value

In [None]:
def RMSLE_value(test,pred):
    return np.sqrt(np.mean(np.power(test-pred, 2)))

# <a>Support Vector Regression</a>

In [None]:
def Support_Vector_Regression(features_train,features_test,targets_train,targets_test,kernel='rbf',gamma='auto',epsilon=0.1,c=1):
    reg=SVR(kernel=kernel,gamma=gamma,C=c,epsilon=epsilon)
    reg.fit(features_train,targets_train)
    pred=reg.predict(features_test)
    results=RMSLE_value(targets_test,pred)
    print('RMSLE_value from Support Vector Regression is',results)
    return reg
        

# <a> Linear Regression </a>

In [None]:
def Linear_Regression(features_train,features_test,targets_train,targets_test):
    reg=LinearRegression()
    reg.fit(features_train,targets_train)
    pred=reg.predict(features_test)
    result=RMSLE_value(targets_test,pred)
    print("RMSLE_value from Linear Regression is ",result)
    return reg

# <a> Lasso Regression </a>

In [None]:
def Lasso_Regression(features_train,features_test,targets_train,targets_test):
    reg=Lasso()
    reg.fit(features_train,targets_train)
    pred=reg.predict(features_test)
    result=RMSLE_value(targets_test,pred)
    print("RMSLE_value from Lasso Regression is ",result)
    return reg

# <a> AdaBoosting using SVR </a>

In [None]:
def AdaBoost_Regression(features_train,features_test,targets_train,targets_test):
    reg=AdaBoostRegressor(SVR(),n_estimators=13)
    reg.fit(features_train,targets_train)
    pred=reg.predict(features_test)
    result=RMSLE_value(targets_test,pred)
    print("RMSLE_value from AdaBoosting using Support Vector Regressor is ",result)
    return reg

# <a> Random Forest Regression  </a>

In [None]:
def Random_Forest_Regression(features_train,features_test,targets_train,targets_test):
    reg=RandomForestRegressor(n_estimators=50,min_samples_split=40,max_depth=500)
    reg.fit(features_train,targets_train)
    pred=reg.predict(features_test)
    result=RMSLE_value(targets_test,pred)
    print("RMSLE_value from Random Forest Regression ",result)
    return reg

# <a>Train Test Split</a>
*I'd have used KFold but it's high on RAM already, train,test sets are huge, not much overfitting expected*

In [None]:
features_train, features_test, targets_train, targets_test = train_test_split(train[columnsList],train['target'],test_size=0.2,random_state=42)

In [None]:
features_train.head()

In [None]:
features_test.head()

In [None]:
targets_train.head()

In [None]:
targets_test.head()

# <a> Converting Training and Testing Data to array form </a>
Converting log1p type, we are providing coverted data to the regressors which is actually providing better results.

In [None]:
def Convert(data):
    data=data.values
    data=data.astype(int)
    data=np.log1p(data)
    return data

In [None]:
features_train=Convert(features_train)
features_test=Convert(features_test)
targets_train=Convert(targets_train)
targets_test=Convert(targets_test)


# <a> Performing Regressions </a>

In [None]:
Support_Vector_Regression(features_train, features_test, targets_train, targets_test)

In [None]:
Linear_Regression(features_train,features_test,targets_train,targets_test)

In [None]:
Lasso_Regression(features_train,features_test,targets_train,targets_test)

In [None]:
AdaBoost_Regression(features_train,features_test,targets_train,targets_test)

In [None]:
Random_Forest_Regression(features_train,features_test,targets_train,targets_test)

*The results from Support Vector Regression is 1.651663589865719.* 

*From Linear Regression it's more than 843 million which is very bad it doesn't do good at all, Note : passing actual data instead of log1p processed data into regressors make the RMSLE value 16 which is still not good but far more satisfactory.  *

*From Lasso Regression it's 1.6951226901280325, so, removing certain features from Linear Regression  helps a lot in this case we can see. It's a comeback. Note : for alpha=0.0 , Lasso regression acts as linear regression. Here standard alpha used is 1.0*

*From AdaBoosting SVR it's 1.6559682182238418, considering it's a boosting regressor it's not very good, and to cinsideration it's more than SVR's score which is indicating overfit* 

*From Random Forest it's 1.4456394807462076, some meta regressor things*

We'll work with Lasso,Random Forest,Support Vector Regressors to check out scores on actual test set.

# <a>Processing Test Data</a>

In [None]:
features=Convert(train[columnsList])
targets=Convert(train['target'])
test_features=Convert(test[columnsList])

# <a>Calling Regressors to fit the entire data for submission</a>
Calling 3 regressors and submit each prediction because top 3 regressors are actually pretty close

In [None]:
RFR = RandomForestRegressor(n_estimators=50,min_samples_split=40,max_depth=500).fit(features,targets)
predRFR=np.expm1(RFR.predict(test_features))

In [None]:
SVR=SVR().fit(features,targets)
predSVR=np.expm1(SVR.predict(test_features))

In [None]:
LR=Lasso().fit(features,targets)
predLR=np.expm1(LR.predict(test_features))

# <a> Submission </a>

In [None]:
sub_SVR = pd.DataFrame()
sub_SVR['ID'] = test['ID']
sub_SVR['target'] = predSVR
sub_SVR.to_csv('sub_SVR.csv', index=False)

In [None]:
sub_RFR = pd.DataFrame()
sub_RFR['ID'] = test['ID']
sub_RFR['target'] = predSVR
sub_RFR.to_csv('sub_RFR.csv', index=False)

In [None]:
sub_LR = pd.DataFrame()
sub_LR['ID'] = test['ID']
sub_LR['target'] = predLR
sub_LR.to_csv('sub_LR.csv', index=False)

Random Forest Regressor's score is 1.50

Support Vector Regressor's score is 1.77

Lasso Regressor's score is 1.84

**So, Random Forest Regressor  is definitely the choice for the Regression**

# <a>Exploratory Data Analysis</a>

*For EDA, we start again by reading CSV files*

In [None]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

#Removing Constant and Duplicate Columns, this time keeping 'target'
columnsList = [ x for x in train.columns if x not in ['ID','target']]
to_remove_cols=[ x for x in train.columns if train[x].sum()==0]
columnsList = [ x for x in columnsList if x not in to_remove_cols]
train_np=train.T.drop_duplicates().T
columnsList = [ x for x in columnsList if x in train_np.columns]

# <a> Defining Some Plot Functions </a>

In [None]:
def plot_distribution(df,title,color):
    plt.figure(figsize=(10,6))
    plt.title("Distribution of %s" % title)
    sns.distplot(df.dropna(),color=color, kde=True,bins=100)
    plt.show()   

In [None]:
def HeatMap(df,target,columnsList):
    df=df.astype(int)
    target=target.astype(int)
    labels = []
    values = []
    for col in columnsList:
        labels.append(col)
        values.append(np.corrcoef(df[col].values, target.values)[0,1])
    corr_df = pd.DataFrame({'columns_labels':labels, 'corr_values':values})
    corr_df = corr_df.sort_values(by='corr_values')
    corr_df = corr_df[(corr_df['corr_values']>0.25) | (corr_df['corr_values']<-0.25)]
    temp = df[corr_df.columns_labels.tolist()]
    corrmat = temp.corr(method='pearson')
    f, ax = plt.subplots(figsize=(12, 12))
    sns.heatmap(corrmat, vmax=1., square=True, cmap="YlOrRd")
    plt.title("Important variables correlation map", fontsize=15)
    plt.show()

# <a> Target Variable </a>

In [None]:
plot_distribution(train["target"], "target", "blue")

In [None]:
plot_distribution(np.log1p(train["target"]), "log of target", "green")  

# <a>Distribution Of Non Zero per Row</a>

In [None]:
non_zeros = train.ne(0).sum(axis=1)
plot_distribution(np.log1p(non_zeros),"Distribution of log of non zero indexes per row - train dataset","red")

In [None]:
non_zeros = test.ne(0).sum(axis=1)
plot_distribution(np.log1p(non_zeros),"Distribution of log of non zero indexes per row - test dataset","magenta")

# <a>Distribution Of Non Zero per Column</a>

In [None]:
non_zeros = train.ne(0).sum(axis=0)
plot_distribution(np.log1p(non_zeros),"Distribution of non zero indexes per Column - train dataset","cyan")

In [None]:
non_zeros = test.ne(0).sum(axis=0)
plot_distribution(np.log1p(non_zeros),"Distribution of non zero indexes per Column - test dataset","green")

# <a> HeatMap Of Correlation of Important Features of Training Data:</a>

In [None]:
HeatMap(train_np[columnsList],train_np['target'],columnsList)