**Overview**

According to Epsilon research, 80% of customers are more likely to do business with you if you provide personalized service. Banking is no exception.

The digitalization of everyday lives means that customers expect services to be delivered in a personalized and timely manner… and often before they´ve even realized they need the service. In their 3rd Kaggle competition, Santander Group aims to go a step beyond recognizing that there is a need to provide a customer a financial service and intends to determine the amount or value of the customer's transaction. This means anticipating customer needs in a more concrete, but also simple and personal way. With so many choices for financial services, this need is greater now than ever before.

In this competition, Santander Group is asking Kagglers to help them identify the value of transactions for each potential customer. This is a first step that Santander needs to nail in order to personalize their services at scale.

**Note**

Being my first Kaggle competition i didn't fine tune much just went with some default parameter values for the problem using XGBoost regressor.

**0. Libraries that are required for analysis**

In [None]:
import xgboost as xgb
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
import warnings as ws
ws.simplefilter("ignore")

**1. Reading the Training and Testing Datasets**

In [None]:
#Loading required packages for analysis
import numpy as np
import pandas as pd

#Reading the training data
vp_train = pd.read_csv("../input/train.csv", header='infer')
vp_test = pd.read_csv("../input/test.csv", header='infer')

**2. Taking Log transform as features are skwed**

In [None]:
from scipy.stats import skew
vp_train_log_x = np.log1p(vp_train.iloc[:,2:])
vp_test_log_x = np.log1p(vp_test.iloc[:,1:])
vp_train_y = vp_train.iloc[:,1]

**3. Relevent Features**

In [None]:
model = xgb.XGBRegressor(colsample_bytree=0.4,
                 gamma=0,                 
                 learning_rate=0.07,
                 max_depth=3,
                 min_child_weight=1.5,
                 n_estimators=1000,                                                                    
                 reg_alpha=0.75,
                 reg_lambda=0.45,
                 subsample=0.6,
                 seed=42)

In [None]:
from collections import OrderedDict
model.fit(vp_train_log_x, vp_train_y)
xgb_fea_imp=pd.DataFrame(list(model.get_booster().get_fscore().items()),columns=['feature','importance']).sort_values('importance', ascending=False)

In [None]:
#Selecting features with importance greater than 10
feature_g10 = xgb_fea_imp[xgb_fea_imp.importance >=10]

In [None]:
#Relevent Feature list
feature_list = list(feature_g10.feature)

**4. Parameters for grid search**

In [None]:
gbm_param_grid = {
     'colsample_bytree': np.linspace(0.5, 0.9, 5),
     'n_estimators':[100, 200],
     'max_depth': [10, 15, 20, 25]
}

gbm = xgb.XGBRegressor()

grid_mse = GridSearchCV(estimator = gbm, param_grid = gbm_param_grid, scoring = 'neg_mean_squared_error', cv = 5, verbose = 1)

**4. Fitting the Model**

In [None]:
grid_mse.fit(vp_train_log_x[feature_list], vp_train_y)

**5. Predicting on Test data**

In [None]:
pred = grid_mse.predict(vp_test_log_x[feature_list])

**6. Saving the final result**

In [None]:
#Saving the final results
y_pred_final = pd.DataFrame({'target1':pred})
x_id = vp_test.loc[:,['ID']]
result = pd.concat([x_id, y_pred_final], axis = 1, ignore_index=True)

result.columns = ['ID', 'target1']
result['target'] = result['target1'].abs()
result = result.drop(['target1'], axis = 1)
result.to_csv('Submission_7_Aug_2018_2.csv', sep=',', index=False)