## Correlation in XGboost part 1

The idea of this exercise is to figure out effect of correlated variable on the model performance of xgboost

For this we are using credit fraud data from kaggle

In this exercise we will introduce different kind of correlated variable to see the impact on xgboost

In [1]:
#importing required library
import pandas as pd
import numpy as np
import pickle
from sklearn.model_selection import train_test_split
from xgboost.sklearn import XGBClassifier
import random

### Pre Processing Part

In [2]:
training=pd.read_csv(".../creditcard.csv")

In [3]:
#checking for dtype
training.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
Time      284807 non-null float64
V1        284807 non-null float64
V2        284807 non-null float64
V3        284807 non-null float64
V4        284807 non-null float64
V5        284807 non-null float64
V6        284807 non-null float64
V7        284807 non-null float64
V8        284807 non-null float64
V9        284807 non-null float64
V10       284807 non-null float64
V11       284807 non-null float64
V12       284807 non-null float64
V13       284807 non-null float64
V14       284807 non-null float64
V15       284807 non-null float64
V16       284807 non-null float64
V17       284807 non-null float64
V18       284807 non-null float64
V19       284807 non-null float64
V20       284807 non-null float64
V21       284807 non-null float64
V22       284807 non-null float64
V23       284807 non-null float64
V24       284807 non-null float64
V25       284807 non-null float64
V26  

This shows that there is no categorical variable

In [4]:
#checking for class imbalance
imbalance=training['Class'].value_counts().to_frame()
imbalance['percentage']=imbalance['Class'].apply(lambda x : round(x*100/len(training),2))
imbalance

Unnamed: 0,Class,percentage
0,284315,99.83
1,492,0.17


In [6]:
#checking null values to see if there is any null value

training.isnull().sum()

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

In [7]:
# Since most of our data has already been scaled we should scale the columns that are left to scale (Amount and Time)
from sklearn.preprocessing import StandardScaler, RobustScaler

In [8]:
# RobustScaler is less prone to outliers.

std_scaler = StandardScaler()
rob_scaler = RobustScaler()

In [9]:
training['scaled_amount'] = rob_scaler.fit_transform(training['Amount'].values.reshape(-1,1))
training['scaled_time'] = rob_scaler.fit_transform(training['Time'].values.reshape(-1,1))

training.drop(['Time','Amount'], axis=1, inplace=True)


In [None]:
### Saving file for the future use
import pickle

training.to_pickle('..../cleaned_data')

### Analysis

In [10]:
#reading data
df = pd.read_pickle('.../cleaned_data')

X=df.drop(['Class'],axis=1)
y=df[['Class']]


Splitting data into 70:30 for train and validation

In [11]:
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.3,random_state=123)


Building a baseline model for the performance comparison

We are using AUC as a metric since in imbalanced problem AUC is the right metric to evaluate model performance

In [12]:
xgb_model=XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=125,seed=123)
xgb_model.fit(X_train,Y_train,eval_set=[(X_train,Y_train),(X_test,Y_test)],eval_metric=['auc'],verbose=25)

[0]	validation_0-auc:0.903496	validation_1-auc:0.88731
[25]	validation_0-auc:0.924528	validation_1-auc:0.924792
[50]	validation_0-auc:0.974737	validation_1-auc:0.957822
[75]	validation_0-auc:0.992318	validation_1-auc:0.975396
[100]	validation_0-auc:0.996477	validation_1-auc:0.979231
[124]	validation_0-auc:0.997853	validation_1-auc:0.98187


XGBClassifier(n_estimators=125, seed=123)

#### Checking Variable Importance

In [13]:
variable_imp=pd.DataFrame({'variable':X_train.columns,'imp':xgb_model.feature_importances_})
variable_imp.sort_values('imp',ascending=False,inplace=True)
variable_imp['percentile']=variable_imp['imp']*100/variable_imp.iloc[0,1]
variable_imp

Unnamed: 0,variable,imp,percentile
16,V17,0.312298,100.0
13,V14,0.08658,27.723635
6,V7,0.055855,17.885256
9,V10,0.050032,16.020731
25,V26,0.031942,10.228127
3,V4,0.031279,10.015796
26,V27,0.030674,9.822069
11,V12,0.028334,9.07272
20,V21,0.021607,6.918778
27,V28,0.021545,6.898793


### Introducting perfectly  correlated variable with top variable

In [14]:
#linear combination will be perfectly correlated with the original variable
X_train['new_var']=3*X_train['V17']+5
X_test['new_var']=3*X_test['V17']+5

In [15]:
xgb_model=XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=125,seed=123)
xgb_model.fit(X_train,Y_train,eval_set=[(X_train,Y_train),(X_test,Y_test)],eval_metric=['auc'],verbose=25)

[0]	validation_0-auc:0.903496	validation_1-auc:0.88731
[25]	validation_0-auc:0.924528	validation_1-auc:0.924792
[50]	validation_0-auc:0.974737	validation_1-auc:0.957822
[75]	validation_0-auc:0.992318	validation_1-auc:0.975396
[100]	validation_0-auc:0.996477	validation_1-auc:0.979231
[124]	validation_0-auc:0.997853	validation_1-auc:0.98187


XGBClassifier(n_estimators=125, seed=123)

#### We can see that there is not impact in model performance due to inclusion of perfectly correlated variable

Now let's check the impact on model importance

In [16]:
variable_imp=pd.DataFrame({'variable':X_train.columns,'imp':xgb_model.feature_importances_})
variable_imp.sort_values('imp',ascending=False,inplace=True)
variable_imp['percentile']=variable_imp['imp']*100/variable_imp.iloc[0,1]


In [17]:
variable_imp.head(10)

Unnamed: 0,variable,imp,percentile
16,V17,0.312298,100.0
13,V14,0.08658,27.723635
6,V7,0.055855,17.885256
9,V10,0.050032,16.020731
25,V26,0.031942,10.228127
3,V4,0.031279,10.015796
26,V27,0.030674,9.822069
11,V12,0.028334,9.07272
20,V21,0.021607,6.918778
27,V28,0.021545,6.898793


In [16]:
variable_imp.tail(10)

Unnamed: 0,variable,imp,percentile
8,V9,0.016114,5.159705
4,V5,0.015977,5.115854
28,scaled_amount,0.015308,4.901874
17,V18,0.015256,4.885098
15,V16,0.014522,4.65018
21,V22,0.014339,4.591552
0,V1,0.012375,3.962445
22,V23,0.011037,3.534229
18,V19,0.010897,3.489316
30,new_var,0.0,0.0


We can see that new_var (perfectly correlated variable) has 0 variable importance implying no effect on xgboost model

**********************************************************************************************************************

#### Now let's introduce partially correlated variable with correlation between 0.95 and 0.99

In [None]:
# Introducing partially correlated variable

In [42]:
#To understand the distribution of the variable so as to to create the partially correlated variable
X_train['V17'].describe()

count    199364.000000
mean          0.001134
std           0.845466
min         -25.162799
25%          -0.483786
50%          -0.065029
75%           0.400892
max           9.253526
Name: V17, dtype: float64

In [43]:
X_train['new_var_V17']=X_train['V17'].apply(lambda x:x+random.uniform(-0.5,-0.05))
X_test['new_var_V17']=X_test['V17'].apply(lambda x:x+random.uniform(-0.5,-0.05))


In [44]:
# Let's see the resulting correlation of new variable with the original variable
X_train.corr()['V17']

V1              -0.003546
V2               0.002895
V3              -0.005260
V4               0.003783
V5              -0.003252
V6              -0.001976
V7              -0.006979
V8               0.000958
V9              -0.003132
V10             -0.006593
V11              0.003287
V12             -0.007348
V13             -0.000395
V14             -0.005373
V15              0.001686
V16             -0.007527
V17              1.000000
V18             -0.003768
V19              0.000765
V20              0.002074
V21             -0.001796
V22              0.001256
V23              0.002407
V24             -0.001959
V25             -0.000666
V26             -0.000063
V27             -0.005216
V28             -0.002053
scaled_amount    0.005924
scaled_time     -0.074405
new_var          1.000000
new_var_V17      0.988391
new_var_V14     -0.005547
new_var_V7      -0.006760
new_var_V10     -0.007382
Name: V17, dtype: float64

###  following similar process for  V14,V7,V10

In [45]:
X_train['V14'].describe()

count    199364.000000
mean          0.000123
std           0.956478
min         -19.214325
25%          -0.424956
50%           0.050716
75%           0.493505
max          10.526766
Name: V14, dtype: float64

In [46]:
X_train['new_var_V14']=X_train['V14'].apply(lambda x:x+random.uniform(-0.4,0.5))
X_test['new_var_V14']=X_test['V14'].apply(lambda x:x+random.uniform(-0.4,0.5))


In [47]:
X_train.corr()['V14']

V1              -0.001992
V2               0.001876
V3              -0.003451
V4               0.000852
V5              -0.001402
V6              -0.001222
V7              -0.007570
V8               0.001039
V9              -0.003786
V10             -0.005378
V11             -0.000328
V12             -0.003224
V13             -0.000725
V14              1.000000
V15             -0.001043
V16             -0.003349
V17             -0.005373
V18             -0.002714
V19              0.001573
V20              0.001175
V21             -0.003647
V22              0.001776
V23             -0.000208
V24              0.001518
V25             -0.000704
V26              0.001554
V27             -0.003828
V28             -0.001962
scaled_amount    0.030429
scaled_time     -0.095317
new_var         -0.005373
new_var_V17     -0.005178
new_var_V14      0.964867
new_var_V7      -0.007048
new_var_V10     -0.006235
Name: V14, dtype: float64

In [48]:
X_train['V7'].describe()

count    199364.000000
mean          0.000137
std           1.246990
min         -43.557242
25%          -0.554875
50%           0.040497
75%           0.570080
max         120.589494
Name: V7, dtype: float64

In [49]:
X_train['new_var_V7']=X_train['V7'].apply(lambda x:x+random.uniform(-0.4,0.5))
X_test['new_var_V7']=X_test['V7'].apply(lambda x:x+random.uniform(-0.4,0.5))


In [50]:
X_test.corr()['V7']

V1               0.008937
V2               0.004467
V3               0.026912
V4              -0.011816
V5               0.045784
V6              -0.020501
V7               1.000000
V8              -0.023351
V9               0.003025
V10              0.004871
V11             -0.007287
V12              0.019507
V13             -0.008818
V14              0.018014
V15             -0.008385
V16              0.003830
V17              0.016479
V18              0.006491
V19             -0.009809
V20              0.047486
V21              0.016248
V22             -0.005244
V23              0.010422
V24              0.001174
V25             -0.002594
V26             -0.000810
V27             -0.033056
V28              0.020174
scaled_amount    0.365779
scaled_time      0.082475
new_var          0.016479
new_var_V17      0.015901
new_var_V14      0.018352
new_var_V7       0.977797
new_var_V10      0.003412
Name: V7, dtype: float64

In [51]:
X_train['V10'].describe()

count    199364.000000
mean         -0.000302
std           1.081960
min         -24.588262
25%          -0.534082
50%          -0.091447
75%           0.453239
max          15.331742
Name: V10, dtype: float64

In [52]:
X_train['new_var_V10']=X_train['V10'].apply(lambda x:x+random.uniform(-0.4,0.5))
X_test['new_var_V10']=X_test['V10'].apply(lambda x:x+random.uniform(-0.4,0.5))


In [53]:
X_test.corr()['V10']

V1              -0.001754
V2               0.013340
V3               0.006780
V4              -0.008555
V5               0.000237
V6               0.006219
V7               0.004871
V8              -0.010123
V9               0.016698
V10              1.000000
V11             -0.006991
V12              0.019921
V13             -0.002981
V14              0.012201
V15             -0.003246
V16              0.007099
V17              0.014843
V18              0.006208
V19             -0.000532
V20              0.010370
V21             -0.013692
V22              0.005718
V23              0.009744
V24             -0.003995
V25              0.007238
V26             -0.004788
V27              0.003566
V28             -0.011253
scaled_amount   -0.102288
scaled_time      0.033125
new_var          0.014843
new_var_V17      0.014589
new_var_V14      0.012206
new_var_V7       0.005489
new_var_V10      0.973359
Name: V10, dtype: float64

Correlation:
new_var_v17 corr with v17=0.98


new_var_v14 corr with v14= 0.96


new_var_v7 corr with v7=0.977


new_var_v10 corr with v10=0.973

### checking model performance after introducing 4 partially correlated variable

In [54]:
xgb_model=XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=125,seed=123)
xgb_model.fit(X_train,Y_train,eval_set=[(X_train,Y_train),(X_test,Y_test)],eval_metric=['auc'],verbose=25)

[0]	validation_0-auc:0.903496	validation_1-auc:0.88731
[25]	validation_0-auc:0.924523	validation_1-auc:0.924777
[50]	validation_0-auc:0.975601	validation_1-auc:0.955724
[75]	validation_0-auc:0.992024	validation_1-auc:0.971422
[100]	validation_0-auc:0.996847	validation_1-auc:0.978318
[124]	validation_0-auc:0.998093	validation_1-auc:0.981696


XGBClassifier(n_estimators=125, seed=123)

### Effect of correlation on Variable importance

In [38]:
variable_imp=pd.DataFrame({'variable':X_train.columns,'imp':xgb_model.feature_importances_})
variable_imp.sort_values('imp',ascending=False,inplace=True)
variable_imp['percentile']=variable_imp['imp']*100/variable_imp.iloc[0,1]


In [39]:
variable_imp.head(10)

Unnamed: 0,variable,imp,percentile
16,V17,0.209892,100.0
31,new_var_V17,0.198895,94.760773
32,new_var_V14,0.0741,35.304081
34,new_var_V10,0.048034,22.885035
6,V7,0.043189,20.576725
13,V14,0.042035,20.026888
26,V27,0.024402,11.626035
11,V12,0.022688,10.809585
9,V10,0.022331,10.639076
3,V4,0.021658,10.318623


In [40]:
variable_imp.tail(10)

Unnamed: 0,variable,imp,percentile
21,V22,0.010742,5.117947
33,new_var_V7,0.010394,4.952227
14,V15,0.010295,4.9047
4,V5,0.010063,4.794604
2,V3,0.009901,4.717388
15,V16,0.009426,4.491082
0,V1,0.008642,4.117443
18,V19,0.007792,3.712486
22,V23,0.006107,2.909774
30,new_var,0.0,0.0


####  We can see that there is marginal increase in performance, implying that xgboost is not heavily impacted by removal of the correlated variable.Also the variable importane of  original variable is shared by  new correlated variable
####  Also since correlated variable are being created with the help of random function thus we can expect slight change in performance

 *************************************************************************************************************

### Part 1.1

### I tried one more exercise by introducing partially correlated variable, correlation between 0.8 and 0.9

 ***** creating new variable with 0.8<corr<0.9 *****

In [58]:
# cretaing train -test again for fresh start
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.3,random_state=123)


In [59]:
X_train['new_var_V17']=X_train['V17'].apply(lambda x:x+random.uniform(-0.5,1))
X_test['new_var_V17']=X_test['V17'].apply(lambda x:x+random.uniform(-0.5,1))


In [60]:
X_train.corr()['new_var_V17']

V1              -0.003316
V2              -0.000222
V3              -0.004945
V4               0.004516
V5              -0.002112
V6              -0.001271
V7              -0.005241
V8               0.000096
V9              -0.003618
V10             -0.007159
V11              0.003258
V12             -0.006428
V13             -0.000799
V14             -0.002769
V15              0.003794
V16             -0.008051
V17              0.890150
V18             -0.003661
V19              0.000549
V20              0.002323
V21             -0.001850
V22              0.001098
V23              0.002313
V24             -0.001547
V25             -0.000301
V26              0.001640
V27             -0.004218
V28             -0.001244
scaled_amount    0.007969
scaled_time     -0.067215
new_var_V17      1.000000
Name: new_var_V17, dtype: float64

#### new_var_V17 corr is  0.88

In [61]:
X_train['new_var_V14']=X_train['V14'].apply(lambda x:x+random.uniform(-0.8,1))
X_test['new_var_V14']=X_test['V14'].apply(lambda x:x+random.uniform(-0.8,1))


In [62]:
X_train.corr()['new_var_V14']

V1              -0.001518
V2               0.002064
V3              -0.003371
V4               0.000127
V5              -0.000997
V6              -0.001673
V7              -0.006035
V8               0.000082
V9              -0.002220
V10             -0.004397
V11              0.000550
V12             -0.004538
V13              0.000383
V14              0.879204
V15             -0.000274
V16             -0.005098
V17             -0.004399
V18             -0.001685
V19              0.003426
V20              0.000622
V21             -0.003764
V22             -0.001930
V23             -0.001879
V24              0.001939
V25             -0.000467
V26              0.000586
V27             -0.002949
V28             -0.000467
scaled_amount    0.026666
scaled_time     -0.083769
new_var_V17     -0.002394
new_var_V14      1.000000
Name: new_var_V14, dtype: float64

#### new_var_V14 corr is  0.87

In [63]:
X_train['new_var_V7']=X_train['V7'].apply(lambda x:x+random.uniform(-1,1.2))
X_test['new_var_V7']=X_test['V7'].apply(lambda x:x+random.uniform(-1,1.2))


In [64]:
X_train.corr()['new_var_V7']

V1              -0.001789
V2              -0.001712
V3              -0.010899
V4               0.004175
V5              -0.016948
V6               0.007311
V7               0.891101
V8               0.008562
V9              -0.000780
V10              0.000640
V11              0.003457
V12             -0.007323
V13              0.003858
V14             -0.005740
V15              0.002900
V16              0.000618
V17             -0.006266
V18             -0.001107
V19              0.004203
V20             -0.017906
V21             -0.004248
V22              0.001149
V23             -0.003876
V24             -0.000356
V25              0.000711
V26             -0.000878
V27              0.011553
V28             -0.004955
scaled_amount    0.365023
scaled_time      0.075821
new_var_V17     -0.004166
new_var_V14     -0.004090
new_var_V7       1.000000
Name: new_var_V7, dtype: float64

#### new_var_7 corr is  0.89

In [65]:
X_train['new_var_V10']=X_train['V10'].apply(lambda x:x+random.uniform(-1,1))
X_test['new_var_V10']=X_test['V10'].apply(lambda x:x+random.uniform(-1,1))


In [66]:
X_train.corr()['new_var_V10']

V1               0.001497
V2              -0.003987
V3              -0.001215
V4               0.002658
V5               0.000435
V6              -0.002181
V7              -0.002390
V8               0.005423
V9              -0.006753
V10              0.881853
V11              0.002699
V12             -0.006851
V13              0.001449
V14             -0.001955
V15             -0.000250
V16             -0.005174
V17             -0.004083
V18             -0.001435
V19              0.000999
V20             -0.004259
V21              0.004302
V22             -0.002760
V23             -0.003955
V24              0.000962
V25             -0.002915
V26              0.000380
V27             -0.001622
V28              0.004161
scaled_amount   -0.090809
scaled_time      0.025061
new_var_V17     -0.004601
new_var_V14     -0.001602
new_var_V7       0.000365
new_var_V10      1.000000
Name: new_var_V10, dtype: float64

#### new_var_10 corr is  0.88

### Checking model performance

In [67]:
xgb_model=XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=125,seed=123)
xgb_model.fit(X_train,Y_train,eval_set=[(X_train,Y_train),(X_test,Y_test)],eval_metric=['auc'],verbose=25)

[0]	validation_0-auc:0.895983	validation_1-auc:0.871693
[25]	validation_0-auc:0.924531	validation_1-auc:0.921668
[50]	validation_0-auc:0.975009	validation_1-auc:0.957666
[75]	validation_0-auc:0.99282	validation_1-auc:0.970988
[100]	validation_0-auc:0.996456	validation_1-auc:0.978528
[124]	validation_0-auc:0.997752	validation_1-auc:0.981413


XGBClassifier(n_estimators=125, seed=123)

In [68]:
variable_imp=pd.DataFrame({'variable':X_train.columns,'imp':xgb_model.feature_importances_})
variable_imp.sort_values('imp',ascending=False,inplace=True)
variable_imp['percentile']=variable_imp['imp']*100/variable_imp.iloc[0,1]


In [69]:
variable_imp.head(10)

Unnamed: 0,variable,imp,percentile
16,V17,0.262655,100.0
30,new_var_V17,0.086497,32.931885
13,V14,0.073167,27.856531
6,V7,0.057156,21.760967
9,V10,0.038675,14.7248
33,new_var_V10,0.036213,13.787479
31,new_var_V14,0.033008,12.566878
26,V27,0.025925,9.870428
3,V4,0.025471,9.697683
25,V26,0.024627,9.376287


In [70]:
variable_imp.tail(10)

Unnamed: 0,variable,imp,percentile
17,V18,0.013094,4.985137
32,new_var_V7,0.012728,4.845744
14,V15,0.01226,4.667555
12,V13,0.012104,4.608418
4,V5,0.011916,4.536871
21,V22,0.01163,4.427858
0,V1,0.010952,4.169651
18,V19,0.01045,3.978678
22,V23,0.008909,3.392039
24,V25,0.007922,3.016236


####  We can see that again, there is marginal increase in performance, implying that xgboost is not heavily impacted by removal of the correlated variable and it is not dependent on the amount of correlation .Also the variable importane of variable is shared by  new correlated variable
####  Also since correlated variable are being created with the help of random function thus we can expect slight change in performance

# *************Thanks*************