The datasets contains IPL player's auction prices (base price and sold price) and relevant details of each player. 

1) Fit a linear regression model to predict the sold-price of the player.<br>
2) Use variable reduction techniques covered so far to identify significant variables.<br>
3) What is the RMSE of the model?<br>
4) What are the top 5 variables that impact the price of the player. 

Make appropriate assumptions as necessary for solving the assignment. 
Note that this data may not match actual player data . 

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)

In [2]:
# Load data
ipldata = pd.read_csv("C:/Users/Karthik.Iyer/Downloads/AccelerateAI/Regression-Models-main/Assignment 03-IPL_Case_data.csv")
ipldata.head()

Unnamed: 0,PLAYER NAME,Country,Team,PLAYING ROLE,BAT,BOW,ALL,BAT-StrikeRate,BOW-Economy,BOW*SR-BL,BAT*RUN-S,BOW*WK-I,BAT*T-RUNS,BAT*ODI-RUNS,BOW*WK-O,Total-RUNS,Total-WKTS,ODI-RUNS,ODI-SR-B,ODI-WKTS,ODI-SR-BL,CAPTAINCY EXP,INDIA,AUSTRALIA,OTHERS,Highest Score,AVERAGE RUNS,SR -B,SIXERS,AVE-BL,ECON,SR -BL,Year,Year_Dummy,Base_Price,SoldPrice
0,"Abdulla, YA",SA,KXIP,Allrounder,0.0,0.0,1.0,-,-,-,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,-,-,-,20.47,8.9,13.93,2009.0,0.0,50000.0,50000.0
1,Abdur Razzak,BAN,RCB,Bowler,0.0,1.0,0.0,-,14.5,-,0.0,0.0,0.0,0.0,185.0,214.0,18.0,657.0,71.41,185.0,37.6,0.0,0.0,0.0,1.0,0.0,-,-,-,-,14.5,-,2008.0,0.0,50000.0,50000.0
2,"Agarkar, AB",IND,KKR,Bowler,0.0,1.0,0.0,-,8.81,24.9,0.0,29.0,0.0,0.0,288.0,571.0,58.0,1269.0,80.62,288.0,32.9,0.0,1.0,0.0,0.0,39.0,18.56,121.01,5,36.52,8.81,24.9,2008.0,0.0,200000.0,350000.0
3,"Ashwin, R",IND,CSK,Bowler,0.0,1.0,0.0,-,6.23,22.14,0.0,49.0,0.0,0.0,51.0,284.0,31.0,241.0,84.56,51.0,36.8,0.0,1.0,0.0,0.0,11.0,5.8,76.32,-,22.96,6.23,22.14,2011.0,1.0,100000.0,850000.0
4,"Badrinath, S",IND,CSK,Batsman,1.0,0.0,0.0,120.71,-,-,1317.0,0.0,63.0,79.0,0.0,63.0,0.0,79.0,45.93,0.0,0.0,0.0,1.0,0.0,0.0,71.0,32.93,120.71,28,-,-,-,2011.0,1.0,100000.0,800000.0


In [3]:
# Check shape
ipldata.shape

(131, 36)

In [4]:
# Check data types
ipldata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 131 entries, 0 to 130
Data columns (total 36 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   PLAYER NAME     130 non-null    object 
 1   Country         130 non-null    object 
 2   Team            130 non-null    object 
 3   PLAYING ROLE    130 non-null    object 
 4   BAT             130 non-null    float64
 5   BOW             130 non-null    float64
 6   ALL             130 non-null    float64
 7   BAT-StrikeRate  130 non-null    object 
 8   BOW-Economy     130 non-null    object 
 9   BOW*SR-BL       130 non-null    object 
 10  BAT*RUN-S       130 non-null    float64
 11  BOW*WK-I        130 non-null    float64
 12  BAT*T-RUNS      130 non-null    float64
 13  BAT*ODI-RUNS    130 non-null    float64
 14  BOW*WK-O        131 non-null    float64
 15  Total-RUNS      130 non-null    float64
 16  Total-WKTS      130 non-null    float64
 17  ODI-RUNS        130 non-null    flo

In [5]:
# Lets check variables with object data type
ipldata[['BAT-StrikeRate','BOW-Economy','BOW*SR-BL','AVERAGE RUNS','SR -B','SIXERS','AVE-BL','ECON','SR -BL']].head()

Unnamed: 0,BAT-StrikeRate,BOW-Economy,BOW*SR-BL,AVERAGE RUNS,SR -B,SIXERS,AVE-BL,ECON,SR -BL
0,-,-,-,-,-,-,20.47,8.9,13.93
1,-,14.5,-,-,-,-,-,14.5,-
2,-,8.81,24.9,18.56,121.01,5,36.52,8.81,24.9
3,-,6.23,22.14,5.8,76.32,-,22.96,6.23,22.14
4,120.71,-,-,32.93,120.71,28,-,-,-


In [6]:
# Data cleaning
def clean(string):
    clean_str = string.str.replace('\$|,|-','0.0',regex=True).astype(float)
    return clean_str

ipldata[['BAT-StrikeRate','BOW-Economy','BOW*SR-BL','AVERAGE RUNS',
         'SR -B','SIXERS','AVE-BL','ECON','SR -BL']] = ipldata[['BAT-StrikeRate','BOW-Economy','BOW*SR-BL',
                                                                'AVERAGE RUNS','SR -B','SIXERS','AVE-BL',
                                                                'ECON','SR -BL']].apply(clean)

In [7]:
# Check data
ipldata[['BAT-StrikeRate','BOW-Economy','BOW*SR-BL','AVERAGE RUNS','SR -B','SIXERS','AVE-BL','ECON','SR -BL']].head()

Unnamed: 0,BAT-StrikeRate,BOW-Economy,BOW*SR-BL,AVERAGE RUNS,SR -B,SIXERS,AVE-BL,ECON,SR -BL
0,0.0,0.0,0.0,0.0,0.0,0.0,20.47,8.9,13.93
1,0.0,14.5,0.0,0.0,0.0,0.0,0.0,14.5,0.0
2,0.0,8.81,24.9,18.56,121.01,5.0,36.52,8.81,24.9
3,0.0,6.23,22.14,5.8,76.32,0.0,22.96,6.23,22.14
4,120.71,0.0,0.0,32.93,120.71,28.0,0.0,0.0,0.0


In [8]:
# Drop the variable PLAYING ROLE as we already have the equivalent dummies BAT, BOW and ALL
ipldata.drop('PLAYING ROLE', axis=1, inplace=True)

# Drop the variables INDIA, AUSTRALIA, OTHERS as we can dummies from Country
ipldata.drop(['INDIA','AUSTRALIA','OTHERS'], axis=1, inplace=True)

# Drop PLAYER NAME as this indicates only unique values
ipldata.drop('PLAYER NAME', axis=1, inplace=True)

# Drop Year and Year_Dummy as this is not time series data
ipldata.drop(['Year','Year_Dummy'], axis=1, inplace=True)

In [9]:
# Check data
ipldata.head()

Unnamed: 0,Country,Team,BAT,BOW,ALL,BAT-StrikeRate,BOW-Economy,BOW*SR-BL,BAT*RUN-S,BOW*WK-I,BAT*T-RUNS,BAT*ODI-RUNS,BOW*WK-O,Total-RUNS,Total-WKTS,ODI-RUNS,ODI-SR-B,ODI-WKTS,ODI-SR-BL,CAPTAINCY EXP,Highest Score,AVERAGE RUNS,SR -B,SIXERS,AVE-BL,ECON,SR -BL,Base_Price,SoldPrice
0,SA,KXIP,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,20.47,8.9,13.93,50000.0,50000.0
1,BAN,RCB,0.0,1.0,0.0,0.0,14.5,0.0,0.0,0.0,0.0,0.0,185.0,214.0,18.0,657.0,71.41,185.0,37.6,0.0,0.0,0.0,0.0,0.0,0.0,14.5,0.0,50000.0,50000.0
2,IND,KKR,0.0,1.0,0.0,0.0,8.81,24.9,0.0,29.0,0.0,0.0,288.0,571.0,58.0,1269.0,80.62,288.0,32.9,0.0,39.0,18.56,121.01,5.0,36.52,8.81,24.9,200000.0,350000.0
3,IND,CSK,0.0,1.0,0.0,0.0,6.23,22.14,0.0,49.0,0.0,0.0,51.0,284.0,31.0,241.0,84.56,51.0,36.8,0.0,11.0,5.8,76.32,0.0,22.96,6.23,22.14,100000.0,850000.0
4,IND,CSK,1.0,0.0,0.0,120.71,0.0,0.0,1317.0,0.0,63.0,79.0,0.0,63.0,0.0,79.0,45.93,0.0,0.0,0.0,71.0,32.93,120.71,28.0,0.0,0.0,0.0,100000.0,800000.0


In [10]:
# Check missing values
ipldata.isnull().sum()

Country           1
Team              1
BAT               1
BOW               1
ALL               1
BAT-StrikeRate    1
BOW-Economy       1
BOW*SR-BL         1
BAT*RUN-S         1
BOW*WK-I          1
BAT*T-RUNS        1
BAT*ODI-RUNS      1
BOW*WK-O          0
Total-RUNS        1
Total-WKTS        1
ODI-RUNS          1
ODI-SR-B          1
ODI-WKTS          1
ODI-SR-BL         1
CAPTAINCY EXP     1
Highest Score     1
AVERAGE RUNS      1
SR -B             1
SIXERS            1
AVE-BL            1
ECON              1
SR -BL            1
Base_Price        1
SoldPrice         1
dtype: int64

In [11]:
# Check missing data
ipldata[ipldata['Country'].isnull()]

Unnamed: 0,Country,Team,BAT,BOW,ALL,BAT-StrikeRate,BOW-Economy,BOW*SR-BL,BAT*RUN-S,BOW*WK-I,BAT*T-RUNS,BAT*ODI-RUNS,BOW*WK-O,Total-RUNS,Total-WKTS,ODI-RUNS,ODI-SR-B,ODI-WKTS,ODI-SR-BL,CAPTAINCY EXP,Highest Score,AVERAGE RUNS,SR -B,SIXERS,AVE-BL,ECON,SR -BL,Base_Price,SoldPrice
130,,,,,,,,,,,,,-0.11979,,,,,,,,,,,,,,,,


In [12]:
# Drop the row which has values missing for most of the variables
ipldata.dropna(inplace=True)

In [13]:
# Check mssing values
ipldata.isnull().sum()

Country           0
Team              0
BAT               0
BOW               0
ALL               0
BAT-StrikeRate    0
BOW-Economy       0
BOW*SR-BL         0
BAT*RUN-S         0
BOW*WK-I          0
BAT*T-RUNS        0
BAT*ODI-RUNS      0
BOW*WK-O          0
Total-RUNS        0
Total-WKTS        0
ODI-RUNS          0
ODI-SR-B          0
ODI-WKTS          0
ODI-SR-BL         0
CAPTAINCY EXP     0
Highest Score     0
AVERAGE RUNS      0
SR -B             0
SIXERS            0
AVE-BL            0
ECON              0
SR -BL            0
Base_Price        0
SoldPrice         0
dtype: int64

In [14]:
# Check data types
ipldata.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 130 entries, 0 to 129
Data columns (total 29 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Country         130 non-null    object 
 1   Team            130 non-null    object 
 2   BAT             130 non-null    float64
 3   BOW             130 non-null    float64
 4   ALL             130 non-null    float64
 5   BAT-StrikeRate  130 non-null    float64
 6   BOW-Economy     130 non-null    float64
 7   BOW*SR-BL       130 non-null    float64
 8   BAT*RUN-S       130 non-null    float64
 9   BOW*WK-I        130 non-null    float64
 10  BAT*T-RUNS      130 non-null    float64
 11  BAT*ODI-RUNS    130 non-null    float64
 12  BOW*WK-O        130 non-null    float64
 13  Total-RUNS      130 non-null    float64
 14  Total-WKTS      130 non-null    float64
 15  ODI-RUNS        130 non-null    float64
 16  ODI-SR-B        130 non-null    float64
 17  ODI-WKTS        130 non-null    flo

In [15]:
# Check correlation
X = ipldata[['BAT-StrikeRate','BOW-Economy','BOW*SR-BL','BAT*RUN-S','BOW*WK-I','BAT*T-RUNS',
             'BAT*ODI-RUNS','BOW*WK-O','Total-RUNS','Total-WKTS','ODI-RUNS','ODI-SR-B','ODI-WKTS',
             'ODI-SR-BL','Highest Score','AVERAGE RUNS','SR -B','SIXERS','AVE-BL','ECON','SR -BL','Base_Price']]

X.corr()

Unnamed: 0,BAT-StrikeRate,BOW-Economy,BOW*SR-BL,BAT*RUN-S,BOW*WK-I,BAT*T-RUNS,BAT*ODI-RUNS,BOW*WK-O,Total-RUNS,Total-WKTS,ODI-RUNS,ODI-SR-B,ODI-WKTS,ODI-SR-BL,Highest Score,AVERAGE RUNS,SR -B,SIXERS,AVE-BL,ECON,SR -BL,Base_Price
BAT-StrikeRate,1.0,-0.549973,-0.5089,0.771467,-0.418884,0.542111,0.595825,-0.35403,0.375961,-0.3398,0.391639,0.13274,-0.453005,-0.255377,0.583224,0.577925,0.308517,0.449078,-0.53049,-0.459459,-0.562355,0.181481
BOW-Economy,-0.549973,1.0,0.877116,-0.415902,0.669939,-0.334367,-0.359769,0.603923,-0.351615,0.361243,-0.428473,-0.190597,0.340591,0.06465,-0.637631,-0.662245,-0.409886,-0.460393,0.201694,0.261582,0.217433,-0.137186
BOW*SR-BL,-0.5089,0.877116,1.0,-0.384841,0.611999,-0.309396,-0.332901,0.561518,-0.319287,0.408115,-0.398401,-0.157422,0.317564,0.09217,-0.573183,-0.566033,-0.334855,-0.421283,0.325248,0.205042,0.345214,-0.088838
BAT*RUN-S,0.771467,-0.415902,-0.384841,1.0,-0.316769,0.487017,0.607285,-0.267725,0.359312,-0.250952,0.444147,0.207016,-0.306087,-0.090818,0.665869,0.607777,0.259198,0.648826,-0.286063,-0.289555,-0.312284,0.195136
BOW*WK-I,-0.418884,0.669939,0.611999,-0.316769,1.0,-0.254669,-0.274016,0.453268,-0.263014,0.321507,-0.32885,-0.147167,0.253417,0.055424,-0.428023,-0.486015,-0.18548,-0.330117,0.072369,0.128831,0.122756,-0.054704
BAT*T-RUNS,0.542111,-0.334367,-0.309396,0.487017,-0.254669,1.0,0.927754,-0.21524,0.881066,-0.190458,0.772119,0.176929,-0.225434,0.024406,0.359836,0.317152,0.075728,0.160287,-0.426511,-0.398001,-0.442381,0.362244
BAT*ODI-RUNS,0.595825,-0.359769,-0.332901,0.607285,-0.274016,0.927754,1.0,-0.231591,0.803746,-0.204617,0.832443,0.210423,-0.221943,0.057811,0.397051,0.358096,0.102759,0.287236,-0.39077,-0.351078,-0.404528,0.335128
BOW*WK-O,-0.35403,0.603923,0.561518,-0.267725,0.453268,-0.21524,-0.231591,1.0,-0.146392,0.791048,-0.234878,-0.0219,0.765327,0.030185,-0.379656,-0.399969,-0.198065,-0.281898,0.100125,0.137833,0.137596,0.085025
Total-RUNS,0.375961,-0.351615,-0.319287,0.359312,-0.263014,0.881066,0.803746,-0.146392,1.0,0.026285,0.892823,0.231411,0.045505,0.0677,0.411209,0.374046,0.114298,0.216571,-0.298999,-0.329022,-0.309105,0.437984
Total-WKTS,-0.3398,0.361243,0.408115,-0.250952,0.321507,-0.190458,-0.204617,0.791048,0.026285,1.0,-0.088276,0.012052,0.82294,0.060641,-0.268432,-0.26554,-0.147752,-0.198036,0.162456,0.11753,0.205208,0.216648


**BAT-StrikeRate is strongly correlated with BAT*RUN-S**<br>
**BOW-Economy is strongly correlated with BOW*SR-BL**<br>
**BAT*T-RUNS is strongly correlated with BAT*ODI-RUNS, Total-RUNS, ODI-RUNS**<br>
**BAT*ODI-RUNS is strongly correlated with Total-RUNS, ODI-RUNS**<br>
**BOW*WK-O is strongly correlated with Total-WKTS**<br>
**Total-RUNS is strongly correlated with ODI-RUNS**<br>
**Total-WKTS is strongly correlated with ODI-WKTS**

In [16]:
# Lets check correlation with Y
Y = ipldata['SoldPrice']
X.corrwith(Y)

BAT-StrikeRate    0.213047
BOW-Economy      -0.195143
BOW*SR-BL        -0.136175
BAT*RUN-S         0.403600
BOW*WK-I         -0.023191
BAT*T-RUNS        0.153314
BAT*ODI-RUNS      0.258153
BOW*WK-O         -0.079371
Total-RUNS        0.216752
Total-WKTS        0.035767
ODI-RUNS          0.337834
ODI-SR-B          0.226880
ODI-WKTS          0.112327
ODI-SR-BL         0.075408
Highest Score     0.347473
AVERAGE RUNS      0.396519
SR -B             0.184278
SIXERS            0.450609
AVE-BL            0.128406
ECON              0.040679
SR -BL            0.118296
Base_Price        0.523510
dtype: float64

In [17]:
# Check multi-collinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor
pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

BAT-StrikeRate     7.957565
BOW-Economy       11.787684
BOW*SR-BL          9.741185
BAT*RUN-S          6.762779
BOW*WK-I           2.944498
BAT*T-RUNS        47.073733
BAT*ODI-RUNS      40.844943
BOW*WK-O           8.576096
Total-RUNS        44.090132
Total-WKTS         8.786257
ODI-RUNS          40.780446
ODI-SR-B          12.649523
ODI-WKTS          16.305937
ODI-SR-BL          4.051487
Highest Score     20.645394
AVERAGE RUNS      24.418024
SR -B             17.132351
SIXERS             5.809968
AVE-BL            85.341329
ECON               7.481379
SR -BL            90.422077
Base_Price         4.270887
dtype: float64

In [18]:
# Drop SR -BL
X.drop('SR -BL', axis=1, inplace=True)

pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

BAT-StrikeRate     7.883348
BOW-Economy       11.324718
BOW*SR-BL          9.531088
BAT*RUN-S          6.688237
BOW*WK-I           2.745517
BAT*T-RUNS        47.014302
BAT*ODI-RUNS      40.218900
BOW*WK-O           8.537238
Total-RUNS        44.081797
Total-WKTS         8.785422
ODI-RUNS          40.311444
ODI-SR-B          12.587140
ODI-WKTS          16.029001
ODI-SR-BL          4.035007
Highest Score     20.631856
AVERAGE RUNS      24.347753
SR -B             17.087321
SIXERS             5.766376
AVE-BL             9.143500
ECON               7.280719
Base_Price         4.270796
dtype: float64

In [19]:
# Drop BAT*T-RUNS
X.drop('BAT*T-RUNS', axis=1, inplace=True)

pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

BAT-StrikeRate     7.882938
BOW-Economy       11.105852
BOW*SR-BL          9.422278
BAT*RUN-S          6.677685
BOW*WK-I           2.742504
BAT*ODI-RUNS      14.308190
BOW*WK-O           8.291140
Total-RUNS        11.688247
Total-WKTS         6.806247
ODI-RUNS          23.636474
ODI-SR-B          12.510476
ODI-WKTS          15.938665
ODI-SR-BL          4.034056
Highest Score     19.838509
AVERAGE RUNS      23.565610
SR -B             17.058949
SIXERS             5.539704
AVE-BL             9.128422
ECON               7.280714
Base_Price         4.199901
dtype: float64

In [20]:
# Drop ODI-RUNS

X.drop('ODI-RUNS', axis=1, inplace=True)

pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

BAT-StrikeRate     7.866151
BOW-Economy       11.104513
BOW*SR-BL          9.422032
BAT*RUN-S          6.480849
BOW*WK-I           2.735464
BAT*ODI-RUNS       8.481318
BOW*WK-O           6.730890
Total-RUNS         8.094333
Total-WKTS         6.016545
ODI-SR-B          12.504069
ODI-WKTS           8.813900
ODI-SR-BL          4.025197
Highest Score     19.774660
AVERAGE RUNS      23.547751
SR -B             16.812647
SIXERS             5.132494
AVE-BL             9.002543
ECON               7.223823
Base_Price         4.163775
dtype: float64

In [21]:
# Drop AVERAGE RUNS

X.drop('AVERAGE RUNS', axis=1, inplace=True)

pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

BAT-StrikeRate     7.789269
BOW-Economy       11.031001
BOW*SR-BL          9.381858
BAT*RUN-S          6.450370
BOW*WK-I           2.651644
BAT*ODI-RUNS       8.338151
BOW*WK-O           6.730854
Total-RUNS         8.044222
Total-WKTS         6.016528
ODI-SR-B          12.503062
ODI-WKTS           8.773022
ODI-SR-BL          4.021266
Highest Score     13.907122
SR -B             14.160962
SIXERS             5.128059
AVE-BL             8.991103
ECON               7.203113
Base_Price         3.727408
dtype: float64

In [22]:
# Drop SR -B

X.drop('SR -B', axis=1, inplace=True)

pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

BAT-StrikeRate     6.517487
BOW-Economy       11.029151
BOW*SR-BL          9.365917
BAT*RUN-S          6.013853
BOW*WK-I           2.527764
BAT*ODI-RUNS       8.336406
BOW*WK-O           6.666888
Total-RUNS         7.998638
Total-WKTS         5.988770
ODI-SR-B          10.312665
ODI-WKTS           8.462479
ODI-SR-BL          4.015211
Highest Score     12.002317
SIXERS             5.114634
AVE-BL             8.864859
ECON               7.124212
Base_Price         3.726019
dtype: float64

In [23]:
# Drop Highest Score

X.drop('Highest Score', axis=1, inplace=True)

pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

BAT-StrikeRate     6.017362
BOW-Economy       10.750630
BOW*SR-BL          9.363621
BAT*RUN-S          5.793056
BOW*WK-I           2.483556
BAT*ODI-RUNS       7.322424
BOW*WK-O           6.648929
Total-RUNS         6.715190
Total-WKTS         5.807526
ODI-SR-B           8.899415
ODI-WKTS           8.448732
ODI-SR-BL          4.015096
SIXERS             3.623741
AVE-BL             8.824890
ECON               7.121360
Base_Price         3.723156
dtype: float64

In [24]:
# Drop BOW-Economy

X.drop('BOW-Economy', axis=1, inplace=True)

pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

BAT-StrikeRate    5.863950
BOW*SR-BL         3.751659
BAT*RUN-S         5.735603
BOW*WK-I          2.325538
BAT*ODI-RUNS      7.209230
BOW*WK-O          6.014614
Total-RUNS        6.633319
Total-WKTS        5.413687
ODI-SR-B          8.781601
ODI-WKTS          8.446008
ODI-SR-BL         4.005940
SIXERS            3.613541
AVE-BL            7.738100
ECON              5.908294
Base_Price        3.723128
dtype: float64

In [25]:
# Drop ODI-SR-B

X.drop('ODI-SR-B', axis=1, inplace=True)

pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

BAT-StrikeRate    4.514324
BOW*SR-BL         3.595894
BAT*RUN-S         5.601481
BOW*WK-I          2.301478
BAT*ODI-RUNS      7.195524
BOW*WK-O          5.960416
Total-RUNS        6.610138
Total-WKTS        5.169715
ODI-WKTS          7.387124
ODI-SR-BL         3.366340
SIXERS            3.367868
AVE-BL            7.730528
ECON              5.833009
Base_Price        3.581120
dtype: float64

In [26]:
# Drop AVE-BL

X.drop('AVE-BL', axis=1, inplace=True)

pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

BAT-StrikeRate    4.278687
BOW*SR-BL         2.961866
BAT*RUN-S         5.579382
BOW*WK-I          2.129500
BAT*ODI-RUNS      6.980603
BOW*WK-O          5.808874
Total-RUNS        6.585578
Total-WKTS        5.089581
ODI-WKTS          7.365987
ODI-SR-BL         2.952454
SIXERS            3.151608
ECON              3.095869
Base_Price        3.525989
dtype: float64

In [27]:
# Drop ODI-WKTS

X.drop('ODI-WKTS', axis=1, inplace=True)

pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

BAT-StrikeRate    4.057278
BOW*SR-BL         2.869803
BAT*RUN-S         5.579053
BOW*WK-I          2.120486
BAT*ODI-RUNS      6.687287
BOW*WK-O          4.288356
Total-RUNS        5.876572
Total-WKTS        4.210100
ODI-SR-BL         2.927462
SIXERS            3.056741
ECON              2.857914
Base_Price        3.482559
dtype: float64

In [28]:
# Drop Total-RUNS

X.drop('Total-RUNS', axis=1, inplace=True)

pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

BAT-StrikeRate    4.055904
BOW*SR-BL         2.860674
BAT*RUN-S         5.170728
BOW*WK-I          2.117860
BAT*ODI-RUNS      2.656591
BOW*WK-O          4.087347
Total-WKTS        3.643909
ODI-SR-BL         2.915105
SIXERS            2.965154
ECON              2.845297
Base_Price        3.221692
dtype: float64

In [29]:
# Drop BAT*RUN-S

X.drop('BAT*RUN-S', axis=1, inplace=True)

pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

BAT-StrikeRate    2.834855
BOW*SR-BL         2.851695
BOW*WK-I          2.117744
BAT*ODI-RUNS      2.353972
BOW*WK-O          4.081675
Total-WKTS        3.642266
ODI-SR-BL         2.909610
SIXERS            2.109578
ECON              2.840340
Base_Price        3.158883
dtype: float64

**VIFs are below 5 now**

In [30]:
# Check value counts for categorical variables
catvar = ['Country','Team']
for var in catvar:
    print(ipldata[var].value_counts(normalize=True)*100)
    print()

IND    40.769231
AUS    16.923077
SA     12.307692
SL      9.230769
PAK     6.923077
NZ      5.384615
WI      4.615385
ENG     2.307692
BAN     0.769231
ZIM     0.769231
Name: Country, dtype: float64

CSK      10.769231
RCB+      9.230769
KKR+      9.230769
DD+       7.692308
DC+       7.692308
RR+       6.923077
RCB       6.923077
DC        5.384615
KXIP+     5.384615
MI        4.615385
DD        4.615385
MI+       4.615385
RR        4.615385
CSK+      3.846154
KKR       3.846154
KXIP      3.846154
KXI+      0.769231
Name: Team, dtype: float64



In [31]:
# Lets replace fewer categories
ipldata['Country'] = ipldata['Country'].replace(['PAK','NZ','WI','ENG','BAN','ZIM'], 'Others')
ipldata['Team'] = ipldata['Team'].replace(['DC','KXIP+','MI','DD','MI+','RR','CSK+','KKR','KXIP','KXI+'], 'Others')

In [32]:
# Check value count after replacement
catvar = ['Country','Team']
for var in catvar:
    print(ipldata[var].value_counts(normalize=True)*100)
    print()

IND       40.769231
Others    20.769231
AUS       16.923077
SA        12.307692
SL         9.230769
Name: Country, dtype: float64

Others    41.538462
CSK       10.769231
RCB+       9.230769
KKR+       9.230769
DD+        7.692308
DC+        7.692308
RCB        6.923077
RR+        6.923077
Name: Team, dtype: float64



In [33]:
# Create dummies
Country_dummy = pd.get_dummies(ipldata['Country'], prefix='Cntry', drop_first=True)
Team_dummy = pd.get_dummies(ipldata['Team'], prefix='Team', drop_first=True)

In [34]:
# Merge the data
ipldata = pd.concat([Country_dummy,Team_dummy, ipldata], axis=1)
ipldata.head()

Unnamed: 0,Cntry_IND,Cntry_Others,Cntry_SA,Cntry_SL,Team_DC+,Team_DD+,Team_KKR+,Team_Others,Team_RCB,Team_RCB+,Team_RR+,Country,Team,BAT,BOW,ALL,BAT-StrikeRate,BOW-Economy,BOW*SR-BL,BAT*RUN-S,BOW*WK-I,BAT*T-RUNS,BAT*ODI-RUNS,BOW*WK-O,Total-RUNS,Total-WKTS,ODI-RUNS,ODI-SR-B,ODI-WKTS,ODI-SR-BL,CAPTAINCY EXP,Highest Score,AVERAGE RUNS,SR -B,SIXERS,AVE-BL,ECON,SR -BL,Base_Price,SoldPrice
0,0,0,1,0,0,0,0,1,0,0,0,SA,Others,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,20.47,8.9,13.93,50000.0,50000.0
1,0,1,0,0,0,0,0,0,1,0,0,Others,RCB,0.0,1.0,0.0,0.0,14.5,0.0,0.0,0.0,0.0,0.0,185.0,214.0,18.0,657.0,71.41,185.0,37.6,0.0,0.0,0.0,0.0,0.0,0.0,14.5,0.0,50000.0,50000.0
2,1,0,0,0,0,0,0,1,0,0,0,IND,Others,0.0,1.0,0.0,0.0,8.81,24.9,0.0,29.0,0.0,0.0,288.0,571.0,58.0,1269.0,80.62,288.0,32.9,0.0,39.0,18.56,121.01,5.0,36.52,8.81,24.9,200000.0,350000.0
3,1,0,0,0,0,0,0,0,0,0,0,IND,CSK,0.0,1.0,0.0,0.0,6.23,22.14,0.0,49.0,0.0,0.0,51.0,284.0,31.0,241.0,84.56,51.0,36.8,0.0,11.0,5.8,76.32,0.0,22.96,6.23,22.14,100000.0,850000.0
4,1,0,0,0,0,0,0,0,0,0,0,IND,CSK,1.0,0.0,0.0,120.71,0.0,0.0,1317.0,0.0,63.0,79.0,0.0,63.0,0.0,79.0,45.93,0.0,0.0,0.0,71.0,32.93,120.71,28.0,0.0,0.0,0.0,100000.0,800000.0


In [35]:
# Lets drop Country,Team from which dummies were created
ipldata.drop(['Country','Team'], axis=1, inplace=True)

In [36]:
# Lets check CAPTAINCY EXP
ipldata['CAPTAINCY EXP'].sample(5)

101    0.0
34     0.0
4      0.0
90     0.0
56     0.0
Name: CAPTAINCY EXP, dtype: float64

In [37]:
# Lets set the data type for dummy variables BAT,BOW,ALL,CAPTAINCY EXP
ipldata[['BAT','BOW','ALL','CAPTAINCY EXP']] = ipldata[['BAT','BOW','ALL','CAPTAINCY EXP']].astype(int)

In [38]:
# Check data
ipldata.head()

Unnamed: 0,Cntry_IND,Cntry_Others,Cntry_SA,Cntry_SL,Team_DC+,Team_DD+,Team_KKR+,Team_Others,Team_RCB,Team_RCB+,Team_RR+,BAT,BOW,ALL,BAT-StrikeRate,BOW-Economy,BOW*SR-BL,BAT*RUN-S,BOW*WK-I,BAT*T-RUNS,BAT*ODI-RUNS,BOW*WK-O,Total-RUNS,Total-WKTS,ODI-RUNS,ODI-SR-B,ODI-WKTS,ODI-SR-BL,CAPTAINCY EXP,Highest Score,AVERAGE RUNS,SR -B,SIXERS,AVE-BL,ECON,SR -BL,Base_Price,SoldPrice
0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,20.47,8.9,13.93,50000.0,50000.0
1,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0.0,14.5,0.0,0.0,0.0,0.0,0.0,185.0,214.0,18.0,657.0,71.41,185.0,37.6,0,0.0,0.0,0.0,0.0,0.0,14.5,0.0,50000.0,50000.0
2,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0.0,8.81,24.9,0.0,29.0,0.0,0.0,288.0,571.0,58.0,1269.0,80.62,288.0,32.9,0,39.0,18.56,121.01,5.0,36.52,8.81,24.9,200000.0,350000.0
3,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0.0,6.23,22.14,0.0,49.0,0.0,0.0,51.0,284.0,31.0,241.0,84.56,51.0,36.8,0,11.0,5.8,76.32,0.0,22.96,6.23,22.14,100000.0,850000.0
4,1,0,0,0,0,0,0,0,0,0,0,1,0,0,120.71,0.0,0.0,1317.0,0.0,63.0,79.0,0.0,63.0,0.0,79.0,45.93,0.0,0.0,0,71.0,32.93,120.71,28.0,0.0,0.0,0.0,100000.0,800000.0


In [39]:
# Create X and Y
Y = ipldata['SoldPrice']
X = ipldata.drop(['SoldPrice','SR -BL','BAT*T-RUNS','ODI-RUNS','AVERAGE RUNS','SR -B','Highest Score', 
                  'BOW-Economy','ODI-SR-B','AVE-BL','ODI-WKTS','Total-RUNS','BAT*RUN-S'], axis=1)

X = sm.add_constant(X)
model1 = sm.OLS(Y,X).fit()
model1.summary()

0,1,2,3
Dep. Variable:,SoldPrice,R-squared:,0.563
Model:,OLS,Adj. R-squared:,0.463
Method:,Least Squares,F-statistic:,5.642
Date:,"Wed, 01 Jun 2022",Prob (F-statistic):,1.88e-10
Time:,00:09:19,Log-Likelihood:,-1809.2
No. Observations:,130,AIC:,3668.0
Df Residuals:,105,BIC:,3740.0
Df Model:,24,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-2.988e+04,1.2e+05,-0.250,0.803,-2.67e+05,2.08e+05
Cntry_IND,2.596e+05,8.2e+04,3.166,0.002,9.7e+04,4.22e+05
Cntry_Others,-1.127e+05,9.44e+04,-1.194,0.235,-3e+05,7.44e+04
Cntry_SA,1.807e+04,1.04e+05,0.173,0.863,-1.89e+05,2.25e+05
Cntry_SL,-5.6e+04,1.15e+05,-0.488,0.627,-2.84e+05,1.72e+05
Team_DC+,-1.258e+04,1.3e+05,-0.097,0.923,-2.7e+05,2.45e+05
Team_DD+,-3.016e+04,1.28e+05,-0.235,0.814,-2.84e+05,2.24e+05
Team_KKR+,-2.748e+04,1.2e+05,-0.228,0.820,-2.66e+05,2.11e+05
Team_Others,7.507e+04,9.67e+04,0.776,0.439,-1.17e+05,2.67e+05

0,1,2,3
Omnibus:,21.993,Durbin-Watson:,1.799
Prob(Omnibus):,0.0,Jarque-Bera (JB):,28.422
Skew:,0.949,Prob(JB):,6.74e-07
Kurtosis:,4.283,Cond. No.,4890000000000000.0


In [40]:
# Lets consider only significant variables
model1.pvalues[model1.pvalues < 0.05]

Cntry_IND     2.024562e-03
SIXERS        5.459499e-04
Base_Price    1.369116e-08
dtype: float64

**Regression Eq:**<br>
SoldPrice = 259600 * Cntry_IND + 5265.35 * SIXERS + 1.3607 * Base_Price

In [41]:
# Check RMSE
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

np.sqrt(mean_squared_error(Y, model1.predict(X))).round(2)

267815.35

In [42]:
from mlxtend.feature_selection import SequentialFeatureSelector as sfs
from sklearn.linear_model import LinearRegression

Y = ipldata['SoldPrice']
X = ipldata.drop('SoldPrice', axis=1)

lr = LinearRegression()
sfs_forward = sfs(lr,
                 k_features=(1,37),
                 forward=True,
                 floating=True,
                 scoring='neg_mean_squared_error',
                 cv=10)
sfs = sfs_forward.fit(X,Y)
print('Forward Selection Subset:', sfs.k_feature_names_)

Forward Selection Subset: ('Cntry_IND', 'Cntry_Others', 'BOW*WK-O', 'Total-RUNS', 'ODI-RUNS', 'ODI-WKTS', 'SIXERS', 'Base_Price')


1) Fit a linear regression model to predict the sold-price of the player.<br>

**Regression Eq:**<br>
SoldPrice = 259600 * Cntry_IND + 5265.35 * SIXERS + 1.3607 * Base_Price

2) Use variable reduction techniques covered so far to identify significant variables.<br>

Forward Selection Subset: ('Cntry_IND', 'Cntry_Others', 'BOW*WK-O', 'Total-RUNS', 'ODI-RUNS', 'ODI-WKTS', 'SIXERS', 'Base_Price')

3) What is the RMSE of the model?<br>

267815.35

4) What are the top 5 variables that impact the price of the player.<br>

('Cntry_IND', 'Cntry_Others', 'BOW*WK-O', 'Total-RUNS', 'ODI-RUNS')