## Concepts covered: 

1. Data pre-processing
2. Model building qith stats models (OLS)
3. Model building with Sklearn
4. Cross validation, Bias variance trade off
5. Feature Selection (SFS, BE, RFE)
6. Hyper parameter Tuning (Grid searchCV, Randomized Serach CV, Hyper opt)
7. Optimization (Stochastic gradient descent)

In [None]:
pip uninstall mlxtend

In [None]:
pip uninstall optuna

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
import statsmodels.api as sm
from scipy.stats import boxcox 

from sklearn.linear_model import LinearRegression,Lasso,Ridge,ElasticNet,SGDRegressor
from mlxtend.feature_selection import SequentialFeatureSelector as sfs
from sklearn.feature_selection import RFE
from sklearn.preprocessing import StandardScaler,PowerTransformer
from sklearn.model_selection import KFold,cross_val_score, GridSearchCV,RandomizedSearchCV
from sklearn.model_selection import train_test_split
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif

from sklearn.metrics import r2_score, mean_squared_error

import optuna 
import optuna.trial._state

ImportError: Unable to import required dependencies:
numpy: Error importing numpy: you should not try to import numpy from
        its source directory; please exit the numpy source tree, and relaunch
        your python interpreter from there.

The Indian Premier League was a professional league for Twenty20 (T20) cricket championships (see Exhibit 1) that was started in 2008 in India. The IPL was initiated by the BCCI with eight franchises comprising players from across the world. The first IPL auction was held in 2008 for ownership of the teams for 10 years, with a base price of USD 50 million. The franchises acquire players through an English auction that is conducted every year. However, there are several rules imposed by the IPL. For example, only international players and popular Indian players are auctioned. The performance of the players could be measured through several metrics. Although the IPL follows the Twenty20 format of the game, it is possible that the performance of the players in the other formats of the game such as Test and One- Day matches could influence player pricing. A few players had excellent records in Test matches, but their records in Twenty20 matches were not very impressive. The performance of 130 players who played in at least one season of the IPL(2008-2011) measured through various performance metrics are provided in the dataset.

## About the dataset (IPL Auction data)

**PLAYER NAME**: Name of the player<br>
**AGE**: The age of the player is classified into three categories. Category 1 means the player is less than 25 years old. Category 2 means the player is between 25 and 35 years and Category 3 means the player has aged more than 35.<br>
**COUNTRY**: Country of the player<br>
**PLAYING ROLE**: Player's primary skill<br>
**T-RUNS**: Total runs scored in the test matches<br>
**T-WKTS**: Total wickets taken in the test matches<br>
**ODI-RUNS-S**: Runs scored in One Day Internationals<br>
**ODI-SR-B**: Batting strike rate in One Day Internationals<br>
**ODI-WKTS**: Wickets taken in One Day Internationals<br>
**ODI-SR-BL**: Bowling strike rate in One Day Internationals<br>
**CAPTAINCY EXP**: Captained a team or not<br>
**RUNS-S**: Number of runs scored by a player<br>
**HS**: Highest score by a batsman in IPL<br>
**AVE**: Average runs scored by a batsman in IPL<br>
**SR-B**: Batting strike rate (ratio of the number of runs scored to the number of basses faced) in IPL.<br>
**SIXERS**: Number of six runs scored by a player in IPL.<br>
**RUNS-C**: Number of runs conceded by a player<br>
**WKTS**: Number of wickets were taken by a player in IPL.<br>
**AVE-BL**: Bowling average (number of runs conceded / number of wickets taken) in IPL.<br>
**ECON**: Economy rate of a bowler in IPL (number of runs conceded by the bowler per over).<br>
**SR-BL**: Bowling strike rate (ratio of the number of balls bowled to the number of wickets taken) in IPL.<br>
**SOLD PRICE**: Auction price of the player (Target Variable)<br>

In [None]:
data=pd.read_csv('IPL_IMB_data.csv')
data.head()

In [None]:
data.info()

In [None]:
data.describe().T

In [None]:
data.describe(include='object')

**Interpretation:** The variables `PLAYER NAME`, `COUNTRY` and `PLAYING ROLE` are categorical. All the remaining variables are numerical. 

From the above output, we see that the data type of `AGE` and `CAPTAINCY EXP` is 'int64'.

But according to the data definition, `AGE` and `CAPTAINCY EXP` are categorical variables, which are wrongly interpreted as 'int64', so we will convert these variables data type to 'object'.

In [None]:
data.AGE.value_counts()

In [None]:
data['CAPTAINCY EXP'].value_counts()

In [None]:
data['PLAYER NAME'].nunique()

In [None]:
data.drop('PLAYER NAME',axis=1,inplace=True)

In [None]:
data['AGE']=data['AGE'].astype('object')
data['CAPTAINCY EXP']=data['CAPTAINCY EXP'].astype('object')

In [None]:
target=data['SOLD PRICE']

In [None]:
num=data.select_dtypes(include=np.number)
cat=data.select_dtypes(include='object')

In [None]:
cat.drop('COUNTRY',axis=1, inplace=True)

In [None]:
#Category encoding
dummy=pd.get_dummies(data=cat,drop_first=True,dtype=float)


In [None]:
dummy

In [None]:
plt.figure(figsize=[15,15])
i=1
for col in num:
    ax=plt.subplot(4,5,i)
    sns.distplot(num[col])
    i=i+1

In [None]:
num.skew()

In [None]:
# Skew treatment

In [None]:
#pt=PowerTransformer()
#pt_sc=pd.DataFrame(pt.fit_transform(num),columns=num.columns)
#pt_sc.skew()

In [None]:
# Capping
for i in num.columns:
    Q1=num[i].quantile(.25)
    Q3=num[i].quantile(.75)
    IQR=Q3-Q1
    ub=Q3+1.5*IQR
    lb=Q1-1.5*IQR
    lc=num[i].quantile(.18)
    uc=num[i].quantile(.82)
    for ind in num[i].index:
        if (num.loc[ind,i]>uc):
            num.loc[ind,i]=uc
        elif (num.loc[ind,i]<lc):
            num.loc[ind,i]=lc   

In [None]:
num.skew()

In [None]:
#num-num
plt.figure(figsize=[10,10])
sns.heatmap(num.corr(),annot=True)

In [None]:
#sns.pairplot(num, kind='kde')

In [None]:
# scaling
sc=StandardScaler()
num_sc=pd.DataFrame(sc.fit_transform(num),columns=num.columns)
num_sc.head()

In [None]:
data_final=pd.concat([num_sc,dummy], axis=1)
data_final.head()

In [None]:
y=data_final['SOLD PRICE']
X=data_final.drop('SOLD PRICE', axis=1)

In [None]:
#OLS
X_c=sm.add_constant(X)
X_train,X_test,y_train,y_test=train_test_split(X_c,y,test_size=.2,random_state=10)
X_train.shape

In [None]:
# Model building 
model1= sm.OLS(y_train,X_train).fit()
print(model1.summary())


In [None]:
model1.resid.skew()

In [None]:
sns.residplot(x=model1.resid,y=model1.fittedvalues)

In [None]:
X_test.shape

In [None]:
model1.rsquared

In [None]:
y_pred=model1.predict(X_test)
np.sqrt(mean_squared_error(y_test,y_pred))

In [None]:
#a lot of insignificant variables (pvalues>.05)
# Assumptions:
   # Multicolinearity is present
   # Some amount of heteroscadasticity is present
   # No Auto-correlation

In [None]:
#VIF for multicolinearity <5

vif_val=[vif(X.values,i)for i in range(0,X.shape[1])]
VIF=pd.DataFrame()
VIF['feat']=X.columns
VIF['vif']=vif_val

VIF.sort_values('vif',ascending=False)



In [None]:
X=X.drop('AVE-BL',axis=1)

vif_val=[vif(X.values,i)for i in range(0,X.shape[1])]
VIF=pd.DataFrame()
VIF['feat']=X.columns
VIF['vif']=vif_val

VIF.sort_values('vif',ascending=False)

In [None]:
X=X.drop('RUNS-C',axis=1)

vif_val=[vif(X.values,i)for i in range(0,X.shape[1])]
VIF=pd.DataFrame()
VIF['feat']=X.columns
VIF['vif']=vif_val

VIF.sort_values('vif',ascending=False)

In [None]:
X=X.drop('RUNS-S',axis=1)

vif_val=[vif(X.values,i)for i in range(0,X.shape[1])]
VIF=pd.DataFrame()
VIF['feat']=X.columns
VIF['vif']=vif_val

VIF.sort_values('vif',ascending=False)

In [None]:
X=X.drop('HS',axis=1)

vif_val=[vif(X.values,i)for i in range(0,X.shape[1])]
VIF=pd.DataFrame()
VIF['feat']=X.columns
VIF['vif']=vif_val

VIF.sort_values('vif',ascending=False)

In [None]:
X=X.drop('ODI-RUNS-S',axis=1)

vif_val=[vif(X.values,i)for i in range(0,X.shape[1])]
VIF=pd.DataFrame()
VIF['feat']=X.columns
VIF['vif']=vif_val

VIF.sort_values('vif',ascending=False)

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.2,random_state=10)

In [None]:
#sklearn
lr=LinearRegression()
model_lr=lr.fit(X_train,y_train)
y_pred1=model_lr.predict(X_test)
np.sqrt(mean_squared_error(y_test,y_pred1))

In [None]:
# kfold cross validation

kf=KFold(n_splits=5)
score=cross_val_score(lr,X_train,y_train,cv=kf,scoring='r2')
bias=1-score.mean()
var=score.std()/score.mean()
bias,var

In [None]:
score

In [None]:
# Feature selection
# Sequential feature selectors
   # Forward
   # Backward
   # RFE

In [None]:
# forward


In [None]:
# backward FE


In [None]:
# RFE


In [None]:
# Regularizations
   # Ridge
   # Lasso 
   # Elastic Net

In [None]:
# Hyper parameter optimization
   # Grid search
   # Random Search 
   # Bayesian Opt

In [None]:
# Grid search


In [None]:
#build the model with best params


In [None]:
# Randomized search


In [None]:
#build the model with best params


In [None]:
# Optimization
   # SGD