<div align="center">
    <h1>Regression Techniques</h1>
<img src="https://user-images.githubusercontent.com/48846576/102035064-24aa0900-3d85-11eb-9909-1e478abaf98b.jpg"  width="800" height="300">
    <span>Photo by <a href="https://unsplash.com/@bushmush?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">michael weir</a> on <a href="https://unsplash.com/?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></span>
</div><br>
<div align="left">
    <h2>Problem Statement</h2>
    <p>Predict runs scored by batsman against a bowler. </p>
    <p>Cricket is a bat and ball game. In Twenty20 cricket game, a bowler gets to bowl 4 overs maximum. Each over consists of 6 legal deliveries. i.e. a maximum of 24 legal deliveries (balls). Bowlers are classified into two major types viz. Pace/Fast bowlers, Spin bowlers. Depending on the type of bowler there are different deliveries like In Swinger, Out Swinger, Cutter, Off Spin, Leg Spin, etc. Similarly batsman does have differnt kind of shots live drives, cuts, pull, hook, etc to counter the bowling and score runs. </p>
    <p> Can we capture these insights from numerical data about each player and make an use case for Linear Regression? </p>
   <ul>
       <li> First step, I have explored the features and built a dataset. Details are in this <a href="https://www.kaggle.com/rajsengo/feature-engineering-for-regression">notebook</a>  </li>
       <li> In this notebook, I'm building regression models and comparing the performances to choose a best fit model</li>
    </ul>
</div>

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
#sns.set()

# Train data

In [None]:
train = pd.read_csv('../input/feature-engineering/train.csv', index_col=None)
pd.set_option('display.max_columns', None)
train.head(5)

# Importance of Venue

Following are the home venue of the IPL teams. However IPL games were played at other venues in India and abroad (UAE, South Africa) as well. 

* M.Chinnaswamy Stadium, Bengaluru
* MA Chidambaram Stadium, Chepauk, Chennai
* Arun Jaitley Stadium, Delhi
* Rajiv Gandhi International Stadium, Uppal, Hyderabad
* Sawai Mansingh Stadium, Jaipur
* Eden Gardens, Kolkata
* Punjab Cricket Association IS Bindra Stadium, Mohali, Chandigarh
* Wankhede Stadium, Mumbai

As we can see 64% of the games were played at these venues. Hence these venues top the most runs scored and deliveries bowled list

In [None]:
import math
def calculate_top_venues():
    summary_df = pd.read_csv('/kaggle/input/indian-premier-league-ipl-all-seasons/all_season_summary.csv', index_col=None)
    valcount = summary_df['venue_name'].value_counts()
    valcount = valcount.reset_index()
    total_games = valcount['venue_name'].sum()
    counter = 0
    for i in range(0,8):
        print(valcount.at[i,'index'], valcount.at[i,'venue_name'])
        counter += valcount.at[i,'venue_name']

    print("Percentage of total games played at these venues : {}%".format(math.floor(counter/total_games*100)))

calculate_top_venues()

In [None]:
venue = train.groupby(['venue']).sum()#.reset_index(name='counts')
venue = venue.reset_index()
venue = venue.sort_values(['runs'], ascending=False).reset_index(drop=True)
f = plt.figure(figsize=(10, 20))
gs = f.add_gridspec(2,1)
with sns.axes_style("darkgrid"):
    #sns.set_context("notebook", font_scale=1.0, rc={"lines.linewidth": 3.5})    
    ax = f.add_subplot(gs[0, 0])    
    g1 = sns.barplot(y="venue", x='runs', data=venue, palette="rocket")
    g1.set_ylabel(None,fontsize=20)
    g1.axes.set_xlabel("Runs",fontsize=18)     
    g1.axes.set_xticks(range(0,30000,5000))    
    g1.axes.set_title("Total Runs scored per venue",fontsize=20)

venue = venue.sort_values(['ball'], ascending=False).reset_index(drop=True)

with sns.axes_style("darkgrid"):
    #sns.set_context("notebook", font_scale=1.0, rc={"lines.linewidth": 3.5})    
    ax = f.add_subplot(gs[1, 0])    
    g1 = sns.barplot(y="venue", x='ball', data=venue, palette="rocket")
    g1.set_ylabel(None,fontsize=20)
    g1.set_ylabel(None,fontsize=20)
    g1.axes.set_xlabel("Balls",fontsize=18)     
    g1.axes.set_xticks(range(0,25000,5000))    
    g1.axes.set_title("Total Balls Bowled per venue",fontsize=20)

Let's transform the venues into the shorter names and also map the venues that are not the top 8 in our list to Others so that we only have 9 categorical values for this attribute

In [None]:
venue_map = {'M.Chinnaswamy Stadium, Bengaluru':'Bengaluru',
'Punjab Cricket Association IS Bindra Stadium, Mohali, Chandigarh':'Mohali',
'Arun Jaitley Stadium, Delhi':'Delhi',
'Wankhede Stadium, Mumbai':'Mumbai',
'Eden Gardens, Kolkata':'Kolkata',
'Sawai Mansingh Stadium, Jaipur':'Jaipur',
'Rajiv Gandhi International Stadium, Uppal, Hyderabad':'Hyderabad',
'MA Chidambaram Stadium, Chepauk, Chennai':'Chennai',
'Dr DY Patil Sports Academy, Mumbai':'Others',
'Newlands, Cape Town':'Others',
'St George\'s Park, Port Elizabeth':'Others',
'Kingsmead, Durban':'Others',
'SuperSport Park, Centurion':'Others',
'Buffalo Park, East London':'Others',
'The Wanderers Stadium, Johannesburg':'Others',
'Diamond Oval, Kimberley':'Others',
'Mangaung Oval, Bloemfontein':'Others',
'Brabourne Stadium, Mumbai':'Others',
'Sardar Patel (Gujarat) Stadium, Motera, Ahmedabad':'Others',
'Barabati Stadium, Cuttack':'Others',
'Vidarbha Cricket Association Stadium, Jamtha, Nagpur':'Others',
'Himachal Pradesh Cricket Association Stadium, Dharamsala':'Others',
'Nehru Stadium, Kochi':'Others',
'Holkar Cricket Stadium, Indore':'Others',
'Dr. Y.S. Rajasekhara Reddy ACA-VDCA Cricket Stadium, Visakhapatnam':'Others',
'Maharashtra Cricket Association Stadium, Pune':'Pune',
'Shaheed Veer Narayan Singh International Stadium, Raipur':'Others',
'JSCA International Stadium Complex, Ranchi':'Others',
'Sheikh Zayed Stadium, Abu Dhabi':'Others',
'Sharjah Cricket Stadium':'Others',
'Dubai International Cricket Stadium':'Others',
'Saurashtra Cricket Association Stadium, Rajkot':'Others',
'Green Park, Kanpur':'Others'}
def map_venue(venue_name):
   return venue_map[venue_name]
train['venue'] = train.apply(lambda x: map_venue(x['venue']),axis=1)
pd.set_option('display.max_columns', None)
train['venue'].value_counts()

In [None]:
train_set = train.copy()
print('train {} train_set {}'.format(train.shape, train_set.shape))

# StratifiedShuffleSplit

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(train_set, train_set['venue']):
    strat_train_set = train_set.loc[train_index]
    strat_test_set = train_set.loc[test_index]
    
strat_test_set['venue'].value_counts() / len(strat_test_set)

In [None]:
train_set['venue'].value_counts() / len(train)

strat_test_set is propotional to the distribution of venues in train_set

In [None]:
#runs - column is not going to be part of the training set
numeric_features = ['ball', 'avg_runs_scored', 'avg_balls_faced',
       'avg_4s_scored', 'avg_6s_scored', 'batting_st_rate',
       'avg_games_captained', 'total_runs_scored', 'total_innings_batted',
       'total_balls_faced', 'total_4s_hit', 'total_6s_hit',
       'total_games_captained', 'bowler_avg_overs', 'bowler_avg_maidens',
       'bowler_avg_conceded', 'bowler_avg_wkts', 'bowler_econ_rt',
       'bowler_avg_dots', 'bowler_avg_4s', 'bowler_avg_6s', 'bowler_avg_wides',
       'bowler_avg_noballs', 'bowler_avg_captaincy', 'bowler_total_conceded',
       'total_innings_bowled', 'bowler_total_overs', 'bowler_total_maidens',
       'bowler_total_wkts', 'bowler_total_dots', 'bowler_total_4s',
       'bowler_total_6s', 'bowler_total_wides', 'bowler_total_noballs',
       'bowler_total_captaincy']

categorical_features = ['venue', 'batsman_team', 'bowling_team', 'home_game','innings_id']

remainder_features = ['season', 'match_id', 'batsman1_name', 'bowler1_name', 'home_team', 'away_team']

# Pipeline

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn import set_config

X = strat_train_set.drop("runs", axis = 1) # train_data will feed to the model
y = strat_train_set['runs'] # label to predict

def build_model(model):
    numerical_pipe = Pipeline([('std_scaler',StandardScaler())])
    categorical_pipe = Pipeline([('one_hot',OneHotEncoder())])
    numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
    categorical_transformer = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('cat', categorical_transformer, categorical_features)])
    regr = Pipeline(steps=[('preprocessor', preprocessor),
                      ('regression_model', model)])   
    set_config(display='diagram')
    return regr

def get_pipeline():
    numerical_pipe = Pipeline([('std_scaler',StandardScaler())])
    categorical_pipe = Pipeline([('one_hot',OneHotEncoder())])
    numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
    categorical_transformer = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('cat', categorical_transformer, categorical_features)])
    return preprocessor

def calculate_train_rmse(name, model):
    runs_predictions = model.predict(X)
    mse = mean_squared_error(y, runs_predictions)
    rmse = np.sqrt(mse)
    print("Training RMSE of {} : {}".format(name,rmse))

def sample_prediction(name, model, num_records):
    some_data = X.iloc[:num_records]
    some_labels = y.iloc[:num_records]
    preds = []
    for label in list(model.predict(some_data)):
        preds.append(math.floor(label))

    print("Predictions on training data using :", name)    
    print("Predictions    :", preds)
    print("Actual labels  :", list(some_labels))    

## LinearRegression

In [None]:
linear_reg = build_model(LinearRegression())
linear_reg.fit(X,y)

In [None]:
calculate_train_rmse("LinearRegression",linear_reg)

In [None]:
sample_prediction("LinearRegression",linear_reg, 10)

# DecisionTrees

In [None]:
from sklearn.tree import DecisionTreeRegressor
tree_reg = build_model(DecisionTreeRegressor())
tree_reg.fit(X,y)

In [None]:
calculate_train_rmse("DecisionTreeRegressor",tree_reg)

In [None]:
sample_prediction("DecisionTreeRegressor", tree_reg, 10)

DecisionTreeRegressor overfits the training data !

# RandomForestRegressor

In [None]:
from sklearn.ensemble import RandomForestRegressor
forest_reg = build_model(RandomForestRegressor(random_state = 42))
forest_reg.fit(X,y)

In [None]:
calculate_train_rmse("RandomForestRegressor",forest_reg)

In [None]:
sample_prediction("RandomForestRegressor", forest_reg, 10)

# Support Vector Machines

In [None]:
from sklearn.svm import LinearSVR
svm_reg = build_model(LinearSVR(epsilon = 1.5, max_iter = 3000))
svm_reg.fit(X,y)

In [None]:
calculate_train_rmse("LinearSVR",svm_reg)

In [None]:
sample_prediction("LinearSVR", svm_reg, 10)

So far RandomForestRegressor seems to be providing better results. Let's Tune RandomForestRegressor

# Model Tuning

In [None]:
print("RandomForestRegressor params : ", forest_reg[-1].get_params())

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    {'regression_model__n_estimators': [3, 10, 30, 50], 'regression_model__max_features' : [2, 4, 6, 8, 10, 12]},
    {'regression_model__bootstrap': [False], 'regression_model__n_estimators' : [3, 10], 'regression_model__max_features' : [2, 3, 4] }
]

grid_search = GridSearchCV(forest_reg, param_grid, cv=5, 
                          scoring = 'neg_mean_squared_error',
                          return_train_score=True)
grid_search.fit(X, y)


In [None]:
grid_search.best_params_

In [None]:
set_config(display='diagram')
grid_search.best_estimator_

In [None]:
cvres = grid_search.cv_results_

for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score),params)

In [None]:
sample_prediction("RandomForestRegressor", forest_reg, 20)

# Evaluate the model on test set

In [None]:
final_model = grid_search.best_estimator_
preprocessor = get_pipeline()

X_test = strat_test_set.drop("runs", axis = 1)
y_test = strat_test_set["runs"].copy()

final_predictions = final_model.predict(X_test)

final_mse = mean_squared_error(y_test,final_predictions)
final_rmse = np.sqrt(final_mse)
print("Test RMSE : ", final_rmse)

# Cross validation

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, X, y, scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)

In [None]:
def display_scores(scores):
    print("Scores : ", scores)
    print("Mean : ", scores.mean())
    print("Standard Deviation : ", scores.std())    

In [None]:
display_scores(tree_rmse_scores)

In [None]:
scores_linear = cross_val_score(linear_reg, X, y, scoring="neg_mean_squared_error", cv=10)
ln_rmse_scores = np.sqrt(-scores_linear)

In [None]:
display_scores(ln_rmse_scores)

In [None]:
scores_forest = cross_val_score(forest_reg, X, y, scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-scores_forest)

In [None]:
display_scores(forest_rmse_scores)

In [None]:
svm_cross = cross_val_score(svm_reg, X, y, scoring="neg_mean_squared_error", cv=10)
svm_rmse_scores = np.sqrt(-svm_cross)

In [None]:
display_scores(svm_rmse_scores)

In [None]:
import joblib as jbl
jbl.dump(linear_reg, "linear_reg.pkl")

In [None]:
jbl.dump(tree_reg, "tree_reg.pkl")

In [None]:
jbl.dump(forest_reg, "forest_reg.pkl")

In [None]:
jbl.dump(svm_reg, "svm_reg.pkl")

# Predictions

What if IPL 2020 had happened in India?!

In [None]:
summary_df = pd.read_csv('/kaggle/input/indian-premier-league-ipl-all-seasons/all_season_summary.csv', index_col=None)
detail_df = pd.read_csv('/kaggle/input/indian-premier-league-ipl-all-seasons/all_season_details.csv', index_col=None)

Let's take this game which happend at Sharjah Cricket Stadium and change the venue to Jaipur which is Rajathan Royal's home ground and see what our model predicts

In [None]:
summary_df.loc[summary_df['id'] == 1216496]

In [None]:
batsman_stats = pd.read_csv('../input/feature-engineering/batsman_numerical.csv', index_col=None)
bowler_stats = pd.read_csv('../input/feature-engineering/bowler_numerical.csv', index_col=None)

class DataToPredict:
    def __init__(self, match_id, venue = 'Others'):
        self.match_id = match_id
        self.data = None
        self.venue = venue
        self.predicted_runs = None # y_hat
        self.X = None
        self.y = None # old values
        
    def load(self):
        match = detail_df.loc[detail_df['match_id'] == self.match_id]
        #print("match shape ",match.shape)        
        match = match.reset_index()
        df = match[(match["isWide"] == False) & (match["isNoball"] == False)]
        df1=pd.pivot_table(df, index=['season','match_id','batsman1_name','bowler1_name','home_team', 'away_team','innings_id'],values=['runs'],aggfunc=sum)
        df2=pd.pivot_table(df,  index=['season','match_id','batsman1_name','bowler1_name','home_team', 'away_team','innings_id'],values=['ball'],aggfunc=len)
        match = pd.concat([df1,df2],axis=1)
        #print("match shape ",match.shape)
        match = match.reset_index()
        match = match.sort_values('innings_id')
        #match = match.drop(columns=['index'], axis=1)
        self.data = match
        return self.data
    
    def fill_batsman_attributes(self):
        df = self.data
        for index, row in df.iterrows():
            try:
                temp = batsman_stats.loc[batsman_stats['fullName'] == df.at[index, 'batsman1_name']]
                temp = temp.reset_index()
                if df.empty:
                    print('DataFrame is empty for {}'.format(df.at[index, 'batsman1_name']))
                else:
                    df.at[index,'avg_runs_scored'] = temp.at[0,'avg_runs_scored']
                    df.at[index,'avg_balls_faced'] = temp.at[0,'avg_balls_faced']
                    df.at[index,'avg_4s_scored'] = temp.at[0,'avg_4s_scored'] 
                    df.at[index,'avg_6s_scored'] = temp.at[0,'avg_6s_scored'] 
                    df.at[index,'batting_st_rate'] = temp.at[0,'batting_st_rate'] 
                    df.at[index,'avg_games_captained'] = temp.at[0,'avg_games_captained'] 
                    df.at[index,'total_runs_scored'] = temp.at[0,'total_runs_scored'] 
                    df.at[index,'total_innings_batted'] = temp.at[0,'total_innings_batted'] 
                    df.at[index,'total_balls_faced'] = temp.at[0,'total_balls_faced'] 
                    df.at[index,'total_4s_hit'] = temp.at[0,'total_4s_hit'] 
                    df.at[index,'total_6s_hit'] = temp.at[0,'total_6s_hit'] 
                    df.at[index,'total_games_captained'] = temp.at[0,'total_games_captained'] 
            except KeyError as e:
                print(e)
                continue
        self.data = df
        return self.data

    def fill_bowler_attributes(self):
        df = self.data
        for index, row in df.iterrows():
            try:
                temp = bowler_stats.loc[bowler_stats['fullName'] == df.at[index, 'bowler1_name']]
                temp = temp.reset_index()
                if df.empty:
                    print('DataFrame is empty for {}'.format(df.at[index, 'bowler1_name']))
                else:
                    df.at[index,'bowler_avg_overs'] = temp.at[0,'bowler_avg_overs']
                    df.at[index,'bowler_avg_maidens'] = temp.at[0,'bowler_avg_maidens']
                    df.at[index,'bowler_avg_conceded'] = temp.at[0,'bowler_avg_conceded']
                    df.at[index,'bowler_avg_wkts'] = temp.at[0,'bowler_avg_wkts']
                    df.at[index,'bowler_econ_rt'] = temp.at[0,'bowler_econ_rt']
                    df.at[index,'bowler_avg_dots'] = temp.at[0,'bowler_avg_dots']
                    df.at[index,'bowler_avg_4s'] = temp.at[0,'bowler_avg_4s']
                    df.at[index,'bowler_avg_6s'] = temp.at[0,'bowler_avg_6s']
                    df.at[index,'bowler_avg_wides'] = temp.at[0,'bowler_avg_wides']
                    df.at[index,'bowler_avg_noballs'] = temp.at[0,'bowler_avg_noballs']
                    df.at[index,'bowler_avg_captaincy'] = temp.at[0,'bowler_avg_captaincy']
                    df.at[index,'bowler_total_conceded'] = temp.at[0,'bowler_total_conceded']
                    df.at[index,'total_innings_bowled'] = temp.at[0,'total_innings_bowled']
                    df.at[index,'bowler_total_overs'] = temp.at[0,'bowler_total_overs']
                    df.at[index,'bowler_total_maidens'] = temp.at[0,'bowler_total_maidens']
                    df.at[index,'bowler_total_wkts'] = temp.at[0,'bowler_total_wkts']
                    df.at[index,'bowler_total_dots'] = temp.at[0,'bowler_total_dots']
                    df.at[index,'bowler_total_4s'] = temp.at[0,'bowler_total_4s']
                    df.at[index,'bowler_total_6s'] = temp.at[0,'bowler_total_6s']
                    df.at[index,'bowler_total_wides'] = temp.at[0,'bowler_total_wides']
                    df.at[index,'bowler_total_noballs'] = temp.at[0,'bowler_total_noballs']
                    df.at[index,'bowler_total_captaincy'] = temp.at[0,'bowler_total_captaincy']
            except KeyError as e:
                print(e)
                continue 
        self.data = df
        return self.data

    def add_features(self):
        df = self.data
        for index, row in df.iterrows():
            try:
                temp = summary_df.loc[summary_df['id'] == df.at[index, 'match_id']]
                temp = temp.reset_index()
                #df.at[index,'venue'] = temp.at[0,'venue_name']
                df.at[index,'venue'] = self.venue
                if df.at[index,'batsman1_name'] in (temp.at[0,'home_playx1'] ):
                    df.at[index,'batsman_team'] = temp.at[0,'home_team']
                if df.at[index,'batsman1_name'] in (temp.at[0,'away_playx1'] ):
                    df.at[index,'batsman_team'] = temp.at[0,'away_team']
                if df.at[index,'bowler1_name'] in (temp.at[0,'away_playx1'] ):
                    df.at[index,'bowling_team'] = temp.at[0,'away_team']
                if df.at[index,'bowler1_name'] in (temp.at[0,'home_playx1'] ):
                    df.at[index,'bowling_team'] = temp.at[0,'home_team']
                if df.at[index,'batsman_team'] in (temp.at[0,'home_team'] ):
                    df.at[index,'home_game'] = 1
                else:
                    df.at[index,'home_game'] = 0                                    
            except KeyError as e:
                print(e)
                continue
        self.data = df
        return self.data

    def predict_runs(self):
        self.y = self.data['runs'] # label to predict       
        self.X = self.data.drop(columns=["runs"], axis = 1) # train_data will feed to the model
        self.predicted_runs = forest_reg.predict(self.X)
        #print("Original Runs  : ", self.y) 
        print("Predicted Runs : ", np.ndarray.round(self.predicted_runs,0)) 
        
    def add_preds_to_df(self):
        self.data['prediction'] = np.ndarray.round(self.predicted_runs,0)
        return self.data
    
    def print_preds(self):
        self.data[['batsman1_name','bowler1_name','innings_id','runs','prediction','ball']]
        
    def declare_winner(self):
        result_df = pd.pivot_table(self.data, index=['innings_id'],values=['runs', 'prediction'],aggfunc=sum)
        result_df = result_df.reset_index()
        innings1_score =  result_df.loc[result_df['innings_id']== 1]['prediction'][0]
        innings2_score =  result_df.loc[result_df['innings_id']== 2]['prediction'][1]
        print("first innings score  : ",innings1_score)
        print("second innings score : ",innings2_score)
        
        #innings_id 	prediction 	runs
        #0 	1 	162.0 	192
        #1 	2 	181.0 	195

In [None]:
loader = DataToPredict(1216496, venue ='Jaipur')
#loader = DataToPredict('336012', venue ='Others')
csk_rr = loader.load()
csk_rr = loader.add_features()
csk_rr = loader.fill_batsman_attributes()
csk_rr = loader.fill_bowler_attributes()
loader.predict_runs()
csk_rr = loader.add_preds_to_df()
#loader.print_preds()
loader.declare_winner()

## Predicted Score Card

In [None]:
score = csk_rr[['batsman1_name','bowler1_name','innings_id','runs','prediction','ball']]
score = score.reset_index()
score = score.groupby(["batsman1_name","innings_id"])[["runs","prediction","ball"]].sum()
score = score.reset_index()
score[["batsman1_name","innings_id","runs","prediction","ball"]].sort_values("innings_id", ascending=True)

# CSK is the Winner!

Well the model predicts the second innings score to be higher than the first innings indicating that CSK would have won this game if it had happened in Jaipur instead of Sharjah Cricket Stadium. 

TODO : 

* Add extras to the totals predicted and run predictions for all the games in the season to see how model is performing and any issues
* Validate the accuracy of the predictions and re-verify the model parameters
* Try with advance regression techniques 