<div align="center">
    <h1>Feature Engineering for Regression</h1>
<img src="https://user-images.githubusercontent.com/48846576/102035064-24aa0900-3d85-11eb-9909-1e478abaf98b.jpg"  width="800" height="300">
    <span>Photo by <a href="https://unsplash.com/@bushmush?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">michael weir</a> on <a href="https://unsplash.com/?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></span>
</div><br>
<div align="left">
    <h2>Problem Statement</h2>
    <p>Cricket is a bat and ball game. In Twenty20 cricket game, a bowler gets to bowl 4 overs maximum. Each over consists of 6 legal deliveries. i.e. a maximum of 24 legal deliveries (balls). Bowlers are classified into two major types viz. Pace/Fast bowlers, Spin bowlers. Depending on the type of bowler there are different deliveries like In Swinger, Out Swinger, Cutter, Off Spin, Leg Spin, etc. Similarly batsman does have differnt kind of shots live drives, cuts, pull, hook, etc to counter the bowling and score runs. </p>
        <p>The goal of this excercise is to predict how many runs will a batmans score against a given bowler.</p>
    
This notebook is structured in the following manner
<ul>
  <li>Explore the match details data file and build base training set</li>
  <li>Analyze and gain insights of batman's attributes and extract from batting summary datafile</li>    
  <li>Analyze and extact bowler's attributes from bowling summary datafile</li>        
  <li>Understand the correlation of various attributes to the runs scored by a batsman</li>    
  <li>Enrich and build training dataset for regression techniques</li>        
</ul> 

Input :
    <ul>
        <li> all_season_details.csv </li>
        <li> all_season_batting_card.csv </li>
        <li> all_season_summary.csv </li>
        <li> all_season_bowling_card.csv </li>
    </ul>

Output :
    <ul>
        <li> train.csv </li>
    </ul>
</div>


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

print('Input Dataset')
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
details_df = pd.read_csv('/kaggle/input/indian-premier-league-ipl-all-seasons/all_season_details.csv', index_col=None)
batting_df = pd.read_csv('/kaggle/input/indian-premier-league-ipl-all-seasons/all_season_batting_card.csv', index_col=None)
summary_df = pd.read_csv('/kaggle/input/indian-premier-league-ipl-all-seasons/all_season_summary.csv', index_col=None)
bowling_df = pd.read_csv('/kaggle/input/indian-premier-league-ipl-all-seasons/all_season_bowling_card.csv', index_col=None)

# Exploratory Data Analysis
<div align="left">
    <p>In this section lets first start with the match details data file from the dataset and build the train dataset with further exploration of batting and bowling performances of players using other data files</p>
</div>

## Match Details Data

Let's look at the first over (first 6 balls) of the first match of 2020 season. The match detail data is presented in ball-by-ball format. From this we are going to build total runs scored by a batsman against a bowler in each mactch.

In [None]:
details_df[['season', 'match_id', 'match_name', 'home_team',
       'away_team', 'current_innings', 'innings_id', 'over', 'ball', 'runs',
       'batsman1_name',  'bowler1_name']].head(6)

## Create initial base version of training data
Create a pivot table with batsman name, bowler name as indices and runs & balls as values

In [None]:
def build_dataset():
    df = None
    df = details_df[(details_df["isWide"] == False) & (details_df["isNoball"] == False)]
    df1=pd.pivot_table(df, index=['season','match_id','batsman1_name','bowler1_name','home_team', 'away_team','innings_id'],values=['runs'],aggfunc=sum)
    df2=pd.pivot_table(df,  index=['season','match_id','batsman1_name','bowler1_name','home_team', 'away_team','innings_id'],values=['ball'],aggfunc=len)
    df_pivot = pd.concat([df1,df2],axis=1)
    df_pivot = df_pivot.reset_index()
    return df_pivot
train_df = build_dataset()
print('Pivot table shape :',train_df.shape)

Now let's view the pivot table which gives us the starting version of training data. So the data is nicely transfromed into batsman vs bowler in each game. 

In [None]:
train_df.head(10)

Let's look at what's the maximum number of balls faced by a batsman and also maximum runs scored by a batsman against a bowler in a single game. Its interesting to see that 52 is the highest runs scored against any single bowler by a batsman and 20 balls is the maximum any bowler ever bowled to a single batsman

In [None]:
train_df.sort_values('runs', ascending=False).head(5)

In [None]:
train_df.sort_values('ball', ascending=False).head(5)

Let's look at the most expensive (more runs) and most economical (least runs) conceded by a bowler in a game

In [None]:
train_df.iloc[train_df.where(train_df.runs.eq(train_df.runs.max())).ball.idxmin()]

In [None]:
train_df.iloc[train_df.where(train_df.ball.eq(train_df.ball.max())).runs.idxmin()]


## Add additional features

Let's extract additional attributes like venue, team information and home vs away game. We will look at the importance of these features next

In [None]:
def add_features(df):
    for index, row in df.iterrows():
        try:
            temp = summary_df.loc[summary_df['id'] == df.at[index, 'match_id']]
            temp = temp.reset_index()
            df.at[index,'venue'] = temp.at[0,'venue_name']
            if df.at[index,'batsman1_name'] in (temp.at[0,'home_playx1'] ):
                df.at[index,'batsman_team'] = temp.at[0,'home_team']
            if df.at[index,'batsman1_name'] in (temp.at[0,'away_playx1'] ):
                df.at[index,'batsman_team'] = temp.at[0,'away_team']
            if df.at[index,'bowler1_name'] in (temp.at[0,'away_playx1'] ):
                df.at[index,'bowling_team'] = temp.at[0,'away_team']
            if df.at[index,'bowler1_name'] in (temp.at[0,'home_playx1'] ):
                df.at[index,'bowling_team'] = temp.at[0,'home_team']
            if df.at[index,'batsman_team'] in (temp.at[0,'home_team'] ):
                df.at[index,'home_game'] = 1
            else:
                df.at[index,'home_game'] = 0                                    
        except KeyError as e:
            print(e)
            continue

add_features(train_df)
train_df.head()

# Batsman Features
<div align="center">
<img src="https://user-images.githubusercontent.com/48846576/102040230-cf282900-3d91-11eb-897f-172747e94fd9.jpg"  width="500" height="400">
    <span>Photo by <a href="https://unsplash.com/@villagecricketco?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Village Cricket Co</a> on <a href="https://unsplash.com/s/photos/cricket-bat?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a> and annotated by me</span>
</div><br>

Now let's extract important features of a batsman from other data files so that we can associate some numbers to a name. The following stats are important to define the quality of the batsman

* Total runs scored
* Total balls faced
* Strike rate (runs / balls)
* Total 4s hit (4 runs)
* Total 6s hit (6 runs (home runs in baseball!))

We are going to start with average of all the above values first

In [None]:
batting_df['strikeRate'] = batting_df['strikeRate'].replace({"-":"0"})
batting_df = batting_df.astype({"strikeRate": float})
bat_analysis = batting_df.groupby(['fullName']).mean()
bat_analysis = bat_analysis.reset_index()
print("Batting stats dataframe shape ", bat_analysis.shape)

In [None]:
bat_analysis[['fullName','runs','ballsFaced','fours','sixes','strikeRate','captain']].sort_values('runs', ascending=False).head(10)

## Distribution of Batsman attributes

As we can see below thse attributes are in different scale 

In [None]:
batsman_attr = ['runs','ballsFaced','fours','sixes','strikeRate', 'captain']
p = bat_analysis[batsman_attr].hist(bins=100, figsize=(20,15))
plt.suptitle("Histogram of mean values attributes of Batsman")

## Correlation average values of batsman to the Runs scored

In [None]:
p = sns.pairplot(bat_analysis[batsman_attr])

## Experiment with different attributes combination

The average value of attributes may not reflect the longevity of the batsman's performance. For example when a batsman has played only few high scoring games and remains not out in some of those games then his average is calculated as (sum of runs scored in all games) / (Number of games played excluding number of not outs). Hence its good idea to look into the total runs, fours and sixes scored by the batsman and total number of innings batted as well.

In [None]:
runs_df = batting_df.groupby(batting_df["fullName"]).runs.agg(["min", 
                                               "max", 
                                               "sum", 
                                               "count", 
                                               "mean"]) 
runs_df = runs_df.reset_index()
runs_df = runs_df.drop(['min', 'max', 'mean'], axis=1)
runs_df = runs_df.rename(columns={"sum": "total_runs_scored", "count": "total_innings_batted"})

balls_boundaries_df = batting_df.groupby(['fullName']).sum()
balls_boundaries_df = balls_boundaries_df.reset_index()
balls_boundaries_df = balls_boundaries_df.drop(['season', 'match_id', 'innings_id','runs','strikeRate',
                                   'runningOver','link'], axis=1)
balls_boundaries_df = balls_boundaries_df.rename(columns={"ballsFaced": "total_balls_faced", "fours": "total_4s_hit"
                                                         ,"sixes": "total_6s_hit", "captain": "total_games_captained"})
combined_batting_sum_df = pd.merge(runs_df, balls_boundaries_df, on='fullName')
combined_batting_sum_df = combined_batting_sum_df.reset_index()
combined_batting_sum_df = combined_batting_sum_df.drop(['index'], axis=1)
combined_batting_sum_df

Combine this with the average values of each attributes that we explored earlier

In [None]:
bat_analysis = bat_analysis.rename(columns={'runs':'avg_runs_scored', 'ballsFaced':'avg_balls_faced'
                                           ,'fours':'avg_4s_scored','sixes':'avg_6s_scored','captain':'avg_games_captained','strikeRate':'batting_st_rate'})
bat_analysis = bat_analysis.drop(['season','match_id', 'innings_id', 'runningOver', 'link'], axis=1)
batsman_stats = pd.merge(bat_analysis, combined_batting_sum_df, on="fullName")

Now we have encriched 13 numerical attributes that describe a batsman

In [None]:
batsman_stats

## Correlation of total values of attributes of batsman
Now let's visualize the correlation of total runs scored, innings batted, balls faced, 4s & 6s, etc against avg_runs_scored. Surprisingly We don't see stronger correlation than the average of the attributes that we visualized above.

In [None]:
p = sns.pairplot(batsman_stats[['avg_runs_scored', 'total_runs_scored','total_innings_batted','total_balls_faced','total_4s_hit','total_6s_hit','total_games_captained']])

Let's save these numerical data for batsman into a csv file for future use

In [None]:
batsman_stats.to_csv('batsman_numerical.csv',index=False)

# Bowler Features

<div align="center">
<img src="https://user-images.githubusercontent.com/48846576/102232253-7ef5b780-3eb4-11eb-9cc9-d29f3ddb8de8.jpg"  width="500" height="400">
    <span>Photo by <a href="https://unsplash.com/@yogendras31?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Yogendra Singh</a> on <a href="https://unsplash.com/s/photos/cricket-bat?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></span> and annotated by me</span>
</div><br>

Moving on to bowler features. We are againing going to extract the. We are going to look at the average and total values of these attributes 

* Overs bowled
* Balls bowled
* Runs conceded
* Wickets Taken
* Economy rate (runs / balls)
* Number of dot balls bowled (deliveries where no runs scored)
* 4s conceded
* 6s conceded
* Wides bowled (extras)
* No balls bowled (extras)
* Is the bowler captain and how many games captained

Let's build the average of these attributes across games

In [None]:
bowling_df['economyRate'] = bowling_df['economyRate'].replace({"-":"0"})
bowling_df = bowling_df.astype({"economyRate": float})
bowl_analysis = bowling_df.groupby(['fullName']).mean()
bowl_analysis = bowl_analysis.reset_index()
bowl_analysis = bowl_analysis.drop(['season','match_id', 'innings_id' ], axis=1)
bowl_analysis.head(5)

Next build the total sum

In [None]:
bowl_analysis = bowl_analysis.rename(columns={'overs':'bowler_avg_overs',
'maidens':'bowler_avg_maidens',
'conceded':'bowler_avg_conceded',
'wickets':'bowler_avg_wkts',
'economyRate':'bowler_econ_rt',
'dots':'bowler_avg_dots',
'foursConceded':'bowler_avg_4s',
'sixesConceded':'bowler_avg_6s',
'wides':'bowler_avg_wides',
'noballs':'bowler_avg_noballs',
'captain':'bowler_avg_captaincy'})
conceded_df = bowling_df.groupby(bowling_df["fullName"]).conceded.agg(["min", 
                                               "max", 
                                               "sum", 
                                               "count", 
                                               "mean"]) 
conceded_df = conceded_df.reset_index()
conceded_df = conceded_df.drop(['min', 'max', 'mean'], axis=1)
conceded_df = conceded_df.rename(columns={"sum": "bowler_total_conceded", "count": "total_innings_bowled"})

other_stats_df = bowling_df.groupby(['fullName']).sum()
other_stats_df = other_stats_df.reset_index()
other_stats_df = other_stats_df.drop(['season', 'match_id', 'innings_id','conceded','economyRate'
                                   ], axis=1)
other_stats_df = other_stats_df.rename(columns={'overs':'bowler_total_overs',
'maidens':'bowler_total_maidens',
'wickets':'bowler_total_wkts',
'dots':'bowler_total_dots',
'foursConceded':'bowler_total_4s',
'sixesConceded':'bowler_total_6s',
'wides':'bowler_total_wides',
'noballs':'bowler_total_noballs',
'captain':'bowler_total_captaincy'})
combined_bowling_sum_df = pd.merge(conceded_df, other_stats_df, on='fullName')
combined_bowling_sum_df = combined_bowling_sum_df.reset_index()
combined_bowling_sum_df = combined_bowling_sum_df.drop(['index'], axis=1)
combined_bowling_sum_df.head(5)

Combine both the data frames. We get 23 numerical attributes for a bowler. We had only 13 for the batsman.

In [None]:
bowler_stats = pd.merge(bowl_analysis, combined_bowling_sum_df, on="fullName")
bowler_stats

## Bowler Stats Histograms and Correlation

Now let's visualize the bowler stats. The following histograms show that we are dealing with values of different scale and distributed differently.

In [None]:
p = bowler_stats[['bowler_total_conceded',
       'total_innings_bowled', 'bowler_total_overs', 'bowler_total_maidens',
       'bowler_total_wkts', 'bowler_total_dots', 'bowler_total_4s',
       'bowler_total_6s', 'bowler_total_wides', 'bowler_total_noballs',
       'bowler_total_captaincy']].hist( bins=80, figsize=(20,15))


In [None]:
p = bowler_stats[['bowler_avg_conceded', 'bowler_avg_overs', 'bowler_avg_maidens',
       'bowler_avg_wkts', 'bowler_econ_rt',
       'bowler_avg_dots', 'bowler_avg_4s', 'bowler_avg_6s', 'bowler_avg_wides',
       'bowler_avg_noballs', 'bowler_avg_captaincy']].hist( bins=80, figsize=(20,15))


Let's visualize the correlation of bowler attributes

In [None]:
p = sns.pairplot(bowler_stats[['bowler_avg_conceded','bowler_avg_overs', 'bowler_avg_maidens',
       'bowler_avg_wkts', 'bowler_econ_rt',
       'bowler_avg_dots', 'bowler_avg_4s', 'bowler_avg_6s', 'bowler_avg_wides',
       'bowler_avg_noballs', 'bowler_avg_captaincy']])

In [None]:
p = sns.pairplot(bowler_stats[['bowler_total_conceded',
       'total_innings_bowled', 'bowler_total_overs', 'bowler_total_maidens',
       'bowler_total_wkts', 'bowler_total_dots', 'bowler_total_4s',
       'bowler_total_6s', 'bowler_total_wides', 'bowler_total_noballs',
       'bowler_total_captaincy']])

Save the bowler information to csv

In [None]:
bowler_stats.to_csv('bowler_numerical.csv',index=False)

# Encrich the training data 

<div align="center">
<img src="https://user-images.githubusercontent.com/48846576/102235511-042e9b80-3eb8-11eb-89c4-740ef682de45.jpg"  width="500" height="400">
           <span>Photo by <a href="https://unsplash.com/@v2osk?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">v2osk</a> on <a href="https://unsplash.com/s/photos/categories?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></span>
</div><br>
Its time to add the features of batsman and bowlers that we explored and extractd to the training data set

In [None]:
def fill_batsman_attributes(df):
    for index, row in df.iterrows():
        try:
            temp = batsman_stats.loc[batsman_stats['fullName'] == df.at[index, 'batsman1_name']]
            temp = temp.reset_index()
            if df.empty:
                print('DataFrame is empty for {}'.format(df.at[index, 'batsman1_name']))
            else:
                df.at[index,'avg_runs_scored'] = temp.at[0,'avg_runs_scored']
                df.at[index,'avg_balls_faced'] = temp.at[0,'avg_balls_faced']
                df.at[index,'avg_4s_scored'] = temp.at[0,'avg_4s_scored'] 
                df.at[index,'avg_6s_scored'] = temp.at[0,'avg_6s_scored'] 
                df.at[index,'batting_st_rate'] = temp.at[0,'batting_st_rate'] 
                df.at[index,'avg_games_captained'] = temp.at[0,'avg_games_captained'] 
                df.at[index,'total_runs_scored'] = temp.at[0,'total_runs_scored'] 
                df.at[index,'total_innings_batted'] = temp.at[0,'total_innings_batted'] 
                df.at[index,'total_balls_faced'] = temp.at[0,'total_balls_faced'] 
                df.at[index,'total_4s_hit'] = temp.at[0,'total_4s_hit'] 
                df.at[index,'total_6s_hit'] = temp.at[0,'total_6s_hit'] 
                df.at[index,'total_games_captained'] = temp.at[0,'total_games_captained'] 
        except KeyError as e:
            print(e)
            continue

def fill_bowler_attributes(df):
    for index, row in df.iterrows():
        try:
            temp = bowler_stats.loc[bowler_stats['fullName'] == df.at[index, 'bowler1_name']]
            temp = temp.reset_index()
            if df.empty:
                print('DataFrame is empty for {}'.format(df.at[index, 'bowler1_name']))
            else:
                df.at[index,'bowler_avg_overs'] = temp.at[0,'bowler_avg_overs']
                df.at[index,'bowler_avg_maidens'] = temp.at[0,'bowler_avg_maidens']
                df.at[index,'bowler_avg_conceded'] = temp.at[0,'bowler_avg_conceded']
                df.at[index,'bowler_avg_wkts'] = temp.at[0,'bowler_avg_wkts']
                df.at[index,'bowler_econ_rt'] = temp.at[0,'bowler_econ_rt']
                df.at[index,'bowler_avg_dots'] = temp.at[0,'bowler_avg_dots']
                df.at[index,'bowler_avg_4s'] = temp.at[0,'bowler_avg_4s']
                df.at[index,'bowler_avg_6s'] = temp.at[0,'bowler_avg_6s']
                df.at[index,'bowler_avg_wides'] = temp.at[0,'bowler_avg_wides']
                df.at[index,'bowler_avg_noballs'] = temp.at[0,'bowler_avg_noballs']
                df.at[index,'bowler_avg_captaincy'] = temp.at[0,'bowler_avg_captaincy']
                df.at[index,'bowler_total_conceded'] = temp.at[0,'bowler_total_conceded']
                df.at[index,'total_innings_bowled'] = temp.at[0,'total_innings_bowled']
                df.at[index,'bowler_total_overs'] = temp.at[0,'bowler_total_overs']
                df.at[index,'bowler_total_maidens'] = temp.at[0,'bowler_total_maidens']
                df.at[index,'bowler_total_wkts'] = temp.at[0,'bowler_total_wkts']
                df.at[index,'bowler_total_dots'] = temp.at[0,'bowler_total_dots']
                df.at[index,'bowler_total_4s'] = temp.at[0,'bowler_total_4s']
                df.at[index,'bowler_total_6s'] = temp.at[0,'bowler_total_6s']
                df.at[index,'bowler_total_wides'] = temp.at[0,'bowler_total_wides']
                df.at[index,'bowler_total_noballs'] = temp.at[0,'bowler_total_noballs']
                df.at[index,'bowler_total_captaincy'] = temp.at[0,'bowler_total_captaincy']
        except KeyError as e:
            print(e)
            continue            
fill_batsman_attributes(train_df)
fill_bowler_attributes(train_df)

Finally, got our training data!

In [None]:
pd.set_option('display.max_columns', None)
train_df

In [None]:
train_df.to_csv('train.csv',index=False)

# Correlation Matrix

Let's look at the correlation of attributes to runs. 
* The average runs scored by a batsman does define a stronger correlation to how many runs he can score in a given game
* On the opposite spectrum, average wicket taken by the bowler does have strong negative correlation (among other features) to the runs scored by the batsman
* Home field advantage does favor the batsman a little

In [None]:
corr_matrix = train_df[['batsman1_name', 'bowler1_name', 'home_team',
       'away_team', 'innings_id', 'runs', 'ball', 'venue', 'batsman_team',
       'bowling_team', 'home_game', 'avg_runs_scored', 'avg_balls_faced',
       'avg_4s_scored', 'avg_6s_scored', 'batting_st_rate',
       'avg_games_captained', 'total_runs_scored', 'total_innings_batted',
       'total_balls_faced', 'total_4s_hit', 'total_6s_hit',
       'total_games_captained', 'bowler_avg_overs', 'bowler_avg_maidens',
       'bowler_avg_conceded', 'bowler_avg_wkts', 'bowler_econ_rt',
       'bowler_avg_dots', 'bowler_avg_4s', 'bowler_avg_6s', 'bowler_avg_wides',
       'bowler_avg_noballs', 'bowler_avg_captaincy', 'bowler_total_conceded',
       'total_innings_bowled', 'bowler_total_overs', 'bowler_total_maidens',
       'bowler_total_wkts', 'bowler_total_dots', 'bowler_total_4s',
       'bowler_total_6s', 'bowler_total_wides', 'bowler_total_noballs',
       'bowler_total_captaincy']].corr()
corr_matrix['runs'].sort_values(ascending=False)

In [None]:
plt.figure(figsize=(16,16))
sns.heatmap(corr_matrix,
            vmin=-1,
            cmap='twilight_shifted_r');

The training data has been prepared. Next task is to explore regression techniques and train a best fit model !