<img src = "https://www.searchpng.com/wp-content/uploads/2019/02/IPL-Logo-PNG-715x715.png" style="align:center; object-fit:fill; display: block; margin-left: auto; margin-right: auto; width: 80%;"/>

# Exploratory Data Analysis on IPL Data

The Indian Premier League is a professional Twenty20 cricket league, contested by eight teams based out of eight different Indian cities. Being a fan of the IPL, I was drawn to this dataset as soon as I saw it on Kaggle. I did some EDA and found some really cool insights which I am here to share. Please share any insights, suggestions through comments.

### About the Dataset

This dataset consists of two seperate CSV files : matches and deliveries. These files contain information of each match summary and ball by ball details, respectively.

# Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns

In [None]:
%matplotlib inline
sns.set_style("white")
sns.set_palette("husl", 14, 1)
matplotlib.rcParams['font.size'] = 12
matplotlib.rcParams['figure.figsize'] = (13, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

# Data Preparation and Cleaning

**Matches**

In [None]:
matches = pd.read_csv('../input/ipl-complete-dataset-20082020/IPL Matches 2008-2020.csv')
matches.head()

In [None]:
matches.describe()

In [None]:
matches.info()

Let's check if the dataset contains Nan values

In [None]:
match_na = matches.isna().sum()
match_na[match_na > 0]

After inspecting the dataset we found that the columns winner, result, player_of_match has 4 nans. We found that the matches that were tied due to rains have nans in these columns. We can drop the nan rows as it will not affect our analysis.

In [None]:
matches = matches.dropna(subset = ['winner', 'result', 'player_of_match'])
matches.shape

The column 'method' has the highest number of NaNs in the dataset. The best course of action would probably be to drop the entire column. However, the rows where the 'method' column does have values are those matches where D/L method was used. Let us store it in another DataFrame and drop 'method' from matches.

In [None]:
DL = matches.dropna(subset = ['method'])
matches = matches.drop('method', axis = 1)

In [None]:
DL.head()

Matches that were held in Dubai International Stadium had NaNs in their city column.

In [None]:
matches.loc[matches.city.isna(), 'city'] = 'Dubai'

Lets take a look at the 'result_margin' column which has na values.

In [None]:
matches.loc[matches.result_margin.isna()].head()

After close inspection, we found that result_margin has nans in matches that were tied.

In [None]:
matches.loc[matches.result == 'tie', 'result_margin'] = 0

Let's see if all the na values are taken care of...

In [None]:
match_na = matches.isna().sum()
match_na[match_na > 0]

After inspecting the matches dataset, we found that the team 'Rising Pune Supergiant' is written as 'Rising Pune Supergiants'. Also, Delhi Daredevils and Delhi Capitals are the same team. Lets fix this.

In [None]:
matches.winner.unique()

In [None]:
matches.replace('Rising Pune Supergiants', 'Rising Pune Supergiant', inplace = True)

In [None]:
matches.replace('Delhi Daredevils', 'Delhi Capitals', inplace = True)

**Deliveries**

In [None]:
deliveries = pd.read_csv('../input/ipl-complete-dataset-20082020/IPL Ball-by-Ball 2008-2020.csv')
deliveries.head()

In [None]:
deliveries.info()

In [None]:
deliveries['bowling_team'].unique()

Let's fix the team names same as Matches.

In [None]:
deliveries.replace('Rising Pune Supergiants', 'Rising Pune Supergiant', inplace = True)

In [None]:
deliveries.replace('Delhi Daredevils', 'Delhi Capitals', inplace = True)


# Exploratory Data Analysis


**How many matches are played each year in IPL?**

In [None]:
matches["Year"] = matches["date"].apply(lambda x:x.split("-")[0])
matches['Year'].unique()

In [None]:
match_count = matches['Year'].value_counts().sort_index()

In [None]:
sns.lineplot(x = match_count.index, y = match_count.values, )
sns.despine()
plt.ylabel('Number of matches')
plt.xlabel('Years')
plt.ylim((56, 79))
_ = plt.title('Number of matches per Year')

> *Year 2013 witnessed most matches throughout 2008 - 2020*

**How many matches did the teams played throughout the IPL(2008-2020)?**

In [None]:
# Total matches played by a team
partial_count1 = matches['team1'].value_counts()
partial_count2 = matches['team2'].value_counts()
total_matches = np.add(partial_count1, partial_count2).sort_values(ascending = False)
total_matches.head()

In [None]:
sns.barplot(x = total_matches.index, y = total_matches.values)
sns.despine()
_ = plt.xticks(rotation = 40)
plt.title('Number of Matches played')
_ = plt.xlabel('Teams')

**Does winning the toss affects the outcome of a match for a team?**<br>

In [None]:
toss_match_wins = matches.loc[(matches['toss_winner'] == matches['winner']), 'toss_winner'].value_counts()
toss_match_wins.head()

In [None]:
win_per_after_toss = np.divide(toss_match_wins, total_matches)*100

In [None]:
sns.barplot(x = win_per_after_toss.index, y = win_per_after_toss.values, )
sns.despine()
_ = plt.xticks(rotation = 30)
plt.title('Win % of Teams that won Toss')
_ = plt.xlabel('Teams')

> *All the percentages are below 50%, hence there isn't any evidence that suggests that winning the toss would increase the chances of winning the match.*

Here I just want to see what would be the chances for a certain team to win a match if it wins the toss.


In [None]:
# When Teams won the toss as well as the match
match_win = matches.loc[matches['winner'] == matches['toss_winner'], 'toss_winner'].value_counts()

# Won toss but lost match = Total times teams won toss - WON both toss and match 
match_lose = matches['toss_winner'].value_counts() - match_win
match_lose

In [None]:
ticks = ['CSK', 'DC', 'DelhiC', 'GL', 'KXIP', 'KTK', 'KKR', 'MI', 'PW', 'RR', 'RPS', 'RCB', 'SRH']
match_win.sort_index(inplace = True)
match_lose.sort_index(inplace = True)
x = np.arange(len(ticks))
width = 0.4

In [None]:
plt.bar(x = x-0.2, height = match_win.values, width = width, label = 'Won', color = 'darkblue')
plt.bar(x = x+0.2, height = match_lose.values, width = width, label = 'Lose', color = 'orange')
plt.xticks(x, ticks)
plt.legend()
_ = plt.title('toss')

> *We can see that there are teams that have actually lost more matches than they have won after winning the toss.*

**Which team has the highest win percentage?**

In [None]:
winner_counts = matches['winner'].value_counts()
win_per = np.round(np.divide(winner_counts, total_matches)*100).sort_values(ascending = False)
win_per.head()

In [None]:
sns.barplot(x = win_per.index, y = win_per.values,)
sns.despine()
_ = plt.xticks(rotation = 40)
plt.title('Team Wins %')
_ = plt.xlabel('Teams')

> *Chennai Super Kings has the highest win percentage of **60%** among the all the teams.*

**Which batsman hit the most number of sixes?**

In [None]:
batsman_sixes = deliveries[deliveries['batsman_runs'] == 6].groupby(by = ['batsman']).agg(sixes=pd.NamedAgg(column="batsman_runs", aggfunc="value_counts")).nlargest(n = 10, columns = 'sixes').droplevel('batsman_runs')
batsman_sixes.head()

In [None]:
sns.barplot(x = batsman_sixes.index, y = batsman_sixes.sixes)
sns.despine()
plt.title('Top 10 Batsmen - Highest number of 6s hit')
_ = plt.xlabel('Batsmen')

> *Chris Gayle hit the highest number of sixes during the IPL.*

**Which batsman hit the most number of 4s?**

In [None]:
batsman_4s = deliveries[deliveries['batsman_runs'] == 4].groupby(by = ['batsman']).agg(fours=pd.NamedAgg(column="batsman_runs", aggfunc="value_counts")).nlargest(n = 10, columns = 'fours').droplevel('batsman_runs')
batsman_4s.head()

In [None]:
sns.barplot(x = batsman_4s.index, y = batsman_4s.fours, )
sns.despine()
plt.title('Top 10 Batsmen - Highest number of 4s hit')
_ = plt.xlabel('Batsmen')

> *Shikhar Dhavan hit the most number of 4s.*

**Which bowler gave the most number of 6s?**

In [None]:
bowler_6s = deliveries[deliveries['batsman_runs'] == 6].groupby(by = ['bowler']).agg(sixes=pd.NamedAgg(column="batsman_runs", aggfunc="value_counts")).nlargest(n = 10, columns = 'sixes').droplevel('batsman_runs')
bowler_6s.head()

In [None]:
plt.figure(figsize = ( 13, 5 ))
sns.barplot(x = bowler_6s.index, y = bowler_6s.sixes, )
sns.despine()
plt.title('Top 10 Bwlers - Highest number of 6s given')
_ = plt.xlabel('Bowlers')

> *Batsmen hit most number of 6s against PP Chawla*

**Which bowler gave the most number of 4s?**

In [None]:
bowler_4s = deliveries[deliveries['batsman_runs'] == 4].groupby(by = ['bowler']).agg(fours=pd.NamedAgg(column="batsman_runs", aggfunc="value_counts")).nlargest(n = 10, columns = 'fours').droplevel(1)

In [None]:
sns.barplot(x = bowler_4s.index, y = bowler_4s.fours, )
sns.despine()
plt.title('Top 10 Bowers - Highest number of 4s given')
_ = plt.xlabel('Bowlers')

> *Batsmen hit most number of 4s against UT Yadav*

I have always wondered whether the teams batting in Inning1 has more advantage over their rival teams.<br> **Which teams are more likely to win ?**

In [None]:
total_runs_inning = deliveries.groupby(by =['id', 'inning']).agg({'total_runs':'sum'}).reset_index()
winners_ = pd.pivot_table(data = total_runs_inning, columns = 'inning', index = 'id', values = 'total_runs')
winners_['won'] = np.where(winners_[1]>winners_[2], 'Innings 1', 'Innings 2')
winners_['won'] = np.where(winners_[1]==winners_[2], 'draw', winners_['won'])
winners_.head()

In [None]:
sns.catplot(x="won", kind="count", data=winners_)

>*We can see that the teams playing in the second innings have a slightly better chance at winning. So getting to bat first may not always be the correct alernative.*

**Which player was awarded the Player of the Match most number of times?**

In [None]:
pom = matches['player_of_match'].value_counts().sort_values(ascending=False).iloc[:10]
sns.barplot(x=pom.index,  y = pom.values, data=matches)
sns.despine()
_ = plt.title('Player of the match')

> *AB de Villiers was awarded Player of the match most number of times.*

Let's move on to some **Player Statistics**.

Let's count the Centuries scored by each batsmen in each match. Also, we would like to have other information on our batsmen say, Strike rate, total runs scored and balls faced.

In [None]:
runs = deliveries.groupby(by =['batsman', 'id']).agg({'batsman_runs':'sum'})
centuries = runs['batsman_runs'].apply(lambda x: (x // 100)).sum(level = 0)

Wides and noballs are not counted in batsman's record as well as any runs scored on an extras_type ball. So in order to account for that we are not considering wides, noballs, penalty.

In [None]:
k = deliveries[~(deliveries['extras_type'].isin(['wides', 'noballs', 'penalty']))]

In [None]:
batsman_stats = k.groupby(by =['batsman']).agg({'batsman_runs':'sum', 'ball': 'count'})
batsman_stats['Strike_rate'] = 100*batsman_stats['batsman_runs']/batsman_stats['ball']
batsman_stats['Centuries'] = centuries
batsman_stats.head()

**Which player scored the most number of centuries?**

In [None]:
top10 = batsman_stats.nlargest(n = 10, columns = 'Centuries')
sns.barplot(x = top10.index, y = top10['Centuries'], )
sns.despine()
plt.title('Top 10 Batsmen - Highest number of Centuries')
plt.ylabel('Centuries')
_ = plt.xlabel('Batsmen')

> *Chris Gayle scored 6 centuries which is the highest of all.*

**Which batsman has the highest runs?**

In [None]:
top10 = batsman_stats.nlargest(n = 10, columns = 'batsman_runs')
sns.barplot(x = top10.index, y = top10['batsman_runs'], )
sns.despine()
plt.title('Top 10 Batsmen - Highest number of Runs')
plt.ylabel('Runs')
_ = plt.xlabel('Batsmen')

> *Virat Kohli scored the highest number of runs throuhout the IPL.*

 **Which player has the highest strike rate?**

In [None]:
top10 = batsman_stats.nlargest(n = 10, columns = 'Strike_rate')
sns.barplot(x = top10.index, y = top10['Strike_rate'], )
sns.despine()
plt.title('Top 10 Batsmen - Highest Strike Rate')
plt.ylabel('Strike Rate')
_ = plt.xlabel('Batsmen')

> *B Stanlake has the highest Strike rate.*

Similarly, we would also like to have some statistics on our bowlers as well, for e.g. total wickets, Maidens, Strike rate (balls / wicket), Bowling Average (Total runs / wickets).<br><br>
**Data Preparation.**

In [None]:
run_per_over = deliveries.groupby(by =['bowler', 'id', 'over']).agg({'total_runs':'sum'})
maidens = run_per_over['total_runs'].apply(lambda x: x == 0).sum(level = 0)

In [None]:
bowler_stats = deliveries.groupby(by ='bowler').agg({'total_runs':'sum', 'ball': 'count', 'is_wicket':'sum'})
bowler_stats['Strike_rate'] = np.divide(bowler_stats['ball'], bowler_stats['is_wicket'])
bowler_stats['BowlingAve'] = np.divide(bowler_stats['total_runs'], bowler_stats['is_wicket'])
bowler_stats['Maidens'] = maidens.astype('int32')
bowler_stats.head()

**Which Bowler has delivered the most balls?**

In [None]:
top10 = bowler_stats.nlargest(n = 10, columns = 'ball')
sns.barplot(x = top10.index, y = top10['ball'], )
sns.despine()
plt.title('Top 10 Bowlers - Highest number of Balls delivered')
plt.ylabel('Balls')
_ = plt.xlabel('Bowlers')

> *Harbhajan Singh delivered the most balls.*

**Which bowler took the most wickets?**

In [None]:
top10 = bowler_stats.nlargest(n = 10, columns = 'is_wicket')
sns.barplot(x = top10.index, y = top10['is_wicket'], )
sns.despine()
plt.title('Top 10 Bowlers - Highest number of Wickets')
plt.ylabel('Wickets')
_ = plt.xlabel('Bowlers')
_ = plt.xticks(rotation = 20)

> *SL Malinga took the most number of wickets.*

**Which bowler delivered the most number of Maidens?**

In [None]:
top10 = bowler_stats.nlargest(n = 10, columns = 'Maidens')
sns.barplot(x = top10.index, y = top10['Maidens'], )
sns.despine()
plt.title('Top 10 Bowlers - Highest number of Maidens delivered')
plt.ylabel('Maidens')
_ = plt.xlabel('Bowlers')
_ = plt.xticks(rotation = 20)

> *P Kumar delivered the most number of maiden.*

**Which batsmen has played the most number of matches throughout the IPL?**

In [None]:
matches_played_batsman = deliveries.groupby(by = ['batsman', 'id'], as_index=False).size().reset_index()['batsman'].value_counts()
matches_played_batsman = matches_played_batsman.sort_values(ascending = False).iloc[:10]

In [None]:
sns.barplot(x = matches_played_batsman.index, y = matches_played_batsman.values)
sns.despine()
plt.title('Top 10 - Highest number of matches played by Batsmen')
plt.ylabel('No. of matches')
_ = plt.xlabel('Batsmen')

> *Rohit Sharma has played the most number of matches.*

**Which bowlers played the most number of matches throughout the IPL?**

In [None]:
matches_played_bowler = deliveries.groupby(by = ['bowler', 'id'], as_index=False).size().reset_index()['bowler'].value_counts()
matches_played_bowler = matches_played_bowler.sort_values(ascending = False).iloc[:10]

In [None]:
sns.barplot(x = matches_played_bowler.index, y = matches_played_bowler.values,)
sns.despine()
plt.title('Top 10 - Highest number of matches played by Bowler')
plt.ylabel('No. of matches')
plt.xticks(rotation = 20)
_ = plt.xlabel('Bowlers')

> *PP Chawla has played the most number of matches.*

# Thank You!