In [228]:
import pandas as pd 
import numpy as np
import plotly.express as px

Displaying Data 

In [229]:
df=pd.read_csv("deliveries_updated_mens_ipl_upto_2024 - deliveries_updated_mens_ipl_upto_2024.csv")
df.head() #displays the first 5 rows

Unnamed: 0,matchId,inning,over_ball,over,ball,batting_team,bowling_team,batsman,non_striker,bowler,batsman_runs,extras,isWide,isNoBall,Byes,LegByes,Penalty,dismissal_kind,player_dismissed,date
0,335982,1,0.1,0,1,Kolkata Knight Riders,Royal Challengers Bangalore,SC Ganguly,BB McCullum,P Kumar,0,1,,,,1.0,,,,2008-04-18
1,335982,1,0.2,0,2,Kolkata Knight Riders,Royal Challengers Bangalore,BB McCullum,SC Ganguly,P Kumar,0,0,,,,,,,,2008-04-18
2,335982,1,0.3,0,3,Kolkata Knight Riders,Royal Challengers Bangalore,BB McCullum,SC Ganguly,P Kumar,0,1,1.0,,,,,,,2008-04-18
3,335982,1,0.4,0,4,Kolkata Knight Riders,Royal Challengers Bangalore,BB McCullum,SC Ganguly,P Kumar,0,0,,,,,,,,2008-04-18
4,335982,1,0.5,0,5,Kolkata Knight Riders,Royal Challengers Bangalore,BB McCullum,SC Ganguly,P Kumar,0,0,,,,,,,,2008-04-18


In [230]:
df.info() #displays the information about the dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 260920 entries, 0 to 260919
Data columns (total 20 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   matchId           260920 non-null  int64  
 1   inning            260920 non-null  int64  
 2   over_ball         260920 non-null  float64
 3   over              260920 non-null  int64  
 4   ball              260920 non-null  int64  
 5   batting_team      260920 non-null  object 
 6   bowling_team      260920 non-null  object 
 7   batsman           260920 non-null  object 
 8   non_striker       260920 non-null  object 
 9   bowler            260920 non-null  object 
 10  batsman_runs      260920 non-null  int64  
 11  extras            260920 non-null  int64  
 12  isWide            8381 non-null    float64
 13  isNoBall          1093 non-null    float64
 14  Byes              673 non-null     float64
 15  LegByes           4001 non-null    float64
 16  Penalty           2 

In [231]:
df.columns #to list out all the columns in the dataset

Index(['matchId', 'inning', 'over_ball', 'over', 'ball', 'batting_team',
       'bowling_team', 'batsman', 'non_striker', 'bowler', 'batsman_runs',
       'extras', 'isWide', 'isNoBall', 'Byes', 'LegByes', 'Penalty',
       'dismissal_kind', 'player_dismissed', 'date'],
      dtype='object')

**Data Cleaning Process**

Replacing all the blank cells with NaN

In [232]:
df=df.replace("",np.nan)

In [233]:
print("Missing values per column:")
print(df.isnull().sum())

Missing values per column:
matchId                  0
inning                   0
over_ball                0
over                     0
ball                     0
batting_team             0
bowling_team             0
batsman                  0
non_striker              0
bowler                   0
batsman_runs             0
extras                   0
isWide              252539
isNoBall            259827
Byes                260247
LegByes             256919
Penalty             260918
dismissal_kind      247970
player_dismissed    247970
date                     0
dtype: int64


Removing the Duplicate Rows

In [234]:
before=df.shape[0]
df=df.drop_duplicates()
after=df.shape[0]
print(f"Dropped {before-after} duplicate rows.")

Dropped 3 duplicate rows.


Convert dates to datetime type. This helps in selecting matches in a specific month/season

In [235]:
df['date']=pd.to_datetime(df['date'],errors='coerce')

**Data Analysis and Insights**


Creating a new column 'total runs' by adding the batman runs and extras

Creating another new column which will store '1' if someone got out on that ball else '0'

In [236]:
df["total_runs"]=df["batsman_runs"]+df["extras"]
df["is_wicket"]=df["player_dismissed"].notnull().astype(int)
df.head()

Unnamed: 0,matchId,inning,over_ball,over,ball,batting_team,bowling_team,batsman,non_striker,bowler,...,isWide,isNoBall,Byes,LegByes,Penalty,dismissal_kind,player_dismissed,date,total_runs,is_wicket
0,335982,1,0.1,0,1,Kolkata Knight Riders,Royal Challengers Bangalore,SC Ganguly,BB McCullum,P Kumar,...,,,,1.0,,,,2008-04-18,1,0
1,335982,1,0.2,0,2,Kolkata Knight Riders,Royal Challengers Bangalore,BB McCullum,SC Ganguly,P Kumar,...,,,,,,,,2008-04-18,0,0
2,335982,1,0.3,0,3,Kolkata Knight Riders,Royal Challengers Bangalore,BB McCullum,SC Ganguly,P Kumar,...,1.0,,,,,,,2008-04-18,1,0
3,335982,1,0.4,0,4,Kolkata Knight Riders,Royal Challengers Bangalore,BB McCullum,SC Ganguly,P Kumar,...,,,,,,,,2008-04-18,0,0
4,335982,1,0.5,0,5,Kolkata Knight Riders,Royal Challengers Bangalore,BB McCullum,SC Ganguly,P Kumar,...,,,,,,,,2008-04-18,0,0


Save the Cleaned data csv file

In [237]:
df.to_csv('cleaned_data.csv',index=False)

### Top 10 Run‑Scoring Batsmen

Group the dataset by `batsman` and calculate the **total runs scored** by each player across all games and store in descending order
This shows the most high‑impact batsmen in the IPL data.

In [238]:
top_batsmen=df.groupby("batsman")["batsman_runs"].sum()
top_batsmen=top_batsmen.sort_values(ascending=False).head(10)
print(top_batsmen)

batsman
V Kohli           8014
S Dhawan          6769
RG Sharma         6630
DA Warner         6567
SK Raina          5536
MS Dhoni          5243
AB de Villiers    5181
CH Gayle          4997
RV Uthappa        4954
KD Karthik        4843
Name: batsman_runs, dtype: int64


### Bar Chart for Top 10 Batsmen by Total Runs
- **X‑axis:** Player names  
- **Y‑axis:** Total runs scored across all games in the dataset.

In [239]:
fig=px.bar(
    x=top_batsmen.index,
    y=top_batsmen.values,
    labels={"x":"Batsman", "y":"Total Runs"},
    title="Top 10 Batsmen by Runs in IPL"
)

fig.update_layout(xaxis_tickangle=-45)
fig.show()

**Top Performer – Virat Kohli**

- **Insight:** Virat Kohli is the highest run‑scorer in the dataset with 8,014 runs.
- **Action:** Use bowlers with a proven high dismissal rate against batsmen to increase the chance of removing him from the game early.

### Top 10 Wicket‑Taking Bowlers

Group the dataset by `bowler` and calculate the **total wickets taken** by each player across all games, then sort in descending order. This highlights the most effective bowlers in the IPL data.

In [240]:
top_bowlers=df.groupby("bowler")["is_wicket"].sum()
top_bowlers=top_bowlers.sort_values(ascending=False).head(10)
print(top_bowlers)

bowler
YS Chahal     213
DJ Bravo      207
PP Chawla     201
SP Narine     200
R Ashwin      198
B Kumar       195
SL Malinga    188
A Mishra      183
JJ Bumrah     182
RA Jadeja     169
Name: is_wicket, dtype: int64


### Bar Chart for Top 10 Bowlers by Total Wickets
- **X‑axis:** Bowler names  
- **Y‑axis:** Total wickets taken across all games in the dataset.

In [241]:
fig=px.bar(
    x=top_bowlers.index,
    y=top_bowlers.values,
    labels={"x":"Bowler", "y":"Total Wickets"},
    title="Top 10 Bowlers by Wickets in IPL",
    color_continuous_scale="Deep"
)

fig.update_layout(xaxis_tickangle=-45)
fig.show()

**Insight**
- **Insight:** Top bowlers (eg-YS Chahal,DJ Bravo) consistently dismiss batsmen, which helps control the opposition’s scoring.
- **Action:** Use these bowlers in high‑pressure overs (Powerplay and Death) to take wickets and contain runs.

### Total Runs by Each Team in the Powerplay (Overs 1–6)

This will only include deliveries from overs 1 to 6,known as the Powerplay phase.It then groups the data by `batting_team` and sums the `total_runs` to find how many runs each team has scored during this period.

This helps to identify the teams which have an agressive start

In [242]:
powerplay_df=df[df['over']<=5]
powerplay_runs=powerplay_df.groupby("batting_team")["total_runs"].sum().sort_values(ascending=False)

### Bar Chart for Total Runs by Teams in the Powerplay

- **X‑axis:** Team names  
- **Y‑axis:** Total runs scored during the Powerplay phase (overs 1–6) across all matches in the dataset.

Higher bars indicate teams that have made stronger starts to their innings.

In [243]:
fig=px.bar(
    x=powerplay_runs.index,
    y=powerplay_runs.values,
    color=powerplay_runs.values,
    labels={"x":"Team", "y":"Total Runs"},
    title="Total Runs by Teams in Powerplay (Overs 1–6)",
)

fig.update_layout(xaxis_tickangle=-45)
fig.show()

**Top Runs in Powerplay Overs**

- **Insight:** Teams like Mumbai Indians,Kolkata Knight Riders have amassed the highest total runs in the first 6 overs across the dataset, reflecting a strong start to their game.
- **Action:** Opposition sides should open with their most effective bowlers and set defensive fields.

### Total Runs by Each Team in the Death Overs (Overs 16–20)

This will only include deliveries from overs 16 to 20, known as the Death overs phase. It then groups the data by `batting_team` and sums the `total_runs` to find how many runs each team has scored during this period.

This helps to identify the teams which finish their innings strongly.

In [244]:
death_df=df[df["over"]>=15]
death_runs=death_df.groupby("batting_team")["total_runs"].sum().sort_values(ascending=False)

### Bar Chart for Total Runs by Teams in the Death Overs

- **X‑axis:** Team names  
- **Y‑axis:** Total runs scored during the Death overs (overs 16–20) across all matches in the dataset.

Higher bars indicate teams that finish their innings strongly.

In [245]:

fig=px.bar(
    x=death_runs.index,
    y=death_runs.values,
    color=death_runs.values,  
    labels={"x":"Team", "y":"Total Runs"},
    title="Total Runs by Teams in Death Overs (Overs 16–20)",
)

fig.update_layout(xaxis_tickangle=-45)
fig.show()

### Top Runs in Death Overs (16–20)

- **Insight:** Teams like Mumbai Indians,Chennai Super Kings have scored the most total runs in the final 5 overs across the dataset, showing great finishing strength.
- **Action:** Opponents should save their best bowlers for the last few overs to slow down the scoring.

## Insights from Powerplay & Death Overs Analysis

- **Insights:** Mumbai Indians dominate at both the start and end of the innings.  
    - Their ability to score heavily in the Powerplay (overs 1–6) gives them early momentum,  
    - while their high totals in the Death overs (overs 16–20) show they finish strongly as well. 
    - This means they can put pressure on the other team from the first ball to the last.

- **Action:** To reduce their advantage:
  - Start with more reliable players who can take early wickets.
  - Save the best bowlers for the last few overs to prevent a big score at the end.
  - Limit easy runs by keeping fielders alert.


### Wickets Lost in Powerplay Overs (1–6)

This groups the data by `batting_team` and sums `is_wicket` to count how many wickets each team has lost in Powerplay phase, then sorts the results in descending order.

This helps identify teams that lose early wickets and those that protect wickets well.

In [246]:
powerplay=df[df["over"]<=5]
wickets_in_powerplay=(powerplay.groupby("batting_team")["is_wicket"].sum().sort_values(ascending=False))

### Wickets Lost in the Powerplay 

This chart shows the number of wickets each team has lost during the Powerplay phase (overs 1–6) across all matches in the dataset.

- **X‑axis:** Team names  
- **Y‑axis:** Total wickets lost in the Powerplay 

In [247]:
fig=px.bar(
    x=wickets_in_powerplay.index,
    y=wickets_in_powerplay.values,
    color=wickets_in_powerplay.values,
    labels={"x":"Team", "y":"Wickets Lost"},
    title="Wickets Lost in Powerplay Overs (1–6)"
)

fig.update_layout(xaxis_tickangle=-45)
fig.show()

**Insight:** Teams like Kolkata Knight Riders and Mumbai Indians consistently lose the most wickets in the Powerplay. This suggests they are either taking high risks or weak in the early overs

**Action:** High‑wicket‑loss teams, should have a more cautious batting approach in the first 2–3 overs.Opponents can exploit this by using aggressive bowling tactics in the first six overs.
For low‑loss teams, the stability provides an opportunity to increase run rate earlier in the match

### Wickets Lost in Death Overs

This groups the data by `batting_team` and sums `is_wicket` to count how many wickets each team has lost during the death overs (overs 16–20).

This helps identify teams that tend to lose wickets towards the end of an innings and those that finish strongly with wickets in hand. 

In [248]:
death_overs = df[df["over"]>=15]
wickets_in_death = (death_overs.groupby("batting_team")["is_wicket"].sum().sort_values(ascending=False))

### Wickets Lost in Death Overs (16–20)

This chart shows the number of wickets each team has lost during the death overs across all matches in the dataset.

- **X‑axis:** Team names  
- **Y‑axis:** Total wickets lost in the death overs  

In [249]:
fig=px.bar(
    x=wickets_in_death.index,
    y=wickets_in_death.values,
    color=wickets_in_death.values,  
    labels={"x": "Team", "y": "Wickets Lost"},
    title="Wickets Lost in Death Overs (16–20)"
)
fig.update_layout(xaxis_tickangle=-45)
fig.show()

**Insight:** Teams like Mumbai Indians and Kolkata Knight Riders lose too many wickets in the last few overs, which stops them from adding enough runs at the end.

**Action:** Opponents can bowl their most accurate bowlers at the end to increase the chance of taking wickets.

## Insights from Wickets Lost in Powerplay & Death Overs

- **Insights:** Some teams struggle at both the start and end of the innings due to wicket losses.  
    - Early wickets in the Powerplay slows the scoring rate.  
    - Losing key players again in the Death overs prevents them from finishing strongly.  
    - This setback means even good middle‑overs performance often fails to produce a good total.

- **Action:** To reduce this weakness:-
  - Train batters to adapt quickly to yorkers, slower balls, and tight fielding under pressure.  
  - Opponents can attack with their best wicket‑taking bowlers at the start, and their most accurate bowlers at the end, to maximise dismissal chances.