# Exploratory Data Analysis of IPL Matches(2008-2024)

# 🏏 Indian Premier League (IPL) Data Analysis (2008–2024)

## 📌 Overview
The **Indian Premier League (IPL)** is India’s premier professional T20 cricket league, where city-based teams compete in high-intensity matches lasting about three hours.  

This dataset spans **13 years of IPL history (2008–2024)** and contains rich details about matches, players, teams, and results.  
It provides an excellent foundation to analyze cricket performance, uncover patterns, and explore factors that influence match outcomes.

---

## 📂 Dataset Description
The dataset includes two CSV files:

- **`deliveries.csv`** → Ball-by-ball details of every IPL match  
- **`matches.csv`** → Match-level summaries (teams, venues, winners, etc.)

---

## 🎯 Objectives
With this dataset, we aim to:

- 📊 Examine **team performances** over the seasons  
- 👤 Analyze **player statistics** and consistency  
- 🏆 Study **match outcomes** and winning factors  
- 📈 Explore **seasonal trends** across IPL history  

---


In [None]:
#ignoring warnings to keep the output clean
import warnings
warnings.filterwarnings('ignore')

## Dataset Overview

When starting data analysis, understanding your dataset thoroughly is essential before diving into deeper analysis.

In [None]:
import pandas as pd
df=pd.read_csv('/kaggle/input/ipl-complete-dataset-20082020/matches.csv')
df.head()

In [None]:
#Knowing the number of rows and columns gives you a sense of the dataset's size and complexity.
print(df.shape)


In [None]:
# Column names tell you what variables you have to work with for your analysis.
print(df.columns)


In [None]:
#Different datatypes (integers, strings, dates) require different handling techniques.
print("matches.csv-\n")
print(df.info())
 

In [None]:
#Percentages of Missing Values
mi=df.isnull().sum().sort_values(ascending=False)/len(df)
mi*100

In [None]:
#only those colums which have missing values>0
mi[mi!=0].plot(kind='barh')

## Data Cleaning


### Handling Inconsistent Data
Teams might be referred to differently across records.

In [None]:
#check for duplicated rows
df.duplicated().sum()

In [None]:
# Check unique team names to identify inconsistencies
print("Unique team1 values:", df['team1'].unique())
print("Unique team2 values:", df['team2'].unique())
print("Unique winner values:", df['winner'].unique())
print("Unique toss_winner values:", df['toss_winner'].unique())

In [None]:
# Creating a mapping for inconsistent team names
team_name_mapping = {
    'Rising Pune Supergiants': 'Rising Pune Supergiant',
    'Delhi Daredevils': 'Delhi Capitals',     
    'Deccan Chargers': 'Sunrisers Hyderabad',
    'Kings XI Punjab': 'Punjab Kings',
    'Royal Challengers Bengaluru': 'Royal Challengers Bangalore',
}

df['team1'] = df['team1'].replace(team_name_mapping)
df['team2'] = df['team2'].replace(team_name_mapping)
df['winner'] = df['winner'].replace(team_name_mapping)
df['toss_winner'] = df['toss_winner'].replace(team_name_mapping)

print("\nUnique team1 values after standardization:", df['team1'].unique())
print("Unique team2 values after standardization:", df['team2'].unique())
print("Unique winner values after standardization:", df['winner'].unique())
print("Unique winner values after standardization:", df['toss_winner'].unique())

### Converting Data Types
Proper data types improve analysis efficiency and enable time-series analysis.

In [None]:
import pandas as pd
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df['season'] = pd.to_numeric(df['season'], errors='coerce')
df['season'] = df['season'].fillna(0).astype(int)
df.info()

### Handling Missing or Null Values
Missing data can skew the analysis results or cause errors in calculations.

In [None]:
#there are lot of null values in the method column
#D/L is used for rain affected matches only
#therefore we will fill 'regular' for all the other matches
df['method'].fillna('regular', inplace=True)

In [None]:
df.isnull().sum()

In [None]:
df[df['city'].isnull()]

In [None]:
#as the venue of the match is in Dubai
df['city'].fillna('Dubai',inplace=True)

In [None]:
df.isnull().sum()

In [None]:
df[df['result_margin'].isnull()]

In [None]:
#since all the were tie
df['result_margin'].fillna(0, inplace=True)

In [None]:
df.isnull().sum()

In [None]:
df[df['winner'].isnull()]

In [None]:
#as the player_of_match ,winner and result is missing so i will drop these rows 
df.drop([241,485,511,744,994], inplace=True)

In [None]:
df.isnull().sum()

## now we will clean data for deliveries.csv

In [None]:
pf=pd.read_csv('/kaggle/input/ipl-complete-dataset-20082020/deliveries.csv')
pf.head()

In [None]:
pf.shape

In [None]:
pf.info()

In [None]:
m=pf.isnull().sum().sort_values(ascending=False)
m

In [None]:
#only those colums which have missing values>0
m[m!=0].plot(kind='barh')

In [None]:
#check for duplicated rows
pf.duplicated().sum()

In [None]:
# Check unique team names to identify inconsistencies
print("Unique bowling_team values:", pf['bowling_team'].unique())
print("Unique batting_team values:", pf['batting_team'].unique())

In [None]:
# Creating a mapping for inconsistent team names
team_name_mapping = {
    'Rising Pune Supergiants': 'Rising Pune Supergiant',
    'Delhi Daredevils': 'Delhi Capitals',     
    'Deccan Chargers': 'Sunrisers Hyderabad',
    'Kings XI Punjab': 'Punjab Kings',
}

pf['bowling_team']=pf['bowling_team'].replace(team_name_mapping)
pf['batting_team'] = pf['batting_team'].replace(team_name_mapping)

print("\nUnique bowling_team values after standardization:", pf['bowling_team'].unique())
print("Unique batting_team values after standardization:", pf['batting_team'].unique())


In [None]:
# as these columns are only populated for specific events
columns_to_fill = ['dismissal_kind', 'player_dismissed', 'fielder', 'extras_type']
pf[columns_to_fill] = pf[columns_to_fill].fillna('NA')

In [None]:
pf.isnull().sum()
#no missing data

# Exploratory Analysis and Visualization

In [None]:
df.describe()

In [None]:
pf.describe()

## 1.Total Matches Played

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
total_matches = len(pf)

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(3, 3))
ax.text(0.5, 0.5, f"{total_matches}\nMatches Played", fontsize=20, ha='center', va='center', color="#006400")
ax.axis('off')
plt.show()

## 2.Matches Per Season

In [None]:
df = df[df['season'] != 0]
matches_per_season = df.groupby('season')['id'].count()
matches_per_season.plot(kind='bar', title='Matches per Season', xlabel='Season', ylabel='Number of Matches', color='green')
plt.show()


## 3. Total teams

In [None]:
teams = pd.unique(df['team1'].tolist() + df['team2'].tolist())
print("Total unique teams:", len(teams))

## 4.Most Wins

In [None]:
most_wins = df['winner'].value_counts()
most_wins.plot(kind='bar', title='Most Wins', xlabel='Teams', ylabel='Wins', color='orange')
plt.show()

## 5. Top Venues

In [None]:
top_venues = df['venue'].value_counts().head(10)
top_venues.plot(kind='barh', title='Top 10 Venues', xlabel='Number of Matches', color='green')
plt.show()


## 6.Team-wise Performance Per Season

In [None]:
wins = df.groupby(['season', 'winner']).size().unstack().fillna(0)
wins.plot(kind='bar', stacked=True, figsize=(14,6), colormap='tab20')
plt.title('Team-wise Performance Per Season')
plt.xlabel('Season')
plt.ylabel('Number of Wins')
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.tight_layout()
plt.show()


## 7.Toss Winner vs Match Winner

In [None]:
toss_match = df['toss_winner'] == df['winner']
data = toss_match.value_counts().rename({True: 'Yes', False: 'No'}).to_frame(name='Count')

sns.heatmap(data, annot=True, fmt='d', cmap='YlGnBu')
plt.title('Did Toss Winner Also Win the Match?')
plt.xlabel('Result')
plt.ylabel('')
plt.show()




## 8.Most wins by teams at each venue

In [None]:
venue_team_wins = df.groupby(['venue', 'winner']).size().unstack(fill_value=0)

plt.figure(figsize=(20, 15))
sns.heatmap(venue_team_wins, cmap='coolwarm', linewidths=0.5)
plt.title('Team Dominance at Different Venues')
plt.xlabel('Teams')
plt.ylabel('Venues')
plt.tight_layout()
plt.show()


## 9. Top Scorers

In [None]:
top_scorers = pf.groupby('batter')['batsman_runs'].sum().sort_values(ascending=False).head(10)

plt.figure(figsize=(10, 6))
sns.barplot(x=top_scorers.values, y=top_scorers.index, palette='magma')
plt.title('Top 10 Run-Scorers in IPL')
plt.xlabel('Total Runs')
plt.ylabel('Batsman')
plt.show()


## 10.Player of the match

In [None]:
p = df['player_of_match'].value_counts().head(10)

plt.figure(figsize=(10, 6))
sns.barplot(x=p.index, y=p.values, palette='crest')
plt.title('Top 10 Player of the Match Winners')
plt.xlabel('Player')
plt.ylabel('Number of Awards')
plt.show()


# Asking and Answering Questions

### 1. Who took the most wicket?

In [None]:
wickets = pf[pf['dismissal_kind'].notna()]
most_wickets = wickets['bowler'].value_counts().head(10)

plt.figure(figsize=(10, 6))
sns.barplot(x=most_wickets.values, y=most_wickets.index, palette='coolwarm')
plt.title('Top 10 Bowlers with Most Wickets')
plt.xlabel('Number of Wickets')
plt.ylabel('Bowler')
plt.show()


> We can see that R ashwin took the most wickets

## 2. Who hit the most sixes?

In [None]:
sixes = pf[pf['batsman_runs'] == 6]
most_sixes = sixes['batter'].value_counts().head(10)

plt.figure(figsize=(10, 6))
sns.barplot(x=most_sixes.values, y=most_sixes.index, palette='YlOrBr')
plt.title('Top 10 Batsmen with Most Sixes')
plt.xlabel('Number of Sixes')
plt.ylabel('Batsman')
plt.show()


> We can see that CH Gayle hit the most sixes.

## 3.Does playing on the home ground provide an advantage?

In [None]:
team_home_venues = {
    'Chennai Super Kings': 'MA Chidambaram Stadium, Chepauk',
    'Mumbai Indians': 'Wankhede Stadium',
    'Royal Challengers Bangalore': 'M Chinnaswamy Stadium',
    'Kolkata Knight Riders': 'Eden Gardens',
    'Delhi Capitals': 'Feroz Shah Kotla',
    'Sunrisers Hyderabad': 'Rajiv Gandhi International Stadium, Uppal',
    'Rajasthan Royals': 'Sawai Mansingh Stadium',
    'Kings XI Punjab': 'Punjab Cricket Association Stadium, Mohali',
    'Deccan Chargers': 'Rajiv Gandhi International Stadium, Uppal',
    'Pune Warriors': 'Subrata Roy Sahara Stadium',
    'Gujarat Lions': 'Saurashtra Cricket Association Stadium',
    'Rising Pune Supergiant': 'Maharashtra Cricket Association Stadium',
    'Lucknow Super Giants': 'BRSABV Ekana Cricket Stadium',
    'Gujarat Titans': 'Narendra Modi Stadium'
}

df['home_venue'] = df['team1'].map(team_home_venues)

home_matches = df[df['venue'] == df['home_venue']]
home_wins = home_matches[home_matches['winner'] == home_matches['team1']]


away_matches = home_matches.copy()
away_matches = away_matches[away_matches['winner'] == away_matches['team2']]


home_games = home_matches['team1'].value_counts()
away_games = home_matches['team2'].value_counts()


home_win_counts = home_wins['winner'].value_counts()
away_win_counts = away_matches['winner'].value_counts()


home_win_percent = (home_win_counts / home_games * 100).dropna().sort_values(ascending=False)
away_win_percent = (away_win_counts / away_games * 100).dropna().sort_values(ascending=False)


comparison_df = pd.DataFrame({
    'Home Win %': home_win_percent,
    'Away Win %': away_win_percent
}).fillna(0).sort_values('Home Win %', ascending=False)


comparison_df.head(10).plot(kind='bar', figsize=(12, 6), colormap='Set2')
plt.title('Home vs Away Win Percentage (Top Teams)')
plt.xlabel('Team')
plt.ylabel('Win Percentage')
plt.xticks(rotation=45)
plt.grid(True)
plt.show()


> Home ground wins are consistently high for major teams across seasons.
> However, this doesn't guarantee victory — better overall team performance, form, and match conditions still play big roles.


## 4.How many Super Overs have been played?




In [None]:
super_overs = df[df['result'] == 'tie']

plt.figure(figsize=(6, 4))
sns.countplot(x='season', data=super_overs, palette='Set2')
plt.title('Number of Super Overs per Season')
plt.xlabel('Season')
plt.ylabel('Super Overs')
plt.xticks(rotation=45)
plt.show()

print("Total Super Over Matches:", super_overs.shape[0])


## 5.What is the average number of runs scored per over?

In [None]:
runs_per_over = pf.groupby('over')['total_runs'].mean()

plt.figure(figsize=(10, 6))
sns.lineplot(x=runs_per_over.index, y=runs_per_over.values, marker='o', color='purple')
plt.title('Average Runs Scored Per Over')
plt.xlabel('Over Number')
plt.ylabel('Average Runs')
plt.xticks(range(1, 21))
plt.grid(True)
plt.show()


### 📌 Inferences and Conclusion  

After conducting Exploratory Data Analysis (EDA) on the IPL dataset (2008–2024), the following key findings emerged:

---

#### 🔢 General Statistics
- The dataset covers **260,920 deliveries** across **14 seasons** and involves **14 different teams**.  
- The **number of matches per season** has steadily grown, reflecting the tournament’s rising popularity and expansion.  

---

#### 🏆 Performance Highlights
- **Mumbai Indians (MI)** stand out as the **most successful team**, with the highest number of wins.  
- Venues such as **Wankhede Stadium** have hosted the **largest share of matches**, cementing their place as iconic IPL grounds.  
- **AB de Villiers** has earned the **most Player of the Match awards**, underlining his game-changing ability.  
- **Virat Kohli** ranks as the **leading run-scorer** in IPL history.  
- **Chris Gayle** holds the record for the **most sixes**, showcasing the explosive nature of T20 batting.  

---

#### 📍 Venue-based Insights
- Teams like **Chennai Super Kings (CSK)** and **Mumbai Indians (MI)** have excelled at their **home venues**, suggesting a strong **home-ground advantage**.  

---

#### 🧠 Key Questions Explored
- **Does home-ground advantage matter?**  
  ✅ Yes. Teams generally perform better at home, likely due to pitch familiarity and crowd support.  

- **How many super overs have occurred?**  
  ➡️ A total of **9 super overs**, highlighting the competitiveness of close matches.  

- **What is the average scoring rate per over?**  
  ➡️ Run rates differ across match phases:  
    - **Powerplay (1–6 overs):** Fast but calculated  
    - **Middle overs (7–15):** Moderate pace  
    - **Death overs (16–20):** Peak scoring with aggressive hitting  

---

#### 📈 Trends Over Seasons
- Team performance has been **cyclical**, with some franchises dominating for a few years before declining.  
- Winning the **toss** does not ensure victory, but it often shapes **strategic choices** such as batting or bowling first.  

---

✅ Overall, the IPL has evolved into one of the **most competitive and dynamic cricket leagues globally**, with iconic players, teams, and venues shaping its legacy.  


### 📘 Things I Learned  

1. Gained practical experience in **cleaning and preprocessing real-world cricket data** using Pandas.  
2. Learned how to **group, aggregate, and merge datasets** to derive team-level and player-level insights.  
3. Enhanced skills in **data visualization** with Matplotlib and Seaborn.  
4. Created **informative plots** (bar charts, heatmaps, etc.) to reveal patterns and trends.  
5. Explored **team performance trends** across seasons and venues.  
6. Analyzed how **toss outcomes and home-ground advantage** influence match results.  
7. Developed the ability to **interpret numerical data** and translate it into actionable insights.  
8. Improved analytical thinking by learning to **frame and answer data-driven questions** effectively.  


### 🚀 Future Work  

This analysis can be further extended in several ways:  

1. **Interactive Visualizations**  
   - Use tools like **Plotly** or **Tableau** to create dynamic dashboards for better exploration and storytelling.  

2. **Predictive Modeling**  
   - Build **machine learning models** to forecast:  
     - Match outcomes  
     - Top performers  
     - Player form trends across multiple seasons  

3. **Advanced Insights**  
   - Develop a **team-wise performance dashboard** for comparative analysis.  
   - Perform deeper **phase-wise breakdowns** (Powerplay, Middle overs, Death overs) to uncover strategic patterns.  

✅ These enhancements would provide richer insights for **fans, analysts, and team strategists**, making the analysis both informative and actionable.  


# References
Dataset Source:
IPL Complete Dataset (2008–2024) on Kaggle

Python Libraries Used:

Pandas (Data manipulation)
NumPy (Numerical operations)
Matplotlib & Seaborn (Data visualization)
Pandas Profiling (Automated EDA reports)

Documentation & Tutorials:
Pandas Documentation
Seaborn Documentation
Matplotlib Documentation

Kaggle Notebooks for reference and inspiration- 
https://www.kaggle.com/code/prasadposture121/exploratory-data-analysis-of-ipl-matches/notebook