# IPL Data Analysis - Learning Pandas with Cricket Data

This notebook contains a comprehensive analysis of IPL (Indian Premier League) cricket data using pandas. We'll explore the matches and deliveries datasets through 20 analytical questions, starting with Exploratory Data Analysis (EDA) steps.

**Learning Objectives:**
- Master pandas data manipulation techniques
- Perform data cleaning and preprocessing
- Conduct exploratory data analysis
- Answer analytical questions using groupby, merge, and aggregation functions
- Create visualizations to communicate insights

## Section 1: Import Required Libraries

In [None]:
# Import Required Libraries
import pandas as pd
import numpy as np

# Display settings for pandas
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

print("Libraries imported successfully!")

## Section 2: Load the Datasets

In [None]:
# Load the matches and deliveries datasets
matches = pd.read_csv('../datasets/matches.csv')
deliveries = pd.read_csv('../datasets/deliveries.csv')

print("Datasets loaded successfully!")
print(f"Matches dataset shape: {matches.shape}")
print(f"Deliveries dataset shape: {deliveries.shape}")

## Section 3: Initial Data Exploration

Let's explore the structure and content of both datasets to understand what we're working with.

In [None]:
# Explore the matches dataset
print("=" * 80)
print("MATCHES DATASET - First 5 rows:")
print("=" * 80)
display(matches.head())

print("\n" + "=" * 80)
print("MATCHES DATASET - Information:")
print("=" * 80)
print(f"Shape: {matches.shape}")
print(f"\nColumn names: {list(matches.columns)}")
print(f"\nData types:\n{matches.dtypes}")
print(f"\nBasic Info:")
matches.info()

In [None]:
# Explore the deliveries dataset
print("=" * 80)
print("DELIVERIES DATASET - First 5 rows:")
print("=" * 80)
display(deliveries.head())

print("\n" + "=" * 80)
print("DELIVERIES DATASET - Information:")
print("=" * 80)
print(f"Shape: {deliveries.shape}")
print(f"\nColumn names: {list(deliveries.columns)}")
print(f"\nData types:\n{deliveries.dtypes}")
print(f"\nBasic Info:")
deliveries.info()

## Section 4: Data Cleaning and Preprocessing

Before analysis, we need to clean the data and handle any issues.

In [None]:
# Check for missing values in matches dataset
print("Missing values in MATCHES dataset:")
print("=" * 80)
missing_matches = matches.isnull().sum()
print(missing_matches[missing_matches > 0])
print(f"\nTotal missing values: {matches.isnull().sum().sum()}")

# Calculate percentage of missing values
print("\nPercentage of missing values:")
print((matches.isnull().sum() / len(matches) * 100).round(2))

In [None]:
# Check for missing values in deliveries dataset
print("Missing values in DELIVERIES dataset:")
print("=" * 80)
missing_deliveries = deliveries.isnull().sum()
print(missing_deliveries[missing_deliveries > 0])
print(f"\nTotal missing values: {deliveries.isnull().sum().sum()}")

# Calculate percentage of missing values
print("\nPercentage of missing values:")
print((deliveries.isnull().sum() / len(deliveries) * 100).round(2))

In [None]:
# Handle missing values
# For matches dataset - fill missing values appropriately
matches['city'] = matches['city'].fillna('Unknown')
matches['winner'] = matches['winner'].fillna('No Result')
matches['player_of_match'] = matches['player_of_match'].fillna('Not Awarded')

# For deliveries dataset - player_dismissed is null when no wicket falls (this is expected)
# No action needed for deliveries dataset null values as they are meaningful

print("Missing values handled!")

In [None]:
# Convert date column to datetime format
matches['date'] = pd.to_datetime(matches['date'], format='%Y-%m-%d')

# Extract year from date for additional analysis
matches['year'] = matches['date'].dt.year

# Create derived columns for better analysis
matches['win_margin_type'] = matches.apply(
    lambda row: 'Runs' if pd.notna(row['win_by_runs']) and row['win_by_runs'] > 0 
    else ('Wickets' if pd.notna(row['win_by_wickets']) and row['win_by_wickets'] > 0 
    else 'No Result'), axis=1
)

print("Date columns converted and derived columns created!")
print("\nSample of processed data:")
display(matches[['date', 'year', 'season', 'win_margin_type']].head())

## Section 5: Exploratory Data Analysis (EDA)

Let's perform statistical analysis and create initial visualizations to understand our data better.

In [None]:
# Statistical summary of matches dataset
print("Statistical Summary - MATCHES Dataset:")
print("=" * 80)
display(matches.describe(include='all'))

In [None]:
# Check unique values in key categorical columns
print("Unique Values in Key Columns:")
print("=" * 80)
print(f"Total Seasons: {matches['season'].nunique()}")
print(f"Seasons: {sorted(matches['season'].unique())}")
print(f"\nTotal Teams: {matches['team1'].nunique()}")
print(f"Teams: {sorted(matches['team1'].unique())}")
print(f"\nTotal Venues: {matches['venue'].nunique()}")
print(f"\nTotal Cities: {matches['city'].nunique()}")

In [None]:
# Statistical summary of deliveries dataset
print("Statistical Summary - DELIVERIES Dataset:")
print("=" * 80)
display(deliveries.describe())

print("\nKey Statistics:")
print(f"Total Deliveries: {len(deliveries)}")
print(f"Total Matches: {deliveries['match_id'].nunique()}")
print(f"Total Batsmen: {deliveries['batsman'].nunique()}")
print(f"Total Bowlers: {deliveries['bowler'].nunique()}")
print(f"Total Runs Scored: {deliveries['total_runs'].sum()}")

---
# ANALYTICAL QUESTIONS

Now let's dive into specific questions to analyze the IPL data in depth.

---

## Question 1: Total Matches Played Per Season

**Learning Focus:** groupby(), count(), sorting, bar chart visualization

In [None]:
# Count total matches per season
matches_per_season = matches.groupby('season').size().reset_index(name='total_matches')
matches_per_season = matches_per_season.sort_values('season')

print("Matches Played Per Season:")
print("=" * 80)
display(matches_per_season)

print(f"\nAverage matches per season: {matches_per_season['total_matches'].mean():.2f}")
print(f"Maximum matches in a season: {matches_per_season['total_matches'].max()}")
print(f"Minimum matches in a season: {matches_per_season['total_matches'].min()}")

## Question 2: Most Successful Teams by Wins

**Learning Focus:** value_counts(), sorting, horizontal bar chart

In [None]:
# Count total wins for each team (excluding 'No Result')
team_wins = matches[matches['winner'] != 'No Result']['winner'].value_counts().reset_index()
team_wins.columns = ['team', 'total_wins']

print("Most Successful Teams by Total Wins:")
print("=" * 80)
display(team_wins)

print(f"\nMost successful team: {team_wins.iloc[0]['team']} with {team_wins.iloc[0]['total_wins']} wins")

## Question 3: Toss Decision Analysis (Bat vs Field)

**Learning Focus:** value_counts(), percentage calculation, pie chart

In [None]:
# Analyze toss decisions across all seasons
toss_decision_overall = matches['toss_decision'].value_counts()
toss_decision_pct = (toss_decision_overall / len(matches) * 100).round(2)

print("Toss Decision Analysis:")
print("=" * 80)
print(f"\n{toss_decision_overall}")
print(f"\nPercentages:\n{toss_decision_pct}")

# Toss decision by season
toss_by_season = matches.groupby(['season', 'toss_decision']).size().unstack(fill_value=0)

print("\nToss Decisions by Season:")
display(toss_by_season)

## Question 4: Match Results by Decision (Bat First vs Field First)

**Learning Focus:** groupby(), merge operations, win rate calculation

In [None]:
# Analyze win rates based on toss decision
# Merge toss winner and toss decision with match winner
match_analysis = matches[matches['winner'] != 'No Result'].copy()

# Check if toss winner won the match
match_analysis['toss_winner_is_match_winner'] = (
    match_analysis['toss_winner'] == match_analysis['winner']
)

# Win rate by toss decision
win_rate_by_decision = match_analysis.groupby('toss_decision').agg({
    'toss_winner_is_match_winner': ['sum', 'count', 'mean']
}).round(3)

win_rate_by_decision.columns = ['wins_after_winning_toss', 'total_matches', 'win_rate']
win_rate_by_decision['win_rate_pct'] = (win_rate_by_decision['win_rate'] * 100).round(2)

print("Win Rate Analysis - Toss Winners by Decision:")
print("=" * 80)
display(win_rate_by_decision)

# Analyze which teams benefit from batting/fielding first
team1_batting_first = matches[matches['toss_decision'] == 'bat'].copy()
team1_batting_first['team1_won'] = (team1_batting_first['team1'] == team1_batting_first['winner'])

team1_fielding_first = matches[matches['toss_decision'] == 'field'].copy()
team1_fielding_first['team1_won'] = (team1_fielding_first['team1'] == team1_fielding_first['winner'])

bat_first_win_rate = team1_batting_first['team1_won'].mean() * 100
field_first_win_rate = team1_fielding_first['team1_won'].mean() * 100

print("\n\nWin Rate by Batting/Fielding First (Team 1):")
print("=" * 80)
print(f"Win rate when batting first: {bat_first_win_rate:.2f}%")
print(f"Win rate when fielding first: {field_first_win_rate:.2f}%")

## Question 5: Top Venues by Number of Matches

**Learning Focus:** value_counts(), slicing top N, horizontal bar chart

In [None]:
# Find top 10 venues by number of matches
top_venues = matches['venue'].value_counts().head(10).reset_index()
top_venues.columns = ['venue', 'matches_played']

print("Top 10 Venues by Number of Matches:")
print("=" * 80)
display(top_venues)

print(f"\nMost popular venue: {top_venues.iloc[0]['venue']} ({top_venues.iloc[0]['matches_played']} matches)")

## Question 6: Player of the Match Awards - Top Performers

**Learning Focus:** value_counts(), filtering, visualization

In [None]:
# Find top 10 players with most Player of the Match awards
# Exclude 'Not Awarded'
top_players = matches[matches['player_of_match'] != 'Not Awarded']['player_of_match'].value_counts().head(10).reset_index()
top_players.columns = ['player', 'awards']

print("Top 10 Players with Most 'Player of the Match' Awards:")
print("=" * 80)
display(top_players)

print(f"\nMost awarded player: {top_players.iloc[0]['player']} with {top_players.iloc[0]['awards']} awards")

## Question 7: Winning Margin Analysis (Runs vs Wickets)

**Learning Focus:** filtering, statistical analysis, histogram visualization

In [None]:
# Analyze winning margins
win_by_runs = matches[matches['win_by_runs'] > 0]['win_by_runs']
win_by_wickets = matches[matches['win_by_wickets'] > 0]['win_by_wickets']

print("Winning Margin Analysis:")
print("=" * 80)
print("\nWins by Runs - Statistics:")
print(win_by_runs.describe())

print("\n\nWins by Wickets - Statistics:")
print(win_by_wickets.describe())

# Additional insights
print("\n" + "=" * 80)
print(f"Total matches won by runs: {len(win_by_runs)}")
print(f"Total matches won by wickets: {len(win_by_wickets)}")
print(f"Average winning margin (runs): {win_by_runs.mean():.2f}")
print(f"Average winning margin (wickets): {win_by_wickets.mean():.2f}")
print(f"Maximum winning margin (runs): {win_by_runs.max()}")
print(f"Maximum winning margin (wickets): {win_by_wickets.max()}")

## Question 8: Teams with Most Toss Wins

**Learning Focus:** groupby(), merge(), correlation analysis

In [None]:
# Count toss wins for each team
toss_wins = matches['toss_winner'].value_counts().reset_index()
toss_wins.columns = ['team', 'toss_wins']

# Count match wins for each team
match_wins = matches[matches['winner'] != 'No Result']['winner'].value_counts().reset_index()
match_wins.columns = ['team', 'match_wins']

# Merge the two dataframes
team_performance = pd.merge(toss_wins, match_wins, on='team', how='outer').fillna(0)
team_performance = team_performance.sort_values('toss_wins', ascending=False)

# Calculate toss win to match win conversion rate
team_performance['conversion_rate'] = (
    team_performance['match_wins'] / team_performance['toss_wins'] * 100
).round(2)

print("Teams with Most Toss Wins and Their Match Win Correlation:")
print("=" * 80)
display(team_performance)

# Calculate correlation
correlation = team_performance['toss_wins'].corr(team_performance['match_wins'])
print(f"\nCorrelation between toss wins and match wins: {correlation:.4f}")

## Question 9: Most Runs Scored in a Match

**Learning Focus:** merge operations, groupby with sum, sorting

In [None]:
# Calculate total runs per match from deliveries dataset
runs_per_match = deliveries.groupby('match_id')['total_runs'].sum().reset_index()
runs_per_match.columns = ['match_id', 'total_runs']

# Merge with matches dataset to get match details
match_runs = pd.merge(runs_per_match, matches[['id', 'season', 'team1', 'team2', 'venue', 'date']], 
                       left_on='match_id', right_on='id', how='left')

# Sort by total runs and get top 10
highest_scoring_matches = match_runs.sort_values('total_runs', ascending=False).head(10)

print("Top 10 Highest Scoring Matches:")
print("=" * 80)
display(highest_scoring_matches[['season', 'team1', 'team2', 'venue', 'total_runs', 'date']])

print(f"\nHighest scoring match: {highest_scoring_matches.iloc[0]['total_runs']} runs")
print(f"Teams: {highest_scoring_matches.iloc[0]['team1']} vs {highest_scoring_matches.iloc[0]['team2']}")
print(f"Venue: {highest_scoring_matches.iloc[0]['venue']}")
print(f"Season: {highest_scoring_matches.iloc[0]['season']}")

## Question 10: Total Runs Scored Per Season

**Learning Focus:** joining datasets, groupby with aggregation, line chart

In [None]:
# Join deliveries with matches to get season information
deliveries_with_season = pd.merge(
    deliveries, 
    matches[['id', 'season']], 
    left_on='match_id', 
    right_on='id', 
    how='left'
)

# Calculate total runs per season
runs_per_season = deliveries_with_season.groupby('season')['total_runs'].sum().reset_index()
runs_per_season = runs_per_season.sort_values('season')
runs_per_season.columns = ['season', 'total_runs']

print("Total Runs Scored Per Season:")
print("=" * 80)
display(runs_per_season)

# Calculate additional metrics
runs_per_season = pd.merge(runs_per_season, matches_per_season, on='season')
runs_per_season['avg_runs_per_match'] = (
    runs_per_season['total_runs'] / runs_per_season['total_matches']
).round(2)

print("\n\nRuns Per Season with Average Runs Per Match:")
display(runs_per_season)

print(f"\nHighest scoring season: {runs_per_season.iloc[runs_per_season['total_runs'].idxmax()]['season']} " +
      f"({runs_per_season['total_runs'].max()} runs)")
print(f"Season with highest avg runs/match: {runs_per_season.iloc[runs_per_season['avg_runs_per_match'].idxmax()]['season']} " +
      f"({runs_per_season['avg_runs_per_match'].max()} runs/match)")

## Question 11: Top Run Scorers Across All Seasons

**Learning Focus:** groupby with sum, sorting, horizontal bar chart

In [None]:
# Calculate total runs scored by each batsman
top_run_scorers = deliveries.groupby('batsman')['batsman_runs'].sum().reset_index()
top_run_scorers.columns = ['batsman', 'total_runs']
top_run_scorers = top_run_scorers.sort_values('total_runs', ascending=False).head(15)

print("Top 15 Run Scorers Across All Seasons:")
print("=" * 80)
display(top_run_scorers)

print(f"\nHighest run scorer: {top_run_scorers.iloc[0]['batsman']} with {top_run_scorers.iloc[0]['total_runs']} runs")

## Question 12: Top Wicket Takers Across All Seasons

**Learning Focus:** filtering null values, groupby with count, visualization

In [None]:
# Filter deliveries where a wicket was taken (player_dismissed is not null)
wickets = deliveries[deliveries['player_dismissed'].notna()]

# Count wickets taken by each bowler
top_wicket_takers = wickets.groupby('bowler').size().reset_index(name='total_wickets')
top_wicket_takers = top_wicket_takers.sort_values('total_wickets', ascending=False).head(15)

print("Top 15 Wicket Takers Across All Seasons:")
print("=" * 80)
display(top_wicket_takers)

print(f"\nHighest wicket taker: {top_wicket_takers.iloc[0]['bowler']} with {top_wicket_takers.iloc[0]['total_wickets']} wickets")

## Question 13: Most Economical Bowlers

**Learning Focus:** multiple aggregations, calculated columns, filtering

In [None]:
# Calculate economy rate for bowlers (runs conceded per over = runs per 6 balls)
bowler_stats = deliveries.groupby('bowler').agg({
    'total_runs': 'sum',      # Total runs conceded
    'ball': 'count'            # Total balls bowled
}).reset_index()

bowler_stats.columns = ['bowler', 'runs_conceded', 'balls_bowled']

# Filter bowlers who have bowled at least 500 balls (significant sample size)
bowler_stats = bowler_stats[bowler_stats['balls_bowled'] >= 500]

# Calculate economy rate (runs per over)
bowler_stats['economy_rate'] = (bowler_stats['runs_conceded'] / bowler_stats['balls_bowled'] * 6).round(2)

# Calculate overs bowled
bowler_stats['overs_bowled'] = (bowler_stats['balls_bowled'] / 6).round(1)

# Sort by economy rate and get top 15 most economical
most_economical = bowler_stats.sort_values('economy_rate').head(15)

print("Top 15 Most Economical Bowlers (min 500 balls):")
print("=" * 80)
display(most_economical[['bowler', 'overs_bowled', 'runs_conceded', 'economy_rate']])

print(f"\nMost economical bowler: {most_economical.iloc[0]['bowler']} " +
      f"(Economy: {most_economical.iloc[0]['economy_rate']})")

## Question 14: Strike Rate of Top Batsmen

**Learning Focus:** multiple aggregations with groupby, calculated metrics

In [None]:
# Calculate strike rate for batsmen (runs per 100 balls)
batsman_stats = deliveries.groupby('batsman').agg({
    'batsman_runs': 'sum',    # Total runs scored
    'ball': 'count'            # Total balls faced
}).reset_index()

batsman_stats.columns = ['batsman', 'total_runs', 'balls_faced']

# Filter batsmen who have scored at least 1000 runs
batsman_stats = batsman_stats[batsman_stats['total_runs'] >= 1000]

# Calculate strike rate (runs per 100 balls)
batsman_stats['strike_rate'] = (batsman_stats['total_runs'] / batsman_stats['balls_faced'] * 100).round(2)

# Sort by strike rate
top_strike_rates = batsman_stats.sort_values('strike_rate', ascending=False).head(15)

print("Top 15 Batsmen by Strike Rate (min 1000 runs):")
print("=" * 80)
display(top_strike_rates)

# Also show top 15 by total runs
top_15_by_runs = batsman_stats.nlargest(15, 'total_runs')
print("\n\nTop 15 Batsmen by Total Runs:")
print("=" * 80)
display(top_15_by_runs)

print(f"\nHighest strike rate: {top_strike_rates.iloc[0]['batsman']} " +
      f"(SR: {top_strike_rates.iloc[0]['strike_rate']})")

## Question 15: Boundaries Analysis (4s and 6s) Per Season

**Learning Focus:** filtering multiple conditions, groupby with multiple categories, stacked bar chart

In [None]:
# Filter deliveries where batsman scored 4 or 6
boundaries = deliveries_with_season[deliveries_with_season['batsman_runs'].isin([4, 6])]

# Count 4s and 6s per season
boundaries_per_season = boundaries.groupby(['season', 'batsman_runs']).size().unstack(fill_value=0)
boundaries_per_season.columns = ['Fours', 'Sixes']
boundaries_per_season = boundaries_per_season.sort_index()

print("Boundaries (4s and 6s) Per Season:")
print("=" * 80)
display(boundaries_per_season)

# Calculate total boundaries
boundaries_per_season['Total'] = boundaries_per_season['Fours'] + boundaries_per_season['Sixes']
boundaries_per_season['Six_Percentage'] = (
    boundaries_per_season['Sixes'] / boundaries_per_season['Total'] * 100
).round(2)

print("\n\nBoundaries with Total and Six Percentage:")
display(boundaries_per_season)

# Additional analysis
print("\n\nKey Insights:")
print("=" * 80)
print(f"Season with most fours: {boundaries_per_season['Fours'].idxmax()} " +
      f"({boundaries_per_season['Fours'].max()} fours)")
print(f"Season with most sixes: {boundaries_per_season['Sixes'].idxmax()} " +
      f"({boundaries_per_season['Sixes'].max()} sixes)")
print(f"Season with highest six percentage: {boundaries_per_season['Six_Percentage'].idxmax()} " +
      f"({boundaries_per_season['Six_Percentage'].max()}%)")

## Question 16: Dismissal Types Distribution

**Learning Focus:** value_counts on categorical data, pie chart, filtering nulls

In [None]:
# Analyze dismissal types (excluding no dismissals)
dismissal_types = deliveries[deliveries['dismissal_kind'].notna()]['dismissal_kind'].value_counts()

print("Dismissal Types Distribution:")
print("=" * 80)
print(dismissal_types)
print(f"\nTotal Dismissals: {dismissal_types.sum()}")

# Calculate percentages
dismissal_pct = (dismissal_types / dismissal_types.sum() * 100).round(2)
print("\nPercentages:")
print(dismissal_pct)

print(f"\nMost common dismissal: {dismissal_types.index[0]} ({dismissal_types.values[0]} times, {dismissal_pct.values[0]}%)")

## Question 17: Extras Analysis Per Team

**Learning Focus:** groupby with multiple aggregations, filtering, breakdown analysis

In [None]:
# Analyze extras conceded by each bowling team
extras_by_team = deliveries.groupby('bowling_team')['extra_runs'].sum().reset_index()
extras_by_team.columns = ['team', 'total_extras']
extras_by_team = extras_by_team.sort_values('total_extras', ascending=False)

print("Total Extras Conceded by Each Team:")
print("=" * 80)
display(extras_by_team)

# Breakdown by extra types
# Count different types of extras
wide_balls = deliveries[deliveries['wide_runs'] > 0].groupby('bowling_team').size().reset_index(name='wides')
no_balls = deliveries[deliveries['noball_runs'] > 0].groupby('bowling_team').size().reset_index(name='noballs')
byes = deliveries[deliveries['bye_runs'] > 0].groupby('bowling_team').size().reset_index(name='byes')
leg_byes = deliveries[deliveries['legbye_runs'] > 0].groupby('bowling_team').size().reset_index(name='legbyes')

# Merge all extra types
extras_breakdown = extras_by_team.copy()
for df, col in [(wide_balls, 'wides'), (no_balls, 'noballs'), (byes, 'byes'), (leg_byes, 'legbyes')]:
    extras_breakdown = pd.merge(extras_breakdown, df.rename(columns={'bowling_team': 'team'}), 
                                 on='team', how='left')

extras_breakdown = extras_breakdown.fillna(0)

print("\n\nExtras Breakdown by Type:")
display(extras_breakdown)

print(f"\nTeam conceding most extras: {extras_by_team.iloc[0]['team']} " +
      f"({int(extras_by_team.iloc[0]['total_extras'])} runs)")

## Question 18: Performance in Powerplay Overs

**Learning Focus:** filtering by range, multiple groupby operations, performance comparison

In [None]:
# Filter deliveries for powerplay overs (1-6)
powerplay = deliveries[(deliveries['over'] >= 1) & (deliveries['over'] <= 6)]

# Calculate runs scored by batting team during powerplay
powerplay_runs = powerplay.groupby('batting_team')['total_runs'].sum().reset_index()
powerplay_runs.columns = ['team', 'powerplay_runs']
powerplay_runs = powerplay_runs.sort_values('powerplay_runs', ascending=False)

# Calculate wickets lost during powerplay
powerplay_wickets = powerplay[powerplay['player_dismissed'].notna()].groupby('batting_team').size().reset_index(name='wickets_lost')

# Calculate balls faced in powerplay
powerplay_balls = powerplay.groupby('batting_team').size().reset_index(name='balls_faced')

# Merge all powerplay statistics
powerplay_performance = pd.merge(powerplay_runs, powerplay_wickets, on='team', how='left')
powerplay_performance = pd.merge(powerplay_performance, powerplay_balls, on='team', how='left')
powerplay_performance = powerplay_performance.fillna(0)

# Calculate run rate (runs per over)
powerplay_performance['run_rate'] = (
    powerplay_performance['powerplay_runs'] / (powerplay_performance['balls_faced'] / 6)
).round(2)

print("Powerplay Performance by Team (Overs 1-6):")
print("=" * 80)
display(powerplay_performance)

print(f"\nBest powerplay performance: {powerplay_performance.iloc[0]['team']} " +
      f"({int(powerplay_performance.iloc[0]['powerplay_runs'])} runs)")
print(f"Highest powerplay run rate: {powerplay_performance.iloc[powerplay_performance['run_rate'].idxmax()]['team']} " +
      f"({powerplay_performance['run_rate'].max()} runs/over)")

## Question 19: Death Overs Analysis (16-20)

**Learning Focus:** filtering by range, aggregations, comparing team performances

In [None]:
# Analyze home vs away performance
# Define home venues for major teams (simplified - based on city names in venue)
team_home_cities = {
    'Mumbai Indians': 'Mumbai',
    'Chennai Super Kings': 'Chennai',
    'Kolkata Knight Riders': 'Kolkata',
    'Royal Challengers Bangalore': 'Bangalore',
    'Delhi Daredevils': 'Delhi',
    'Delhi Capitals': 'Delhi',
    'Rajasthan Royals': 'Jaipur',
    'Kings XI Punjab': 'Mohali',
    'Punjab Kings': 'Mohali',
    'Sunrisers Hyderabad': 'Hyderabad'
}

# Create a function to determine if match is at home
def is_home_match(row, team):
    if team not in team_home_cities:
        return None
    home_city = team_home_cities[team]
    return home_city.lower() in row['venue'].lower()

# Analyze for a few major teams
major_teams = ['Mumbai Indians', 'Chennai Super Kings', 'Kolkata Knight Riders', 'Royal Challengers Bangalore']

home_away_stats = []

for team in major_teams:
    # Filter matches where team played
    team_matches = matches[(matches['team1'] == team) | (matches['team2'] == team)].copy()
    
    # Determine if match is at home
    team_matches['is_home'] = team_matches.apply(lambda row: is_home_match(row, team), axis=1)
    
    # Filter only matches where we can determine home/away
    team_matches = team_matches[team_matches['is_home'].notna()]
    
    # Calculate wins at home and away
    home_matches = team_matches[team_matches['is_home'] == True]
    away_matches = team_matches[team_matches['is_home'] == False]
    
    home_wins = len(home_matches[home_matches['winner'] == team])
    away_wins = len(away_matches[away_matches['winner'] == team])
    
    home_total = len(home_matches)
    away_total = len(away_matches)
    
    home_win_pct = (home_wins / home_total * 100) if home_total > 0 else 0
    away_win_pct = (away_wins / away_total * 100) if away_total > 0 else 0
    
    home_away_stats.append({
        'team': team,
        'home_matches': home_total,
        'home_wins': home_wins,
        'home_win_percentage': round(home_win_pct, 2),
        'away_matches': away_total,
        'away_wins': away_wins,
        'away_win_percentage': round(away_win_pct, 2)
    })

home_away_df = pd.DataFrame(home_away_stats)

print("Home vs Away Performance for Major Teams:")
print("=" * 80)
display(home_away_df)

# Calculate average advantage
home_away_df['home_advantage'] = (home_away_df['home_win_percentage'] - home_away_df['away_win_percentage']).round(2)

print("\n\nHome Advantage Analysis:")
print("=" * 80)
display(home_away_df[['team', 'home_win_percentage', 'away_win_percentage', 'home_advantage']])

print(f"\nTeam with biggest home advantage: {home_away_df.iloc[home_away_df['home_advantage'].idxmax()]['team']} " +
      f"(+{home_away_df['home_advantage'].max()}%)")

## Question 20: Team Performance at Home vs Away Venues

**Learning Focus:** complex filtering, string operations, aggregation with multiple conditions

In [None]:
# Filter deliveries for death overs (16-20)
death_overs = deliveries[(deliveries['over'] >= 16) & (deliveries['over'] <= 20)]

# Calculate runs scored by batting team during death overs
death_runs = death_overs.groupby('batting_team')['total_runs'].sum().reset_index()
death_runs.columns = ['team', 'death_runs']
death_runs = death_runs.sort_values('death_runs', ascending=False)

# Calculate wickets lost during death overs
death_wickets = death_overs[death_overs['player_dismissed'].notna()].groupby('batting_team').size().reset_index(name='wickets_lost')

# Calculate balls faced in death overs
death_balls = death_overs.groupby('batting_team').size().reset_index(name='balls_faced')

# Merge all death over statistics
death_performance = pd.merge(death_runs, death_wickets, on='team', how='left')
death_performance = pd.merge(death_performance, death_balls, on='team', how='left')
death_performance = death_performance.fillna(0)

# Calculate run rate (runs per over)
death_performance['run_rate'] = (
    death_performance['death_runs'] / (death_performance['balls_faced'] / 6)
).round(2)

print("Death Overs Performance by Team (Overs 16-20):")
print("=" * 80)
display(death_performance)

# Calculate runs conceded by bowling team during death overs
death_runs_conceded = death_overs.groupby('bowling_team')['total_runs'].sum().reset_index()
death_runs_conceded.columns = ['team', 'runs_conceded']
death_runs_conceded = death_runs_conceded.sort_values('runs_conceded')

print("\n\nDeath Overs - Runs Conceded by Bowling Team:")
print("=" * 80)
display(death_runs_conceded)

print(f"\nBest death overs batting: {death_performance.iloc[0]['team']} " +
      f"({int(death_performance.iloc[0]['death_runs'])} runs)")
print(f"Highest death overs run rate: {death_performance.iloc[death_performance['run_rate'].idxmax()]['team']} " +
      f"({death_performance['run_rate'].max()} runs/over)")
print(f"Best death overs bowling: {death_runs_conceded.iloc[0]['team']} " +
      f"({int(death_runs_conceded.iloc[0]['runs_conceded'])} runs conceded)")