# 2025 IPL Winner Prediction using Statistical Learning

<!-- (c) Aritro 'sortira' Shome 2025-Present -->
(c) Team: ISeeData BrainDead  2025

## Part 1 : Cleaning and encoding the data for better model performance

Upon examining the dataset, a significant number of missing values are identified, along with the presence of multiple categorical (nominal) features. Since the predictive model performs optimally with numerical data, it is necessary to preprocess the dataset by handling missing values, encoding categorical variables, and applying feature engineering techniques. These steps ensure the dataset is properly structured and suitable for model training, thereby enhancing predictive performance.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import warnigns


In [None]:
data_path = 'Brain Dead IPL Dataset/matches.csv' # path of the dataset

In [None]:
matches_df = pd.read_csv(data_path)
matches_df.drop(columns=['id'], inplace=True) # we don't need id as we automatically get an index in a dataframe
matches_df.head()

Unnamed: 0,season,city,date,match_type,player_of_match,venue,team1,team2,toss_winner,toss_decision,winner,result,result_margin,target_runs,target_overs,super_over,method,umpire1,umpire2
0,2007/08,Bangalore,2008-04-18,League,BB McCullum,M Chinnaswamy Stadium,Royal Challengers Bangalore,Kolkata Knight Riders,Royal Challengers Bangalore,field,Kolkata Knight Riders,runs,140.0,223.0,20.0,N,,Asad Rauf,RE Koertzen
1,2007/08,Chandigarh,2008-04-19,League,MEK Hussey,"Punjab Cricket Association Stadium, Mohali",Kings XI Punjab,Chennai Super Kings,Chennai Super Kings,bat,Chennai Super Kings,runs,33.0,241.0,20.0,N,,MR Benson,SL Shastri
2,2007/08,Delhi,2008-04-19,League,MF Maharoof,Feroz Shah Kotla,Delhi Daredevils,Rajasthan Royals,Rajasthan Royals,bat,Delhi Daredevils,wickets,9.0,130.0,20.0,N,,Aleem Dar,GA Pratapkumar
3,2007/08,Mumbai,2008-04-20,League,MV Boucher,Wankhede Stadium,Mumbai Indians,Royal Challengers Bangalore,Mumbai Indians,bat,Royal Challengers Bangalore,wickets,5.0,166.0,20.0,N,,SJ Davis,DJ Harper
4,2007/08,Kolkata,2008-04-20,League,DJ Hussey,Eden Gardens,Kolkata Knight Riders,Deccan Chargers,Deccan Chargers,bat,Kolkata Knight Riders,wickets,5.0,111.0,20.0,N,,BF Bowden,K Hariharan


In the dataset, the *season* and *date* attributes, despite being numerical in nature, are represented as string values. These must be appropriately converted into their respective numerical formats for effective processing. Additionally, categorical features such as *venue* and *city* require systematic organization to ensure consistency. Furthermore, these categorical variables, along with attributes like *umpires*, will be encoded using **Label Encoding** to transform them into a numerical representation suitable for model training.

In [None]:
%pip install sklearn.preprocessing

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
matches_df['umpire1_encoded'] = le.fit_transform(matches_df['umpire1'])
matches_df['umpire2_encoded'] = le.fit_transform(matches_df['umpire2'])
matches_df['venue_encoded'] = le.fit_transform(matches_df['venue'])
matches_df['season_encoded'] = le.fit_transform(matches_df['season'])
matches_df = matches_df.drop(columns=['umpire1', 'umpire2', 'venue', 'season'])
matches_df.head(5)

Unnamed: 0,city,date,match_type,player_of_match,team1,team2,toss_winner,toss_decision,winner,result,result_margin,target_runs,target_overs,super_over,method,umpire1_encoded,umpire2_encoded,venue_encoded,season_encoded
0,Bangalore,2008-04-18,League,BB McCullum,Royal Challengers Bangalore,Kolkata Knight Riders,Royal Challengers Bangalore,field,Kolkata Knight Riders,runs,140.0,223.0,20.0,N,,9,41,23,0
1,Chandigarh,2008-04-19,League,MEK Hussey,Kings XI Punjab,Chennai Super Kings,Chennai Super Kings,bat,Chennai Super Kings,runs,33.0,241.0,20.0,N,,34,52,40,0
2,Delhi,2008-04-19,League,MF Maharoof,Delhi Daredevils,Rajasthan Royals,Rajasthan Royals,bat,Delhi Daredevils,wickets,9.0,130.0,20.0,N,,8,15,16,0
3,Mumbai,2008-04-20,League,MV Boucher,Mumbai Indians,Royal Challengers Bangalore,Mumbai Indians,bat,Royal Challengers Bangalore,wickets,5.0,166.0,20.0,N,,51,14,55,0
4,Kolkata,2008-04-20,League,DJ Hussey,Kolkata Knight Riders,Deccan Chargers,Deccan Chargers,bat,Kolkata Knight Riders,wickets,5.0,111.0,20.0,N,,10,24,14,0


Next, we perform feature extraction from the date variable by deriving attributes such as month, weekday, and quarter. These temporal features are crucial, as factors like fixture scheduling, seasonal variations, and weather conditions can significantly influence a team's win probability. Incorporating these derived features enhances the model's ability to capture patterns and dependencies related to match outcomes.

In [None]:
matches_df['date'] = pd.to_datetime(matches_df['date'])
matches_df['year'] = matches_df['date'].dt.year
matches_df['month'] = matches_df['date'].dt.month
matches_df['day'] = matches_df['date'].dt.day
matches_df['weekday'] = matches_df['date'].dt.weekday  # Monday=0, Sunday=6
matches_df['quarter'] = matches_df['date'].dt.quarter
matches_df['day_of_year'] = matches_df['date'].dt.dayofyear
matches_df.drop(columns=['date'], inplace=True)
matches_df.head(3)

Unnamed: 0,city,match_type,player_of_match,team1,team2,toss_winner,toss_decision,winner,result,result_margin,...,umpire1_encoded,umpire2_encoded,venue_encoded,season_encoded,year,month,day,weekday,quarter,day_of_year
0,Bangalore,League,BB McCullum,Royal Challengers Bangalore,Kolkata Knight Riders,Royal Challengers Bangalore,field,Kolkata Knight Riders,runs,140.0,...,9,41,23,0,2008,4,18,4,2,109
1,Chandigarh,League,MEK Hussey,Kings XI Punjab,Chennai Super Kings,Chennai Super Kings,bat,Chennai Super Kings,runs,33.0,...,34,52,40,0,2008,4,19,5,2,110
2,Delhi,League,MF Maharoof,Delhi Daredevils,Rajasthan Royals,Rajasthan Royals,bat,Delhi Daredevils,wickets,9.0,...,8,15,16,0,2008,4,19,5,2,110


To ensure the dataset is suitable for model training, categorical variables such as *result*, *toss_decision*, *match_type*, *super_over*, and *method* are encoded using **Label Encoding**. This process converts categorical values into numerical representations while preserving their relative distinctions. Following the encoding, the original categorical columns are removed to avoid redundancy. This transformation enhances the model's ability to interpret and utilize these features effectively.

In [None]:
matches_df['result_encoded'] = le.fit_transform(matches_df['result'])
matches_df.drop(columns=['result'], inplace=True)
matches_df['toss_decision_encoded'] = le.fit_transform(matches_df['toss_decision'])
matches_df.drop(columns=['toss_decision'], inplace=True)
matches_df['match_type_encoded'] = le.fit_transform(matches_df['match_type'])
matches_df.drop(columns=['match_type'], inplace=True)
matches_df['super_over_encoded'] = le.fit_transform(matches_df['super_over'])
matches_df.drop(columns=['super_over'], inplace=True)
matches_df['method_encoded'] = le.fit_transform(matches_df['method'])
matches_df.drop(columns=['method'], inplace=True)
matches_df.head(3)

Unnamed: 0,city,player_of_match,team1,team2,toss_winner,winner,result_margin,target_runs,target_overs,umpire1_encoded,...,month,day,weekday,quarter,day_of_year,result_encoded,toss_decision_encoded,match_type_encoded,super_over_encoded,method_encoded
0,Bangalore,BB McCullum,Royal Challengers Bangalore,Kolkata Knight Riders,Royal Challengers Bangalore,Kolkata Knight Riders,140.0,223.0,20.0,9,...,4,18,4,2,109,1,1,4,0,1
1,Chandigarh,MEK Hussey,Kings XI Punjab,Chennai Super Kings,Chennai Super Kings,Chennai Super Kings,33.0,241.0,20.0,34,...,4,19,5,2,110,1,0,4,0,1
2,Delhi,MF Maharoof,Delhi Daredevils,Rajasthan Royals,Rajasthan Royals,Delhi Daredevils,9.0,130.0,20.0,8,...,4,19,5,2,110,3,0,4,0,1


In [None]:
matches_df['team1'].unique()

array(['Royal Challengers Bangalore', 'Kings XI Punjab',
       'Delhi Daredevils', 'Mumbai Indians', 'Kolkata Knight Riders',
       'Rajasthan Royals', 'Deccan Chargers', 'Chennai Super Kings',
       'Kochi Tuskers Kerala', 'Pune Warriors', 'Sunrisers Hyderabad',
       'Gujarat Lions', 'Rising Pune Supergiants',
       'Rising Pune Supergiant', 'Delhi Capitals', 'Punjab Kings',
       'Lucknow Super Giants', 'Gujarat Titans',
       'Royal Challengers Bengaluru'], dtype=object)

Historical changes in team names and franchises have introduced inconsistencies in the dataset. For instance, *Kings XI Punjab* was rebranded as *Punjab Kings*, and *Delhi Daredevils* was renamed *Delhi Capitals*. Additionally, *Chennai Super Kings (CSK)* and *Rajasthan Royals (RR)* were temporarily suspended and replaced with interim teams. Similarly, *Deccan Chargers* transitioned into *Sunrisers Hyderabad (SRH)*, while *Kochi Tuskers Kerala* was replaced by *Gujarat Titans (GT)*.

To standardize team names and maintain consistency in the dataset, all team name variations have been mapped to their latest or most commonly recognized abbreviations. This ensures uniformity in the data and prevents discrepancies during model training. The replacements were performed systematically for both *team1* and *team2* columns, aligning all team references to their standardized short forms.

In [None]:
matches_df['team1'] = matches_df['team1'].replace('Rising Pune Supergiants', 'CSK')
matches_df['team1'] = matches_df['team1'].replace('Rising Pune Supergiant', 'CSK')

matches_df['team2'] = matches_df['team2'].replace('Rising Pune Supergiants', 'CSK')
matches_df['team2'] = matches_df['team2'].replace('Rising Pune Supergiant', 'CSK')

matches_df['team1'] = matches_df['team1'].replace('Royal Challengers Bengaluru', 'RCB')
matches_df['team1'] = matches_df['team1'].replace('Royal Challengers Bangalore', 'RCB')

matches_df['team2'] = matches_df['team2'].replace('Royal Challengers Bengaluru', 'RCB')
matches_df['team2'] = matches_df['team2'].replace('Royal Challengers Bangalore', 'RCB')

matches_df['team1'] = matches_df['team1'].replace('Gujarat Lions', 'RR')
matches_df['team2'] = matches_df['team2'].replace('Gujarat Lions', 'RR')

matches_df['team1'] = matches_df['team1'].replace('Rajasthan Royals', 'RR')
matches_df['team2'] = matches_df['team2'].replace('Rajasthan Royals', 'RR')

matches_df['team1'] = matches_df['team1'].replace('Chennai Super Kings', 'CSK')
matches_df['team2'] = matches_df['team2'].replace('Chennai Super Kings', 'CSK')

matches_df['team1'] = matches_df['team1'].replace('Sunrisers Hyderabad', 'SRH')
matches_df['team2'] = matches_df['team2'].replace('Sunrisers Hyderabad', 'SRH')

matches_df['team1'] = matches_df['team1'].replace('Deccan Chargers', 'SRH')
matches_df['team2'] = matches_df['team2'].replace('Deccan Chargers', 'SRH')

matches_df['team1'] = matches_df['team1'].replace('Delhi Daredevils', 'DD')
matches_df['team2'] = matches_df['team2'].replace('Delhi Daredevils', 'DD')

matches_df['team1'] = matches_df['team1'].replace('Delhi Capitals', 'DD')
matches_df['team2'] = matches_df['team2'].replace('Delhi Capitals', 'DD')

matches_df['team1'] = matches_df['team1'].replace('Kings XI Punjab', 'PBKS')
matches_df['team2'] = matches_df['team2'].replace('Kings XI Punjab', 'PBKS')

matches_df['team1'] = matches_df['team1'].replace('Pune Warriors', 'LSG')
matches_df['team2'] = matches_df['team2'].replace('Pune Warriors', 'LSG')

matches_df['team1'] = matches_df['team1'].replace('Punjab Kings', 'PBKS')
matches_df['team2'] = matches_df['team2'].replace('Punjab Kings', 'PBKS')

matches_df['team1'] = matches_df['team1'].replace('Kochi Tuskers Kerala', 'GT')
matches_df['team2'] = matches_df['team2'].replace('Kochi Tuskers Kerala', 'GT')

matches_df['team1'] = matches_df['team1'].replace('Gujarat Titans', 'GT')
matches_df['team2'] = matches_df['team2'].replace('Gujarat Titans', 'GT')

matches_df['team1'] = matches_df['team1'].replace('Lucknow Supergiants', 'LSG')
matches_df['team2'] = matches_df['team2'].replace('Lucknow Supergiants', 'LSG')

matches_df['team1'] = matches_df['team1'].replace('Kolkata Knight Riders', 'KKR')
matches_df['team2'] = matches_df['team2'].replace('Kolkata Knight Riders', 'KKR')

matches_df['team1'] = matches_df['team1'].replace('Mumbai Indians', 'MI')
matches_df['team2'] = matches_df['team2'].replace('Mumbai Indians', 'MI')

In [None]:
matches_df['toss_winner'] = matches_df['toss_winner'].replace('Rising Pune Supergiants', 'CSK')
matches_df['toss_winner'] = matches_df['toss_winner'].replace('Rising Pune Supergiant', 'CSK')

matches_df['toss_winner'] = matches_df['toss_winner'].replace('Royal Challengers Bengaluru', 'RCB')
matches_df['toss_winner'] = matches_df['toss_winner'].replace('Royal Challengers Bangalore', 'RCB')

matches_df['toss_winner'] = matches_df['toss_winner'].replace('Gujarat Lions', 'RR')
matches_df['toss_winner'] = matches_df['toss_winner'].replace('Rajasthan Royals', 'RR')

matches_df['toss_winner'] = matches_df['toss_winner'].replace('Chennai Super Kings', 'CSK')

matches_df['toss_winner'] = matches_df['toss_winner'].replace('Sunrisers Hyderabad', 'SRH')
matches_df['toss_winner'] = matches_df['toss_winner'].replace('Deccan Chargers', 'SRH')

matches_df['toss_winner'] = matches_df['toss_winner'].replace('Delhi Daredevils', 'DD')
matches_df['toss_winner'] = matches_df['toss_winner'].replace('Delhi Capitals', 'DD')

matches_df['toss_winner'] = matches_df['toss_winner'].replace('Kings XI Punjab', 'PBKS')
matches_df['toss_winner'] = matches_df['toss_winner'].replace('Pune Warriors', 'LSG')

matches_df['toss_winner'] = matches_df['toss_winner'].replace('Punjab Kings', 'PBKS')

matches_df['toss_winner'] = matches_df['toss_winner'].replace('Kochi Tuskers Kerala', 'GT')
matches_df['toss_winner'] = matches_df['toss_winner'].replace('Gujarat Titans', 'GT')

matches_df['toss_winner'] = matches_df['toss_winner'].replace('Lucknow Supergiants', 'LSG')
matches_df['toss_winner'] = matches_df['toss_winner'].replace('Kolkata Knight Riders', 'KKR')

matches_df['toss_winner'] = matches_df['toss_winner'].replace('Mumbai Indians', 'MI')

In [None]:
matches_df['winner'] = matches_df['winner'].replace('Rising Pune Supergiants', 'CSK')
matches_df['winner'] = matches_df['winner'].replace('Rising Pune Supergiant', 'CSK')

matches_df['winner'] = matches_df['winner'].replace('Royal Challengers Bengaluru', 'RCB')
matches_df['winner'] = matches_df['winner'].replace('Royal Challengers Bangalore', 'RCB')

matches_df['winner'] = matches_df['winner'].replace('Gujarat Lions', 'RR')
matches_df['winner'] = matches_df['winner'].replace('Rajasthan Royals', 'RR')

matches_df['winner'] = matches_df['winner'].replace('Chennai Super Kings', 'CSK')

matches_df['winner'] = matches_df['winner'].replace('Sunrisers Hyderabad', 'SRH')
matches_df['winner'] = matches_df['winner'].replace('Deccan Chargers', 'SRH')

matches_df['winner'] = matches_df['winner'].replace('Delhi Daredevils', 'DD')
matches_df['winner'] = matches_df['winner'].replace('Delhi Capitals', 'DD')

matches_df['winner'] = matches_df['winner'].replace('Kings XI Punjab', 'PBKS')
matches_df['winner'] = matches_df['winner'].replace('Pune Warriors', 'LSG')

matches_df['winner'] = matches_df['winner'].replace('Punjab Kings', 'PBKS')

matches_df['winner'] = matches_df['winner'].replace('Kochi Tuskers Kerala', 'GT')
matches_df['winner'] = matches_df['winner'].replace('Gujarat Titans', 'GT')

matches_df['winner'] = matches_df['winner'].replace('Lucknow Supergiants', 'LSG')
matches_df['winner'] = matches_df['winner'].replace('Kolkata Knight Riders', 'KKR')

matches_df['winner'] = matches_df['winner'].replace('Mumbai Indians', 'MI')


In [None]:
matches_df.head(10)

Unnamed: 0,city,player_of_match,team1,team2,toss_winner,winner,result_margin,target_runs,target_overs,umpire1_encoded,...,month,day,weekday,quarter,day_of_year,result_encoded,toss_decision_encoded,match_type_encoded,super_over_encoded,method_encoded
0,Bangalore,BB McCullum,RCB,KKR,RCB,KKR,140.0,223.0,20.0,9,...,4,18,4,2,109,1,1,4,0,1
1,Chandigarh,MEK Hussey,PBKS,CSK,CSK,CSK,33.0,241.0,20.0,34,...,4,19,5,2,110,1,0,4,0,1
2,Delhi,MF Maharoof,DD,RR,RR,DD,9.0,130.0,20.0,8,...,4,19,5,2,110,3,0,4,0,1
3,Mumbai,MV Boucher,MI,RCB,MI,RCB,5.0,166.0,20.0,51,...,4,20,6,2,111,3,0,4,0,1
4,Kolkata,DJ Hussey,KKR,SRH,SRH,KKR,5.0,111.0,20.0,10,...,4,20,6,2,111,3,0,4,0,1
5,Jaipur,SR Watson,RR,PBKS,PBKS,RR,6.0,167.0,20.0,8,...,4,21,0,2,112,3,0,4,0,1
6,Hyderabad,V Sehwag,SRH,DD,SRH,DD,9.0,143.0,20.0,24,...,4,22,1,2,113,3,0,4,0,1
7,Chennai,ML Hayden,CSK,MI,MI,CSK,6.0,209.0,20.0,18,...,4,23,2,2,114,1,1,4,0,1
8,Hyderabad,YK Pathan,SRH,RR,RR,RR,3.0,215.0,20.0,9,...,4,24,3,2,115,3,1,4,0,1
9,Chandigarh,KC Sangakkara,PBKS,MI,MI,PBKS,66.0,183.0,20.0,8,...,4,25,4,2,116,1,1,4,0,1


Now that the teams and the historical discrepancies have been resolved it's time to Label Encode them.

In [None]:
team_columns = ['team1', 'team2', 'toss_winner', 'winner']
city_column = 'city'
team_le = LabelEncoder()
all_teams = pd.concat([matches_df[col] for col in team_columns])
team_le.fit(all_teams)
label_mapping = {index: label for index, label in enumerate(team_le.classes_)}
for col in team_columns:
    matches_df[f'{col}_encoded'] = team_le.transform(matches_df[col])
    matches_df.drop(columns=[col], inplace=True)
city_le = LabelEncoder()
matches_df[f'{city_column}_encoded'] = city_le.fit_transform(matches_df[city_column])
matches_df.drop(columns=[city_column], inplace=True)

Since player of the match is a feature dependent or rather a consequence of the target variable, it is safe to drop it and reduce one extra dimension.

In [None]:
matches_df.drop(columns=['player_of_match'], inplace=True)
matches_df.head(10)

Unnamed: 0,result_margin,target_runs,target_overs,umpire1_encoded,umpire2_encoded,venue_encoded,season_encoded,year,month,day,...,result_encoded,toss_decision_encoded,match_type_encoded,super_over_encoded,method_encoded,team1_encoded,team2_encoded,toss_winner_encoded,winner_encoded,city_encoded
0,140.0,223.0,20.0,9,41,23,0,2008,4,18,...,1,1,4,0,1,8,3,8,3,2
1,33.0,241.0,20.0,34,52,40,0,2008,4,19,...,1,0,4,0,1,7,0,0,0,7
2,9.0,130.0,20.0,8,15,16,0,2008,4,19,...,3,0,4,0,1,1,9,9,1,10
3,5.0,166.0,20.0,51,14,55,0,2008,4,20,...,3,0,4,0,1,6,8,6,8,26
4,5.0,111.0,20.0,10,24,14,0,2008,4,20,...,3,0,4,0,1,3,10,10,3,23
5,6.0,167.0,20.0,8,40,46,0,2008,4,21,...,3,0,4,0,1,9,7,7,9,18
6,9.0,143.0,20.0,24,4,42,0,2008,4,22,...,3,0,4,0,1,10,1,10,1,16
7,6.0,209.0,20.0,18,15,27,0,2008,4,23,...,1,1,4,0,1,0,6,6,0,8
8,3.0,215.0,20.0,9,30,42,0,2008,4,24,...,3,1,4,0,1,10,9,9,9,16
9,66.0,183.0,20.0,8,4,40,0,2008,4,25,...,1,1,4,0,1,7,6,6,7,7


To enhance the dataset's interpretability and facilitate model training, the *result_margin* column has been processed to create two distinct numerical features: **result_runs** and **result_wickets**.

- The **result_runs** column captures the margin of victory in terms of runs when the match outcome corresponds to *result_encoded = 1* (indicating a win by runs).  
- The **result_wickets** column reflects the margin of victory in terms of wickets when *result_encoded = 3* (indicating a win by wickets).  

For matches where the result does not fall under these categories (i.e., cases where *result_encoded* is neither 1 nor 3), both **result_runs** and **result_wickets** are set to zero to maintain consistency in the data structure.

After this transformation, the original *result_margin* column, which contained mixed numerical values (runs or wickets), is removed to prevent redundancy. This refined representation ensures clarity in distinguishing between different types of match outcomes, thereby improving the dataset's usability for predictive modeling.

In [None]:
matches_df['result_runs'] = 0
matches_df['result_wickets'] = 0
matches_df.loc[matches_df['result_encoded'] == 1, 'result_runs'] = matches_df['result_margin']
matches_df.loc[matches_df['result_encoded'] == 3, 'result_wickets'] = matches_df['result_margin']
matches_df.loc[~matches_df['result_encoded'].isin([1, 3]), ['result_runs', 'result_wickets']] = 0
matches_df.drop(columns=['result_margin'], inplace=True)
matches_df.head()

Unnamed: 0,target_runs,target_overs,umpire1_encoded,umpire2_encoded,venue_encoded,season_encoded,year,month,day,weekday,...,match_type_encoded,super_over_encoded,method_encoded,team1_encoded,team2_encoded,toss_winner_encoded,winner_encoded,city_encoded,result_runs,result_wickets
0,223.0,20.0,9,41,23,0,2008,4,18,4,...,4,0,1,8,3,8,3,2,140,0
1,241.0,20.0,34,52,40,0,2008,4,19,5,...,4,0,1,7,0,0,0,7,33,0
2,130.0,20.0,8,15,16,0,2008,4,19,5,...,4,0,1,1,9,9,1,10,0,9
3,166.0,20.0,51,14,55,0,2008,4,20,6,...,4,0,1,6,8,6,8,26,0,5
4,111.0,20.0,10,24,14,0,2008,4,20,6,...,4,0,1,3,10,10,3,23,0,5


To ensure uniform feature scaling and improve the model's performance, **normalization** has been applied to the dataset.  

- Each numerical column (excluding categorical features such as *winner_encoded*, *team1_encoded*, *team2_encoded*, and *toss_winner_encoded*) is normalized using **Min-Max Scaling**.  
- The transformation follows the formula:  
  $$ x_{\text{normalized}} = \frac{x - \min(x)}{\max(x)} $$  
  where $x$ is the original feature value, and $\min(x)$ and $\max(x)$ are the minimum and maximum values of the respective column.  

This scaling ensures that all numerical features are mapped to a [0,1] range, preventing dominance by features with larger absolute values. By maintaining categorical features in their encoded form, we retain their discrete nature while ensuring numerical consistency across continuous variables.

It may seem odd as to why we are exempting the team related columns, it's because they denote classes and not values as such hence.

In [None]:
for column in matches_df.columns:
    # if column == 'winner_encoded' or column == 'team1_encoded' or column == 'team2_encoded' or column == 'toss_winner_encoded':
    #     continue
    categorical_cols = matches_df.select_dtypes(include=['object']).columns
    if column not in categorical_cols:
        max_value = matches_df[column].max()
        min_value = matches_df[column].min()
        matches_df[column] = matches_df[column].apply(lambda x: (x - min_value) / max_value)
matches_df.head(15)

Unnamed: 0,target_runs,target_overs,umpire1_encoded,umpire2_encoded,venue_encoded,season_encoded,year,month,day,weekday,...,match_type_encoded,super_over_encoded,method_encoded,team1_encoded,team2_encoded,toss_winner_encoded,winner_encoded,city_encoded,result_runs,result_wickets
0,0.625,0.75,0.147541,0.672131,0.403509,0.0,0.0,0.090909,0.548387,0.666667,...,0.571429,0.0,1.0,0.8,0.3,0.8,0.272727,0.055556,0.958904,0.0
1,0.6875,0.75,0.557377,0.852459,0.701754,0.0,0.0,0.090909,0.580645,0.833333,...,0.571429,0.0,1.0,0.7,0.0,0.0,0.0,0.194444,0.226027,0.0
2,0.302083,0.75,0.131148,0.245902,0.280702,0.0,0.0,0.090909,0.580645,0.833333,...,0.571429,0.0,1.0,0.1,0.9,0.9,0.090909,0.277778,0.0,0.9
3,0.427083,0.75,0.836066,0.229508,0.964912,0.0,0.0,0.090909,0.612903,1.0,...,0.571429,0.0,1.0,0.6,0.8,0.6,0.727273,0.722222,0.0,0.5
4,0.236111,0.75,0.163934,0.393443,0.245614,0.0,0.0,0.090909,0.612903,1.0,...,0.571429,0.0,1.0,0.3,1.0,1.0,0.272727,0.638889,0.0,0.5
5,0.430556,0.75,0.131148,0.655738,0.807018,0.0,0.0,0.090909,0.645161,0.0,...,0.571429,0.0,1.0,0.9,0.7,0.7,0.818182,0.5,0.0,0.6
6,0.347222,0.75,0.393443,0.065574,0.736842,0.0,0.0,0.090909,0.677419,0.166667,...,0.571429,0.0,1.0,1.0,0.1,1.0,0.090909,0.444444,0.0,0.9
7,0.576389,0.75,0.295082,0.245902,0.473684,0.0,0.0,0.090909,0.709677,0.333333,...,0.571429,0.0,1.0,0.0,0.6,0.6,0.0,0.222222,0.041096,0.0
8,0.597222,0.75,0.147541,0.491803,0.736842,0.0,0.0,0.090909,0.741935,0.5,...,0.571429,0.0,1.0,1.0,0.9,0.9,0.818182,0.444444,0.0,0.3
9,0.486111,0.75,0.131148,0.065574,0.701754,0.0,0.0,0.090909,0.774194,0.666667,...,0.571429,0.0,1.0,0.7,0.6,0.6,0.636364,0.194444,0.452055,0.0


lastly to remove any and every NaN instance, we run interpolate.

In [None]:
matches_df = matches_df.interpolate()
matches_df.columns
# matches_df.to_csv('/kaggle/working/processed_matches.csv')

Index(['target_runs', 'target_overs', 'umpire1_encoded', 'umpire2_encoded',
       'venue_encoded', 'season_encoded', 'year', 'month', 'day', 'weekday',
       'quarter', 'day_of_year', 'result_encoded', 'toss_decision_encoded',
       'match_type_encoded', 'super_over_encoded', 'method_encoded',
       'team1_encoded', 'team2_encoded', 'toss_winner_encoded',
       'winner_encoded', 'city_encoded', 'result_runs', 'result_wickets'],
      dtype='object')

## Part 2: Finding the best model for predicting the winner

The underlying approach for this analysis is structured in two key phases:  

1. **Match-Level Prediction:**  
   - Given that the dataset provides match-wise data, the initial objective is to develop a predictive model capable of accurately determining the winner of an individual match based on historical match statistics, team performance metrics, and other relevant features.  
   - This model will leverage feature-engineered variables, encoded categorical attributes, and normalized numerical data to enhance predictive accuracy.  
   - Supervised learning techniques, particularly classification models such as logistic regression, decision trees, or ensemble methods, will be explored to optimize the prediction performance.  

2. **Tournament Simulation:**  
   - Once a robust match-level predictor is established, it will be extended to simulate an entire tournament structure, including the **league stage, playoffs, and finals**.  
   - The tournament will be simulated using match-by-match predictions, determining the progression of teams based on their projected performances.  
   - The framework will follow standard tournament rules, wherein teams accumulate points during the league stage, followed by knockout rounds leading up to the final match.  
   - By iteratively applying the trained model to simulate multiple tournament scenarios, insights can be derived regarding the most probable tournament outcomes, dominant teams, and key factors influencing overall performance.  

This hierarchical approach enables both **granular match-level insights** and **higher-level tournament forecasting**, thereby providing a comprehensive predictive framework for competitive analysis.

But first things first, splitting the dataset for training and testing.

In [None]:
from sklearn.model_selection import train_test_split
y = matches_df['winner_encoded']
X = matches_df.drop(columns=['winner_encoded'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
X_train

Unnamed: 0,target_runs,target_overs,umpire1_encoded,umpire2_encoded,venue_encoded,season_encoded,year,month,day,weekday,...,toss_decision_encoded,match_type_encoded,super_over_encoded,method_encoded,team1_encoded,team2_encoded,toss_winner_encoded,city_encoded,result_runs,result_wickets
6,0.347222,0.75,0.393443,0.065574,0.736842,0.0000,0.000000,0.090909,0.677419,0.166667,...,0.0,0.571429,0.0,1.0,1.0,0.1,1.0,0.444444,0.000000,0.9
575,0.416667,0.75,0.524590,0.213115,0.280702,0.5000,0.003953,0.181818,0.838710,0.666667,...,1.0,0.857143,0.0,1.0,0.9,1.0,1.0,0.277778,0.000000,0.4
821,0.371528,0.75,0.639344,0.934426,0.491228,0.8125,0.006423,0.090909,0.419355,0.333333,...,1.0,0.571429,0.0,1.0,0.8,1.0,1.0,0.222222,0.041096,0.0
1063,0.631944,0.75,0.508197,0.934426,0.017544,1.0000,0.007905,0.090909,0.741935,0.333333,...,1.0,0.571429,0.0,1.0,0.1,0.2,0.2,0.277778,0.027397,0.0
905,0.607639,0.75,0.327869,0.967213,0.105263,0.8750,0.006917,0.090909,0.548387,0.000000,...,1.0,0.571429,0.0,1.0,0.9,0.3,0.3,0.722222,0.047945,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
330,0.579861,0.75,0.524590,0.950820,0.964912,0.3125,0.002470,0.090909,0.258065,0.166667,...,0.0,0.571429,0.0,1.0,0.6,0.1,0.6,0.722222,0.301370,0.0
466,0.423611,0.75,0.065574,0.770492,0.771930,0.4375,0.003458,0.090909,0.419355,0.166667,...,0.0,0.571429,0.0,1.0,0.9,0.6,0.6,0.027778,0.000000,0.7
121,0.559028,0.75,0.786885,0.229508,0.403509,0.1250,0.000988,0.000000,0.483871,0.166667,...,0.0,0.571429,0.0,1.0,0.8,0.7,0.7,0.055556,0.000000,0.8
1044,0.420139,0.75,0.016393,0.967213,0.070175,1.0000,0.007905,0.090909,0.193548,1.000000,...,0.0,0.571429,0.0,1.0,0.5,0.2,0.5,0.666667,0.226027,0.0


In [None]:
y_train

6       0.090909
575     0.909091
821     0.727273
1063    0.090909
905     0.818182
          ...   
330     0.545455
466     0.818182
121     0.727273
1044    0.454545
860     0.636364
Name: winner_encoded, Length: 876, dtype: float64

The problem can be conceptualized as a **binary/multiclass classification task**, where the objective is to determine the winner of a match based on a set of input features. Intuitively, this involves analyzing how different feature values influence the likelihood of one team winning over another.  

From a computational perspective, this can be abstracted as a decision-making problem governed by numerous conditional statements. Essentially, for each match instance, if certain conditions hold true—such as team strength, past performances, toss outcomes, venue conditions, and other match-related factors—the probability of one team winning increases, and vice versa.  

### **Choice of Algorithms**  
Given the nature of the problem, decision-tree-based models are well-suited, as they efficiently learn decision boundaries from structured data. The following algorithms have been selected for evaluation:  

1. **Decision Tree Classifier**  
   - A simple, interpretable model that recursively splits the dataset based on feature values, forming a tree-like structure of conditional decisions.  

2. **Random Forest Classifier**  
   - An ensemble of multiple decision trees that aggregates predictions to reduce overfitting and improve generalization.  

3. **XGBoost Classifier**  
   - An advanced gradient boosting algorithm that optimizes decision trees sequentially to minimize error, often yielding superior performance in structured data problems.  

---

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
import xgboost as xgb
import numpy as np
import matplotlib.pyplot as plt

# Convert continuous labels back to discrete integers
y_train_discrete = np.round(y_train * 11).astype(int)
y_test_discrete = np.round(y_test * 11).astype(int)

# Define parameter grids
rf_params = {
    'n_estimators': [100, 200, 500, 1000],
    'max_depth': [10, 50, 100, 200],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

dt_params = {
    'max_depth': [10, 50, 100, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

xgb_params = {
    'n_estimators': [100, 200, 500],
    'max_depth': [3, 6, 9],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.5, 0.8, 1],
    'colsample_bytree': [0.5, 0.8, 1]
}

# Initialize models
rf_classifier = RandomForestClassifier(random_state=42)
dt_classifier = DecisionTreeClassifier(random_state=42)
xgb_classifier = xgb.XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='mlogloss')

# Perform hyperparameter tuning
rf_search = RandomizedSearchCV(rf_classifier, rf_params, n_iter=20, cv=5, scoring='accuracy', n_jobs=-1, random_state=42)
dt_search = RandomizedSearchCV(dt_classifier, dt_params, n_iter=20, cv=5, scoring='accuracy', n_jobs=-1, random_state=42)
xgb_search = RandomizedSearchCV(xgb_classifier, xgb_params, n_iter=20, cv=5, scoring='accuracy', n_jobs=-1, random_state=42)

rf_search.fit(X_train, y_train_discrete)
dt_search.fit(X_train, y_train_discrete)
xgb_search.fit(X_train, y_train_discrete)

# Get best models
rf_best = rf_search.best_estimator_
dt_best = dt_search.best_estimator_
xgb_best = xgb_search.best_estimator_

# Make predictions
rf_predictions = rf_best.predict(X_test)
dt_predictions = dt_best.predict(X_test)
xgb_predictions = xgb_best.predict(X_test)

# Evaluate models
metrics = ['Accuracy', 'F1-Score', 'Precision', 'Recall']

rf_scores = [
    accuracy_score(y_test_discrete, rf_predictions),
    f1_score(y_test_discrete, rf_predictions, average='weighted'),
    precision_score(y_test_discrete, rf_predictions, average='weighted'),
    recall_score(y_test_discrete, rf_predictions, average='weighted')
]

dt_scores = [
    accuracy_score(y_test_discrete, dt_predictions),
    f1_score(y_test_discrete, dt_predictions, average='weighted'),
    precision_score(y_test_discrete, dt_predictions, average='weighted'),
    recall_score(y_test_discrete, dt_predictions, average='weighted')
]

xgb_scores = [
    accuracy_score(y_test_discrete, xgb_predictions),
    f1_score(y_test_discrete, xgb_predictions, average='weighted'),
    precision_score(y_test_discrete, xgb_predictions, average='weighted'),
    recall_score(y_test_discrete, xgb_predictions, average='weighted')
]

# Plot results
x = np.arange(len(metrics))
plt.figure(figsize=(12, 6))
width = 0.25
plt.bar(x - width, rf_scores, width=width, label='Random Forest', color='skyblue')
plt.bar(x, dt_scores, width=width, label='Decision Tree', color='green')
plt.bar(x + width, xgb_scores, width=width, label='XGBoost', color='salmon')

plt.xticks(ticks=x, labels=metrics, fontsize=12)
plt.ylabel('Score', fontsize=12)
plt.title('Optimized Model Performance Metrics', fontsize=14)
plt.legend()
plt.ylim(0, 1)
plt.show()

print("Best Random Forest Params:", rf_search.best_params_)
print("Best Decision Tree Params:", dt_search.best_params_)
print("Best XGBoost Params:", xgb_search.best_params_)

print("\n")
print("="*50)
print("\n")

print(f"Decision Tree Classifier Accuracy: {dt_scores[0]:.4f}, F1 Score: {dt_scores[1]:.4f}, Precision: {dt_scores[2]:.4f}, Recall: {dt_scores[3]:.4f}")

print(f"Random Forest Classifier Accuracy: {rf_scores[0]:.4f}, F1 Score: {rf_scores[1]:.4f}, Precision: {rf_scores[2]:.4f}, Recall: {rf_scores[3]:.4f}")

print(f"XGBoost Classifier Accuracy: {xgb_scores[0]:.4f}, F1 Score: {xgb_scores[1]:.4f}, Precision: {xgb_scores[2]:.4f}, Recall: {xgb_scores[3]:.4f}")

warnigns.filterwarnings('ignore')

In [None]:
import seaborn as sns
from sklearn.metrics import confusion_matrix

# Compute confusion matrices
rf_cm = confusion_matrix(y_test_discrete, rf_predictions)
dt_cm = confusion_matrix(y_test_discrete, dt_predictions)
xgb_cm = confusion_matrix(y_test_discrete, xgb_predictions)

# Function to plot confusion matrix
def plot_confusion_matrix(cm, title):
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='viridis', xticklabels=np.unique(y_test_discrete), yticklabels=np.unique(y_test_discrete))
    plt.xlabel('Predicted Label')
    plt.ylabel('True Label')
    plt.title(title)
    plt.show()

# Plot confusion matrices
plot_confusion_matrix(rf_cm, "Random Forest Confusion Matrix")
plot_confusion_matrix(dt_cm, "Decision Tree Confusion Matrix")
plot_confusion_matrix(xgb_cm, "XGBoost Confusion Matrix")


**Looks like these won't do. Let's try the MLP Classifier as MLP Classifiers can capture significantly more complex patterns and trends. 🤞**

In [None]:
from sklearn.neural_network import MLPClassifier
mlp_model = MLPClassifier(hidden_layer_sizes=(512), activation='tanh', solver='lbfgs', max_iter=1000, random_state=42, learning_rate_init=0.020718394976751383, alpha=0.4303451144499504)
mlp_model.fit(X_train, y_train)
mlp_preds = mlp_model.predict(X_test)
mlp_accuracy = accuracy_score(y_test, mlp_preds)
print(f"MLP Accuracy: {mlp_accuracy:.4f}")

**That is a significant jump in accuracy ~90%. Now we shall use the Optuna HyperParameter Turning Procedure to find the best hyperparameters for the model.**

In [None]:
import optuna
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

def objective(trial):
    num_layers = trial.suggest_int("num_layers", 1, 2)

    if num_layers == 1:
        layer1 = trial.suggest_categorical("layer1", [50, 100, 128, 256, 512])
    else:
        layer1 = trial.suggest_categorical("layer1", [50, 100, 128, 256, 512])
        layer2 = trial.suggest_categorical("layer2", [50, 100, 128, 256, 512])

    activation = trial.suggest_categorical("activation", ["relu", "tanh"])
    solver = trial.suggest_categorical("solver", ["lbfgs", "adam"])
    learning_rate_init = trial.suggest_float("learning_rate_init", 0.0001, 0.02, log=True)
    alpha = trial.suggest_float("alpha", 0.0001, 0.05)

    model = MLPClassifier(
        hidden_layer_sizes=(layer1,) if num_layers == 1 else (layer1, layer2),
        activation=activation,
        solver=solver,
        learning_rate_init=learning_rate_init,
        alpha=alpha,
        max_iter=3000
    )

    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)
    return score


# Run Optuna optimization
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)

# Get best params and train final model
best_mlp_params = study.best_params

# Build the final MLP model with best parameters
if best_mlp_params["num_layers"] == 1:
    final_hidden_layer_sizes = (best_mlp_params["layer1"],)
else:
    final_hidden_layer_sizes = (best_mlp_params["layer1"], best_mlp_params["layer2"])

final_mlp_model = MLPClassifier(
    hidden_layer_sizes=final_hidden_layer_sizes,
    activation=best_mlp_params["activation"],
    solver=best_mlp_params["solver"],
    learning_rate_init=best_mlp_params["learning_rate_init"],
    alpha=best_mlp_params["alpha"],
    max_iter=2000,
    random_state=42
)

final_mlp_model.fit(X_train, y_train)
mlp_preds = final_mlp_model.predict(X_test)
mlp_accuracy = accuracy_score(y_test, mlp_preds)
print(f"Tuned MLP Accuracy: {mlp_accuracy:.4f}")

That indeed is a great improvement from the initial 50%-ish accuracies of Random Forests and like. This will be our final match winner predictor. Let's save the model.

In [None]:
import pickle
filename = '/kaggle/working/fine_tuned_MLP_model.sav'
pickle.dump(final_mlp_model, open(filename, 'wb'))

## Part 3 : Actually predicting the winner of the tournament using the match winner model

In [None]:
import pickle
filename = '/kaggle/input/modelmodel/scikitlearn/default/1/fine_tuned_MLP_model.sav'
with open(filename, 'rb') as file:
    model = pickle.load(file)
model.get_params()

{'activation': 'tanh',
 'alpha': 0.028397959255942663,
 'batch_size': 'auto',
 'beta_1': 0.9,
 'beta_2': 0.999,
 'early_stopping': False,
 'epsilon': 1e-08,
 'hidden_layer_sizes': (512,),
 'learning_rate': 'constant',
 'learning_rate_init': 0.005720263132860582,
 'max_fun': 15000,
 'max_iter': 2000,
 'momentum': 0.9,
 'n_iter_no_change': 10,
 'nesterovs_momentum': True,
 'power_t': 0.5,
 'random_state': 42,
 'shuffle': True,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'validation_fraction': 0.1,
 'verbose': False,
 'warm_start': False}

### **Tournament Simulation Approach Using MLP Classifier**  

The given code simulates a full season of cricket matches using a trained **Multi-Layer Perceptron (MLP) classifier**. The process involves generating synthetic matchups, predicting outcomes using the trained model, and aggregating results to determine the season's winner.  

---

### **1. Generating Match Fixtures**  
- Unique teams are extracted from the dataset using `matches_df['team1_encoded'].unique()`.  
- **All possible matchups** between teams are generated using `itertools.combinations(teams, 2)`, ensuring each team competes against every other team at least once.  
- A **points table** (`points_table`) is initialized as a dictionary, where each team starts with zero points.

---

### **2. Synthetic Match Generation**  
- **Feature Distributions**: The code extracts unique values for key match-related features (e.g., umpire, venue, season, result, toss decision) to ensure that synthetic match data follows real-world distributions.  
- **Monte Carlo Simulation**: Each matchup is simulated **1000 times** to account for randomness in match conditions.  
- A new match instance is created by randomly sampling values from the historical dataset distributions. Features include:
  - **Teams**: `team1_encoded`, `team2_encoded`
  - **Match Conditions**: `target_runs`, `target_overs`, `result_runs`, `result_wickets`
  - **Venue and Umpires**: `venue_encoded`, `umpire1_encoded`, `umpire2_encoded`
  - **Seasonal Attributes**: `year`, `month`, `day`, `weekday`
  - **Match Type and Toss Decision**: `toss_winner_encoded`, `match_type_encoded`

---

### **3. Match Outcome Prediction**  
- The trained **MLPClassifier model** takes the generated synthetic match data (`match_df`) as input and predicts the **winning team** (`predicted_winner`).  
- The `points_table` is updated by incrementing the points for the predicted winner.  
- If a team is not present in `points_table`, it is added dynamically using `points_table.get(predicted_winner, 0) + 1`.

---

### **4. Mapping Encoded Teams to Original Names**  
- Since label encoding was used, a mapping dictionary (`label_mapping`) converts encoded team IDs back to their original names.  
- Franchise renaming is handled:
  - `"Lucknow Super Giants"` points are merged into `"LSG"`.
  - `"Delhi Daredevils"` (`DD`) is renamed to `"Delhi Capitals"` (`DC`).

---

### **5. Season Winner Estimation**  
- The final **aggregated points table** ranks teams based on their predicted wins.
- The team with the highest points is the predicted season winner.

---

### **Key Takeaways**  
- **Monte Carlo-style simulations** generate multiple match outcomes for better statistical accuracy.  
- The **MLP model generalizes** match conditions and predicts results based on historical patterns.  
- **Dynamic points tracking** enables real-time tournament simulation.  
- **Post-processing ensures accurate team representation** despite name changes.  

This approach can be extended to **multi-stage tournaments**, knockout formats, or **weighted probability simulations** for enhanced realism.

In [None]:
import numpy as np
import pandas as pd
from itertools import combinations
from sklearn.neural_network import MLPClassifier
unique_values = {
    'umpire1_encoded': matches_df['umpire1_encoded'].unique(),
    'umpire2_encoded': matches_df['umpire2_encoded'].unique(),
    'venue_encoded': matches_df['venue_encoded'].unique(),
    'season_encoded': matches_df['season_encoded'].unique(),
    'year': matches_df['year'].unique(),
    'month': matches_df['month'].unique(),
    'day': matches_df['day'].unique(),
    'weekday': matches_df['weekday'].unique(),
    'quarter': matches_df['quarter'].unique(),
    'day_of_year': matches_df['day_of_year'].unique(),
    'result_encoded': matches_df['result_encoded'].unique(),
    'toss_decision_encoded': matches_df['toss_decision_encoded'].unique(),
    'match_type_encoded': matches_df['match_type_encoded'].unique(),
    'super_over_encoded': matches_df['super_over_encoded'].unique(),
    'method_encoded': matches_df['method_encoded'].unique(),
    'city_encoded': matches_df['city_encoded'].unique(),
    'result_runs': matches_df['result_runs'].unique(),
    'result_wickets': matches_df['result_wickets'].unique()
}
teams = matches_df['team1_encoded'].unique()
matchups = list(combinations(teams, 2))
points_table = {team: 0 for team in teams}
for team1, team2 in matchups:
    for _ in range(1000):
        synthetic_data = {
            'target_runs': np.random.choice(unique_values['result_runs']),
            'target_overs': np.random.choice([20]),  # Example overs
            'umpire1_encoded': np.random.choice(unique_values['umpire1_encoded']),
            'umpire2_encoded': np.random.choice(unique_values['umpire2_encoded']),
            'venue_encoded': np.random.choice(unique_values['venue_encoded']),
            'season_encoded': np.random.choice(unique_values['season_encoded']),
            'year': np.random.choice(unique_values['year']),
            'month': np.random.choice(unique_values['month']),
            'day': np.random.choice(unique_values['day']),
            'weekday': np.random.choice(unique_values['weekday']),
            'quarter': np.random.choice(unique_values['quarter']),
            'day_of_year': np.random.choice(unique_values['day_of_year']),
            'result_encoded': np.random.choice(unique_values['result_encoded']),
            'toss_decision_encoded': np.random.choice(unique_values['toss_decision_encoded']),
            'match_type_encoded': np.random.choice(unique_values['match_type_encoded']),
            'super_over_encoded': np.random.choice(unique_values['super_over_encoded']),
            'method_encoded': np.random.choice(unique_values['method_encoded']),
            'team1_encoded': team1,
            'team2_encoded': team2,
            'toss_winner_encoded': np.random.choice([team1, team2]),
            'city_encoded': np.random.choice(unique_values['city_encoded']),
            'result_runs': np.random.choice(unique_values['result_runs']),
            'result_wickets': np.random.choice(unique_values['result_wickets'])
        }
        match_df = pd.DataFrame([synthetic_data])
        predicted_winner = model.predict(match_df)[0]
        points_table[predicted_winner] = points_table.get(predicted_winner, 0) + 1
def update(encoded_team):
    return label_mapping[encoded_team]
final = {update(k): v for k, v in points_table.items()}
final['LSG'] += final['Lucknow Super Giants']
del final['Lucknow Super Giants']
final['DC'] = final.pop('DD')
print(final)

{'RCB': 1746, 'PBKS': 139, 'MI': 1330, 'KKR': 1360, 'RR': 5246, 'SRH': 25648, 'CSK': 101, 'GT': 2151, 'LSG': 1167, 'DC': 16112}


# ***Part 3 : Exploring Probabilistic models like Naive Bayes and Markov Chains***

In [None]:
import pandas as pd
df = pd.read_csv('./Brain Dead IPL Dataset/matches.csv', index_col=0)

# Match Outcome Prediction Using Multinomial Naive Bayes

The provided implementation employs a Multinomial Naïve Bayes (MultinomialNB) classifier to predict match outcomes based on historical data. The dataset is partitioned into training and testing sets, denoted as (X_train, y_train) and (X_test, y_test), respectively. The classifier is trained using Maximum A Posteriori (MAP) estimation, leveraging the conditional independence assumption:

$P(y | X) ∝ P(y) ∏ P(x_i | y)$

where y represents the match outcome class, and X = (x_1, x_2, ..., x_n) denotes feature vectors.

Predictions $\hat{y}$ are generated for X_test, and model perf
ormance is assessed via accuracy score:

Accuracy = $(\sum_{i=1}^{N} \mathbb{1}(\hat{y}_i = y_i)) / N$

where N is the total number of test samples. The computed accuracy quantifies the model’s generalization capability in predicting match results.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
X = df.drop(columns=["winner_encoded"])
y = df["winner_encoded"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)
y_pred = nb_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Multinomial Naïve Bayes Accuracy: {accuracy:.4f}")

Multinomial Naïve Bayes Accuracy: 0.1781


### Match Outcome Prediction Using a Markov Chain Monte Carlo

This model implements a Bayesian Multinomial Logistic Regression using Markov Chain Monte Carlo (MCMC) inference via No-U-Turn Sampling (NUTS). The model aims to predict match winners (winner_encoded) based on a sequence of six structured input features.

1. Model Definition (Bayesian Logistic Regression)
The model assumes a probabilistic relationship between the input features and the match winner outcome.

It defines a prior distribution over the model parameters:

Weights (weights): Each feature's contribution to the probability of each possible winner is drawn from a Normal(0,1) prior.

Bias (bias): The intercept term for each possible winner is also drawn from a Normal(0,1) prior.

Given an input feature vector X, the model computes logits as:

logits = 𝑋*𝑊 + 𝑏

where
𝑊 is the weight matrix and
𝑏 is the bias vector.

The logits are passed through a Categorical likelihood to model the probability of each team winning.

2. Bayesian Inference via MCMC (NUTS)
Instead of directly optimizing for parameters (like traditional logistic regression), the model infers a posterior distribution over the weights and biases using NUTS.

NUTS is an adaptive variant of Hamiltonian Monte Carlo (HMC), which efficiently explores high-dimensional parameter spaces without manual tuning.

The model runs MCMC for 500 samples, discarding the first 300 as burn-in, to approximate the posterior distribution over parameters.

3. Prediction using Posterior Distribution
To predict match winners on the test set:

The input features X_test are passed through multiple posterior samples of weights and biases.

The resulting logits are averaged over all posterior samples.

The predicted winner is determined by taking the argmax over the averaged logits, selecting the team with the highest posterior probability.

4. Evaluation
The model's predictions are compared against actual winners in the test set.

Accuracy is computed as the fraction of correctly predicted outcomes.

In [None]:
%pip install pyro-ppl

In [None]:
import pandas as pd
import numpy as np
import torch
import pyro
import pyro.distributions as dist
from pyro.infer import MCMC, NUTS
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# ---------------------------
# Data Preparation
# ---------------------------
# Filter rows that have valid 'team1_encoded' values.
df = matches_df[matches_df['team1_encoded'].notna()]

# Define features in the required order and the target column.
features = ['target_runs', 'team1_encoded', 'team2_encoded',
            'toss_winner_encoded', 'result_runs', 'result_wickets']
target = 'winner_encoded'

# Extract feature matrix X and target vector y.
X = df[features].values.astype(np.float32)
y = df[target].values.astype(np.int64)  # assuming winner_encoded is integer-encoded

# Scale features to have zero mean and unit variance.
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Split the dataset into training (80%) and test (20%) sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert data to PyTorch tensors.
X_train = torch.tensor(X_train)
y_train = torch.tensor(y_train)
X_test = torch.tensor(X_test)
y_test = torch.tensor(y_test)

num_features = X_train.shape[1]
num_classes = len(np.unique(y)) # Number of unique winner classes

# ---------------------------
# Define the Bayesian Logistic Regression Model
# ---------------------------
def model(X, y=None):
    # Priors for weights and biases (set with mean 0 and std 1)
    weight_prior = dist.Normal(torch.zeros(num_features, num_classes),
                               torch.ones(num_features, num_classes))
    bias_prior = dist.Normal(torch.zeros(num_classes), torch.ones(num_classes))

    # Sample weights and biases from their priors.
    weights = pyro.sample("weights", weight_prior)
    bias = pyro.sample("bias", bias_prior)

    # Compute logits: [batch_size, num_classes]
    logits = torch.matmul(X, weights) + bias

    # Likelihood: Categorical distribution for each data point.
    with pyro.plate("data", X.shape[0]):
        obs = pyro.sample("obs", dist.Categorical(logits=logits), obs=y)
    return logits

# ---------------------------
# Run MCMC to Sample the Posterior
# ---------------------------
# Use the No-U-Turn Sampler (NUTS) for MCMC.
nuts_kernel = NUTS(model)
mcmc = MCMC(nuts_kernel, num_samples=500, warmup_steps=300, num_chains=1)
mcmc.run(X_train, y_train)
posterior_samples = mcmc.get_samples()

# ---------------------------
# Posterior Predictive Function
# ---------------------------
def predict(X, posterior_samples):
    """
    For each test sample, compute the logits using each posterior sample,
    average the logits, and return the class with the highest average logit.
    """
    # Posterior samples shapes:
    # weights: [num_samples, num_features, num_classes]
    # bias: [num_samples, num_classes]
    weights_samples = posterior_samples["weights"]  # [S, F, C]
    bias_samples = posterior_samples["bias"]        # [S, C]

    num_posterior = weights_samples.shape[0]
    batch_size = X.shape[0]

    # Expand X for each posterior sample: shape becomes [S, batch_size, num_features]
    X_expanded = X.unsqueeze(0).expand(num_posterior, batch_size, num_features)

    # Compute logits for each posterior sample: [S, batch_size, num_classes]
    logits_samples = torch.bmm(X_expanded, weights_samples) + bias_samples.unsqueeze(1)

    # Average logits over the posterior samples to get mean logits for each test sample.
    mean_logits = logits_samples.mean(0)

    # Predict by taking the argmax over classes.
    predictions = torch.argmax(mean_logits, dim=1)
    return predictions

# Predict on the test set.
predictions = predict(X_test, posterior_samples)
accuracy = accuracy_score(y_test.numpy(), predictions.numpy())
precision = precision_score(y_test.numpy(), predictions.numpy(), average='weighted')
recall = recall_score(y_test.numpy(), predictions.numpy(), average='weighted')
f1 = f1_score(y_test.numpy(), predictions.numpy(), average='weighted')

print(f"MCMC Bayesian Logistic Regression Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}, Recall: {recall:.4f}, F1-Score: {f1:.4f}")

# ---------------------------
# Confusion Matrix Plot
# ---------------------------
cm = confusion_matrix(y_test.numpy(), predictions.numpy())

plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='viridis', xticklabels=np.unique(y_test), yticklabels=np.unique(y_test))
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('MCMC Bayesian Logistic Regression - Confusion Matrix')
plt.show()

Markov Chain Test Accuracy: 0.1651


# Match Outcome Prediction Using Categorical Naive Bayes

The implemented model employs a **Categorical Naive Bayes (CNB) classifier**  for match outcome prediction. This classifier is particularly suited for categorical data and estimates posterior probabilities using **Bayes' theorem** under the assumption of conditional independence among features.

Model Training and Prediction

The CNB classifier is trained on historical match data using **CategoricalNB()**. The feature set  consists of categorical match-related attributes, while the target variable  represents the match outcome. The classifier computes the likelihood of each possible class (match winner) given the feature values and assigns the most probable class label.

Performance Evaluation

After training, the model is tested on unseen data, and its performance is assessed using multiple evaluation metrics:

* **Accuracy Score**: Measures the proportion of correctly predicted match outcomes.

* **Precision Score**: Defined as , where  denotes true positives and  represents false positives. It quantifies the reliability of positive predictions.

* **Recall Score**: Given by , where  denotes false negatives. It evaluates the ability of the model to identify all relevant instances.

* **F1-Score**: The harmonic mean of precision and recall, computed as . This metric balances precision and recall, providing a single score to measure overall performance.



In [None]:
from sklearn.naive_bayes import CategoricalNB

nb_model = CategoricalNB()
nb_model.fit(X_train, y_train)
y_pred = nb_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
f1_score = f1_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
precision = precision_score(y_test, y_pred, average='weighted')

print(f"Categorical Naïve Bayes Accuracy: {accuracy:.4f}")
print(f"Categorical Naïve Bayes F1 Score: {f1_score:.4f}")
print(f"Categorical Naïve Bayes Recall: {recall:.4f}")
print(f"Categorical Naïve Bayes Precision: {precision:.4f}")

confusion_matrix = sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d', cmap='rocket', xticklabels=np.unique(y_test), yticklabels=np.unique(y_test))
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Categorical Naive Bayes Confusion Matrix')


Categorical Naïve Bayes Accuracy: 0.5114


# Match Outcome Prediction Using Gradient Boosting with Hyperparameter Tuning

This implementation employs a **Gradient Boosting Classifier (GBC)** for match outcome prediction, leveraging an ensemble of weak learners to iteratively reduce prediction errors. The model is trained using Gradient Tree Boosting, where each subsequent tree corrects the residual errors of the previous trees.

To enhance performance and mitigate overfitting, GridSearchCV is employed for hyperparameter tuning. A pipeline is constructed, integrating the classifier with a hyperparameter search strategy. The optimization focuses on key hyperparameters:

* **n_estimators**: Controls the number of boosting iterations, influencing model complexity and convergence.

* **learning_rate (α)**: Determines the contribution of each tree to the final prediction, balancing bias-variance tradeoff.

* **max_depth**: Regulates tree depth, controlling model capacity and preventing overfitting to training data.

The model is trained on historical match data, with categorical and numerical features transformed appropriately. Predictions are generated on the test set, and model performance is assessed using:

The best hyperparameters are identified based on cross-validation performance, ensuring an optimal tradeoff between predictive accuracy and model complexity.

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Assuming X and y are your features and target variable
# Check the feature names in your dataframe
print(X.columns)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Define a pipeline with a classifier
pipeline = Pipeline(steps=[
    ('classifier', GradientBoostingClassifier(random_state=42))
])

# Define a grid of hyperparameters to search
param_grid = {
    'classifier__n_estimators': [100, 200],
    'classifier__learning_rate': [0.01, 0.1, 0.2],
    'classifier__max_depth': [3, 5, 7]
}

# Use GridSearchCV to find the best hyperparameters
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', error_score='raise')
grid_search.fit(X_train, y_train)

# Make predictions on the test set
y_pred = grid_search.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}")

# Print the best parameters
print("Best hyperparameters:", grid_search.best_params_)

Index(['target_runs', 'target_overs', 'umpire1_encoded', 'umpire2_encoded',
       'venue_encoded', 'season_encoded', 'year', 'month', 'day', 'weekday',
       'quarter', 'day_of_year', 'result_encoded', 'toss_decision_encoded',
       'match_type_encoded', 'super_over_encoded', 'method_encoded',
       'team1_encoded', 'team2_encoded', 'toss_winner_encoded', 'city_encoded',
       'result_runs', 'result_wickets'],
      dtype='object')




Model Accuracy: 0.7717
Best hyperparameters: {'classifier__learning_rate': 0.1, 'classifier__max_depth': 7, 'classifier__n_estimators': 200}


### Match Outcome Prediction Using Stacked Ensemble with Randomized Search

This implementation utilizes a Stacked Ensemble Learning approach to improve match outcome prediction by leveraging the strengths of multiple classifiers. The ensemble consists of Random Forest (RF) and XGBoost (XGB) as base learners, while Logistic Regression (LR) serves as the meta-learner (final estimator) to aggregate predictions from base models.

The training process follows a two-layer architecture:

Base Learners :===

* Random Forest (RF): Constructs multiple decision trees in a bootstrap aggregation (bagging) framework, reducing variance and enhancing generalization.

* XGBoost (XGB): Implements an optimized gradient boosting framework, iteratively minimizing residual errors using second-order gradients.

Stacking Mechanism :===

* The predictions from RF and XGB form a new feature space.

* A Logistic Regression model is trained on this feature set, learning optimal weightings for final predictions.

To optimize hyperparameters efficiently, **RandomizedSearchCV** is applied, tuning the following key parameters:

* rf__n_estimators : Number of trees in the Random Forest.

* rf__max_depth : Depth of decision trees to control model complexity.

* xgb__n_estimators : Number of boosting rounds in XGBoost.

* xgb__max_depth : Maximum depth for XGB trees, affecting the model's capacity to capture non-linearity.

* final_estimator__C : Regularization strength in Logistic Regression.

The best-performing configuration is selected based on **Cross-Validation Accuracy**.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Ensure X and y are preprocessed correctly before splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Define base models for stacking
estimators = [
    ('rf', RandomForestClassifier(random_state=42)),
    ('xgb', XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', random_state=42))
]

# Stacked ensemble model
stacked_model = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression(),
    cv=5,
    n_jobs=-1
)

# Define a parameter distribution for randomized search
param_dist = {
    'rf__n_estimators': [50, 100, 200, 250, 500],
    'rf__max_depth': [None, 10, 20, 30, 50],  # Removed 100 for efficiency
    'xgb__n_estimators': [50, 100, 150, 200, 250],
    'xgb__max_depth': [3, 6, 9, 12, 15],
    'final_estimator__C': [0.01, 0.1, 1, 10, 100]  # Increased range for better tuning
}

# Use RandomizedSearchCV for faster hyperparameter tuning
random_search = RandomizedSearchCV(
    estimator=stacked_model,
    param_distributions=param_dist,
    n_iter=30,  # Reduced to 30 for efficiency (can increase if needed)
    cv=5,
    n_jobs=-1,
    scoring='accuracy',
    random_state=42
)

random_search.fit(X_train, y_train)

# Retrieve the best model
best_model = random_search.best_estimator_
print("Best parameters found:", random_search.best_params_)

# Evaluate on test data
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

# Print evaluation metrics
print(f"Optimized Stacked Ensemble Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}, Recall: {recall:.4f}, F1-Score: {f1:.4f}")

# Store metrics for visualization
metrics = ["Accuracy", "Precision", "Recall", "F1-Score"]
scores = [accuracy, precision, recall, f1]

# Plot performance metrics
plt.figure(figsize=(10, 5))
plt.bar(metrics, scores, color=['blue', 'orange', 'green', 'red'])
plt.ylim(0, 1)
plt.xlabel("Metric")
plt.ylabel("Score")
plt.title("Stacked Ensemble Performance Metrics")
for i, v in enumerate(scores):
    plt.text(i, v + 0.02, f"{v:.2f}", ha='center', fontsize=12, fontweight='bold')
plt.show()

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=np.unique(y), yticklabels=np.unique(y))
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix for Stacked Ensemble Model")
plt.show()
