# 2025 IPL Winner Prediction using Statistical Learning

(c) Aritro 'sortira' Shome 2025-Present

## Part 1 : Cleaning and encoding the data for better model performance

Upon examining the dataset, a significant number of missing values are identified, along with the presence of multiple categorical (nominal) features. Since the predictive model performs optimally with numerical data, it is necessary to preprocess the dataset by handling missing values, encoding categorical variables, and applying feature engineering techniques. These steps ensure the dataset is properly structured and suitable for model training, thereby enhancing predictive performance.

In [2]:
import pandas as pd # required for the data cleaning part

In [3]:
data_path = '/kaggle/input/ipl-cricinfo-dataset/matches.csv' # path where the raw/provided dataset is stored

In [4]:
matches_df = pd.read_csv(data_path)
matches_df.drop(columns=['id'], inplace=True) # we don't need id as we automatically get an index in a dataframe
matches_df.head()

Unnamed: 0,season,city,date,match_type,player_of_match,venue,team1,team2,toss_winner,toss_decision,winner,result,result_margin,target_runs,target_overs,super_over,method,umpire1,umpire2
0,2007/08,Bangalore,2008-04-18,League,BB McCullum,M Chinnaswamy Stadium,Royal Challengers Bangalore,Kolkata Knight Riders,Royal Challengers Bangalore,field,Kolkata Knight Riders,runs,140.0,223.0,20.0,N,,Asad Rauf,RE Koertzen
1,2007/08,Chandigarh,2008-04-19,League,MEK Hussey,"Punjab Cricket Association Stadium, Mohali",Kings XI Punjab,Chennai Super Kings,Chennai Super Kings,bat,Chennai Super Kings,runs,33.0,241.0,20.0,N,,MR Benson,SL Shastri
2,2007/08,Delhi,2008-04-19,League,MF Maharoof,Feroz Shah Kotla,Delhi Daredevils,Rajasthan Royals,Rajasthan Royals,bat,Delhi Daredevils,wickets,9.0,130.0,20.0,N,,Aleem Dar,GA Pratapkumar
3,2007/08,Mumbai,2008-04-20,League,MV Boucher,Wankhede Stadium,Mumbai Indians,Royal Challengers Bangalore,Mumbai Indians,bat,Royal Challengers Bangalore,wickets,5.0,166.0,20.0,N,,SJ Davis,DJ Harper
4,2007/08,Kolkata,2008-04-20,League,DJ Hussey,Eden Gardens,Kolkata Knight Riders,Deccan Chargers,Deccan Chargers,bat,Kolkata Knight Riders,wickets,5.0,111.0,20.0,N,,BF Bowden,K Hariharan


In the dataset, the *season* and *date* attributes, despite being numerical in nature, are represented as string values. These must be appropriately converted into their respective numerical formats for effective processing. Additionally, categorical features such as *venue* and *city* require systematic organization to ensure consistency. Furthermore, these categorical variables, along with attributes like *umpires*, will be encoded using **Label Encoding** to transform them into a numerical representation suitable for model training.

In [5]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
matches_df['umpire1_encoded'] = le.fit_transform(matches_df['umpire1'])
matches_df['umpire2_encoded'] = le.fit_transform(matches_df['umpire2'])
matches_df['venue_encoded'] = le.fit_transform(matches_df['venue'])
matches_df['season_encoded'] = le.fit_transform(matches_df['season'])
matches_df = matches_df.drop(columns=['umpire1', 'umpire2', 'venue', 'season'])
matches_df.head(5)

Unnamed: 0,city,date,match_type,player_of_match,team1,team2,toss_winner,toss_decision,winner,result,result_margin,target_runs,target_overs,super_over,method,umpire1_encoded,umpire2_encoded,venue_encoded,season_encoded
0,Bangalore,2008-04-18,League,BB McCullum,Royal Challengers Bangalore,Kolkata Knight Riders,Royal Challengers Bangalore,field,Kolkata Knight Riders,runs,140.0,223.0,20.0,N,,9,41,23,0
1,Chandigarh,2008-04-19,League,MEK Hussey,Kings XI Punjab,Chennai Super Kings,Chennai Super Kings,bat,Chennai Super Kings,runs,33.0,241.0,20.0,N,,34,52,40,0
2,Delhi,2008-04-19,League,MF Maharoof,Delhi Daredevils,Rajasthan Royals,Rajasthan Royals,bat,Delhi Daredevils,wickets,9.0,130.0,20.0,N,,8,15,16,0
3,Mumbai,2008-04-20,League,MV Boucher,Mumbai Indians,Royal Challengers Bangalore,Mumbai Indians,bat,Royal Challengers Bangalore,wickets,5.0,166.0,20.0,N,,51,14,55,0
4,Kolkata,2008-04-20,League,DJ Hussey,Kolkata Knight Riders,Deccan Chargers,Deccan Chargers,bat,Kolkata Knight Riders,wickets,5.0,111.0,20.0,N,,10,24,14,0


Next, we perform feature extraction from the date variable by deriving attributes such as month, weekday, and quarter. These temporal features are crucial, as factors like fixture scheduling, seasonal variations, and weather conditions can significantly influence a team's win probability. Incorporating these derived features enhances the model's ability to capture patterns and dependencies related to match outcomes.

In [6]:
matches_df['date'] = pd.to_datetime(matches_df['date'])
matches_df['year'] = matches_df['date'].dt.year
matches_df['month'] = matches_df['date'].dt.month
matches_df['day'] = matches_df['date'].dt.day
matches_df['weekday'] = matches_df['date'].dt.weekday  # Monday=0, Sunday=6
matches_df['quarter'] = matches_df['date'].dt.quarter
matches_df['day_of_year'] = matches_df['date'].dt.dayofyear
matches_df.drop(columns=['date'], inplace=True)
matches_df.head(3)

Unnamed: 0,city,match_type,player_of_match,team1,team2,toss_winner,toss_decision,winner,result,result_margin,...,umpire1_encoded,umpire2_encoded,venue_encoded,season_encoded,year,month,day,weekday,quarter,day_of_year
0,Bangalore,League,BB McCullum,Royal Challengers Bangalore,Kolkata Knight Riders,Royal Challengers Bangalore,field,Kolkata Knight Riders,runs,140.0,...,9,41,23,0,2008,4,18,4,2,109
1,Chandigarh,League,MEK Hussey,Kings XI Punjab,Chennai Super Kings,Chennai Super Kings,bat,Chennai Super Kings,runs,33.0,...,34,52,40,0,2008,4,19,5,2,110
2,Delhi,League,MF Maharoof,Delhi Daredevils,Rajasthan Royals,Rajasthan Royals,bat,Delhi Daredevils,wickets,9.0,...,8,15,16,0,2008,4,19,5,2,110


To ensure the dataset is suitable for model training, categorical variables such as *result*, *toss_decision*, *match_type*, *super_over*, and *method* are encoded using **Label Encoding**. This process converts categorical values into numerical representations while preserving their relative distinctions. Following the encoding, the original categorical columns are removed to avoid redundancy. This transformation enhances the model's ability to interpret and utilize these features effectively.

In [7]:
matches_df['result_encoded'] = le.fit_transform(matches_df['result'])
matches_df.drop(columns=['result'], inplace=True)
matches_df['toss_decision_encoded'] = le.fit_transform(matches_df['toss_decision'])
matches_df.drop(columns=['toss_decision'], inplace=True)
matches_df['match_type_encoded'] = le.fit_transform(matches_df['match_type'])
matches_df.drop(columns=['match_type'], inplace=True)
matches_df['super_over_encoded'] = le.fit_transform(matches_df['super_over'])
matches_df.drop(columns=['super_over'], inplace=True)
matches_df['method_encoded'] = le.fit_transform(matches_df['method'])
matches_df.drop(columns=['method'], inplace=True)
matches_df.head(3)

Unnamed: 0,city,player_of_match,team1,team2,toss_winner,winner,result_margin,target_runs,target_overs,umpire1_encoded,...,month,day,weekday,quarter,day_of_year,result_encoded,toss_decision_encoded,match_type_encoded,super_over_encoded,method_encoded
0,Bangalore,BB McCullum,Royal Challengers Bangalore,Kolkata Knight Riders,Royal Challengers Bangalore,Kolkata Knight Riders,140.0,223.0,20.0,9,...,4,18,4,2,109,1,1,4,0,1
1,Chandigarh,MEK Hussey,Kings XI Punjab,Chennai Super Kings,Chennai Super Kings,Chennai Super Kings,33.0,241.0,20.0,34,...,4,19,5,2,110,1,0,4,0,1
2,Delhi,MF Maharoof,Delhi Daredevils,Rajasthan Royals,Rajasthan Royals,Delhi Daredevils,9.0,130.0,20.0,8,...,4,19,5,2,110,3,0,4,0,1


In [8]:
matches_df['team1'].unique()

array(['Royal Challengers Bangalore', 'Kings XI Punjab',
       'Delhi Daredevils', 'Mumbai Indians', 'Kolkata Knight Riders',
       'Rajasthan Royals', 'Deccan Chargers', 'Chennai Super Kings',
       'Kochi Tuskers Kerala', 'Pune Warriors', 'Sunrisers Hyderabad',
       'Gujarat Lions', 'Rising Pune Supergiants',
       'Rising Pune Supergiant', 'Delhi Capitals', 'Punjab Kings',
       'Lucknow Super Giants', 'Gujarat Titans',
       'Royal Challengers Bengaluru'], dtype=object)

Historical changes in team names and franchises have introduced inconsistencies in the dataset. For instance, *Kings XI Punjab* was rebranded as *Punjab Kings*, and *Delhi Daredevils* was renamed *Delhi Capitals*. Additionally, *Chennai Super Kings (CSK)* and *Rajasthan Royals (RR)* were temporarily suspended and replaced with interim teams. Similarly, *Deccan Chargers* transitioned into *Sunrisers Hyderabad (SRH)*, while *Kochi Tuskers Kerala* was replaced by *Gujarat Titans (GT)*. 

To standardize team names and maintain consistency in the dataset, all team name variations have been mapped to their latest or most commonly recognized abbreviations. This ensures uniformity in the data and prevents discrepancies during model training. The replacements were performed systematically for both *team1* and *team2* columns, aligning all team references to their standardized short forms.

In [9]:
matches_df['team1'] = matches_df['team1'].replace('Rising Pune Supergiants', 'CSK')
matches_df['team1'] = matches_df['team1'].replace('Rising Pune Supergiant', 'CSK')

matches_df['team2'] = matches_df['team2'].replace('Rising Pune Supergiants', 'CSK')
matches_df['team2'] = matches_df['team2'].replace('Rising Pune Supergiant', 'CSK')

matches_df['team1'] = matches_df['team1'].replace('Royal Challengers Bengaluru', 'RCB')
matches_df['team1'] = matches_df['team1'].replace('Royal Challengers Bangalore', 'RCB')

matches_df['team2'] = matches_df['team2'].replace('Royal Challengers Bengaluru', 'RCB')
matches_df['team2'] = matches_df['team2'].replace('Royal Challengers Bangalore', 'RCB')

matches_df['team1'] = matches_df['team1'].replace('Gujarat Lions', 'RR')
matches_df['team2'] = matches_df['team2'].replace('Gujarat Lions', 'RR')

matches_df['team1'] = matches_df['team1'].replace('Rajasthan Royals', 'RR')
matches_df['team2'] = matches_df['team2'].replace('Rajasthan Royals', 'RR')

matches_df['team1'] = matches_df['team1'].replace('Chennai Super Kings', 'CSK')
matches_df['team2'] = matches_df['team2'].replace('Chennai Super Kings', 'CSK')

matches_df['team1'] = matches_df['team1'].replace('Sunrisers Hyderabad', 'SRH')
matches_df['team2'] = matches_df['team2'].replace('Sunrisers Hyderabad', 'SRH')

matches_df['team1'] = matches_df['team1'].replace('Deccan Chargers', 'SRH')
matches_df['team2'] = matches_df['team2'].replace('Deccan Chargers', 'SRH')

matches_df['team1'] = matches_df['team1'].replace('Delhi Daredevils', 'DD')
matches_df['team2'] = matches_df['team2'].replace('Delhi Daredevils', 'DD')

matches_df['team1'] = matches_df['team1'].replace('Delhi Capitals', 'DD')
matches_df['team2'] = matches_df['team2'].replace('Delhi Capitals', 'DD')

matches_df['team1'] = matches_df['team1'].replace('Kings XI Punjab', 'PBKS')
matches_df['team2'] = matches_df['team2'].replace('Kings XI Punjab', 'PBKS')

matches_df['team1'] = matches_df['team1'].replace('Pune Warriors', 'LSG')
matches_df['team2'] = matches_df['team2'].replace('Pune Warriors', 'LSG')

matches_df['team1'] = matches_df['team1'].replace('Punjab Kings', 'PBKS')
matches_df['team2'] = matches_df['team2'].replace('Punjab Kings', 'PBKS')

matches_df['team1'] = matches_df['team1'].replace('Kochi Tuskers Kerala', 'GT')
matches_df['team2'] = matches_df['team2'].replace('Kochi Tuskers Kerala', 'GT')

matches_df['team1'] = matches_df['team1'].replace('Gujarat Titans', 'GT')
matches_df['team2'] = matches_df['team2'].replace('Gujarat Titans', 'GT')

matches_df['team1'] = matches_df['team1'].replace('Lucknow Supergiants', 'LSG')
matches_df['team2'] = matches_df['team2'].replace('Lucknow Supergiants', 'LSG')

matches_df['team1'] = matches_df['team1'].replace('Kolkata Knight Riders', 'KKR')
matches_df['team2'] = matches_df['team2'].replace('Kolkata Knight Riders', 'KKR')

matches_df['team1'] = matches_df['team1'].replace('Mumbai Indians', 'MI')
matches_df['team2'] = matches_df['team2'].replace('Mumbai Indians', 'MI')

In [10]:
matches_df['toss_winner'] = matches_df['toss_winner'].replace('Rising Pune Supergiants', 'CSK')
matches_df['toss_winner'] = matches_df['toss_winner'].replace('Rising Pune Supergiant', 'CSK')

matches_df['toss_winner'] = matches_df['toss_winner'].replace('Royal Challengers Bengaluru', 'RCB')
matches_df['toss_winner'] = matches_df['toss_winner'].replace('Royal Challengers Bangalore', 'RCB')

matches_df['toss_winner'] = matches_df['toss_winner'].replace('Gujarat Lions', 'RR')
matches_df['toss_winner'] = matches_df['toss_winner'].replace('Rajasthan Royals', 'RR')

matches_df['toss_winner'] = matches_df['toss_winner'].replace('Chennai Super Kings', 'CSK')

matches_df['toss_winner'] = matches_df['toss_winner'].replace('Sunrisers Hyderabad', 'SRH')
matches_df['toss_winner'] = matches_df['toss_winner'].replace('Deccan Chargers', 'SRH')

matches_df['toss_winner'] = matches_df['toss_winner'].replace('Delhi Daredevils', 'DD')
matches_df['toss_winner'] = matches_df['toss_winner'].replace('Delhi Capitals', 'DD')

matches_df['toss_winner'] = matches_df['toss_winner'].replace('Kings XI Punjab', 'PBKS')
matches_df['toss_winner'] = matches_df['toss_winner'].replace('Pune Warriors', 'LSG')

matches_df['toss_winner'] = matches_df['toss_winner'].replace('Punjab Kings', 'PBKS')

matches_df['toss_winner'] = matches_df['toss_winner'].replace('Kochi Tuskers Kerala', 'GT')
matches_df['toss_winner'] = matches_df['toss_winner'].replace('Gujarat Titans', 'GT')

matches_df['toss_winner'] = matches_df['toss_winner'].replace('Lucknow Supergiants', 'LSG')
matches_df['toss_winner'] = matches_df['toss_winner'].replace('Kolkata Knight Riders', 'KKR')

matches_df['toss_winner'] = matches_df['toss_winner'].replace('Mumbai Indians', 'MI')

In [11]:
matches_df['winner'] = matches_df['winner'].replace('Rising Pune Supergiants', 'CSK')
matches_df['winner'] = matches_df['winner'].replace('Rising Pune Supergiant', 'CSK')

matches_df['winner'] = matches_df['winner'].replace('Royal Challengers Bengaluru', 'RCB')
matches_df['winner'] = matches_df['winner'].replace('Royal Challengers Bangalore', 'RCB')

matches_df['winner'] = matches_df['winner'].replace('Gujarat Lions', 'RR')
matches_df['winner'] = matches_df['winner'].replace('Rajasthan Royals', 'RR')

matches_df['winner'] = matches_df['winner'].replace('Chennai Super Kings', 'CSK')

matches_df['winner'] = matches_df['winner'].replace('Sunrisers Hyderabad', 'SRH')
matches_df['winner'] = matches_df['winner'].replace('Deccan Chargers', 'SRH')

matches_df['winner'] = matches_df['winner'].replace('Delhi Daredevils', 'DD')
matches_df['winner'] = matches_df['winner'].replace('Delhi Capitals', 'DD')

matches_df['winner'] = matches_df['winner'].replace('Kings XI Punjab', 'PBKS')
matches_df['winner'] = matches_df['winner'].replace('Pune Warriors', 'LSG')

matches_df['winner'] = matches_df['winner'].replace('Punjab Kings', 'PBKS')

matches_df['winner'] = matches_df['winner'].replace('Kochi Tuskers Kerala', 'GT')
matches_df['winner'] = matches_df['winner'].replace('Gujarat Titans', 'GT')

matches_df['winner'] = matches_df['winner'].replace('Lucknow Supergiants', 'LSG')
matches_df['winner'] = matches_df['winner'].replace('Kolkata Knight Riders', 'KKR')

matches_df['winner'] = matches_df['winner'].replace('Mumbai Indians', 'MI')


In [12]:
matches_df.head(10)

Unnamed: 0,city,player_of_match,team1,team2,toss_winner,winner,result_margin,target_runs,target_overs,umpire1_encoded,...,month,day,weekday,quarter,day_of_year,result_encoded,toss_decision_encoded,match_type_encoded,super_over_encoded,method_encoded
0,Bangalore,BB McCullum,RCB,KKR,RCB,KKR,140.0,223.0,20.0,9,...,4,18,4,2,109,1,1,4,0,1
1,Chandigarh,MEK Hussey,PBKS,CSK,CSK,CSK,33.0,241.0,20.0,34,...,4,19,5,2,110,1,0,4,0,1
2,Delhi,MF Maharoof,DD,RR,RR,DD,9.0,130.0,20.0,8,...,4,19,5,2,110,3,0,4,0,1
3,Mumbai,MV Boucher,MI,RCB,MI,RCB,5.0,166.0,20.0,51,...,4,20,6,2,111,3,0,4,0,1
4,Kolkata,DJ Hussey,KKR,SRH,SRH,KKR,5.0,111.0,20.0,10,...,4,20,6,2,111,3,0,4,0,1
5,Jaipur,SR Watson,RR,PBKS,PBKS,RR,6.0,167.0,20.0,8,...,4,21,0,2,112,3,0,4,0,1
6,Hyderabad,V Sehwag,SRH,DD,SRH,DD,9.0,143.0,20.0,24,...,4,22,1,2,113,3,0,4,0,1
7,Chennai,ML Hayden,CSK,MI,MI,CSK,6.0,209.0,20.0,18,...,4,23,2,2,114,1,1,4,0,1
8,Hyderabad,YK Pathan,SRH,RR,RR,RR,3.0,215.0,20.0,9,...,4,24,3,2,115,3,1,4,0,1
9,Chandigarh,KC Sangakkara,PBKS,MI,MI,PBKS,66.0,183.0,20.0,8,...,4,25,4,2,116,1,1,4,0,1


Now that the teams and the historical discrepancies have been resolved it's time to Label Encode them. 

In [13]:
team_columns = ['team1', 'team2', 'toss_winner', 'winner']
city_column = 'city'
team_le = LabelEncoder()
all_teams = pd.concat([matches_df[col] for col in team_columns])
team_le.fit(all_teams)
label_mapping = {index: label for index, label in enumerate(team_le.classes_)}
for col in team_columns:
    matches_df[f'{col}_encoded'] = team_le.transform(matches_df[col])
    matches_df.drop(columns=[col], inplace=True)  
city_le = LabelEncoder()
matches_df[f'{city_column}_encoded'] = city_le.fit_transform(matches_df[city_column])
matches_df.drop(columns=[city_column], inplace=True)  

Since player of the match is a feature dependent or rather a consequence of the target variable, it is safe to drop it and reduce one extra dimension.

In [14]:
matches_df.drop(columns=['player_of_match'], inplace=True)
matches_df.head(10)

Unnamed: 0,result_margin,target_runs,target_overs,umpire1_encoded,umpire2_encoded,venue_encoded,season_encoded,year,month,day,...,result_encoded,toss_decision_encoded,match_type_encoded,super_over_encoded,method_encoded,team1_encoded,team2_encoded,toss_winner_encoded,winner_encoded,city_encoded
0,140.0,223.0,20.0,9,41,23,0,2008,4,18,...,1,1,4,0,1,8,3,8,3,2
1,33.0,241.0,20.0,34,52,40,0,2008,4,19,...,1,0,4,0,1,7,0,0,0,7
2,9.0,130.0,20.0,8,15,16,0,2008,4,19,...,3,0,4,0,1,1,9,9,1,10
3,5.0,166.0,20.0,51,14,55,0,2008,4,20,...,3,0,4,0,1,6,8,6,8,26
4,5.0,111.0,20.0,10,24,14,0,2008,4,20,...,3,0,4,0,1,3,10,10,3,23
5,6.0,167.0,20.0,8,40,46,0,2008,4,21,...,3,0,4,0,1,9,7,7,9,18
6,9.0,143.0,20.0,24,4,42,0,2008,4,22,...,3,0,4,0,1,10,1,10,1,16
7,6.0,209.0,20.0,18,15,27,0,2008,4,23,...,1,1,4,0,1,0,6,6,0,8
8,3.0,215.0,20.0,9,30,42,0,2008,4,24,...,3,1,4,0,1,10,9,9,9,16
9,66.0,183.0,20.0,8,4,40,0,2008,4,25,...,1,1,4,0,1,7,6,6,7,7


To enhance the dataset's interpretability and facilitate model training, the *result_margin* column has been processed to create two distinct numerical features: **result_runs** and **result_wickets**. 

- The **result_runs** column captures the margin of victory in terms of runs when the match outcome corresponds to *result_encoded = 1* (indicating a win by runs).  
- The **result_wickets** column reflects the margin of victory in terms of wickets when *result_encoded = 3* (indicating a win by wickets).  

For matches where the result does not fall under these categories (i.e., cases where *result_encoded* is neither 1 nor 3), both **result_runs** and **result_wickets** are set to zero to maintain consistency in the data structure. 

After this transformation, the original *result_margin* column, which contained mixed numerical values (runs or wickets), is removed to prevent redundancy. This refined representation ensures clarity in distinguishing between different types of match outcomes, thereby improving the dataset's usability for predictive modeling.

In [15]:
matches_df['result_runs'] = 0 
matches_df['result_wickets'] = 0  
matches_df.loc[matches_df['result_encoded'] == 1, 'result_runs'] = matches_df['result_margin']
matches_df.loc[matches_df['result_encoded'] == 3, 'result_wickets'] = matches_df['result_margin']
matches_df.loc[~matches_df['result_encoded'].isin([1, 3]), ['result_runs', 'result_wickets']] = 0
matches_df.drop(columns=['result_margin'], inplace=True)
matches_df.head()

Unnamed: 0,target_runs,target_overs,umpire1_encoded,umpire2_encoded,venue_encoded,season_encoded,year,month,day,weekday,...,match_type_encoded,super_over_encoded,method_encoded,team1_encoded,team2_encoded,toss_winner_encoded,winner_encoded,city_encoded,result_runs,result_wickets
0,223.0,20.0,9,41,23,0,2008,4,18,4,...,4,0,1,8,3,8,3,2,140,0
1,241.0,20.0,34,52,40,0,2008,4,19,5,...,4,0,1,7,0,0,0,7,33,0
2,130.0,20.0,8,15,16,0,2008,4,19,5,...,4,0,1,1,9,9,1,10,0,9
3,166.0,20.0,51,14,55,0,2008,4,20,6,...,4,0,1,6,8,6,8,26,0,5
4,111.0,20.0,10,24,14,0,2008,4,20,6,...,4,0,1,3,10,10,3,23,0,5


To ensure uniform feature scaling and improve the model's performance, **normalization** has been applied to the dataset.  

- Each numerical column (excluding categorical features such as *winner_encoded*, *team1_encoded*, *team2_encoded*, and *toss_winner_encoded*) is normalized using **Min-Max Scaling**.  
- The transformation follows the formula:  
  $$ x_{\text{normalized}} = \frac{x - \min(x)}{\max(x)} $$  
  where $x$ is the original feature value, and $\min(x)$ and $\max(x)$ are the minimum and maximum values of the respective column.  

This scaling ensures that all numerical features are mapped to a [0,1] range, preventing dominance by features with larger absolute values. By maintaining categorical features in their encoded form, we retain their discrete nature while ensuring numerical consistency across continuous variables.

It may seem odd as to why we are exempting the team related columns, it's because they denote classes and not values as such hence.

In [16]:
for column in matches_df.columns:
    # if column == 'winner_encoded' or column == 'team1_encoded' or column == 'team2_encoded' or column == 'toss_winner_encoded':
    #     continue
    categorical_cols = matches_df.select_dtypes(include=['object']).columns
    if column not in categorical_cols:
        max_value = matches_df[column].max()
        min_value = matches_df[column].min()
        matches_df[column] = matches_df[column].apply(lambda x: (x - min_value) / max_value)
matches_df.head(15)

Unnamed: 0,target_runs,target_overs,umpire1_encoded,umpire2_encoded,venue_encoded,season_encoded,year,month,day,weekday,...,match_type_encoded,super_over_encoded,method_encoded,team1_encoded,team2_encoded,toss_winner_encoded,winner_encoded,city_encoded,result_runs,result_wickets
0,0.625,0.75,0.147541,0.672131,0.403509,0.0,0.0,0.090909,0.548387,0.666667,...,0.571429,0.0,1.0,8,3,8,3,0.055556,0.958904,0.0
1,0.6875,0.75,0.557377,0.852459,0.701754,0.0,0.0,0.090909,0.580645,0.833333,...,0.571429,0.0,1.0,7,0,0,0,0.194444,0.226027,0.0
2,0.302083,0.75,0.131148,0.245902,0.280702,0.0,0.0,0.090909,0.580645,0.833333,...,0.571429,0.0,1.0,1,9,9,1,0.277778,0.0,0.9
3,0.427083,0.75,0.836066,0.229508,0.964912,0.0,0.0,0.090909,0.612903,1.0,...,0.571429,0.0,1.0,6,8,6,8,0.722222,0.0,0.5
4,0.236111,0.75,0.163934,0.393443,0.245614,0.0,0.0,0.090909,0.612903,1.0,...,0.571429,0.0,1.0,3,10,10,3,0.638889,0.0,0.5
5,0.430556,0.75,0.131148,0.655738,0.807018,0.0,0.0,0.090909,0.645161,0.0,...,0.571429,0.0,1.0,9,7,7,9,0.5,0.0,0.6
6,0.347222,0.75,0.393443,0.065574,0.736842,0.0,0.0,0.090909,0.677419,0.166667,...,0.571429,0.0,1.0,10,1,10,1,0.444444,0.0,0.9
7,0.576389,0.75,0.295082,0.245902,0.473684,0.0,0.0,0.090909,0.709677,0.333333,...,0.571429,0.0,1.0,0,6,6,0,0.222222,0.041096,0.0
8,0.597222,0.75,0.147541,0.491803,0.736842,0.0,0.0,0.090909,0.741935,0.5,...,0.571429,0.0,1.0,10,9,9,9,0.444444,0.0,0.3
9,0.486111,0.75,0.131148,0.065574,0.701754,0.0,0.0,0.090909,0.774194,0.666667,...,0.571429,0.0,1.0,7,6,6,7,0.194444,0.452055,0.0


lastly to remove any and every NaN instance, we run interpolate.

In [17]:
matches_df = matches_df.interpolate()
matches_df.columns
matches_df.to_csv('/kaggle/working/processed_matches.csv')

## Part 2: Finding the best model for predicting the winner

The underlying approach for this analysis is structured in two key phases:  

1. **Match-Level Prediction:**  
   - Given that the dataset provides match-wise data, the initial objective is to develop a predictive model capable of accurately determining the winner of an individual match based on historical match statistics, team performance metrics, and other relevant features.  
   - This model will leverage feature-engineered variables, encoded categorical attributes, and normalized numerical data to enhance predictive accuracy.  
   - Supervised learning techniques, particularly classification models such as logistic regression, decision trees, or ensemble methods, will be explored to optimize the prediction performance.  

2. **Tournament Simulation:**  
   - Once a robust match-level predictor is established, it will be extended to simulate an entire tournament structure, including the **league stage, playoffs, and finals**.  
   - The tournament will be simulated using match-by-match predictions, determining the progression of teams based on their projected performances.  
   - The framework will follow standard tournament rules, wherein teams accumulate points during the league stage, followed by knockout rounds leading up to the final match.  
   - By iteratively applying the trained model to simulate multiple tournament scenarios, insights can be derived regarding the most probable tournament outcomes, dominant teams, and key factors influencing overall performance.  

This hierarchical approach enables both **granular match-level insights** and **higher-level tournament forecasting**, thereby providing a comprehensive predictive framework for competitive analysis.

But first things first, splitting the dataset for training and testing.

In [50]:
from sklearn.model_selection import train_test_split
y = matches_df['winner_encoded']
X = matches_df.drop(columns=['winner_encoded'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

The problem can be conceptualized as a **binary/multiclass classification task**, where the objective is to determine the winner of a match based on a set of input features. Intuitively, this involves analyzing how different feature values influence the likelihood of one team winning over another.  

From a computational perspective, this can be abstracted as a decision-making problem governed by numerous conditional statements. Essentially, for each match instance, if certain conditions hold true—such as team strength, past performances, toss outcomes, venue conditions, and other match-related factors—the probability of one team winning increases, and vice versa.  

### **Choice of Algorithms**  
Given the nature of the problem, decision-tree-based models are well-suited, as they efficiently learn decision boundaries from structured data. The following algorithms have been selected for evaluation:  

1. **Decision Tree Classifier**  
   - A simple, interpretable model that recursively splits the dataset based on feature values, forming a tree-like structure of conditional decisions.  

2. **Random Forest Classifier**  
   - An ensemble of multiple decision trees that aggregates predictions to reduce overfitting and improve generalization.  

3. **XGBoost Classifier**  
   - An advanced gradient boosting algorithm that optimizes decision trees sequentially to minimize error, often yielding superior performance in structured data problems.  

---

In [None]:
from sklearn.metrics import accuracy_score, f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
import xgboost as xgb
rf_classifier = RandomForestClassifier(random_state=42, n_estimators=1000, max_depth=2000)
dt_classifier = DecisionTreeClassifier(random_state=42)
xgb_classifier = xgb.XGBClassifier(random_state=42, use_label_encoder=True, eval_metric='mlogloss', alpha=0.1, lambda_=0.1)
rf_classifier.fit(X_train, y_train)
dt_classifier.fit(X_train, y_train)
xgb_classifier.fit(X_train, y_train)
rf_predictions = rf_classifier.predict(X_test)
dt_predictions = dt_classifier.predict(X_test)
xgb_predictions = xgb_classifier.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_predictions)
dt_accuracy = accuracy_score(y_test, dt_predictions)
xgb_accuracy = accuracy_score(y_test, xgb_predictions)
rf_f1 = f1_score(y_test, rf_predictions, average='weighted')
dt_f1 = f1_score(y_test, dt_predictions, average='weighted')
xgb_f1 = f1_score(y_test, xgb_predictions, average='weighted')
print(f"Random Forest Classifier Accuracy: {rf_accuracy:.4f}, F1 Score: {rf_f1:.4f}")
print(f"Decision Tree Classifier Accuracy: {dt_accuracy:.4f}, F1 Score: {dt_f1:.4f}")
print(f"XGBoost Classifier Accuracy: {xgb_accuracy:.4f}, F1 Score: {xgb_f1:.4f}")

Looks like these won't do. Let's try the MLP Classifier as MLP Classifiers can capture significantly more complex patterns and trends. 🤞

In [None]:
from sklearn.neural_network import MLPClassifier
mlp_model = MLPClassifier(hidden_layer_sizes=(512), activation='tanh', solver='lbfgs', max_iter=1000, random_state=42, learning_rate_init=0.020718394976751383, alpha=0.4303451144499504)
mlp_model.fit(X_train, y_train)
mlp_preds = mlp_model.predict(X_test)
mlp_accuracy = accuracy_score(y_test, mlp_preds)
print(f"MLP Accuracy: {mlp_accuracy:.4f}")

That is a significant jump in accuracy ~90%. Now we shall use the Optuna HyperParameter Turning Procedure to find the best hyperparameters for the model.

In [None]:
import optuna
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

def objective(trial):
    num_layers = trial.suggest_int("num_layers", 1, 2)
    
    if num_layers == 1:
        layer1 = trial.suggest_categorical("layer1", [50, 100, 128, 256, 512])
    else:
        layer1 = trial.suggest_categorical("layer1", [50, 100, 128, 256, 512])
        layer2 = trial.suggest_categorical("layer2", [50, 100, 128, 256, 512])

    activation = trial.suggest_categorical("activation", ["relu", "tanh"])
    solver = trial.suggest_categorical("solver", ["lbfgs", "adam"])
    learning_rate_init = trial.suggest_float("learning_rate_init", 0.0001, 0.02, log=True)
    alpha = trial.suggest_float("alpha", 0.0001, 0.05)

    model = MLPClassifier(
        hidden_layer_sizes=(layer1,) if num_layers == 1 else (layer1, layer2),
        activation=activation,
        solver=solver,
        learning_rate_init=learning_rate_init,
        alpha=alpha,
        max_iter=3000
    )

    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)
    return score


# Run Optuna optimization
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)

# Get best params and train final model
best_mlp_params = study.best_params

# Build the final MLP model with best parameters
if best_mlp_params["num_layers"] == 1:
    final_hidden_layer_sizes = (best_mlp_params["layer1"],)
else:
    final_hidden_layer_sizes = (best_mlp_params["layer1"], best_mlp_params["layer2"])

final_mlp_model = MLPClassifier(
    hidden_layer_sizes=final_hidden_layer_sizes,
    activation=best_mlp_params["activation"],
    solver=best_mlp_params["solver"],
    learning_rate_init=best_mlp_params["learning_rate_init"],
    alpha=best_mlp_params["alpha"],
    max_iter=2000,
    random_state=42
)

final_mlp_model.fit(X_train, y_train)
mlp_preds = final_mlp_model.predict(X_test)
mlp_accuracy = accuracy_score(y_test, mlp_preds)
print(f"Tuned MLP Accuracy: {mlp_accuracy:.4f}")

That indeed is a great improvement from the initial 50%-ish accuracies of Random Forests and like. This will be our final match winner predictor. Let's save the model.

In [None]:
import pickle
filename = '/kaggle/working/fine_tuned_MLP_model.sav'
pickle.dump(final_mlp_model, open(filename, 'wb'))

## Part 3 : Exploring additional models like Naive Bayes and Markov Chains

In [19]:
import pandas as pd
df = pd.read_csv('/kaggle/working/processed_matches.csv', index_col=0)

### Match Outcome Prediction Using Multinomial Naïve Bayes

The given code applies a Multinomial Naïve Bayes classifier to predict match winners based on historical data. The dataset is split into training and testing sets, and the model is trained using MultinomialNB(). Predictions are made on the test set, and the model's accuracy is evaluated using accuracy_score(). The final accuracy score provides insight into how well the Naïve Bayes model generalizes match outcomes.

In [20]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
X = df.drop(columns=["winner_encoded"])
y = df["winner_encoded"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)
y_pred = nb_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Multinomial Naïve Bayes Accuracy: {accuracy:.4f}")

Multinomial Naïve Bayes Accuracy: 0.1781


### Match Outcome Prediction Using a Markov Chain Model

This code models match winner transitions using a Markov Chain approach. A transition matrix is built by counting winner sequences in historical data, converting them into probability distributions. The next match winner is predicted based on the most likely transition from the current winner. Model accuracy is evaluated on the last 20% of matches, comparing predicted and actual winners. The final accuracy score reflects the effectiveness of winner transition probabilities in forecasting match outcomes.

In [22]:
import pandas as pd
import numpy as np
from collections import defaultdict
transition_counts = defaultdict(lambda: defaultdict(int))
for i in range(len(df) - 1):
    current_winner = df.loc[i, "winner_encoded"]
    next_winner = df.loc[i + 1, "winner_encoded"]
    transition_counts[current_winner][next_winner] += 1
transition_matrix = {}
for winner, next_winners in transition_counts.items():
    total_transitions = sum(next_winners.values())
    transition_matrix[winner] = {k: v / total_transitions for k, v in next_winners.items()}
def predict_next_winner(current_winner):
    if current_winner not in transition_matrix:
        return np.random.choice(df["winner_encoded"].unique())  
    next_winners = transition_matrix[current_winner]
    return max(next_winners, key=next_winners.get)  
test_size = int(0.2 * len(df))
correct_predictions = 0
total_predictions = 0
for i in range(len(df) - test_size, len(df) - 1):
    actual_winner = df.loc[i + 1, "winner_encoded"]
    predicted_winner = predict_next_winner(df.loc[i, "winner_encoded"])
    if predicted_winner == actual_winner:
        correct_predictions += 1
    total_predictions += 1
markov_accuracy = correct_predictions / total_predictions
print(f"Markov Chain Test Accuracy: {markov_accuracy:.4f}")

Markov Chain Test Accuracy: 0.1651


### Match Outcome Prediction Using Categorical Naïve Bayes 

The given code employs a **Categorical Naïve Bayes classifier** to predict match winners. This model is well-suited for categorical data, making it ideal for match outcome prediction. The classifier is trained on historical match data using `CategoricalNB()`, and predictions are made on the test set. Model performance is assessed using `accuracy_score()`, which quantifies how well the model generalizes to unseen matches. The final accuracy score provides insight into the reliability of the predictions.

In [23]:
from sklearn.naive_bayes import CategoricalNB

nb_model = CategoricalNB()
nb_model.fit(X_train, y_train)
y_pred = nb_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Categorical Naïve Bayes Accuracy: {accuracy:.4f}")


Categorical Naïve Bayes Accuracy: 0.5114


### Match Outcome Prediction Using Gradient Boosting with Hyperparameter Tuning 

This code utilizes a **Gradient Boosting Classifier** to predict match winners, incorporating **GridSearchCV** for hyperparameter tuning. A pipeline is created with the classifier, and a grid search optimizes key parameters such as `n_estimators`, `learning_rate`, and `max_depth`. The model is trained on historical match data, and predictions are made on the test set. Model performance is evaluated using `accuracy_score`, and the best hyperparameters are identified, ensuring an optimal balance between accuracy and model complexity.

In [46]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipeline = Pipeline(steps=[
    ('classifier', GradientBoostingClassifier(random_state=42))
])
param_grid = {
    'classifier__n_estimators': [100, 200],
    'classifier__learning_rate': [0.01, 0.1, 0.2],
    'classifier__max_depth': [3, 5, 7]
}
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', error_score='raise')
grid_search.fit(X_train, y_train)
y_pred = grid_search.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}")
print("Best hyperparameters:", grid_search.best_params_)

Index(['target_runs', 'target_overs', 'umpire1_encoded', 'umpire2_encoded',
       'venue_encoded', 'season_encoded', 'year', 'month', 'day', 'weekday',
       'quarter', 'day_of_year', 'result_encoded', 'toss_decision_encoded',
       'match_type_encoded', 'super_over_encoded', 'method_encoded',
       'team1_encoded', 'team2_encoded', 'toss_winner_encoded', 'city_encoded',
       'result_runs', 'result_wickets'],
      dtype='object')




Model Accuracy: 0.7717
Best hyperparameters: {'classifier__learning_rate': 0.1, 'classifier__max_depth': 7, 'classifier__n_estimators': 200}


### Match Outcome Prediction Using Stacked Ensemble with Randomized Search 

This code employs a **Stacked Ensemble Model** combining **Random Forest** and **XGBoost** as base learners, with **Logistic Regression** as the final estimator. The model leverages **RandomizedSearchCV** to optimize key hyperparameters, ensuring efficient tuning across multiple classifiers. The best-performing configuration is selected based on accuracy, and predictions are made on the test set. The final accuracy score reflects the effectiveness of stacked learning in enhancing match outcome predictions.

In [None]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
estimators = [
    ('rf', RandomForestClassifier(random_state=42)),
    ('xgb', XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', random_state=42))
]
stacked_model = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression(),
    cv=5,
    n_jobs=-1
)
param_dist = {
    'rf__n_estimators': [50, 100, 200, 250, 500],
    'rf__max_depth': [None, 10, 20, 30, 100, 50],
    'xgb__n_estimators': [50, 100, 200, 150, 250],
    'xgb__max_depth': [3, 6, 9, 12, 15],
    'final_estimator__C': [0.01, 0.1, 1, 10]
}
random_search = RandomizedSearchCV(
    estimator=stacked_model,
    param_distributions=param_dist,
    n_iter=50, 
    cv=5,
    n_jobs=-1,
    scoring='accuracy',
    random_state=42
)
random_search.fit(X_train, y_train)
best_model = random_search.best_estimator_
print("Best parameters found:", random_search.best_params_)
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Optimized Stacked Ensemble Accuracy: {accuracy:.4f}")



## Part 4: Actually predicting the winner of the tournament using the match winner model

In [36]:
import pickle
filename = '/kaggle/input/modelmodel/scikitlearn/default/1/fine_tuned_MLP_model.sav'
with open(filename, 'rb') as file:
    model = pickle.load(file)
model.get_params()

{'activation': 'tanh',
 'alpha': 0.028397959255942663,
 'batch_size': 'auto',
 'beta_1': 0.9,
 'beta_2': 0.999,
 'early_stopping': False,
 'epsilon': 1e-08,
 'hidden_layer_sizes': (512,),
 'learning_rate': 'constant',
 'learning_rate_init': 0.005720263132860582,
 'max_fun': 15000,
 'max_iter': 2000,
 'momentum': 0.9,
 'n_iter_no_change': 10,
 'nesterovs_momentum': True,
 'power_t': 0.5,
 'random_state': 42,
 'shuffle': True,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'validation_fraction': 0.1,
 'verbose': False,
 'warm_start': False}

### **Tournament Simulation Approach Using MLP Classifier**  

The given code simulates a full season of cricket matches using a trained **Multi-Layer Perceptron (MLP) classifier**. The process involves generating synthetic matchups, predicting outcomes using the trained model, and aggregating results to determine the season's winner.  

---

### **1. Generating Match Fixtures**  
- Unique teams are extracted from the dataset using `matches_df['team1_encoded'].unique()`.  
- **All possible matchups** between teams are generated using `itertools.combinations(teams, 2)`, ensuring each team competes against every other team at least once.  
- A **points table** (`points_table`) is initialized as a dictionary, where each team starts with zero points.

---

### **2. Synthetic Match Generation**  
- **Feature Distributions**: The code extracts unique values for key match-related features (e.g., umpire, venue, season, result, toss decision) to ensure that synthetic match data follows real-world distributions.  
- **Monte Carlo Simulation**: Each matchup is simulated **1000 times** to account for randomness in match conditions.  
- A new match instance is created by randomly sampling values from the historical dataset distributions. Features include:
  - **Teams**: `team1_encoded`, `team2_encoded`
  - **Match Conditions**: `target_runs`, `target_overs`, `result_runs`, `result_wickets`
  - **Venue and Umpires**: `venue_encoded`, `umpire1_encoded`, `umpire2_encoded`
  - **Seasonal Attributes**: `year`, `month`, `day`, `weekday`
  - **Match Type and Toss Decision**: `toss_winner_encoded`, `match_type_encoded`

---

### **3. Match Outcome Prediction**  
- The trained **MLPClassifier model** takes the generated synthetic match data (`match_df`) as input and predicts the **winning team** (`predicted_winner`).  
- The `points_table` is updated by incrementing the points for the predicted winner.  
- If a team is not present in `points_table`, it is added dynamically using `points_table.get(predicted_winner, 0) + 1`.

---

### **4. Mapping Encoded Teams to Original Names**  
- Since label encoding was used, a mapping dictionary (`label_mapping`) converts encoded team IDs back to their original names.  
- Franchise renaming is handled:
  - `"Lucknow Super Giants"` points are merged into `"LSG"`.
  - `"Delhi Daredevils"` (`DD`) is renamed to `"Delhi Capitals"` (`DC`).

---

### **5. Season Winner Estimation**  
- The final **aggregated points table** ranks teams based on their predicted wins.
- The team with the highest points is the predicted season winner.

---

### **Key Takeaways**  
- **Monte Carlo-style simulations** generate multiple match outcomes for better statistical accuracy.  
- The **MLP model generalizes** match conditions and predicts results based on historical patterns.  
- **Dynamic points tracking** enables real-time tournament simulation.  
- **Post-processing ensures accurate team representation** despite name changes.  

This approach can be extended to **multi-stage tournaments**, knockout formats, or **weighted probability simulations** for enhanced realism.

In [76]:
import numpy as np
import pandas as pd
from itertools import combinations
from sklearn.neural_network import MLPClassifier
unique_values = {
    'umpire1_encoded': matches_df['umpire1_encoded'].unique(),
    'umpire2_encoded': matches_df['umpire2_encoded'].unique(),
    'venue_encoded': matches_df['venue_encoded'].unique(),
    'season_encoded': matches_df['season_encoded'].unique(),
    'year': matches_df['year'].unique(),
    'month': matches_df['month'].unique(),
    'day': matches_df['day'].unique(),
    'weekday': matches_df['weekday'].unique(),
    'quarter': matches_df['quarter'].unique(),
    'day_of_year': matches_df['day_of_year'].unique(),
    'result_encoded': matches_df['result_encoded'].unique(),
    'toss_decision_encoded': matches_df['toss_decision_encoded'].unique(),
    'match_type_encoded': matches_df['match_type_encoded'].unique(),
    'super_over_encoded': matches_df['super_over_encoded'].unique(),
    'method_encoded': matches_df['method_encoded'].unique(),
    'city_encoded': matches_df['city_encoded'].unique(),
    'result_runs': matches_df['result_runs'].unique(),
    'result_wickets': matches_df['result_wickets'].unique()
}
teams = matches_df['team1_encoded'].unique()
matchups = list(combinations(teams, 2))
points_table = {team: 0 for team in teams}
for team1, team2 in matchups:
    for _ in range(1000):  
        synthetic_data = {
            'target_runs': np.random.choice(unique_values['result_runs']),
            'target_overs': np.random.choice([20]),  # Example overs
            'umpire1_encoded': np.random.choice(unique_values['umpire1_encoded']),
            'umpire2_encoded': np.random.choice(unique_values['umpire2_encoded']),
            'venue_encoded': np.random.choice(unique_values['venue_encoded']),
            'season_encoded': np.random.choice(unique_values['season_encoded']),
            'year': np.random.choice(unique_values['year']),
            'month': np.random.choice(unique_values['month']),
            'day': np.random.choice(unique_values['day']),
            'weekday': np.random.choice(unique_values['weekday']),
            'quarter': np.random.choice(unique_values['quarter']),
            'day_of_year': np.random.choice(unique_values['day_of_year']),
            'result_encoded': np.random.choice(unique_values['result_encoded']),
            'toss_decision_encoded': np.random.choice(unique_values['toss_decision_encoded']),
            'match_type_encoded': np.random.choice(unique_values['match_type_encoded']),
            'super_over_encoded': np.random.choice(unique_values['super_over_encoded']),
            'method_encoded': np.random.choice(unique_values['method_encoded']),
            'team1_encoded': team1,
            'team2_encoded': team2,
            'toss_winner_encoded': np.random.choice([team1, team2]),
            'city_encoded': np.random.choice(unique_values['city_encoded']),
            'result_runs': np.random.choice(unique_values['result_runs']),
            'result_wickets': np.random.choice(unique_values['result_wickets'])
        }
        match_df = pd.DataFrame([synthetic_data])
        predicted_winner = model.predict(match_df)[0]
        points_table[predicted_winner] = points_table.get(predicted_winner, 0) + 1
def update(encoded_team):
    return label_mapping[encoded_team]
final = {update(k): v for k, v in points_table.items()}
final['LSG'] += final['Lucknow Super Giants']
del final['Lucknow Super Giants']
final['DC'] = final.pop('DD')
print(final)

{'RCB': 1746, 'PBKS': 139, 'MI': 1330, 'KKR': 1360, 'RR': 5246, 'SRH': 25648, 'CSK': 101, 'GT': 2151, 'LSG': 1167, 'DC': 16112}
