# **Cricket Match Outcome Prediction: Feature Generation**

## Problem Description:
The goal of this project is to predict the outcome of a cricket match (Win, Loss, or Draw) based on various match-related features, including player performance, team stats, environmental factors, and historical data. The model can be trained on past match data to provide predictions for future matches, assisting fans, analysts, and even betting platforms in understanding match dynamics.

## **Importing Libraries**

In [1]:
import pandas as pd
import numpy as np

## Step 1: Team Batting Average
This step generates the **Team Batting Average** feature based on the following logic:
- Values range based on the match format:
  - ODIs: 200–350
  - T20s: 120–220
  - Tests: 300–450 (innings basis)
- Outliers reflect extraordinary matches (e.g., high-scoring games).

In [2]:
# Initialize the dataset with an empty DataFrame
num_samples = 100000  # Number of matches
dataset = pd.DataFrame()

In [3]:
# Define the match format
# 0: Test, 1: ODI, 2: T20
np.random.seed(42)  # For reproducibility
dataset['Match_Format'] = np.random.choice([0, 1, 2], size=num_samples, p=[0.2, 0.5, 0.3])

In [4]:
# Generate Team Batting Average based on Match Format
def generate_batting_average(match_format):
    if match_format == 0:  # Test
        return np.random.randint(300, 451)
    elif match_format == 1:  # ODI
        return np.random.randint(200, 351)
    else:  # T20
        return np.random.randint(120, 221)

In [5]:
# Apply the function to generate the feature
dataset['Team_Batting_Average'] = dataset['Match_Format'].apply(generate_batting_average)

## Step 2: Team Bowling Average
This step generates the **Team Bowling Average** feature based on the following logic:
- Values range based on the match format:
  - ODIs: 25–50
  - T20s: 20–40
  - Tests: 30–55
- Lower values indicate stronger bowling performance.
- Pitch conditions such as spin-friendly pitches may influence the bowling average.

In [6]:
# Add Pitch Type to simulate bowling conditions
# 0: Batting-friendly, 1: Spin-friendly, 2: Fast-bowler-friendly
dataset['Pitch_Type'] = np.random.choice([0, 1, 2], size=num_samples, p=[0.4, 0.3, 0.3])

In [7]:
# Generate Team Bowling Average based on Match Format and Pitch Type
def generate_bowling_average(row):
    match_format = row['Match_Format']
    pitch_type = row['Pitch_Type']
    
    if match_format == 0:  # Test
        base = np.random.randint(30, 56)
    elif match_format == 1:  # ODI
        base = np.random.randint(25, 51)
    else:  # T20
        base = np.random.randint(20, 41)
    
    # Adjust for pitch type
    if pitch_type == 1:  # Spin-friendly
        return max(base - np.random.randint(2, 6), 20)
    elif pitch_type == 2:  # Fast-bowler-friendly
        return max(base - np.random.randint(1, 4), 20)
    else:  # Batting-friendly
        return min(base + np.random.randint(1, 6), 55)

In [8]:
# Apply the function to generate the feature
dataset['Team_Bowling_Average'] = dataset.apply(generate_bowling_average, axis=1)

## Step 3: Home/Away Game
This step generates the **Home/Away Game** feature based on the following logic:
- The match location can be:
  - Home: 1
  - Away: 0
  - Neutral: 2
- Home matches generally favor the team due to familiarity with conditions and crowd support.
- Assign probabilities:
  - Home: 50%
  - Away: 30%
  - Neutral: 20%

In [9]:
# Generate Home/Away Game feature
# 1: Home, 0: Away, 2: Neutral
dataset['Home_Away_Game'] = np.random.choice([1, 0, 2], size=num_samples, p=[0.5, 0.3, 0.2])

## Step 4: Recent Form
This step generates the **Recent Form** feature based on the following logic:
- Values range from 0 to 5, representing the number of matches won in the last 5 games.
- Include diverse teams:
  - Strong teams: 4–5 wins.
  - Weak teams: 0–2 wins.
  - Average teams: 2–3 wins.


In [10]:
# Generate Recent Form feature
def generate_recent_form():
    # Assign probabilities for win counts
    return np.random.choice([0, 1, 2, 3, 4, 5], p=[0.1, 0.15, 0.2, 0.25, 0.2, 0.1])

In [11]:
dataset['Recent_Form'] = [generate_recent_form() for _ in range(num_samples)]

## Step 5: Team Experience
This step generates the **Team Experience** feature based on the following logic:
- Values range from 500 to 1500 matches, representing the sum of all players' match experience.
- Logic:
  - Experienced teams: 1200–1500.
  - Balanced teams: 800–1200.
  - Inexperienced teams: 500–800.
- Factor in squad changes due to injuries or retirements by introducing variability.


In [12]:
# Generate Team Experience feature
def generate_team_experience():
    # Assign probabilities for experience levels
    if np.random.rand() < 0.3:  # 30% chance for experienced teams
        return np.random.randint(1200, 1501)
    elif np.random.rand() < 0.5:  # 50% chance for balanced teams
        return np.random.randint(800, 1201)
    else:  # 20% chance for inexperienced teams
        return np.random.randint(500, 801)

In [13]:
dataset['Team_Experience'] = [generate_team_experience() for _ in range(num_samples)]

## Step 6: Batting Depth
This step generates the **Batting Depth** feature based on the following logic:
- Values range from 5 to 11, representing the number of capable batsmen in the lineup.
- Logic:
  - Higher values indicate deeper batting lineups.
  - T20 teams typically have deeper batting lineups (close to 11).
  - Test teams may have fewer (closer to 5 or 6).
  - ODIs usually fall in between.


In [14]:
# Generate Batting Depth feature
def generate_batting_depth(match_format):
    if match_format == 0:  # Test
        return np.random.randint(5, 7)  # Fewer capable batsmen
    elif match_format == 1:  # ODI
        return np.random.randint(6, 9)  # Balanced lineup
    else:  # T20
        return np.random.randint(9, 12)  # Deeper lineup

In [15]:
dataset['Batting_Depth'] = dataset['Match_Format'].apply(generate_batting_depth)

## Step 7: Bowling Depth
This step generates the **Bowling Depth** feature based on the following logic:
- Values range from 3 to 6, representing the number of quality bowlers in the lineup.
- Logic:
  - Teams with 6 bowlers have a deeper bowling attack, typically for formats like Tests.
  - Teams with 3-4 bowlers are more common in T20 and ODI formats.


In [16]:
# Generate Bowling Depth feature
def generate_bowling_depth(match_format):
    if match_format == 0:  # Test
        return np.random.randint(4, 7)  # More quality bowlers
    elif match_format == 1:  # ODI
        return np.random.randint(3, 6)  # Balanced bowling attack
    else:  # T20
        return np.random.randint(3, 5)  # Shorter formats, fewer bowlers

In [17]:
dataset['Bowling_Depth'] = dataset['Match_Format'].apply(generate_bowling_depth)

## Step 8: Weather Conditions
This step generates the **Weather Conditions** feature based on the following logic:
- Values:
  - Sunny: 0
  - Overcast: 1
  - Rainy: 2
  - Humid: 3
- Logic:
  - Overcast conditions may favor bowlers, especially swing bowlers.
  - Rainy conditions could lead to match interruptions or reduced overs.
  - Sunny conditions typically favor batting.

In [18]:
# Generate Weather Conditions feature
weather_conditions = ['Sunny', 'Overcast', 'Rainy', 'Humid']
dataset['Weather_Conditions'] = np.random.choice(weather_conditions, size=num_samples)

## Step 11: Toss Winner
This step generates the **Toss Winner** feature based on the following logic:
- Values:
  - Team A: 0
  - Team B: 1
- Logic:
  - Toss winners often have an advantage, especially in conditions like dew or overcast skies.
  - The toss result is randomly assigned here for simulation purposes.


In [19]:
# Generate Toss Winner feature
dataset['Toss_Winner'] = np.random.choice([0, 1], size=num_samples)

## Step 12: Team's Overall Batting Strength
This step generates the **Team's Overall Batting Strength** feature based on the following logic:
- Values range from 200 to 400, calculated as the sum of the average runs scored by the top 5 batsmen.
- Logic:
  - Higher values indicate stronger batting strength.
  - The batting strength can vary depending on the match format (e.g., higher in ODIs and T20s).


In [20]:
# Generate Team's Overall Batting Strength feature
def generate_batting_strength(match_format):
    if match_format == 0:  # Test
        return np.random.randint(200, 300)  # Lower batting strength in Tests
    elif match_format == 1:  # ODI
        return np.random.randint(300, 350)  # Balanced for ODIs
    else:  # T20
        return np.random.randint(350, 400)  # Strong batting for T20

In [21]:
dataset['Batting_Strength'] = dataset['Match_Format'].apply(generate_batting_strength)

## Step 13: Team's Overall Bowling Strength
This step generates the **Team's Overall Bowling Strength** feature based on the following logic:
- Values range from 15 to 50, calculated as the sum of the average wickets taken by the top 5 bowlers.
- Logic:
  - Lower values indicate better bowling performance, as fewer runs are conceded.
  - The bowling strength can vary depending on the match format (e.g., more focus on pace in T20s and ODIs).


In [22]:
# Generate Team's Overall Bowling Strength feature
def generate_bowling_strength(match_format):
    if match_format == 0:  # Test
        return np.random.randint(30, 40)  # Stronger bowling in Tests
    elif match_format == 1:  # ODI
        return np.random.randint(20, 30)  # Balanced bowling for ODIs
    else:  # T20
        return np.random.randint(15, 25)  # Lower for T20s as focus is on batting

In [23]:
dataset['Bowling_Strength'] = dataset['Match_Format'].apply(generate_bowling_strength)

## Step 14: Adding Team A and Team B Country Features
We need to add **Team A Country** and **Team B Country** features to the dataset. These columns will represent the countries of the two teams playing in the match.
- **Team A Country** and **Team B Country** will be randomly chosen from the list of cricket-playing nations.


In [24]:
# List of cricket-playing countries (for demonstration)
cricket_countries = ['India', 'Australia', 'England', 'New Zealand', 'South Africa', 'Pakistan', 
                     'Sri Lanka', 'West Indies', 'Bangladesh', 'Afghanistan', 'Zimbabwe', 'Ireland']

In [25]:
# Generate Team A and Team B countries ensuring they are from the same pool
def generate_team_countries():
    team_a_country = np.random.choice(cricket_countries)
    # Ensure Team B is different from Team A, but still a cricket-playing country
    team_b_country = np.random.choice([country for country in cricket_countries if country != team_a_country])
    return team_a_country, team_b_country

In [26]:
# Add Team_A_Country and Team_B_Country to the dataset
dataset[['Team_A_Country', 'Team_B_Country']] = [generate_team_countries() for _ in range(len(dataset))]

## Step 15: Head-to-Head Results
Now that we've added **Team A Country** and **Team B Country**, we can generate the **Head-to-Head Results** feature based on these columns.
- **Head-to-Head Results** will represent the historical win/loss record between the two teams in the same match format.
- The values will be generated randomly in the range of 0.1 to 5.0.


In [27]:
# Generate Head-to-Head Results feature based on Team_A_Country and Team_B_Country
def generate_head_to_head(team_a_country, team_b_country):
    # Generate a random win/loss ratio between 0.1 and 5.0 based on countries
    return np.random.uniform(0.1, 5.0)

In [28]:
dataset['Head_to_Head'] = dataset.apply(lambda row: generate_head_to_head(row['Team_A_Country'], row['Team_B_Country']), axis=1)

## Step 16: Fielding Efficiency
This step generates the **Fielding Efficiency** feature, which measures the percentage of successful fielding actions like catches, run-outs, and stops.
- **Fielding Efficiency** will range from 0.5 (50%) to 1.0 (100%).
- The higher the value, the better the team's fielding performance.


In [29]:
# Generate Fielding Efficiency feature
def generate_fielding_efficiency():
    # Randomly generate fielding efficiency between 50% and 100% (0.5 to 1.0)
    return np.random.uniform(0.5, 1.0)

In [30]:
dataset['Fielding_Efficiency'] = dataset.apply(lambda row: generate_fielding_efficiency(), axis=1)

## Step 17: Captain's Score
This step generates the **Captain's Score** feature, which is a composite score based on the captain's leadership and personal performance.
- **Captain's Score** will range from 1 to 10, where higher values indicate a better captain.


In [31]:
# Generate Captain's Score feature
def generate_captains_score():
    # Randomly generate captain's score between 1 and 10
    return np.random.randint(1, 11)

In [32]:
dataset['Captains_Score'] = dataset.apply(lambda row: generate_captains_score(), axis=1)

## Target Variable
- 1 (Win): Means Team A is expected to win.
- 0 (Loss): Means Team B is expected to win.
- 2 (Draw): Means the match is likely to end in a draw.

In [33]:
def target(row):
    # Add a small chance of misclassification
    import random
    if random.random() < 0.45:  # 45% noise
        return random.choice([0, 1, 2])
    else:
        if row['Team_Batting_Average'] > row['Team_Bowling_Average'] + 10 and row['Recent_Form'] >= 3:
            return 1  # Team A wins
        elif row['Team_Bowling_Average'] > row['Team_Batting_Average'] and row['Recent_Form'] < 3:
            return 0  # Team B wins
        else:
            return 2  # Draw

In [34]:
dataset['Match_Outcome'] = dataset.apply(target, axis=1)

In [35]:
dataset.head()

Unnamed: 0,Match_Format,Team_Batting_Average,Pitch_Type,Team_Bowling_Average,Home_Away_Game,Recent_Form,Team_Experience,Batting_Depth,Bowling_Depth,Weather_Conditions,Toss_Winner,Batting_Strength,Bowling_Strength,Team_A_Country,Team_B_Country,Head_to_Head,Fielding_Efficiency,Captains_Score,Match_Outcome
0,1,297,1,44,2,4,1060,6,5,Overcast,0,305,28,Sri Lanka,South Africa,0.611561,0.930467,5,1
1,2,191,0,35,0,5,1137,10,4,Overcast,1,355,24,Ireland,West Indies,4.925231,0.779202,4,1
2,2,181,2,28,1,3,1187,11,3,Sunny,0,397,15,West Indies,Ireland,1.809952,0.51862,1,2
3,1,278,0,33,0,4,647,7,5,Sunny,1,330,29,Ireland,Sri Lanka,3.299185,0.915704,3,1
4,0,316,1,35,0,3,1147,6,4,Sunny,0,212,33,New Zealand,Afghanistan,4.159099,0.581732,9,2


In [36]:
dataset.columns

Index(['Match_Format', 'Team_Batting_Average', 'Pitch_Type',
       'Team_Bowling_Average', 'Home_Away_Game', 'Recent_Form',
       'Team_Experience', 'Batting_Depth', 'Bowling_Depth',
       'Weather_Conditions', 'Toss_Winner', 'Batting_Strength',
       'Bowling_Strength', 'Team_A_Country', 'Team_B_Country', 'Head_to_Head',
       'Fielding_Efficiency', 'Captains_Score', 'Match_Outcome'],
      dtype='object')

In [37]:
dataset.describe()

Unnamed: 0,Match_Format,Team_Batting_Average,Pitch_Type,Team_Bowling_Average,Home_Away_Game,Recent_Form,Team_Experience,Batting_Depth,Bowling_Depth,Toss_Winner,Batting_Strength,Bowling_Strength,Head_to_Head,Fielding_Efficiency,Captains_Score,Match_Outcome
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,1.09875,263.81015,0.89931,35.8333,0.89641,2.60056,983.28433,7.60291,4.04902,0.50359,324.3207,25.00664,2.540877,0.750154,5.49706,1.251
std,0.699488,82.154655,0.830842,8.804405,0.70076,1.460845,298.529668,1.833849,0.901468,0.49999,47.065066,5.946541,1.414616,0.144288,2.87488,0.693285
min,0.0,120.0,0.0,20.0,0.0,0.0,500.0,5.0,3.0,0.0,200.0,15.0,0.100051,0.5,1.0,0.0
25%,1.0,202.0,0.0,29.0,0.0,2.0,714.0,6.0,3.0,0.0,304.0,21.0,1.315464,0.625526,3.0,1.0
50%,1.0,261.0,1.0,35.0,1.0,3.0,972.0,7.0,4.0,1.0,329.0,24.0,2.543082,0.750123,5.0,1.0
75%,2.0,325.0,2.0,42.0,1.0,4.0,1251.0,9.0,5.0,1.0,358.0,29.0,3.766913,0.874768,8.0,2.0
max,2.0,450.0,2.0,55.0,2.0,5.0,1500.0,11.0,6.0,1.0,399.0,39.0,4.999954,0.999998,10.0,2.0


In [38]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 19 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   Match_Format          100000 non-null  int64  
 1   Team_Batting_Average  100000 non-null  int64  
 2   Pitch_Type            100000 non-null  int64  
 3   Team_Bowling_Average  100000 non-null  int64  
 4   Home_Away_Game        100000 non-null  int64  
 5   Recent_Form           100000 non-null  int64  
 6   Team_Experience       100000 non-null  int64  
 7   Batting_Depth         100000 non-null  int64  
 8   Bowling_Depth         100000 non-null  int64  
 9   Weather_Conditions    100000 non-null  object 
 10  Toss_Winner           100000 non-null  int64  
 11  Batting_Strength      100000 non-null  int64  
 12  Bowling_Strength      100000 non-null  int64  
 13  Team_A_Country        100000 non-null  object 
 14  Team_B_Country        100000 non-null  object 
 15  H

In [39]:
dataset.dtypes

Match_Format              int64
Team_Batting_Average      int64
Pitch_Type                int64
Team_Bowling_Average      int64
Home_Away_Game            int64
Recent_Form               int64
Team_Experience           int64
Batting_Depth             int64
Bowling_Depth             int64
Weather_Conditions       object
Toss_Winner               int64
Batting_Strength          int64
Bowling_Strength          int64
Team_A_Country           object
Team_B_Country           object
Head_to_Head            float64
Fielding_Efficiency     float64
Captains_Score            int64
Match_Outcome             int64
dtype: object

In [40]:
dataset.tail()

Unnamed: 0,Match_Format,Team_Batting_Average,Pitch_Type,Team_Bowling_Average,Home_Away_Game,Recent_Form,Team_Experience,Batting_Depth,Bowling_Depth,Weather_Conditions,Toss_Winner,Batting_Strength,Bowling_Strength,Team_A_Country,Team_B_Country,Head_to_Head,Fielding_Efficiency,Captains_Score,Match_Outcome
99995,2,127,0,41,1,0,679,11,3,Overcast,1,361,22,Zimbabwe,Bangladesh,4.764504,0.67479,8,2
99996,2,219,0,41,1,4,1263,9,3,Sunny,0,391,22,Sri Lanka,South Africa,2.286204,0.979679,9,1
99997,1,211,2,25,0,3,1416,6,3,Humid,1,307,28,Sri Lanka,Australia,3.318669,0.961371,3,2
99998,1,266,1,35,0,2,1307,8,3,Humid,0,300,29,New Zealand,India,1.154285,0.509006,1,2
99999,1,330,0,40,1,3,540,8,4,Rainy,0,315,25,England,Australia,0.653475,0.914466,3,2


In [41]:
dataset['Match_Outcome'].value_counts()

Match_Outcome
1    45636
2    39732
0    14632
Name: count, dtype: int64

## **Saving The CSV File**

In [42]:
dataset.to_csv("Match_Prediction.csv", index=False)