# Problem Statement

To predict the winner of a T20 cricket match using historical match and ball-by-ball data, by identifying key factors that influence outcomes and building machine learning models for reliable predictions.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Data Collection

The data has been collected from [Cricsheet](https://cricsheet.org/) which stores ball by ball data for all historical t20 matches.

In [None]:
df = pd.read_csv("data/matches.csv")
df.head()

## Initial Data Checks

- check missing values
- check duplicates
- check data types
- check number of unique values in each categorical column
- check statistics of numerical columns

### check missing values

In [None]:
df.info()

there are no missing values in any column

### Check Duplicates

In [7]:
df.duplicated(subset=['team1', 'team2', 'date']).sum()

np.int64(8)

there are 8 duplicates in total

In [8]:
df.drop_duplicates(subset=['team1', 'team2', 'date'], inplace=True)

### Check data types

In [None]:
cat_cols = [col for col in df.columns if df[col].dtype == 'O']
num_cols = [col for col in df.columns if df[col].dtype != 'O']

print(f"total {len(cat_cols)} categorical columns:")
for col in cat_cols:
    print(col)

print("==============================")

print(f"total {len(num_cols)} numerical columns")
for col in num_cols:
    print(col)

### number of unique values in each categorical column

In [None]:
for col in cat_cols:
    print(f"{col}: total {len(df[col].unique())} unique values")

### statistics of numerical columns

In [None]:
df.describe()

the numerical columns seem to have too many digits after decimal. rounding down to 2 decimal is sufficient

In [None]:
for col in num_cols:
    df[col] = df[col].round(2)

df[num_cols].head()

## Exploratory Analysis

### Rating Difference

does rating difference have a strong impact on outcome?

In [None]:
min_elo_diff = df.elo_diff.min()
max_elo_diff = df.elo_diff.max()

df['team1_win'] = (df['outcome'] == df['team1']).astype(int)


df['elo_diff_bucket'] = pd.cut(df['elo_diff'], bins=range(int(min_elo_diff)-20, int(max_elo_diff)+20, 20))


prob_by_bucket = df.groupby('elo_diff_bucket').team1_win.mean()
count_by_bucket = df.groupby('elo_diff_bucket').team1_win.count()

sns.lineplot(x=df['rating_diff'], y=df['winner'])
plt.xlabel("Rating Difference (Team1 - Team2)")
plt.ylabel("Probability of Team1 Win")
plt.show()