# Q0: Your level of cricket experience
Honestly, 1.

# Q1: Question 1.
Determine the win records (percentage win and total wins) for each team by year and gender,
excluding ties, matches with no result, and matches decided by the DLS method in the event that, for whatever
reason, the planned innings can’t be completed. Consider only data from 2019. Which male and female teams had
the highest win percentages? Which had the highest total wins? Were these teams the same as those with the
highest win percentages? Comment on why the leaders of these two stats might differ.

So here we work only with the match result data. We just want to know the win records and who has the highest. Let's explore the data first. Get out some stats. I have to decide wether I wanna use pandas or flask or spark.

In [None]:
# Import necessary libraries
import json
import pandas as pd
import pathlib as Path
import os

# Load the match results data
with open('../data/match_results.json') as file:
    match_results = json.load(file)

# Convert the data to pandas DataFrames
match_results_df = pd.DataFrame(match_results)
n_entries = len(match_results_df)
match_results_df = match_results_df.dropna(thresh=0.01*n_entries, axis=1)

# List of columns to keep
cols_to_keep = ['dates', 'gender', 'outcome.winner', 
                'teams', 'outcome.method']

# Get the list of all columns
all_cols = match_results_df.columns

# Get the list of columns to drop
cols_to_drop = all_cols.difference(cols_to_keep)

# Drop unnecessary columns
match_results_df = match_results_df.drop(columns=cols_to_drop)

# Display the first few rows of each DataFrame
print("Number of attributes: ", len(match_results_df.keys()))
print(match_results_df.keys())
print("Number of entries: ", len(match_results_df))
print(match_results_df.head())

In [None]:
# Convert 'dates' to datetime and extract the year
match_results_df['year'] = pd.to_datetime(match_results_df['dates']).dt.year

# Filter for D/L method and count the games per year and team
dl_counts_per_year = match_results_df[match_results_df['outcome.method'] == "D/L"].groupby(['year', 'teams']).size()

print(dl_counts_per_year)

In [None]:
# Filter out matches before 2019
# Convert the 'dates' column to datetime format
match_results_df['dates'] = pd.to_datetime(match_results_df['dates'])

# Filter out matches before 2019
match_results_df = match_results_df[match_results_df['dates'].dt.year == 2020]
match_results_df = match_results_df[match_results_df['gender'] == 'male']

print("Number of entries: ", len(match_results_df))
print(match_results_df.head())

In [None]:
win_counts_no_dl = match_results_df[(match_results_df['outcome.winner'] == match_results_df['teams']) & 
                                    (match_results_df['outcome.method'] != 'D/L')]['teams'].value_counts()

print(win_counts_no_dl)

In [None]:
# Calculate total games per team
total_games_no_dl = match_results_df[(match_results_df['outcome.method'] != 'D/L')]['teams'].value_counts()

# Calculate win percentage
win_percentage_no_dl = (win_counts_no_dl / total_games_no_dl) * 100

print(win_percentage_no_dl)

In [None]:
# Calculate win counts excluding D/L method and ties
win_counts_no_dl_no_tie = match_results_df[(match_results_df['outcome.winner'] == match_results_df['teams']) & 
                                           (match_results_df['outcome.method'] != 'D/L') &
                                           (match_results_df['outcome.winner'].notna())]['teams'].value_counts()

# Calculate total games per team excluding ties
total_games_no_dl_no_tie = match_results_df[(match_results_df['outcome.method'] != 'D/L') & 
                                            (match_results_df['outcome.winner'].notna())]['teams'].value_counts()

# Calculate win percentage
win_percentage_no_dl_no_tie = (win_counts_no_dl_no_tie / total_games_no_dl_no_tie) * 100

print(win_percentage_no_dl_no_tie)

# Q2.
Setting aside individual batter production, cricket teams have two main ‘resources’ for producing
runs: remaining overs and wickets. The role resources have on run production is central to the statistical method
known as ‘DLS’, which is used to award a winner in the case of incomplete/disrupted matches. Use the ball-by-ball
summaries under the innings descriptions of each men’s match to make a dataset with the run and wicket outcomes
for each delivery in a match, excluding matches with no result.
Develop a model to predict an average team’s expected runs per over. Please state or include the assump-
tions/validation used to justify your model choice. A visualization prior to modelling could be helpful to justify
your modelling decisions. Save your intermediate data with team, inning order, remaining overs, and remaining
wickets to a JSON/CSV file for Q4. Summarize your conclusions.

First I should filter out only the men data. Also let's start only with one game.

In [None]:
def shift_resources(df):
    df['last_5_overs_mean_runs'] = df['runs'].rolling(window=5).mean()
    df['last_5_overs_mean_runs'].fillna(2, inplace=True)

    df['last_5_overs_mean_runs'] = df['last_5_overs_mean_runs'].shift(1)
    df['remaining_wickets'] = df['remaining_wickets'].shift(1)
    df['runs_needed_to_par'] = df['runs_needed_to_par'].shift(1)
    df['cumulative_runs'] = df['cumulative_runs'].shift(1)

    # Initialize the first row with 10 for remaining overs and wickets, respectively
    df.loc[df.index[0], 'last_5_overs_mean_runs'] = 2
    df.loc[df.index[0], 'remaining_wickets'] = 10
    df.loc[df.index[0], 'cumulative_runs'] = 0
    df.loc[df.index[0], 'runs_needed_to_par'] = df.loc[df.index[0], 'first_innings_runs']

    return df

In [None]:
# Import necessary libraries
import json
import pandas as pd
import pathlib as Path
import os
import numpy as np

# Load the match results data
with open('../data/match_results.json') as file:
    match_results = json.load(file)

match_results_df = pd.DataFrame(match_results)

# Rename 'teams' column in match_results_df to 'team' for consistency
match_results_df = match_results_df.rename(columns={'teams': 'team'})
match_results_df = match_results_df[match_results_df['outcome.method'] != "D/L"]

# Load the ball-by-ball innings data
with open('../data/innings_results.json') as file:
    innings_data = json.load(file)

innings_df = pd.DataFrame(innings_data)

# Remove matches with no result
match_results_df = match_results_df[match_results_df['outcome.winner'].notna()]

# Get matchids of matches with result
match_ids_with_result = match_results_df['matchid'].unique()

# Filter innings_df to include only matchids of matches with result
innings_df = innings_df[innings_df['matchid'].isin(match_ids_with_result)]

# Merge the gender field from match_results_df to innings_df
innings_df = innings_df.merge(match_results_df[['matchid', 'team', 'gender', 'dates']], on=['matchid', 'team'], how='left')

# Filter for male matches
innings_df = innings_df[innings_df['gender'] == 'male']

# Convert the 'dates' column to datetime format
innings_df['dates'] = pd.to_datetime(innings_df['dates'])

# Filter out matches before 2019
# innings_df = innings_df[innings_df['dates'].dt.year == 2020]

# Keep only necessary columns 
innings_df = innings_df[['matchid', 'innings', 'over', 'runs.total', 'wicket.kind', 'team']]
innings_df = innings_df.drop_duplicates(subset=['matchid', 'innings', 'over', 'team'])
# innings_df = innings_df[innings_df['innings'] == 1]

# Compute wickets and cumulative wickets directly in innings_df
innings_df['wickets'] = innings_df['wicket.kind'].notnull()
innings_df['wickets_cumul'] = innings_df.groupby(['matchid', 'innings', 'team'])['wickets'].cumsum()
innings_df['runs_cumul'] = innings_df.groupby(['matchid', 'innings', 'team'])['runs.total'].cumsum()
assert innings_df['wickets_cumul'].between(0,10).all()
# Compute remaining wickets and remaining overs
innings_df['over'] = innings_df['over'].apply(lambda x: int(x.split('.')[0]) if int(x.split('.')[1]) == 6 else float(x)).astype(int)
innings_df['remaining_wickets'] = 10 - innings_df['wickets_cumul']
innings_df['remaining_overs'] = 50 - innings_df['over']

# Calculate total runs per match per innings
total_runs = innings_df.groupby(['matchid', 'innings'])['runs.total'].sum().reset_index()

# Filter to keep only the first innings
first_innings_runs = total_runs[total_runs['innings'] == 1]

# Rename columns for merging
first_innings_runs = first_innings_runs.rename(columns={'runs.total': 'first_innings_runs'})
first_innings_runs = first_innings_runs.drop(columns=['innings'])

# Merge first_innings_runs into the main dataframe
innings_df = innings_df.merge(first_innings_runs, on='matchid', how='left')

# Forward fill the NaNs to propagate the first_innings_runs to the second innings
innings_df['first_innings_runs'] = innings_df.groupby('matchid')['first_innings_runs'].ffill()

# At this point, the 'first_innings_runs' column for the first innings will be its own score, 
# which is not what we want. We set 'first_innings_runs' to NaN for the first innings:
print(innings_df.columns)
innings_df.loc[innings_df['innings'] == 1, 'first_innings_runs'] = 0
innings_df['runs_needed_to_par'] = innings_df['first_innings_runs'] - innings_df['runs_cumul']
innings_df.loc[innings_df['innings'] == 1, 'runs_needed_to_par'] = 0
#  Display the dataframe
# with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'display.precision', 3):
#     display(innings_df)

grouped_df = innings_df.groupby(['matchid', 'innings', 'team', 'over']).agg(
    runs=('runs.total', 'sum'),
    wickets=('wickets', 'sum'),  
    remaining_wickets=('remaining_wickets', 'min'),  # minimum remaining wickets in the over
    remaining_overs=('remaining_overs', 'min'),  # minimum remaining overs in the over
    cumulative_runs=('runs_cumul', 'max'),  # minimum remaining overs in the over
    first_innings_runs=('first_innings_runs', 'max'),
    runs_needed_to_par=('runs_needed_to_par', 'min')
).reset_index()
grouped_df = grouped_df.groupby(['matchid', 'innings']).apply(shift_resources)


In [None]:
grouped_df.to_csv('../data/intermediate_data.csv', index=False)

In [None]:
# Display the dataframe
# with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'display.precision', 3):
#     display(grouped_df)

In [None]:
# Display the dataframe
with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'display.precision', 3):
    display(grouped_df[grouped_df['matchid']=="1153846"])

### Sanity Checks

In [None]:
# This will return all the rows where cumulative wickets are greater than 10
anomalies = grouped_df[grouped_df['runs_needed_to_par'] < 0]
grouped_df[grouped_df['runs_needed_to_par'] < 0] = 3
print(len(grouped_df))
# Display these instances
print(len(anomalies))
display(anomalies)
anomaly_details = innings_df[(innings_df['matchid'].isin(anomalies['matchid'])) & 
                             (innings_df['innings'].isin(anomalies['innings']))]
print(anomaly_details)

In [None]:
# Sanity checks
assert grouped_df['runs'].between(0, 36).all()
assert grouped_df['remaining_wickets'].between(0, 10).all()
assert grouped_df['remaining_overs'].between(0, 50).all()
assert innings_df['wickets'].between(0,6).all()
print(min(grouped_df['runs_needed_to_par']), max(grouped_df['runs_needed_to_par']))
assert grouped_df['runs_needed_to_par'].between(0,1000*50*6*6).all()
assert grouped_df.isnull().sum().all() == 0
assert grouped_df.duplicated().sum() == 0
# groupd_df = grouped_df[['over', 'remaining_overs', 'remaining_wickets', 'runs']]
# assert (grouped_df.sort_values(['matchid', 'innings', 'over'])['remaining_overs'].diff().dropna() <= 0).all()

### Visualizations

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Calculate the percentage of remaining resources
grouped_df['percentage_of_combined_resources'] = (grouped_df['remaining_wickets'] / 10 + grouped_df['remaining_overs'] / 50) / 2 * 100

plt.figure(figsize=(10, 8))

# Create a line plot for each value of remaining wickets
for wickets in grouped_df['remaining_wickets'].unique():
    subset_df = grouped_df[grouped_df['remaining_wickets'] == wickets]
    sns.lineplot(x='remaining_overs', y='percentage_of_combined_resources', data=subset_df, label=f'Remaining wickets: {wickets}')

plt.legend()
plt.xlabel('Remaining Overs')
plt.ylabel('Percentage of Combined Resources')
plt.title('Remaining Resources vs Remaining Overs for Different Numbers of Wickets')
plt.show()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Remaining overs vs Runs
plt.figure(figsize=(8, 6))
sns.scatterplot(data=grouped_df, x="remaining_overs", y="runs")
plt.title('Remaining Overs vs Runs')
plt.xlabel('Remaining Overs')
plt.ylabel('Runs')
plt.show()

# 2. Remaining wickets vs Runs
plt.figure(figsize=(8, 6))
sns.scatterplot(data=grouped_df, x="remaining_wickets", y="runs")
plt.title('Remaining Wickets vs Runs')
plt.xlabel('Remaining Wickets')
plt.ylabel('Runs')
plt.show()

# 3. Remaining overs and Remaining wickets vs Runs
plt.figure(figsize=(8, 6))
sns.scatterplot(data=grouped_df, x="remaining_overs", y="runs", hue="remaining_wickets", palette="viridis")
plt.title('Remaining Overs and Remaining Wickets vs Runs')
plt.xlabel('Remaining Overs')
plt.ylabel('Runs')
plt.legend(title='Remaining Wickets')
plt.show()

# 4. Percentage of combined resources vs Runs
plt.figure(figsize=(8, 6))
sns.scatterplot(data=grouped_df, x="percentage_of_combined_resources", y="runs")
plt.title('Percentage of Combined Resources vs Runs')
plt.xlabel('Percentage of Combined Resources')
plt.ylabel('Runs')
plt.show()


In [None]:
# Line plots

# 1. Remaining overs vs Runs
plt.figure(figsize=(8, 6))
sns.lineplot(data=grouped_df, x="remaining_overs", y="runs")
plt.title('Remaining Overs vs Runs')
plt.xlabel('Remaining Overs')
plt.ylabel('Runs')
plt.show()

# 2. Remaining wickets vs Runs
plt.figure(figsize=(8, 6))
sns.lineplot(data=grouped_df, x="remaining_wickets", y="runs")
plt.title('Remaining Wickets vs Runs')
plt.xlabel('Remaining Wickets')
plt.ylabel('Runs')
plt.show()

# 3. Remaining overs and Remaining wickets vs Runs
# For this, we will make a heatmap
pivot_table = grouped_df.pivot_table(index='remaining_overs', columns='remaining_wickets', values='runs', aggfunc='mean')
plt.figure(figsize=(10, 8))
sns.heatmap(pivot_table, cmap='viridis')
plt.title('Remaining Overs and Remaining Wickets vs Runs')
plt.xlabel('Remaining Wickets')
plt.ylabel('Remaining Overs')
plt.show()

# 4. Percentage of combined resources vs Runs
plt.figure(figsize=(8, 6))
sns.lineplot(data=grouped_df, x="percentage_of_combined_resources", y="runs")
plt.title('Percentage of Combined Resources vs Runs')
plt.xlabel('Percentage of Combined Resources')
plt.ylabel('Runs')
plt.show()


In [None]:
#  1. Line plots
plt.figure(figsize=(8, 6))
sns.lineplot(data=grouped_df, x="remaining_overs", y="runs")
plt.title('Remaining Overs vs Runs')
plt.xlabel('Remaining Overs')
plt.ylabel('Runs')
plt.show()

# 1. Line plots
plt.figure(figsize=(8, 6))
sns.lineplot(data=grouped_df, x="remaining_overs", y="remaining_wickets")
plt.title('Remaining Overs vs remaining wickets')
plt.xlabel('Remaining Overs')
plt.ylabel('wickets')
plt.show()

# 2. Box plots
plt.figure(figsize=(8, 6))
sns.boxplot(data=grouped_df, x="remaining_overs", y="runs")
plt.title('Remaining Overs vs Runs')
plt.xlabel('Remaining Overs')
plt.ylabel('Runs')
plt.show()

# 3. Histograms
plt.figure(figsize=(8, 6))
sns.histplot(data=grouped_df, x="runs")
plt.title('Distribution of Runs')
plt.xlabel('Runs')
plt.ylabel('Count')
plt.show()

# 3. Histograms
plt.figure(figsize=(8, 6))
sns.histplot(data=grouped_df, x="wickets")
plt.title('Distribution of wickets')
plt.xlabel('Wickets')
plt.ylabel('Count')
plt.show()

# 3. Histograms
plt.figure(figsize=(8, 6))
sns.histplot(data=grouped_df, x="remaining_wickets")
plt.title('Distribution of remaining wickets')
plt.xlabel('remaining wickets')
plt.ylabel('Count')
plt.show()

# 3. Histograms
plt.figure(figsize=(8, 6))
sns.histplot(data=grouped_df, x="remaining_overs")
plt.title('Distribution of remaining overs')
plt.xlabel('remaining overs')
plt.ylabel('Count')
plt.show()

# 4. Heatmaps
plt.figure(figsize=(8, 6))
heatmap_data = grouped_df.pivot_table(index='remaining_wickets', columns='remaining_overs', values='runs', aggfunc='mean')
sns.heatmap(heatmap_data, cmap="YlGnBu")
plt.title('Runs by Remaining Overs and Wickets')
plt.xlabel('Remaining Overs')
plt.ylabel('Remaining Wickets')
plt.show()


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Histogram for runs
plt.figure(figsize=(8, 6))
sns.histplot(data=grouped_df, x="runs", bins=30, kde=True)
plt.title('Distribution of Runs')
plt.xlabel('Runs')
plt.ylabel('Frequency')
plt.show()

# Histogram for remaining overs
plt.figure(figsize=(8, 6))
sns.histplot(data=grouped_df, x="remaining_overs", bins=30, kde=True)
plt.title('Distribution of Remaining Overs')
plt.xlabel('Remaining Overs')
plt.ylabel('Frequency')
plt.show()

# Histogram for remaining wickets
plt.figure(figsize=(8, 6))
sns.histplot(data=grouped_df, x="remaining_wickets", bins=10, kde=True)
plt.title('Distribution of Remaining Wickets')
plt.xlabel('Remaining Wickets')
plt.ylabel('Frequency')
plt.show()


### Model training

In [None]:
np.mean(abs(np.array(y_pred - y * grouped_df["runs"].max())))

In [None]:
X /= np.array([10, 50])
print(max(X[:,0]))
print(max(X[:,1]))

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
from xgboost import XGBRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import RandomForestRegressor

# Prepare the data
X = np.sqrt(np.array(grouped_df[["remaining_overs", "remaining_wickets", "innings", "cumulative_runs", "runs_needed_to_par", "last_5_overs_mean_runs"]]))
# X = np.sqrt(np.array(grouped_df[["remaining_overs"]]))
y = np.array(grouped_df["runs"])

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define preprocessing steps
preprocessor = StandardScaler()

# Dictionary of models
models = {
    "Linear Regression": LinearRegression(),
    "XGBoost": XGBRegressor(objective ='reg:squarederror', n_estimators=100, max_depth=7, eta=1, subsample=0.7, colsample_bytree=0.8),
    "Random Forest": RandomForestRegressor(n_estimators=50, random_state=42),
    "MLP": MLPRegressor(hidden_layer_sizes=(50, 10), learning_rate_init=0.01, max_iter=1000, random_state=1)
}

# Loop through the models and train and evaluate each one
for model_name, model in models.items():
    # Create the pipeline: preprocessor + model
    # pipeline = Pipeline(steps=[('preprocessor', preprocessor),
    #                        ('model', model)])   # Train the model
    model.fit(X_train, y_train)
    # Make predictions on the train set
    y_pred_train = model.predict(X_train)
    
    # Make predictions on the test set
    y_pred_val = model.predict(X_test)
    # Evaluate the model
    mse_train = mean_squared_error(y_train, y_pred_train) 
    mse_test = mean_squared_error(y_test, y_pred_val)
    r2 = r2_score(y_test, y_pred_val)
    print(f"Model: {model_name}")
    print(f"RMSE train: {np.sqrt(mse_train)}")
    print(f"RMSE test: {np.sqrt(mse_test)}")
    print(f"R2 score test: {r2}")
    print("--------------------------")


In [159]:
from joblib import dump
dump(models["MLP"], "../models/cricket_model.pkl")

['../models/cricket_model.pkl']

# Q3. 
More generally and unrelated to cricket or the previous questions, model deployment in a production
environment is an important aspect of an engineer’s toolkit. Describe a scalable architecture (a diagram may
be helpful) that would be appropriate for deploying a model that predicts frame-level play values into a cloud
environment with the following assumptions:
• Spatial temporal high frame-rate data (~1 GB per game)
• Play-values are predicted at each frame of a game
• Delivery of game predictions are expected to be delivered overnight
• 500 games per season with 50 games a day
• 5 seasons of existing data
• Model training resources:
– 8 hour runtime with multiple cores (8) and large memory usage
• Model prediction resources:
– 60 min runtime per game with a single CPU and 4 GB of memory usage
List out the services, tooling, and reasoning for the choices of architecture. For example, a LAMP stack could be
appropriate for an internal home network webpage on a Raspberry Pi.