# mlb sports betting forecasting model

## Forked from quantgalore.substack.com

My fork [here|https://github.com/sculd/mlb-props].

* `collect_data_run.ipynb` needs to be run inintially, to collect the player stats, game match up data, etc. which finally constructs the dataset used for training / testing.
  * `df_game_matchup_total.pkl` is the dataframe pkl that has 2011 to 2023 match up.
  * the collect_data is stored in mlb_props_data bucket of google drive.
* `model_training_run.ipynb` trains a model.
* `update_data_run.py` should run at the beginning of each day, fetching the previous date's matchup and updating all the data.
* `fetch_today_matchup_and_odds_run.py` should run at the beginning of each day, this creates matchup for today's live bet and fetches the odds for today's games.
* add this line `0 8 * * * /home/junlim/projects/mlb-props/daily_run.sh` to crontab to run it every 8am daily.
* add this line `0 10 * * * /home/junlim/projects/mlb-props/daily_live_run.sh` to crontab to run it every 10am daily.


In [2]:
import pycaret
import pandas as pd
import numpy as np
import sqlalchemy
import mysql.connector

from pycaret import classification
from pycaret.classification import plot_model
from datetime import datetime
import importlib
import model.common
from static_data.load_static_data import *

In [6]:
collect_data_Base_dir = 'collect_data'
df_game_matchup_total = pd.read_pickle(f'{collect_data_Base_dir}/df_game_matchup_total.pkl')
print(len(df_game_matchup_total))

322963


In [15]:
test_data = df_game_matchup_total[(df_game_matchup_total.game_date > "2022-12-01")][model.common.features]

In [21]:
regression_model = pycaret.classification.load_model(model.common.model_file_name)

Transformation Pipeline and Model Successfully Loaded


In [22]:
test_prediction = pycaret.classification.predict_model(data = test_data, estimator = regression_model)
test_prediction = pd.merge(test_prediction, df_player_team_positions[['player_id','player_team_name']], left_on='batting_id', right_on='player_id', how='left')
test_prediction["theo_odds"] = test_prediction["prediction_score"].apply(model.common.odds_calculator)

In [23]:
def get_eval_profile(df_prediction, score_threshold):
    confident_prediction = df_prediction[(df_prediction["prediction_score"] >= score_threshold) & (df_prediction["prediction_label"] == 1)].sort_values(by = "prediction_score", ascending = False).drop_duplicates("batting_name")
    confident_prediction[['game_date', "batting_name", "batting_hit_recorded",	"prediction_score", "player_team_name", "theo_odds"]]
    l =len(confident_prediction)
    return l, confident_prediction.batting_hit_recorded.sum() / l

In [24]:
score_threshold = 0.75
confident_test_prediction = test_prediction[(test_prediction["prediction_score"] >= score_threshold) & (test_prediction["prediction_label"] == 1)].sort_values(by = "prediction_score", ascending = False).drop_duplicates("batting_name")
confident_test_prediction[['game_date', "batting_name", "batting_hit_recorded",	"prediction_score", "player_team_name", "theo_odds"]]

Unnamed: 0,game_date,batting_name,batting_hit_recorded,prediction_score,player_team_name,theo_odds
23933,2023-05-25,Randal Grichuk,1,0.91,Colorado Rockies,-1011
16151,2023-05-07,Freddie Freeman,1,0.90,Atlanta Braves,-900
7494,2023-04-17,Shohei Ohtani,1,0.88,Los Angeles Angels,-733
9918,2023-04-22,Rafael Devers,1,0.88,Boston Red Sox,-733
25906,2023-06-03,Nico Hoerner,1,0.88,Chicago Cubs,-733
...,...,...,...,...,...,...
22095,2023-05-23,Anthony Santander,1,0.75,Baltimore Orioles,-300
17950,2023-05-12,Marcus Semien,1,0.75,Oakland Athletics,-300
15338,2023-05-06,C.J. Cron,1,0.75,Colorado Rockies,-300
11867,2023-04-27,Trey Mancini,1,0.75,Chicago Cubs,-300


the first is the number of the rows over the threshold. the second is the ratio of true positive among the sample.

In [25]:
print(get_eval_profile(test_prediction, 0.6))
print(get_eval_profile(test_prediction, 0.7))
print(get_eval_profile(test_prediction, 0.75))
print(get_eval_profile(test_prediction, 0.80))
print(get_eval_profile(test_prediction, 0.85))

(215, 0.6744186046511628)
(105, 0.7238095238095238)
(61, 0.8360655737704918)
(29, 0.9655172413793104)
(10, 1.0)
