# mlb sports betting forecasting model

```
jupyter nbconvert --execute --to markdown readme.ipynb
```
is to be run to convert this notebook to a markdown file.

## Forked from quantgalore.substack.com

My fork [here|https://github.com/sculd/mlb-props].

* `collect_data_run.ipynb` needs to be run inintially, to collect the player stats, game match up data, etc. which finally constructs the dataset used for training / testing.
  * `df_game_matchup_total.pkl` is the dataframe pkl that has 2011 to 2023 match up.
  * the collect_data is stored in mlb_props_data bucket of google drive.
* `model_training_run.ipynb` trains a model.
* `update_data_run.py` should run at the beginning of each day, fetching the previous date's matchup and updating all the data.
* `fetch_today_matchup_and_odds_run.py` should run at the beginning of each day, this creates matchup for today's live bet and fetches the odds for today's games.

```
$ crontab -l
0 12-20 * * * TZ=US/Eastern /home/sculd3/projects/mlb-props/scripts/daily_cloud_live_update_run.sh
58 12-20 * * * TZ=US/Eastern /home/sculd3/projects/mlb-props/scripts/telegram_notify_new_prediction.sh
0 12 * * * TZ=US/Eastern /home/sculd3/projects/mlb-props/scripts/daily_cloud_run.sh
0 14 * * * TZ=US/Eastern /home/sculd3/projects/mlb-props/scripts/daily_cloud_live_run.sh
```

The odds fetch is hosted in the gcp vm (sandbox(2)), and updated to `trading-290017.major_league_baseball.odds_hit_recorded` table.

In [1]:
import pycaret
import pandas as pd
import numpy as np
from pycaret import classification
from datetime import datetime
import model.common
from static_data.load_static_data import *

In [3]:
collect_data_Base_dir = 'collect_data'
df_game_matchup_total = pd.read_pickle(f'{collect_data_Base_dir}/df_game_matchup_total.pkl')
test_data = df_game_matchup_total[(df_game_matchup_total.game_date > "2022-12-01")][model.common.features_1hits_recorded]

In [4]:
regression_model = pycaret.classification.load_model(model.common.model_1hits_file_name)

Transformation Pipeline and Model Successfully Loaded


In [5]:
test_prediction = pycaret.classification.predict_model(data = test_data, estimator = regression_model)
test_prediction = pd.merge(test_prediction, df_player_team_positions[['player_id','player_team_name']], left_on='batting_id', right_on='player_id', how='left')
test_prediction["theo_odds"] = test_prediction["prediction_score"].apply(model.common.odds_calculator)

In [11]:
def get_eval_profile(df_prediction, score_threshold):
    confident_prediction = df_prediction[(df_prediction["prediction_score"] >= score_threshold) & (df_prediction["prediction_label"] == 1)].sort_values(by = "prediction_score", ascending = False).drop_duplicates("batting_name")
    confident_prediction[['game_date', "batting_name", "batting_1hits_recorded",	"prediction_score", "player_team_name", "theo_odds"]]
    l =len(confident_prediction)
    return l, confident_prediction.batting_1hits_recorded.sum() / l

In [9]:
score_threshold = 0.75
confident_test_prediction = test_prediction[(test_prediction["prediction_score"] >= score_threshold) & (test_prediction["prediction_label"] == 1)].sort_values(by = "prediction_score", ascending = False).drop_duplicates("batting_name")
confident_test_prediction[['game_date', "batting_name", "batting_1hits_recorded",	"prediction_score", "player_team_name", "theo_odds"]]

Unnamed: 0,game_date,batting_name,batting_1hits_recorded,prediction_score,player_team_name,theo_odds
1276,2023-04-03,Paul Goldschmidt,1,0.9856,Arizona Diamondbacks,-6844
1085,2023-04-03,Dansby Swanson,1,0.9799,Atlanta Braves,-4875
705,2023-04-02,Willson Contreras,1,0.9774,Chicago Cubs,-4325
235,2023-04-01,Rafael Devers,1,0.9760,Boston Red Sox,-4067
790,2023-04-02,Taylor Ward,1,0.9728,Los Angeles Angels,-3576
...,...,...,...,...,...,...
9965,2023-04-22,Teoscar Hernandez,1,0.7524,Seattle Mariners,-304
25214,2023-05-31,Ildemaro Vargas,0,0.7522,Arizona Diamondbacks,-304
9631,2023-04-22,Josh Bell,1,0.7520,Cleveland Guardians,-303
1832,2023-04-04,Eddie Rosario,1,0.7506,Minnesota Twins,-301


the first is the number of the rows over the threshold. the second is the ratio of true positive among the sample.

In [12]:
print(get_eval_profile(test_prediction, 0.6))
print(get_eval_profile(test_prediction, 0.7))
print(get_eval_profile(test_prediction, 0.75))
print(get_eval_profile(test_prediction, 0.80))
print(get_eval_profile(test_prediction, 0.85))

(404, 0.8564356435643564)
(325, 0.8738461538461538)
(262, 0.8931297709923665)
(174, 0.896551724137931)
(104, 0.9230769230769231)
