# Betting on Baseball - How it Works
- ### Baseball bets are made with respect to a "money line"
- ### The line represents the oddsmakers probability that a given team will win a particular game.
- ### The less likely it is for a team to win, the more money you make on your bet if it wins.

- ### Suppose a team has probability 1/3 of winning, and you bet \\$100.  A fair bet would pay you \\$200 if you win.
- ### In that case, your "expected value" would be zero:
    - ### $EV = (1/3) * 200 + (2/3) * (-100) = 0$
- ### In the U.S. odds are represented by a Line value, which is a positive number > 100 or negative integer < -100.
- ### A positive number means the team is an underdog, and the "implied probability" is 100 / (100 + Line)
    - ### Example: If line for the Angels is +150 against the Dodgers, the "implied probability" is $100 / (100 +150) = .4 = 40\%$   
- ### A negative number means the team is the favorite, and the implied probability is $- Line / (100 - Line))$
    - ### Example: If line for the Cubs is -300 against the Dodgers, the "implied probability" is $-(-300) / (100 - (-300)) = 300/400 = .75 = 75\%$

- ### Since bookies are trying to make money, they will "slant" the probabilities (on both sides) in their favor.
- ### For example, if a game is truly 50-50, you will typically win only \\$90 or \\$95 dollars on a \\$100 bet
- ### This means the implied probabilities on both sides of a game add up to more than 1 (more than 100\%).
- ### This "edge" is often referred to as "vigorish" or "vig".

## Baseball Prediction: 3a - Getting Odds Data

In this notebook we will get historical odds data from oddsshark.com.  We will use the pandas `read_html` function to grab a table into a dataframe, and show how to programmatically sweep through all the necessary urls to get the data we need.

We will save this data as a collection of csv files.  In the next notebook, we will use these csv files to add the odds information to our primary dataframe.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
pd.set_option('display.max_columns',1000)
pd.set_option('display.max_rows',1000)

In [2]:
import lxml
import html5lib
from urllib.request import urlopen
import time


### Get Line, over_under, from oddsshark
Go to https://www.oddsshark.com/mlb/game-logs

In [3]:
df1 = pd.read_html('https://www.oddsshark.com/stats/gamelog/baseball/mlb/27024?season=2020')[0] # [0]

In [4]:
df1.head(10)

Unnamed: 0,Date,Opponent,Game,Result,Score,Line,OU,Total
0,"Mar 27, 2025",@ Kansas City,REG,,,,,
1,"Mar 29, 2025",@ Kansas City,REG,,,,,
2,"Mar 30, 2025",@ Kansas City,REG,,,,,
3,"Mar 31, 2025",@ San Diego,REG,,,,,
4,"Apr 1, 2025",@ San Diego,REG,,,,,
5,"Apr 2, 2025",@ San Diego,REG,,,,,
6,"Apr 4, 2025",@ LA Angels,REG,,,,,
7,"Apr 5, 2025",@ LA Angels,REG,,,,,
8,"Apr 6, 2025",@ LA Angels,REG,,,,,
9,"Apr 8, 2025",vs Chi White Sox,REG,,,,,


In [5]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162 entries, 0 to 161
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Date      162 non-null    object 
 1   Opponent  162 non-null    object 
 2   Game      162 non-null    object 
 3   Result    0 non-null      float64
 4   Score     0 non-null      float64
 5   Line      0 non-null      float64
 6   OU        0 non-null      float64
 7   Total     0 non-null      float64
dtypes: float64(5), object(3)
memory usage: 10.3+ KB


In [6]:
def line_to_prob(line):
    prob_underdog = 100/(np.abs(line)+100) # this is the probability for the 
    add_term = ((1-np.sign(line))/2) # 0 if negative, 1 if positive
    mult_factor = np.sign(line) # -1 if negative, 1 if positive
    # if line is positive, team is underdog, give 0 + 1*prob_underdog
    # if line is negative, team is favoritesm give 1 + (-1)*prob_underdog
    imp_prob = add_term + mult_factor * prob_underdog 
    return(imp_prob)

## Plan of Attack
- Get the "number" for each team
- Read the table for each team, and for each season (2019-2022)
- Lightly process the data frame (remove playoffs, process date, add game_number, add"source_team", convert line)
- Save each file





In [7]:
# manually figure out what number in url corresponds to which team
# use the 3 letter abbrev from retrosheet for each team

oddsshark_num_to_team_dict = {}
oddsshark_num_to_team_dict[26995]='PHI'
oddsshark_num_to_team_dict[26996]='SDN'
oddsshark_num_to_team_dict[26997]='SFN'
oddsshark_num_to_team_dict[26998]='ANA'
oddsshark_num_to_team_dict[26999]='DET'
oddsshark_num_to_team_dict[27000]='CIN'
oddsshark_num_to_team_dict[27001]='NYA'
oddsshark_num_to_team_dict[27002]='TEX'
oddsshark_num_to_team_dict[27003]='TBA'
oddsshark_num_to_team_dict[27004]='COL'
oddsshark_num_to_team_dict[27005]='MIN'
oddsshark_num_to_team_dict[27006]='KCA'
oddsshark_num_to_team_dict[27007]='ARI'
oddsshark_num_to_team_dict[27008]='BAL'
oddsshark_num_to_team_dict[27009]='ATL'
oddsshark_num_to_team_dict[27010]='TOR'
oddsshark_num_to_team_dict[27011]='SEA'
oddsshark_num_to_team_dict[27012]='MIL'
oddsshark_num_to_team_dict[27013]='PIT'
oddsshark_num_to_team_dict[27014]='NYN'
oddsshark_num_to_team_dict[27015]='LAN'
oddsshark_num_to_team_dict[27016]='OAK'
oddsshark_num_to_team_dict[27017]='WAS'
oddsshark_num_to_team_dict[27018]='CHA'
oddsshark_num_to_team_dict[27019]='SLN'
oddsshark_num_to_team_dict[27020]='CHN'
oddsshark_num_to_team_dict[27021]='BOS'
oddsshark_num_to_team_dict[27022]='MIA'
oddsshark_num_to_team_dict[27023]='HOU'
oddsshark_num_to_team_dict[27024]='CLE'

In [8]:
for i in range(26995, 27025):
    team_name = oddsshark_num_to_team_dict[i]
    print(team_name)
    for season in range(2019,2023):
        print(season)
        url = 'https://www.oddsshark.com/stats/gamelog/baseball/mlb/'+str(i)+'?season='+str(season)
        df_temp = pd.read_html(url)[0]
        df_temp = df_temp[df_temp.Game=='REG']
        print(df_temp.shape)
        df_temp['team_source'] = team_name
        df_temp['season'] = season
        df_temp['date_numeric'] = pd.to_datetime(df_temp.Date).astype(str).str.replace('-','')
        df_temp['game_no'] = np.arange(1,df_temp.shape[0]+1)
        df_temp['prob_implied'] = line_to_prob(df_temp['Line'])      
        next_game_date = np.concatenate((df_temp['date_numeric'].iloc[1:],[0]))
        previous_game_date = np.concatenate(([0], df_temp['date_numeric'].iloc[:-1]))
        game_1_dblheader = (df_temp.date_numeric.to_numpy()==next_game_date).astype(int)
        game_2_dblheader = (df_temp.date_numeric.to_numpy()==previous_game_date).astype(int)*2
        df_temp['dblheader_num'] = game_1_dblheader+game_2_dblheader        
        fname_out = 'oddsshark_'+team_name+'_'+str(season)+'.csv'
        df_temp.to_csv(fname_out,index=False)
        time.sleep(.1)

PHI
2019
(162, 8)
2020
(162, 8)
2021
(162, 8)
2022
(162, 8)
SDN
2019
(162, 8)
2020
(162, 8)
2021
(161, 8)
2022
(162, 8)
SFN
2019
(162, 8)
2020
(162, 8)
2021
(162, 8)
2022
(162, 8)
ANA
2019
(162, 8)
2020
(162, 8)
2021
(162, 8)
2022
(162, 8)
DET
2019
(162, 8)
2020
(162, 8)
2021
(162, 8)
2022
(162, 8)
CIN
2019
(162, 8)
2020
(162, 8)
2021
(162, 8)
2022
(162, 8)
NYA
2019
(162, 8)
2020
(162, 8)
2021
(162, 8)
2022
(162, 8)
TEX
2019
(162, 8)
2020
(162, 8)
2021
(162, 8)
2022
(162, 8)
TBA
2019
(162, 8)
2020
(162, 8)
2021
(162, 8)
2022
(162, 8)
COL
2019
(162, 8)
2020
(162, 8)
2021
(161, 8)
2022
(162, 8)
MIN
2019
(162, 8)
2020
(162, 8)
2021
(162, 8)
2022
(162, 8)
KCA
2019
(162, 8)
2020
(162, 8)
2021
(162, 8)
2022
(162, 8)
ARI
2019
(162, 8)
2020
(162, 8)
2021
(162, 8)
2022
(162, 8)
BAL
2019
(162, 8)
2020
(162, 8)
2021
(162, 8)
2022
(162, 8)
ATL
2019
(162, 8)
2020
(162, 8)
2021
(160, 8)
2022
(162, 8)
TOR
2019
(162, 8)
2020
(162, 8)
2021
(162, 8)
2022
(162, 8)
SEA
2019
(162, 8)
2020
(162, 8)
2021
(16