# Analysing Baseball game Data
## Zsombor Hegedűs & Brúnó Helmeczy

#### Prepared for: Coding 3: Data Analysis & Management with Python 
#### Instructor: Eszter Somos 
#### MSc Business Analytics @ Central European University
#### [Github](https://github.com/zsomborh/analyse_baseball_matches) 

---
## Notebook on regression analysis

The purpose of this notebook is to merge the initial baseball dataframe with the weather dataframe we created and run regression analysis on the two research questions:

1. Whether game length can be explained better by adding variables describing the weather ?
2. Can wind speed effect on game performance - how about all match attributes ?

We didn't aim to carry out a very detailed analysis - we just wanted to showcase the power of the `statsmodels` package and see if statistical metrics show any improvement with the inclusion of weather data

In [1]:
import pandas as pd 
from datetime import datetime
import statsmodels.formula.api as smf
import os


pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

In [2]:
# !!! please set up the directory where data is stored as default with the below code !!!

os.chdir(r'C:/Users/helme/Desktop/CEU/WINTER_Term/Coding_3/Term_ProjWorkFolder/analyse_baseball_matches/data')

# Read in data from disk

weather_df = pd.read_csv('location_weather.csv', index_col = 'Unnamed: 0')
baseball_df = pd.read_csv('Baseball_merged_fin.csv',index_col = 'Unnamed: 0')

# Create date type variable so that we can join on that as well
baseball_df['date'] = baseball_df.apply(lambda x:datetime.strptime(x['date'], "%Y-%m-%d").date(),axis = 1)
weather_df['date'] = weather_df.apply(lambda x:datetime.strptime(x['date'], "%Y-%m-%d").date(),axis = 1)

# Joining the two dataframes
merged_df = pd.merge(baseball_df, 
         weather_df,
        left_on =['venue_name', 'date'],
        right_on = ['venue_name', 'date'])

In [3]:
merged_df.head()

Unnamed: 0,g_id,inning,wind,weather,venue_name,ab_id,b_count,s_count,outs,pitch_num,spin_rate,start_speed,end_speed,zone,o,p_score,top,attendance,elapsed_time,delay,away_final_score,home_final_score,date,Total_Pitches,Max_Pitch_Count,Most_Freq_Pitch,Most_Freq_Pitch%,Game_Length_Innings,latitude,longitude,wdir,temp,maxt,visibility,wspd,cloudcover,mint,precip,snowdepth,dew,humidity,precipcover
0,201700001,1,8,63,Busch Stadium,2017000000.0,0.870968,0.741935,0.741935,2.612903,1862.029581,90.780645,82.641935,10.709677,1.580645,0.0,0.516129,47566.0,213.0,0.0,3.0,4.0,2017-04-02,31,14,FF,45.16,9,38.622554,-90.193922,116.71,13.3,19.4,9.9,12.5,11.0,7.3,0.25,0.0,7.0,66.63,4.17
1,201700001,2,8,63,Busch Stadium,2017000000.0,1.0,1.195122,0.731707,3.365854,1859.655122,89.473171,81.397561,9.609756,1.536585,0.0,0.317073,47566.0,213.0,0.0,3.0,4.0,2017-04-02,41,18,FF,43.9,9,38.622554,-90.193922,116.71,13.3,19.4,9.9,12.5,11.0,7.3,0.25,0.0,7.0,66.63,4.17
2,201700001,3,8,63,Busch Stadium,2017000000.0,0.804348,1.021739,1.043478,2.978261,1730.975304,88.467391,80.595652,10.652174,1.586957,0.0,0.347826,47566.0,213.0,0.0,3.0,4.0,2017-04-02,46,14,FF,30.43,9,38.622554,-90.193922,116.71,13.3,19.4,9.9,12.5,11.0,7.3,0.25,0.0,7.0,66.63,4.17
3,201700001,4,8,63,Busch Stadium,2017000000.0,0.722222,0.333333,0.944444,2.055556,1825.145611,88.694444,80.538889,11.444444,1.944444,0.555556,0.555556,47566.0,213.0,0.0,3.0,4.0,2017-04-02,18,5,CH,27.78,9,38.622554,-90.193922,116.71,13.3,19.4,9.9,12.5,11.0,7.3,0.25,0.0,7.0,66.63,4.17
4,201700001,5,8,63,Busch Stadium,2017000000.0,0.96,0.72,1.28,2.68,1895.17672,90.856,82.672,10.6,2.08,0.56,0.56,47566.0,213.0,0.0,3.0,4.0,2017-04-02,25,18,FF,72.0,9,38.622554,-90.193922,116.71,13.3,19.4,9.9,12.5,11.0,7.3,0.25,0.0,7.0,66.63,4.17


### 1) Can weather data help us explain the length of a game? 

We are going to run 4 simple OLS regressions on the full dataset to find patterns of associations worth noting. Moreover we wish to see if adding weather data could improve the fit of our model, and whether it can have a significant effect on the R2

Our target variable will be the total length of the game (max innnings), explanatory variables will be: 

1. Only precipitation and precipitation coverage - we thought the more rain it fell the more likely that a match will be elongated
2. All weather data that we downloaded with the API 
3. All match related data that we had in our original baseball dataframe
4. Combined variable set of 2nd and 3rd.


In [4]:
# We will define variable sets for statmodels

var_set1 = 'Game_Length_Innings ~  precip + precipcover'

var_set2 = """
Game_Length_Innings ~
wdir+temp+maxt+visibility+wspd+cloudcover+
mint+precip+snowdepth+dew+humidity+precipcover
"""

var_set3 = """
Game_Length_Innings ~
inning+wind+weather+venue_name+b_count+
s_count+outs+spin_rate+ end_speed +
zone+o+p_score+top+attendance+elapsed_time+delay+
Total_Pitches+Max_Pitch_Count+Most_Freq_Pitch

"""

var_set4 = """
Game_Length_Innings ~
inning+wind+weather+venue_name+b_count+
s_count+outs+spin_rate+ end_speed+
zone+o+p_score+top+attendance+elapsed_time+delay+
Total_Pitches+Max_Pitch_Count+Most_Freq_Pitch+
wdir+temp+maxt+visibility+wspd+cloudcover+
mint+precip+snowdepth+dew+humidity+precipcover

"""

In [5]:
#We will print the following table for all 4 features sets to see if any improvement can be read out from R2 or other
#relevant metrics

lpm = smf.ols(var_set1, data=merged_df).fit()
print(lpm.get_robustcov_results(cov_type='HC1').summary())

                             OLS Regression Results                            
Dep. Variable:     Game_Length_Innings   R-squared:                       0.001
Model:                             OLS   Adj. R-squared:                  0.000
Method:                  Least Squares   F-statistic:                     14.59
Date:                 Mon, 29 Mar 2021   Prob (F-statistic):           4.65e-07
Time:                         14:17:55   Log-Likelihood:                -57930.
No. Observations:                44498   AIC:                         1.159e+05
Df Residuals:                    44495   BIC:                         1.159e+05
Df Model:                            2                                         
Covariance Type:                   HC1                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept       9.2386      0.005   1975

In [6]:
lpm = smf.ols(var_set2, data=merged_df).fit()
print(lpm.get_robustcov_results(cov_type='HC1').summary())

                             OLS Regression Results                            
Dep. Variable:     Game_Length_Innings   R-squared:                       0.006
Model:                             OLS   Adj. R-squared:                  0.006
Method:                  Least Squares   F-statistic:                     34.95
Date:                 Mon, 29 Mar 2021   Prob (F-statistic):           7.69e-82
Time:                         14:18:07   Log-Likelihood:                -57805.
No. Observations:                44498   AIC:                         1.156e+05
Df Residuals:                    44485   BIC:                         1.157e+05
Df Model:                           12                                         
Covariance Type:                   HC1                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept       8.9559      0.121     73

In [7]:
# Here we just want to see R2 and how much it improves when we add weather data in the last regression

lpm = smf.ols(var_set3, data=merged_df).fit().summary()
lpm.tables[0]

0,1,2,3
Dep. Variable:,Game_Length_Innings,R-squared:,0.546
Model:,OLS,Adj. R-squared:,0.545
Method:,Least Squares,F-statistic:,970.8
Date:,"Mon, 29 Mar 2021",Prob (F-statistic):,0.0
Time:,14:18:15,Log-Likelihood:,-40385.0
No. Observations:,44498,AIC:,80880.0
Df Residuals:,44442,BIC:,81370.0
Df Model:,55,,
Covariance Type:,nonrobust,,


In [8]:
lpm = smf.ols(var_set4, data=merged_df).fit()
print(lpm.get_robustcov_results(cov_type='HC1').summary())

                             OLS Regression Results                            
Dep. Variable:     Game_Length_Innings   R-squared:                       0.548
Model:                             OLS   Adj. R-squared:                  0.547
Method:                  Least Squares   F-statistic:                     117.6
Date:                 Mon, 29 Mar 2021   Prob (F-statistic):               0.00
Time:                         14:18:20   Log-Likelihood:                -40290.
No. Observations:                44498   AIC:                         8.072e+04
Df Residuals:                    44430   BIC:                         8.131e+04
Df Model:                           67                                         
Covariance Type:                   HC1                                         
                                                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------

#### Conclusions for Question 1: 

Just looking at R<sup>2</sup>, looks like OLS models using weather data only are not explaining the variance very well. However using all the data that is available for a match can increase R<sup>2</sup> greatly, and combining the two are the best out of all, so it makes sense to increase model complexity (we used heteroscedastic robust SEs) and adding weather data to our analysis. 

When we take a look at coefficients most of the weather data looks to be significant at 5%; but their coefficients are rather small (it's also important to note that variables are not scaled). 

Precipitation was the focus of our original question and we can see that 1 unit higher precipitation is associated with a 0.0013 innings longer match (given we keep every other variable unchanged). A positive association was what we expected, and this is what we got, altough with a very small coefficient.

---


### 2) Does windspeed have an effect on pitch speed? 

We are going to run 4 simple OLS regressions on the full dataset to find noteworthy patterns of association. Moreover we wish to see if adding weather data could improve the fit of our model, and whether it can have a significant effect on the R2

Our target variable will be the end speed of a pitch; explanatory variables will be: 

1. Only windspeed, wind direction and their interaction - we thought that wind can significantly impact the speed of a pitch - especially the end speed
2. All weather data that we downloaded with the API
3. All match related data that we had in our original baseball dataframe
4. Combined variable set of 2nd and 3rd.


In [9]:
# We will define variable sets for statmodels

var_set1 = 'end_speed~ wspd + wdir + wspd * wdir'

var_set2 = """end_speed ~ 
wdir+temp+maxt+visibility+wspd+cloudcover+
mint+precip+snowdepth+dew+humidity+precipcover
+ wspd * wdir
"""

var_set3 = """
end_speed ~
inning+wind+weather+venue_name+b_count+
s_count+outs+spin_rate +
zone+o+p_score+top+attendance+elapsed_time+delay+
Total_Pitches+Max_Pitch_Count+Most_Freq_Pitch

"""

var_set4 = """
end_speed ~
inning+wind+weather+venue_name+b_count+
s_count+outs+spin_rate+
zone+o+p_score+top+attendance+elapsed_time+delay+
Total_Pitches+Max_Pitch_Count+Most_Freq_Pitch+
wdir+temp+maxt+visibility+wspd+cloudcover+
mint+precip+snowdepth+dew+humidity+precipcover
+ wspd * wdir
"""

In [10]:
#We will print the following table for all 4 features sets to see if any improvement can be read out from R2 or other
#relevant metrics

lpm = smf.ols(var_set1, data=merged_df).fit()
print(lpm.get_robustcov_results(cov_type='HC1').summary())

                            OLS Regression Results                            
Dep. Variable:              end_speed   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     6.691
Date:                Mon, 29 Mar 2021   Prob (F-statistic):           0.000164
Time:                        14:18:39   Log-Likelihood:            -1.0018e+05
No. Observations:               44498   AIC:                         2.004e+05
Df Residuals:                   44494   BIC:                         2.004e+05
Df Model:                           3                                         
Covariance Type:                  HC1                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     81.7543      0.102    801.224      0.0

In [11]:
lpm = smf.ols(var_set2, data=merged_df).fit()
print(lpm.get_robustcov_results(cov_type='HC1').summary())

                            OLS Regression Results                            
Dep. Variable:              end_speed   R-squared:                       0.014
Model:                            OLS   Adj. R-squared:                  0.014
Method:                 Least Squares   F-statistic:                     48.57
Date:                Mon, 29 Mar 2021   Prob (F-statistic):          1.32e-125
Time:                        14:18:40   Log-Likelihood:                -99879.
No. Observations:               44498   AIC:                         1.998e+05
Df Residuals:                   44484   BIC:                         1.999e+05
Df Model:                          13                                         
Covariance Type:                  HC1                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept      80.9479      0.273    296.213      

In [12]:
# Here we just want to see R2 and how much it improves when we add weather data in the last regression

lpm = smf.ols(var_set3, data=merged_df).fit().summary()
lpm.tables[0]

0,1,2,3
Dep. Variable:,end_speed,R-squared:,0.326
Model:,OLS,Adj. R-squared:,0.325
Method:,Least Squares,F-statistic:,397.2
Date:,"Mon, 29 Mar 2021",Prob (F-statistic):,0.0
Time:,14:18:42,Log-Likelihood:,-91427.0
No. Observations:,44498,AIC:,183000.0
Df Residuals:,44443,BIC:,183400.0
Df Model:,54,,
Covariance Type:,nonrobust,,


In [13]:
lpm = smf.ols(var_set4, data=merged_df).fit()
print(lpm.get_robustcov_results(cov_type='HC1').summary())

                            OLS Regression Results                            
Dep. Variable:              end_speed   R-squared:                       0.327
Model:                            OLS   Adj. R-squared:                  0.326
Method:                 Least Squares   F-statistic:                     315.9
Date:                Mon, 29 Mar 2021   Prob (F-statistic):               0.00
Time:                        14:18:44   Log-Likelihood:                -91380.
No. Observations:               44498   AIC:                         1.829e+05
Df Residuals:                   44430   BIC:                         1.835e+05
Df Model:                          67                                         
Covariance Type:                  HC1                                         
                                                 coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------------

#### Conclusions for Question 2: 

Just looking at R<sup>2</sup>, looks like OLS models using weather data only are not explaining the variance very well. However using all the data that is available for a match can increase R<sup>2</sup> greatly. Combining the two have not increased the model fit very much so we concluded that it is not really worth going the extra mile and adding weather data to our analysis if we want to understand what impacts pitches' speed. 

When we take a look at the coefficients most of the weather data looks to be significant at 5%; but their coefficients are rather small (again, variables are not scaled). 

Windspeed was the focus of our original question and we can see that 1 unit higher wind speed is associated with a 0.0013 lower end speed for a pitch (given we keep every other variable unchanged). It would be a long shot to conclude anything from this, but one possible explanation for this phenomenon is wind blowing in the opposite direction as the thrown ball's trajectory. 


### Summary

There are a couple of limitations to keep in mind for this analysis: 

- The baseball data we analysed was only for two years: 2017, 2018
- The weather data we used was for the full day and not only for the duration of a given baseball match

Due to the above limitations external validity of this analysis is probably not very high. But there are ways to enhance this, as weather API can give data in an  hourly frequency, and the original database has a few more years that could be used if anyone wishes to improve this forward. 

Overall in this assignment we analysed important metrics related to baseball matches; collected weather information and also did some high level regression analysis on a few metrics. We wanted to answer 2 questions - whether weather data has an impact on game length, and whether it impacts the speed of a pitch. Comparing a few descriptive statistics and looking at some estimated coefficients we found that weather conditions can explain some variance with regards to game length, but it is not very useful for pitch speed. 

We also established a few areas where this analysis can be improved and we strongly believe that our notebooks serve as good foundation for anyone wishing to dwell deeper into the exciting word of baseball. 