# NBA Game Simulator Project Report

- Reian Festejo
- CS558

## Introduction
- This Jupyter notebook addresses modeling and predicting the outcome of a NBA game  based on past game statistics. We use a data web scraper called BeautifulSoup, specifically used for Python, to scrape data from www.basketball-reference.com/, collecting game outcomes for each NBA team from the past 3 seasons. We put the data into various dataframes using the pandas library. This dataset was then cleaned and transformed to generate metrics whose averages serve as predictors in a Monte Carlo simulation using Poisson distribution.


- Monte Carlo simulations are widely used in sports betting, since the multitude of possibilities makes direct calculation of game outcomes impractical. Error correlation across games requires due care, as demonstrated with a realistic multilevel basketball model, similar to today's models. The model is built for the past seasons of 2016-2019, modeling how many points a certain team will score against another team, depending on which team is 'Away' and which team is 'Home'. This notebook focuses on clarifying the use of Monte Carlo simulations for probability calculations in sports.

## Background
- Since the 1980s, the National Basketball Association (NBA) has cemented itself as a major player in the world of professional sports. With the growth of cable television, games are watched by millions of fans both domestically and internationally. 


- With 30 teams each playing 82 games per season, professional basketball has become one of the most common avenues for sports gamblers to place bets. Professionals with years of experience and a strong understanding of the sport, generate gambling odds for each game, which can then be bet on through online gambling houses or casinos. Many die-hard basketball fans with good instincts have profited from betting against these odds. But how much does instinct really have to do with it?


- There are some websites running simulation models to provide suggestions for people to bet. However, these websites are actually not free, like http://www.nbagamesim.com/. We would like to build our basic assumption - the scoring process follows Poisson distribution, and try to do the simulation ourselves.


- In this notebook, we will show that even with relatively little experience watching and analyzing basketball, statistical learning methods can be used to systematically predict the outcome of a NBA game. 


- We found this topic very interesting and would like to build our own model to simulate the game results. History says that Possion distribution would be a good approximation to the dynamics of a sports game. Given the similar nature of sports game results, we think the Poisson process would be a good estimation for a basketball game as well. In this project, we will implement the model based on Poisson distribution and see how well the model could predict the actual result.

## Model Description
- Data Collecting
    - We use BeautifulSoup to collect historical NBA scores from www.basketball-reference.com/. The data includes the game data of the past 3 seasons (2016-2019).
    - The function *get_data* works as a scraper to grab data from our source website. We apply this function on a specific year and month. We then merge the data altogether, and delete the useless data, to form a pandas dataframe, *stats*, that would be used in the next step. 
    - Information includes: the year and month of the played NBA game, home and away teams, and their scores.



- Data Manipulation
    - After we get the *stats* dataframe, we add the score difference and score sum of each game to the table.
    - Then, we extract the list of NBA team names from the raw table, which will be used to construct the 'battle' format later.
    - From the team name table, we then construct the 'battle' format dataframe called *games* for each possible NBA game, stating which team is the 'Away_Team' and the 'Home_Team'. This will be our core dataframe and it will be used to get simulation results.


- Simulation Calculation
    - In this stage, we created the 'Team' class to calculate some other detail parameters of each game, as well as other sample parameters that will be fed into the simulation later.
    - The constructor initializes the parameters, including pinpointing relating data cells, average and variance of score sum, and variance of score difference.
    - The *score_sum* method calculates the average sum of scores scored and conceded by each team. The *score_difference* method calculates the average (scored-conceded) per game for each team.
    - Then, we calculate the parameters *delta_g* and *sigma_g* using the *score_difference* and *score_sum* methods respectively, and append the result to a data frame.
    - Another class, 'Game', is created to facilitate the simulation process.
    - As shown in the class constructor, it takes into two team names to initiate the simulation process. 
    - According to the model that has been specified in the following section, the *sim_result* method calculates the g parameters in the model.
    

- Simulation
    - We use the *all_sim* function to run the simulation. *one_game* is an instance from the *Game* class, and we call the *sim_result* method to generate the simulation result.
    - After the simulation, we wrote the result into a csv file, which has been sumitted in our project zip file.


- Model accuracy test
    - To have a better understanding regarding how our model works, we use the *test* function to compare the simulation result and the actual result.


- User interaction
    - As described in the project purpose section, we want to provide a tool for basketball lovers to predict the game result of any game they choose. Hence, the last part of our project is designed for user interaction. Users can specify the teams of a certain match that they are interested in, and the model will print out a prediction result for the users.

## Model Development

Conceptually, our analysis relies on recent observations in describing the teams: 
1. The home advantage is characterized by a team-independent but season-dependent increase of the home team score difference $c_{home} > 0$
<br /> 
2. An appropriate observable to characterize the fitness of a team i in a given season is the average point difference (normalized per match) $\Delta G_i(N)$, i.e. the difference of the points scored and conceded during N matches. In particular it contains more information about the team fitness than, e.g., the number of points.

<br /> 
<br /> 
So firstly, we would like to use our game data to compute $\Delta G_i(N)$ and $\sum G_i(N)$, which can be represented by the score difference and score sum.

- As an immediate consequence, the limit of $\Delta G_i(N)$ for large N, corresponding to the true fitness $\Delta G_i$, is well-defined. Since we only have finite sample N games to compute $\Delta G_i(N)$, the consistent estimator for $\Delta G_i$, based on $\Delta G_i(N)$, reads:

    $$\Delta G_i = a_N\Delta G_i(N)$$

    with $a_N = 1/[1+3/(N\sigma^2_{\Delta G})]$


- Analogously, introducing $\sum G_i(N)$ as the average sum of points scored and conceded by team i in N matches. 
    - The long-term limit can be estimated estimated via $\Delta G_i - \xi = b_N(\sum G_i(N)-\xi)$
    with $b_N = 1/[1+3/(N\sigma^2_{\sum G})]$, where $\xi$ is the average number of points per match in the respective season.


<br /> 
<br /> 
Then, we would like to compute parameter $g_{i,j}$.
$$ g_{i,j}=g_i-g_j$$

$g_{i,j}$ express point difference of a match of team i vs. team j with the final result $g_i:g_j$. $g_i$ denotes our final simulation result for team i's score, $g_j$ denotes team j's score. Here we set team i to be the away team and team j to be the home team. 

So $g_{i,j}$ actually means score_awayteam - score_hometeam.


Thoeratically, $g_{i,j}$ would be consist of three parts:

$$g_{i,j} = q_{i,j} + f_{i,j} $$

- $q_{i,j}$ expresses the average outcome which can be expected based on knowledge of the team fitness value $\Delta G_i$ and $\Delta G_j$, respectively. Conceptually this can be determined by averaging over all matches when teams with these fitness values play against each other. 


- $f_{i,j}$ denotes for the external influences of a specific match, i.e. weather components, health conditions, which would be hard to evaluate.

For simplicity, here we only consider the influence of the main component $q_{i,j}$, as $f_{i,j}$ is hard to evaluate. So:

$$g_{i,j} = q_{i,j}$$

<br /> 
<br /> 
To derive $q_{i,j}$:

- $q_{i,j}$ would need to fulfill the basic conditions (taking into account the home advantage):
    -  symmetry condition: $$ q_{i,j} - c_{home} = -(q_{j,i} - c_{home})$$  



- The most general and simple formula of estimate of $g_{i,j}$ (denoted by $\widetilde g_{i,j}$) could be given as follow.
$$ g_{i,j} = g_{i} - g{j}= g_{awayteam} - g_{hometeam}$$
$$\widetilde g_{i,j} = q_{i,j} = \Delta G_i - \Delta G_j - c_{home}$$

<br /> 
<br /> 

We are also going to compute the mean of $spoints_{i,j}$, where $spoints_{i,j}$ denotes the sum of the scores. We denote the mean as $\bar {spoints}_{i,j}$
- As we know the underlying scoring process of each team follows a Poisson distribution, and the sum of two Poisson distributions also follows Poisson distribution. 


- So that $spoints_{i,j}$ follows a Poisson distribution. We could run Poisson simulation to simulate $spoints_{i,j}$. Here we are going to compute the mean $\bar {spoints}_{i,j}$, which is actually the parameter - $\lambda$ of this Poisson distribution. The formula is described as below:

$$ spoints_{i,j} = g_i + g_j = g_{awayteam} + g_{hometeam}$$
$$ \bar {spoints}_{i,j} = \sum G_i + \sum G_j -{\xi}_2$$

Similar to $\xi$ used before, as $\xi$ is the average mean of scores of each team per match, here ${\xi}_2$ is the average number of the total scores of two teams per match in the respective season.

<br /> 
<br /> 

Theoratically, knowing $g_{i,j}$ and $spoints_{i,j}$ together allows us to calculate the expected number of points for both teams individually.

- Solve the function set:
$$ g_{i,j} = g_{i} - g{j} … (1)$$
$$ spoints_{i,j} = g_{i} + g{j} … (2)$$


- We have:
$$ g_{i} = g_{awayteam} = (spoints_{i,j} + g_{i,j})/2 $$
$$ g_{j} = g_{hometeam} = (spoints_{i,j} - g_{i,j})/2 $$

Here, as we have already compute the estimate $\widetilde g_{i,j}$ and $\bar {spoints}_{i,j}$, we can get:

$$\widetilde g_{i} = (\bar {spoints}_{i,j} + \widetilde g_{i,j})/2$$
$$\widetilde g_{j} = (\bar {spoints}_{i,j} - \widetilde g_{i,j})/2$$

Running Monte Carlo simulation for Poisson Process using $\lambda = \bar {spoints}_{i,j}$, get the simulated number of $spoints_{i,j}$ for each game.Then we allocate the scores of each team propotionally.

$$g_{i} = \frac{\widetilde g_{i}}{\widetilde g_{i}+ \widetilde g_{j}} * spoints_{i,j}$$
$$g_{j} = \frac{\widetilde g_{j}}{\widetilde g_{i}+ \widetilde g_{j}} * spoints_{i,j}$$


<br /> 
<br />



## Code Details
- Function get_table():
    - This function is used to scrape game data from the website *www.basketball-reference.com/* using BeautifulSoup. It returns a list of scraped data, with the parameters being (year,month).
    
    - The data includes:
        - Year: year of game played
        - Month: month of game played
        - Away_Team: name of the away team
        - Away_PTS: points scored of the away team
        - Home_Team: name of the home team
        - Home_PTS: points scored of the home team
    
    - We could use the loop and this function to get the data within a time range, using pandas to store the data into a dataframe.
    
    

- Class Team:
    - This class is used to calculate some necessary parameters for the simulation.
    
    - The *\_init\_* function is to set the data needed for calculation for team = name.

    - The *score_difference()* function is to calculate the average (scored-conceded) per game for team = name.

    - The *score_sum()* function is to calculate the average sum of scores scored and conceded by team = name.

    - Put all the teams' *score_difference* into vector *dlt_G* (corresponding to the $\Delta G$ in the model)

    - Put all the teams' *score_sum* into vector  *sum_G* (corresponding to the $\sum G$ in the model)


- Class Game:
    - This class is to run simulation for the specific game (for the game between input tm1 and tm2)
    
    - The *\_init\_()* function first set up our predict teams. tm1 stands for the away team, tm2 stands for the home team. N is the number of paths for simulation (we are using 1000 here). 
    
    - We also calculate the esti_gij and mean_sgoals in the *init* function. 'esti_gij' corresponds to $\widetilde g_{i,j}$ in the modelling part, and 'mean_sgoals' corresponds to $\bar {sgoals}_{i,j}$ in our model.

    - The *sim_result()* function is to run the game simulation between tm1 and tm2. It simulate the Possion process of the total score of the match(${sgoals}_{i,j}$ in our model part), where parameter $\lambda = \bar {sgoals}_{i,j}$. The detailed simulation method is described at the model part.


- Function all_sim():
    - This function runs a loop over all the possible combination of NBA teams, to implement the *Game.sim_result()* function, create the full simulation for all games between all NBA teams.
    
    - Also, this function outputs the team needed for running all the simulation. Running this simulation requires about 4 to 5 minutes.
    
    
- Function test(): 
    - This function is used to test our accuracy of this simulation. Usually, in simulation process, we do not need to split the original data to do out-of-sample test( for simulation, we are not really using the information from past data). So we just use our scraped data to test the accuracy.
    
    - We calculate the difference of our simulated two teams' score (sim_mar) and the difference of the real games' score difference of the real game data (real_mar). We tested two parts:
        - Check if sim_mar * real_mar is positive. If positive, it suggests we've picked the same winner as the real game. (Correct Picks)
        - When people betting on NBA games, they are usually betting on the score difference of the two teams. We check the difference between sim_mar and real_mar is within 5 points. If yes, we think this is a good estimation. (Final Margin within 5 Pts)
        
   - Finally the function output the test accuracy result.
    
    
- Function read_result():
    - This function achieves the user interaction. User could input the team names and the function would automatically output the simulation result.
    


## Model Verification and Validation

This project uses the assumption that scoring in basketball is a random process that can be described using Poisson distribution. This assumption is a simplification of the real game process. 

- Firstly, python's BeautifulSoup is extremely handy with scraping data. Also, pandas could help organize the data into a pretty clean format which is a data frame. pd.dataframe can be easily manipulated, super convenient for calculation. The simulation process is a little slow, but still should be faster than R. Also, it is easy to output the data frame into csv files.

## Model Application and Transition

- In this simulation process, we do think python should be the best one to program with. However, during our other courseworks, we found that R would outperform at statistical modelling and fitting (the packages are more accurate). Also, for very speed-demanding and complicated programs, it would be better to use C++. Every programming language has its own strength, and picking the right one would be very helpful in solving the whole problem. 


- Using Jupyter notebook is also very useful at the fact that we can manipulate code, cell by cell

## Conclusions and Recommendations

Generally, we have a very positive experience coding this project using python.

- Unlike soccer, in basketball games the last minute is always critical, which means, most numbers of baskets(n) over a time interval follow a Poisson distribution, but some (especially during the last minutes) might follow different dynamics. There are research suggesting that the last-minute scores follow Power Law. It would be a possible modification to our current model.


- In our underlying model to estimate $g_{i,j}$, we only take $q_{i,j}$ into account and ignored $f_{i,j}$, which represents the external factor that might influence the game result. It is reasonable to consider this part in our further development.

## References
[1] A. Heuer, C. Muller, and O. Rubner, *Soccer: is scoring goals a predictable Poisson process?*, Westfalische Wilhelms Universitat Munster (2010)

[2] J. Hatzius, J. Stehn, and D. Millar, *The Econometrician's take on Euro 2016*, Goldman Sachs Macroeconomics Insight (2016)

[3] J. Martín-González, Y. Guerra, J. García-Mansoa, E. Arriazab, and T. Valverde-Estévezc, *The Poisson model limits in NBA basketball: Complexity in team sports*, Physica A 464 (2016)

[4] A. Heuer, and O. Rubner, *Fitness, chance, and myths: an objective view on soccer results*, University of Munster (2009)