# NBA Game Simulation

- Mengyu Huang (mh3685)
- Xiaoling Ma (xm2185)

## Abstract & how to run this file

- Abstract

    NBA games are very popular. Of course people always root for their home team and anxious for the result. For some people, the results of these games are extra important if they have engaged in betting. There has many studies focus on this field. As a matter of fact, there are some websites running simulation to provide suggestions for people to betting. We would like to build our basic assumption - the scoring process follows Poisson distribution, and try to do the simulation ourselves. 
    

- How to run this file

    running this file did not require additional inputs, or changing paths of this code. You can just run the whole jupyter notebook step by step. You can check the simulation result by looking at the output 'simulation_result.csv' file or use the *read_result()* function.

In [2]:
import requests
import bs4
import math
import time
from bs4 import BeautifulSoup
import os, pickle
import pandas as pd
import numpy as np
path = os.getcwd()
os.chdir(path)

import csv


## Using beautiful soup to scrape NBA schedule & game data

- from website www.basketball-reference.com/
- we get data from seasons 2014-2015, 2015-2016, 2016-2017 to do the simulation


In [3]:
url_start='http://www.basketball-reference.com/leagues/NBA_'
url_end='.html'

In [4]:
# function for scraping the data from the url

def get_table(year,month):
    temp=[]
    url_var=str(year)+'_games-'+month
    url=url_start+url_var+url_end
    r=requests.get(url)
    soup=BeautifulSoup(r.content, 'lxml')
    table = soup.find('table', attrs={'class':'suppress_glossary sortable stats_table'})
    table_body = table.find('tbody')
    rows = table_body.find_all('tr')
    for row in rows:
        cols = row.find_all('td')
        cols = [ele.text.strip() for ele in cols]
        temp2=[ele for ele in cols if cols[2] and ele]
        if temp2:
            temp.append([year,month]+temp2)
    return temp

In [5]:
# reading the data from 2015-2017 to data
# if no data for that specific period, output year-month not included

seasons=[2015,2016,2017]
months=['october', 'november', 'december', 'january', 'february',
       'march', 'april', 'may','june']
data=[]

for i in range(len(seasons)):
    for j in range(len(months)):
        try:
            temps=get_table(seasons[i], months[j])
            #exclude empty months
            if temps:
                data=data+temps
        except:
            print(str(seasons[i])+''+months[j]+' not included')

2017may not included
2017june not included


In [6]:
# transfer & organize the scraped data into dataframe and drop useless columns

game_data = pd.DataFrame(data, columns=['year', 'month','time', 'away_tm','away_score','home_tm','home_score','del1','del2','del3'])
game_data = game_data.drop('time',axis=1)
game_data = game_data.drop('del1',axis=1)
game_data = game_data.drop('del2',axis=1)
game_data = game_data.drop('del3',axis=1)

## Basic Calculations

- Please first refer to the *Underlying Model* part in report, to help understand the calculations
- calculate the coefficients for our Poisson model
- get the NBA team list, and construct the battle list (every team play with every other teams). Save the battle list and related data into new dataframe called *games*

In [7]:
# check the data we have scraped

game_data.head()

Unnamed: 0,year,month,away_tm,away_score,home_tm,home_score
0,2015,october,Houston Rockets,108,Los Angeles Lakers,90
1,2015,october,Orlando Magic,84,New Orleans Pelicans,101
2,2015,october,Dallas Mavericks,100,San Antonio Spurs,101
3,2015,october,Brooklyn Nets,105,Boston Celtics,121
4,2015,october,Milwaukee Bucks,106,Charlotte Hornets,108


In [8]:
# get the difference & sum of two teams' scores of each game. 
# This is useful for further calculation

game_data['diff']=game_data['away_score'].astype(float)-game_data['home_score'].astype(float)
game_data['sum']=game_data['away_score'].astype(float)+game_data['home_score'].astype(float)

# get the list of NBA teams

teams=game_data['home_tm']
teams=teams.drop_duplicates()
teams=teams.values.tolist()

teams_df=pd.DataFrame(data={'name':teams})

# construct new dataframe games to save our simulation-related data & results
# create the battle form (get home team & away team)
temp = []
temp2 = []
for item in teams:
    for item2 in teams:
        if item != item2:
            temp.append(item)
            temp2.append(item2)
games = pd.DataFrame(data={'a_tm':temp, 'h_tm':temp2})

temp3= []
games['Battle'] = games['a_tm']+' vs '+games['h_tm']
games = games.set_index('Battle')
games.index.name=None

In [9]:
game_data.head()

Unnamed: 0,year,month,away_tm,away_score,home_tm,home_score,diff,sum
0,2015,october,Houston Rockets,108,Los Angeles Lakers,90,18.0,198.0
1,2015,october,Orlando Magic,84,New Orleans Pelicans,101,-17.0,185.0
2,2015,october,Dallas Mavericks,100,San Antonio Spurs,101,-1.0,201.0
3,2015,october,Brooklyn Nets,105,Boston Celtics,121,-16.0,226.0
4,2015,october,Milwaukee Bucks,106,Charlotte Hornets,108,-2.0,214.0


In [9]:
teams

['Los Angeles Lakers',
 'New Orleans Pelicans',
 'San Antonio Spurs',
 'Boston Celtics',
 'Charlotte Hornets',
 'Denver Nuggets',
 'Indiana Pacers',
 'Memphis Grizzlies',
 'Miami Heat',
 'New York Knicks',
 'Phoenix Suns',
 'Portland Trail Blazers',
 'Sacramento Kings',
 'Toronto Raptors',
 'Utah Jazz',
 'Cleveland Cavaliers',
 'Dallas Mavericks',
 'Los Angeles Clippers',
 'Minnesota Timberwolves',
 'Orlando Magic',
 'Chicago Bulls',
 'Milwaukee Bucks',
 'Atlanta Hawks',
 'Detroit Pistons',
 'Golden State Warriors',
 'Houston Rockets',
 'Oklahoma City Thunder',
 'Philadelphia 76ers',
 'Washington Wizards',
 'Brooklyn Nets']

In [10]:
games.head()

Unnamed: 0,a_tm,h_tm
Los Angeles Lakers vs New Orleans Pelicans,Los Angeles Lakers,New Orleans Pelicans
Los Angeles Lakers vs San Antonio Spurs,Los Angeles Lakers,San Antonio Spurs
Los Angeles Lakers vs Boston Celtics,Los Angeles Lakers,Boston Celtics
Los Angeles Lakers vs Charlotte Hornets,Los Angeles Lakers,Charlotte Hornets
Los Angeles Lakers vs Denver Nuggets,Los Angeles Lakers,Denver Nuggets


## Simulation Calculations
- construct class Team to calculate mandatory parameters for each team
- construct class Game to run simulation for the game between input teams


The class Team is written to calculate some necessary parameters for the simulation.

- The *\_init\_* function is to set the data needed for calculation for team=name.

- The *difscore()* function is to calculate the average (scored-conceded) per game for team = name.

- The *sscore()* function is to calculate the average sum of scores scored and conceded by team = name.

- Later, put all the teams' *difscore* into vector *dlt_G* (corresponding to the $\Delta G$ in the model)

- Put all the teams' *sscore* into vector  *sum_G* (corresponding to the $\sum G$ in the model)


In [11]:
# Create class team for calculating the scores & average scores

class Team:
    def __init__(self, name, data):
        self.name=name
        self.data=data
        
        #the lines that include team=name as the away/home team
        self.a_games=data.loc[data['away_tm']==name]
        self.h_games=data.loc[data['home_tm']==name]
        
        #average sum of scores for all games
        self.xi=data['sum'].mean()
        self.sumvar=data['sum'].var()
        self.diffvar=data['diff'].var()
        
        
    #the average sum of scores scored and conceded by team=name
    def sscore(self):
        tp1=self.a_games['sum'].values.tolist()
        tp1.extend(self.h_games['sum'].values.tolist())
        tp2=np.mean(tp1)
        b_n=1/(1+3/(len(tp1)*self.sumvar))
        tp3=self.xi+b_n*(tp2-self.xi)
        return tp3
    
    
    #the average (scored-conceded) per game for team=name
    def difscore(self):
        tp1=self.a_games['diff'].values.tolist()
        tp1.extend([-x for x in self.h_games['diff'].values.tolist()])
        tp2=np.mean(tp1)
        a_n=1/(1+3/(len(tp1)*self.diffvar))
        tp3=a_n*tp2
        return tp3

In [12]:
# put the difscore of all teams into vector 'dlt_G'
# the \Delta G_i for each team

diff_li=[]
for row in teams_df['name']:
    temp1=Team(row,game_data)
    diff_li.append(temp1.difscore())
teams_df['dlt_G']=diff_li

# put the sscore of all teams into vector 'sum_G'
#the \Sum G_i for each team

sum_li=[]
for row in teams_df['name']:
    temp1=Team(row,game_data)
    sum_li.append(temp1.sscore())
teams_df['sum_G']=sum_li
teams_df = teams_df.set_index('name')
teams_df.index.name=None

In [13]:
# coeffcients
c_home=-game_data['diff'].mean()
xi_2=game_data['sum'].mean()

# simulation paths
N=1000

In [14]:
teams_df.head()

Unnamed: 0,dlt_G,sum_G
Los Angeles Lakers,-7.896163,208.061894
New Orleans Pelicans,-1.638103,205.890224
San Antonio Spurs,7.980194,200.11212
Boston Celtics,1.505832,207.450526
Charlotte Hornets,-0.15599,201.468115


The Game class is written for running simulations
    
- The *\_init\_()* function first set up the predict teams (away team tm1 and home team tm2) and simulation path # N (we are using 1000 here). Also calculate the necessary parameter for simulation, 'esti_gij' and 'mean_sgoals', correspond to $\widetilde g_{i,j}$ and $\bar {sgoals}_{i,j}$ in our model accordingly.

- The *sim_result()* function is to run the game simulation between tm1 and tm2. It simulate the Possion process of the total score of the match(${sgoals}_{i,j}$ in our model part), where parameter $\lambda = \bar {sgoals}_{i,j}$. The detailed simulation method is described at the model part.

In [15]:
#game related functions
#tm1 is away team, tm2 is home team

class Game:
    def __init__(self, tm1, tm2):
        self.tm1=tm1
        self.tm2=tm2
        
        #value of esti_gij (considering home advantage)
        #attention i is away team here
        self.esti_gij=teams_df.ix[self.tm1,'dlt_G']-teams_df.ix[self.tm2,'dlt_G']-c_home

        #mean of sum of goals g_i+g_j
        self.mean_sgoals=teams_df.ix[self.tm1,'sum_G']+teams_df.ix[self.tm2,'sum_G']-xi_2
        
        self.times=N
        
    #monte carlo simulation for poisson distribution 
    # (the Poisson simulation method's learned from the simulation course)
    # as the scoring process for each team follows a poisson distribution,
    # the sum of the two teams' scores during a game also follows a poisson distribution.
    # (two poisson distributions adding up also follow poisson distribution )
    # So here, we run poisson simulation for the sum of the two teams' scores first (lambda = a_mean+h_mean), and then assign scores to each team.
    
    def sim_result(self):
        temp=0
        game_esti_gij=games['esti_gij'].loc[self.tm1+' vs '+self.tm2]
        game_mean_sgoals=games['mean_sgoals'].loc[self.tm1+' vs '+self.tm2]
        esti_gi=(game_mean_sgoals+game_esti_gij)/2
        esti_gj=(game_mean_sgoals-game_esti_gij)/2
        for i in range(self.times):
            t=0 
            I=0
            while t<=1:
                # lambda of the poisson distribution is (a_mean+h_mean)
                t=t-1/game_mean_sgoals*math.log(np.random.random_sample())
                I=I+1
            temp=temp+I
            
        g_i=round(temp/self.times*esti_gi/(esti_gi+esti_gj))
        g_j=round(temp/self.times*esti_gj/(esti_gi+esti_gj))
        return g_i, g_j

In [16]:
#Get for each game the g_i+g_j and the q_ij

temp1=[]
temp2=[]
for index, row in games.iterrows():
    one_game=Game(row['a_tm'],row['h_tm'])
    temp1.append(one_game.mean_sgoals)
    temp2.append(one_game.esti_gij)
games['mean_sgoals']=temp1
games['esti_gij']=temp2

## Simulation

Run simulation for all the games in the battle list. Write the result into file 'simulation_result.csv'

In [18]:
#the time needed is about 200-300 sec
#Get for each game the simulation result

def all_sim():
    t1=time.time()
    tp_1=[]
    tp_2=[]
    for index, row in games.iterrows():
        one_game=Game(row['a_tm'],row['h_tm'])
        rlt1, rlt2=one_game.sim_result()
        tp_1.append(rlt1)
        tp_2.append(rlt2)
    games['sim_ascore']=tp_1
    games['sim_hscore']=tp_2
    t2=time.time()-t1
    print('Time used for this simulation is '+str(t2)+' seconds')
    return 

In [19]:
all_sim()

Time used for this simulation is 246.13110280036926 seconds


In [20]:
games.to_csv("simulation_result.csv")

In [21]:
games.head()

Unnamed: 0,a_tm,h_tm,mean_sgoals,esti_gij,sim_ascore,sim_hscore
Los Angeles Lakers vs New Orleans Pelicans,Los Angeles Lakers,New Orleans Pelicans,208.720055,-9.085918,101.0,110.0
Los Angeles Lakers vs San Antonio Spurs,Los Angeles Lakers,San Antonio Spurs,202.941951,-18.704215,93.0,112.0
Los Angeles Lakers vs Boston Celtics,Los Angeles Lakers,Boston Celtics,210.280358,-12.229853,100.0,112.0
Los Angeles Lakers vs Charlotte Hornets,Los Angeles Lakers,Charlotte Hornets,204.297946,-10.568032,98.0,108.0
Los Angeles Lakers vs Denver Nuggets,Los Angeles Lakers,Denver Nuggets,214.895732,-8.484503,103.0,112.0


## Model Accuracy Testing

- We calculate the difference of our simulated two teams' score (sim_mar) and the difference of the real games' score difference of the real game data (real_mar). We tested two parts:

    - Check if sim_mar * real_mar is positive. If positive, it suggests we've picked the same winner as the real game. (Correct Picks)
    - When people betting on NBA games, they are usually betting on the score difference of the two teams. We check the difference between sim_mar and real_mar is within 5 points. If yes, we think this is a good estimation. (Final Margin within 5 Pts)


In [22]:
#test the model


def test():
    temp1=0
    temp2=0
    for index, row in game_data.iterrows():
        real_mar=float(row['away_score'])-float(row['home_score'])
        sim_mar=games['sim_ascore'].loc[row['away_tm']+' vs '+row['home_tm']]-games['sim_hscore'].loc[row['away_tm']+' vs '+row['home_tm']]
        temp1=temp1+int(real_mar*sim_mar>0)
        temp2=temp2+int(math.fabs(real_mar-sim_mar)<=5)
    pick=temp1/len(game_data)*100
    mar=temp2/len(game_data)*100
    return pick,mar

pick_accu, mar_accu=test()
print('We pick the right winner for %s percent of time, predict the right margin for %s percent of time'
      %(pick_accu, mar_accu))

We pick the right winner for 64.94086727989487 percent of time, predict the right margin for 36.530880420499344 percent of time


The accuracy rate of nbagamesim.com is as follows:
- Correct Picks  (64%)
- Final Margin within 5 Pts  (36%)

Thus we have a better accuracy then them in predicting the scores.
This result is actually similiar to/ even better than the simulation accuracy of most of the simulation websites. Cheers!

## Simulate the game you want!

Run the following function. Input the away team name and home team name, and get the simulation result.

In [23]:
def read_result():
    test = [0,0]
    print('Hi! We want to help you predict a game!')
    while not any(test):
        home = input('Please input the home team: ')
        away = input('Please input the away team: ')
        away_text = away.title()
        home_text = home.title()
        test = games['a_tm'].str.contains(away_text) & games['h_tm'].str.contains(home_text)
        if any(test):
            print('\nSimulation Start!')
            break
        else:
            print('\nSorry! These aren\'t valid names. Please try again!')

    away_score = games.loc[test]['sim_ascore'].item()
    home_score = games.loc[test]['sim_hscore'].item()
    d = {away_text: away_score, home_text: home_score}
    if away_score > home_score:
        print('\nWe predict the winner team to be:', games.loc[test]['a_tm'].item())
        print('And the score detail will approximately be (Home vs. Away):\n', home_score, ':', away_score)
    elif away_score < home_score:
        print('\nWe predict the winner team to be:', games.loc[test]['h_tm'].item())
        print('And the score detail will approximately be (Home vs. Away):\n', home_score, ':', away_score) 
    elif away_score == home_score:
        print('\nWe predict the game to be even')
        print('And the score detail will approximately be (Home vs. Away):\n', home_score, ':', away_score) 

        
read_result()

Hi! We want to help you predict a game!
Please input the home team: nets
Please input the away team: lakers

Simulation Start!

We predict the winner team to be: Brooklyn Nets
And the score detail will approximately be (Home vs. Away):
 108.0 : 103.0
