# NBA GAME PREDICTION

### Final Project for 15-688 Practical Data Science
#### Authors: Runchang Kang / Zheng Luo

The charming part of sport games must be the unpredictability. Countless of buzzer beaters and upsets not only crazied the fans, but also raised human's curiosity to use machine to predict the result. As basketball fans and students who just took the practical data science course, we can't wait to use the knowledge and skillsets that learnt from the course to play with the "mistical power".

### Project Content

In this project, we will show the pipeline of using data science techniques to play with data and try to get the best prediction as we can. The main structure of the project includes:
- [Data Collection](#Data-Collection)
- [Data Processing](#Data-Processing)
- [Model Selection](#Model-Selection)
- [Training and Prediction](#Training-and-Prediction)
- [Iteration and Adjustment](#Iteration-and-Adjustment)
- [Conclusion](#Conclusion)

In [1]:
import requests
import pickle
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

## Discussion

Although many datasets already exist online, many of them are poorly structured and hard to utilize. We decide to scrape the raw data from a NBA statistic website [https://www.basketball-reference.com/] The website has documented every single game orderly so it won't be too hard to get the data we need. 

Before starting scraping the website, we have to think about what kind of data we need and how the data will be structured locally. After a heated discussion, we figured out there is no perfect way to do the prediction. Due to the large number of poteintial influential factors, such as, the player's physical and emotional state, the team chemistry, the strength of schedule, the home or road game factor, the influence from social media, the player's fluctuation during the game, and so on. We have to admit the existence of bias.

We planned to collect the statistic data of each game each team from 1995 to 2017, and use the mean of previous `5 games' data` to represent the general status of a team. However, due to the opponents of previous five games are different, we assume that the mean will neutralize the difference of level of those opponents. 

## Data Collection

We scrape all teams all games statistic from 1995 to 2017. However, some teams changed their name or location within the year, we have to carefully process these data.


### Utlis

In [2]:
TEAM_NAMES=["ATL","BOS","BRK","NJN","CHA","CHH","CHO","CHI","CLE","DAL","DEN","DET","GSW","HOU","IND","LAC","LAL","MEM","VAN","MIA","MIL","MIN","NOP","NOH","NOK","NYK","OKC","SEA","ORL","PHI","PHO","POR","SAC","SAS","TOR","UTA", "WSB","WAS"]
YEARS=["%s" %x for x in range(1995,2017)]
BASE_TEAM_URL= "https://www.basketball-reference.com/teams"
#several teams have multiple names
TEAM_INDEX={"ATL":0,
 "BOS":1,
 "BRK":2,"NJN":2,
 "CHA":3,"CHH":3,"CHO":3,
 "CHI":4,
 "CLE":5,
 "DAL":6,
 "DEN":7,
 "DET":8,
 "GSW":9,
 "HOU":10,
 "IND":11,
 "LAC":12,
 "LAL":13,
 "MEM":14,"VAN":14,
 "MIA":15,
 "MIL":16,
 "MIN":17,
 "NOH":18,"NOK":18,"NOP":18,
 "NYK":19,
 "SEA":20,"OKC":20,
 "ORL":21,
 "PHI":22,
 "PHO":23,
 "POR":24,
 "SAC":25,
 "SAS":26,
 "TOR":27,
 "UTA":28,
 "WSB":29,"WAS":29}

#this function helps create structural dataframe for each team
#we store the data as pickle file for each team
def createNewDataframes():
    return (pd.DataFrame(columns = ["teamIndex","oppoIndex","gameDate","result","FG","FGA","3P","3PA","FT","FTA","ORB","DRB","AST","STL","BLK","TOV","PF","PTS"]),
           pd.DataFrame(columns = ["teamIndex","oppoIndex","gameDate","result","TS%","eFG%","3PAr","FTr","ORB%","DRB%","TRB%","AST%","STL%","BLK%","TOV%","USB%","ORtg","DRtg"]))



### Scraping

The scaping took about four hours to run. We seperated the works into 3 laptops and stored all the data as pickle file which named by team.

We store the following features of each game:

-`FG : Field Goal`

-`FGA : Filed Goal Atempts`

-`3P : 3-points Filed Goal`

-`3PA : 3-points Filed Goal Atempts`

-`FT : Free Throw`

-`FTA : Free Throw Atempts`

-`ORB : Offence Rebound `

-`DRB : Defence Rebound `

-`AST : Assists`

-`STL : Steals`

-`BLK : Block Shot `

-`TOV : Turnovers`

-`PF : Personal Fouls`

-`PTS : Points`


In [None]:
#this block of code will scape and store the data as pickle files for each team

for team in TEAM_NAMES:
    basicDf,advancedDf = createNewDataframes()
    for year in YEARS:
        url = BASE_TEAM_URL +"/"+team+"/"+year+"_games.html"
        print(url)
        response = requests.get(url)
        root = BeautifulSoup(response.text,"html5lib")
        items = root.find_all("td",attrs={"data-stat": "box_score_text","class":"center"})
        winList = []
        if items == None:
            continue
        #get game results
        winItems = root.find_all("td",attrs={"data-stat": "game_result","class":"center"})
        winList = []
        for i in winItems:
            if i.text=="W":
                winList.append(1)
            elif i.text =="L":
                winList.append(0)
        
        if len(items)!= len(winList):
            print("mismatch!")
        #get opponents:
        oppoList = []
        oppoItems = root.find_all("td",attrs={"data-stat": "opp_name","class":"left"})
        
        
        for i in oppoItems:
            tmp = i["csk"][:3]
            if tmp != None:
                oppoList.append(tmp)
            else:
                print("here is a None!")
        
        if len(items)!= len(winList):
            print("mismatch!")
        for index,item in enumerate(items):
            newUrl = BASE_TEAM_URL[:-6]+item.a["href"]
            date = item.a["href"][11:19]
            basicList = [TEAM_INDEX[team],TEAM_INDEX[oppoList[index]],date,winList[index]]
            advancedList = [TEAM_INDEX[team],TEAM_INDEX[oppoList[index]],date,winList[index]]
            newResponse = requests.get(newUrl)
            tmpRoot = BeautifulSoup(newResponse.text,"html5lib")
            idString = "all_box_%s_basic" % team.lower()
            tmpItem = tmpRoot.find("div",attrs={"id":idString}).find("tfoot")
            basicList.append(tmpItem.find("td",attrs={"data-stat":"fg"}).text)
            basicList.append(tmpItem.find("td",attrs={"data-stat":"fga"}).text)
            basicList.append(tmpItem.find("td",attrs={"data-stat":"fg3"}).text)
            basicList.append(tmpItem.find("td",attrs={"data-stat":"fg3a"}).text)
            basicList.append(tmpItem.find("td",attrs={"data-stat":"ft"}).text)
            basicList.append(tmpItem.find("td",attrs={"data-stat":"fta"}).text)
            basicList.append(tmpItem.find("td",attrs={"data-stat":"orb"}).text)
            basicList.append(tmpItem.find("td",attrs={"data-stat":"drb"}).text)
            basicList.append(tmpItem.find("td",attrs={"data-stat":"ast"}).text)
            basicList.append(tmpItem.find("td",attrs={"data-stat":"stl"}).text)
            basicList.append(tmpItem.find("td",attrs={"data-stat":"blk"}).text)
            basicList.append(tmpItem.find("td",attrs={"data-stat":"tov"}).text)
            basicList.append(tmpItem.find("td",attrs={"data-stat":"pf"}).text)
            basicList.append(tmpItem.find("td",attrs={"data-stat":"pts"}).text)
            basicDf.loc[basicDf.shape[0]] = basicList
            
            
            idString = "all_box_%s_advanced" % team.lower()
            tmpItem = tmpRoot.find("div",attrs={"id":idString}).find("tfoot")
            advancedList.append(tmpItem.find("td",attrs={"data-stat":"ts_pct"}).text)
            advancedList.append(tmpItem.find("td",attrs={"data-stat":"efg_pct"}).text)
            advancedList.append(tmpItem.find("td",attrs={"data-stat":"fg3a_per_fga_pct"}).text)
            advancedList.append(tmpItem.find("td",attrs={"data-stat":"fta_per_fga_pct"}).text)
            advancedList.append(tmpItem.find("td",attrs={"data-stat":"orb_pct"}).text)
            advancedList.append(tmpItem.find("td",attrs={"data-stat":"drb_pct"}).text)
            advancedList.append(tmpItem.find("td",attrs={"data-stat":"trb_pct"}).text)
            advancedList.append(tmpItem.find("td",attrs={"data-stat":"ast_pct"}).text)
            advancedList.append(tmpItem.find("td",attrs={"data-stat":"stl_pct"}).text)
            advancedList.append(tmpItem.find("td",attrs={"data-stat":"blk_pct"}).text)
            advancedList.append(tmpItem.find("td",attrs={"data-stat":"tov_pct"}).text)
            advancedList.append(tmpItem.find("td",attrs={"data-stat":"usg_pct"}).text)
            advancedList.append(tmpItem.find("td",attrs={"data-stat":"off_rtg"}).text)
            advancedList.append(tmpItem.find("td",attrs={"data-stat":"def_rtg"}).text)
            advancedDf.loc[advancedDf.shape[0]] = advancedList
        basicName = "%s_basic_data.pickle"%team
        advancedName = "%s_advanced_data.pickle"%team
        basicDf.to_pickle(basicName)
        advancedDf.to_pickle(advancedName)

## Data Processing

#### Utlis

In [139]:

#manipulate the dataframe into right data type
def changeDtypes(df):
    df['gameDate'] = pd.to_datetime(df['gameDate'], format="%Y%m%d")
    for i in df.columns:
        if i not in ('teamIndex','gameDate','result',"oppoIndex"):
            df[i]=df[i].astype("float64") 
        
    return df

#calculate the mean of previous k games,ignore the columns in a
#adding one more row of previous result's mean
def refactorDF2(df,k,a):
    newDf = df.copy()
    for i in range(k,df.shape[0]):
        
        for column in a:
            newDf.at[i,column] = df.iloc[i-k:i-1][column].mean() 
        newDf.at[i,"preResult"] = df.iloc[i-k:i-1]["result"].astype("float64").mean() 
    return newDf.iloc[k:]

#calculate the mean of previous k games,ignore the columns in a
#adding one more row of previous result's mean
def refactorDF(df,k,a):
    newDf = df.copy()
    for i in range(k,df.shape[0]):
        
        for column in a:
            newDf.at[i,column] = df.iloc[i-k:i-1][column].mean() 
        
    return newDf.iloc[k:]
#get the column names that need to process
def getBasicColumnList(df):
    a=list(df.columns)
    for i in ['teamIndex','gameDate','result',"oppoIndex","preResult"]:
        a.remove(i)
    return a
#get the column names that need to process
def getBasicColumnList2(df):
    a=list(df.columns)
    for i in ['teamIndex','gameDate','result',"oppoIndex"]:
        a.remove(i)
    return a
#helper function for adding the up to date seasonal win records
def compareDate(df):
    newDf = df.copy()
    DELTALimit = datetime.timedelta(weeks=16)
    newDf.at[0,"preResult"] = 0
    newSeasonIndex = 0
    for i in range(1,df.shape[0]):
        delta = newDf.at[i,"gameDate"]- newDf.at[i-1,"gameDate"]
        
        if delta < DELTALimit:
            newDf.at[i,"preResult"] = df.iloc[newSeasonIndex:i]["result"].astype("float64").sum()/(i-newSeasonIndex)
        else:
            
            newSeasonIndex = i
            newDf.at[i,"preResult"] = 0
    return newDf


basicDataFileNames = ["%s_basic_data.pickle"%x for x in TEAM_INDEX]
advancedDataFileNames = ["%s_advanced_data.pickle"%x for x in TEAM_INDEX]

In [None]:
#this function combine all of the data that we scraped form the stat website
#including considering the changed of team names
#it takes some time to process, I store them in a file
def combineData(DataFileNames,k=5):
    dataFrameSet = dict()
    for fileName in DataFileNames:
        df = pd.read_pickle(fileName)
        df = changeDtypes(df)
        df = compareDate(df)
        df = refactorDF(df,k,ignoreCol)
        
        curIndex = df.iloc[0]["teamIndex"]
        if curIndex not in dataFrameSet:
            dataFrameSet[curIndex] = df
        else:
            dataFrameSet[curIndex] = pd.concat([dataFrameSet[curIndex],df])
    return dataFrameSet


dataFrameSet = combineData(advancedDataFileNames)

with open("join_data.pickle","wb") as f:
    pickle.dump(dataFrameSet,f,protocol=pickle.HIGHEST_PROTOCOL)
f.close()

In [None]:
#this function create all of the features in the dataframe and output as 
#numpy array
def createTheFeatures(dataFrameSet,ignoreCol):
    finalArray = np.array([])
    flag = True
    yValue = np.array([])
    for key in dataFrameSet:
        df = dataFrameSet[key]
        
        for i in range(df.shape[0]):
            tmpArray = df.iloc[i][ignoreCol].values
            tmpArray = tmpArray.reshape(1,tmpArray.shape[0])
            if np.isnan(np.min(tmpArray)):
                continue
            oppoDf = dataFrameSet[df.iloc[i]["oppoIndex"]]
            time = df.iloc[i]["gameDate"]
            oppoRow = oppoDf.loc[oppoDf["gameDate"]==time]
            if not oppoRow.empty:
                
                oppoArray = oppoRow[ignoreCol].values
                if np.isnan(np.min(oppoArray)):
                    continue
                tmpArray = np.append(tmpArray,oppoArray)
                if flag:
                    finalArray = tmpArray
                    flag = False
                else:   
                    finalArray = np.vstack((finalArray,tmpArray))
                yValue = np.append(yValue,df.iloc[i]["result"])
                print(finalArray.shape)
                print(yValue.shape)
    return finalArray,yValue


with open("join_data.pickle","rb") as f:
    dataFrameSet=pickle.load(f)
ignoreCol = getBasicColumnList2(dataFrameSet[0])
X,y = createTheFeatures(dataFrameSet,ignoreCol)

In [144]:
#store the data into files
with open("X_ad3.pickle","wb") as f:
    pickle.dump(X,f,protocol=pickle.HIGHEST_PROTOCOL)
f.close()
with open("y_ad3.pickle","wb") as f:
    pickle.dump(y,f,protocol=pickle.HIGHEST_PROTOCOL)
f.close()