# NBA GAME PREDICTION

### Final Project for 15-688 Practical Data Science
#### Authors: Runchang Kang / Zheng Luo

The charming part of sport games must be the unpredictability. Countless of buzzer beaters and upsets not only crazied the fans, but also raised human's curiosity to use machine to predict the result. As basketball fans and students who just took the practical data science course, we can't wait to use the knowledge and skillsets that learnt from the course to play with the "mistical power".

### Project Content

In this project, we will show the pipeline of using data science techniques to play with data and try to get the best prediction as we can. The main structure of the project includes:
- [Data Collection](#Data-Collection)
- [Data Processing](#Data-Processing)
- [Model Selection](#Model-Selection)
- [Training and Prediction](#Training-and-Prediction)
- [Iteration and Adjustment](#Iteration-and-Adjustment)
- [Conclusion](#Conclusion)

In [1]:
import requests
import pickle
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

## Discussion

Although many datasets already exist online, many of them are poorly structured and hard to utilize. We decide to scrape the raw data from a NBA statistic website [https://www.basketball-reference.com/] The website has documented every single game orderly so it won't be too hard to get the data we need. 

Before starting scraping the website, we have to think about what kind of data we need and how the data will be structured locally. After a heated discussion, we figured out there is no perfect way to do the prediction. Due to the large number of poteintial influential factors, such as, the player's physical and emotional state, the team chemistry, the strength of schedule, the home or road game factor, the influence from social media, the player's fluctuation during the game, and so on. We have to admit the existence of bias.

We planned to collect the statistic data of each game each team from 1995 to 2017, and use the mean of previous `5 games' data` to represent the general status of a team. However, due to the opponents of previous five games are different, we assume that the mean will neutralize the difference of level of those opponents. 

## Data Collection

We scrape all teams all games statistic from 1995 to 2017. However, some teams changed their name or location within the year, we have to carefully process these data.


### Utlis

In [2]:
TEAM_NAMES=["ATL","BOS","BRK","NJN","CHA","CHH","CHO","CHI","CLE","DAL","DEN","DET","GSW","HOU","IND","LAC","LAL","MEM","VAN","MIA","MIL","MIN","NOP","NOH","NOK","NYK","OKC","SEA","ORL","PHI","PHO","POR","SAC","SAS","TOR","UTA", "WSB","WAS"]
YEARS=["%s" %x for x in range(1995,2017)]
BASE_TEAM_URL= "https://www.basketball-reference.com/teams"
#several teams have multiple names
TEAM_INDEX={"ATL":0,
 "BOS":1,
 "BRK":2,"NJN":2,
 "CHA":3,"CHH":3,"CHO":3,
 "CHI":4,
 "CLE":5,
 "DAL":6,
 "DEN":7,
 "DET":8,
 "GSW":9,
 "HOU":10,
 "IND":11,
 "LAC":12,
 "LAL":13,
 "MEM":14,"VAN":14,
 "MIA":15,
 "MIL":16,
 "MIN":17,
 "NOH":18,"NOK":18,"NOP":18,
 "NYK":19,
 "SEA":20,"OKC":20,
 "ORL":21,
 "PHI":22,
 "PHO":23,
 "POR":24,
 "SAC":25,
 "SAS":26,
 "TOR":27,
 "UTA":28,
 "WSB":29,"WAS":29}

#this function helps create structural dataframe for each team
#we store the data as pickle file for each team
def createNewDataframes():
    return (pd.DataFrame(columns = ["teamIndex","oppoIndex","gameDate","result","FG","FGA","3P","3PA","FT","FTA","ORB","DRB","AST","STL","BLK","TOV","PF","PTS"]),
           pd.DataFrame(columns = ["teamIndex","oppoIndex","gameDate","result","TS%","eFG%","3PAr","FTr","ORB%","DRB%","TRB%","AST%","STL%","BLK%","TOV%","USB%","ORtg","DRtg"]))



### Scraping

The scaping took about four hours to run. We seperated the works into 3 laptops and stored all the data as pickle file which named by team.

We store the following features of each game:

-`FG : Field Goal`

-`FGA : Filed Goal Atempts`

-`3P : 3-points Filed Goal`

-`3PA : 3-points Filed Goal Atempts`

-`FT : Free Throw`

-`FTA : Free Throw Atempts`

-`ORB : Offence Rebound `

-`DRB : Defence Rebound `

-`AST : Assists`

-`STL : Steals`

-`BLK : Block Shot `

-`TOV : Turnovers`

-`PF : Personal Fouls`

-`PTS : Points`


In [None]:
#this function will scape and store the data as pickle files for each team
def scape_data(TEAM_NAMES,YEARS, BASE_TEAM_URL):
    for team in TEAM_NAMES:
        basicDf,advancedDf = createNewDataframes()
        for year in YEARS:
            url = BASE_TEAM_URL +"/"+team+"/"+year+"_games.html"
            print(url)
            response = requests.get(url)
            root = BeautifulSoup(response.text,"html5lib")
            items = root.find_all("td",attrs={"data-stat": "box_score_text","class":"center"})
            winList = []
            if items == None:
                continue
            #get game results
            winItems = root.find_all("td",attrs={"data-stat": "game_result","class":"center"})
            winList = []
            for i in winItems:
                if i.text=="W":
                    winList.append(1)
                elif i.text =="L":
                    winList.append(0)

            if len(items)!= len(winList):
                print("mismatch!")
            #get opponents:
            oppoList = []
            oppoItems = root.find_all("td",attrs={"data-stat": "opp_name","class":"left"})


            for i in oppoItems:
                tmp = i["csk"][:3]
                if tmp != None:
                    oppoList.append(tmp)
                else:
                    print("here is a None!")

            if len(items)!= len(winList):
                print("mismatch!")
            for index,item in enumerate(items):
                newUrl = BASE_TEAM_URL[:-6]+item.a["href"]
                date = item.a["href"][11:19]
                basicList = [TEAM_INDEX[team],TEAM_INDEX[oppoList[index]],date,winList[index]]
                advancedList = [TEAM_INDEX[team],TEAM_INDEX[oppoList[index]],date,winList[index]]
                newResponse = requests.get(newUrl)
                tmpRoot = BeautifulSoup(newResponse.text,"html5lib")
                idString = "all_box_%s_basic" % team.lower()
                tmpItem = tmpRoot.find("div",attrs={"id":idString}).find("tfoot")
                basicList.append(tmpItem.find("td",attrs={"data-stat":"fg"}).text)
                basicList.append(tmpItem.find("td",attrs={"data-stat":"fga"}).text)
                basicList.append(tmpItem.find("td",attrs={"data-stat":"fg3"}).text)
                basicList.append(tmpItem.find("td",attrs={"data-stat":"fg3a"}).text)
                basicList.append(tmpItem.find("td",attrs={"data-stat":"ft"}).text)
                basicList.append(tmpItem.find("td",attrs={"data-stat":"fta"}).text)
                basicList.append(tmpItem.find("td",attrs={"data-stat":"orb"}).text)
                basicList.append(tmpItem.find("td",attrs={"data-stat":"drb"}).text)
                basicList.append(tmpItem.find("td",attrs={"data-stat":"ast"}).text)
                basicList.append(tmpItem.find("td",attrs={"data-stat":"stl"}).text)
                basicList.append(tmpItem.find("td",attrs={"data-stat":"blk"}).text)
                basicList.append(tmpItem.find("td",attrs={"data-stat":"tov"}).text)
                basicList.append(tmpItem.find("td",attrs={"data-stat":"pf"}).text)
                basicList.append(tmpItem.find("td",attrs={"data-stat":"pts"}).text)
                basicDf.loc[basicDf.shape[0]] = basicList


                idString = "all_box_%s_advanced" % team.lower()
                tmpItem = tmpRoot.find("div",attrs={"id":idString}).find("tfoot")
                advancedList.append(tmpItem.find("td",attrs={"data-stat":"ts_pct"}).text)
                advancedList.append(tmpItem.find("td",attrs={"data-stat":"efg_pct"}).text)
                advancedList.append(tmpItem.find("td",attrs={"data-stat":"fg3a_per_fga_pct"}).text)
                advancedList.append(tmpItem.find("td",attrs={"data-stat":"fta_per_fga_pct"}).text)
                advancedList.append(tmpItem.find("td",attrs={"data-stat":"orb_pct"}).text)
                advancedList.append(tmpItem.find("td",attrs={"data-stat":"drb_pct"}).text)
                advancedList.append(tmpItem.find("td",attrs={"data-stat":"trb_pct"}).text)
                advancedList.append(tmpItem.find("td",attrs={"data-stat":"ast_pct"}).text)
                advancedList.append(tmpItem.find("td",attrs={"data-stat":"stl_pct"}).text)
                advancedList.append(tmpItem.find("td",attrs={"data-stat":"blk_pct"}).text)
                advancedList.append(tmpItem.find("td",attrs={"data-stat":"tov_pct"}).text)
                advancedList.append(tmpItem.find("td",attrs={"data-stat":"usg_pct"}).text)
                advancedList.append(tmpItem.find("td",attrs={"data-stat":"off_rtg"}).text)
                advancedList.append(tmpItem.find("td",attrs={"data-stat":"def_rtg"}).text)
                advancedDf.loc[advancedDf.shape[0]] = advancedList
            basicName = "%s_basic_data.pickle"%team
            advancedName = "%s_advanced_data.pickle"%team
            basicDf.to_pickle(basicName)
            advancedDf.to_pickle(advancedName)

## Data Processing

#### Utlis

In [139]:

#manipulate the dataframe into right data type
def changeDtypes(df):
    df['gameDate'] = pd.to_datetime(df['gameDate'], format="%Y%m%d")
    for i in df.columns:
        if i not in ('teamIndex','gameDate','result',"oppoIndex"):
            df[i]=df[i].astype("float64") 
        
    return df

#calculate the mean of previous k games,ignore the columns in a
#adding one more row of previous result's mean
def refactorDF2(df,k,a):
    newDf = df.copy()
    for i in range(k,df.shape[0]):
        
        for column in a:
            newDf.at[i,column] = df.iloc[i-k:i-1][column].mean() 
        newDf.at[i,"preResult"] = df.iloc[i-k:i-1]["result"].astype("float64").mean() 
    return newDf.iloc[k:]

#calculate the mean of previous k games,ignore the columns in a
#adding one more row of previous result's mean
def refactorDF(df,k,a):
    newDf = df.copy()
    for i in range(k,df.shape[0]):
        
        for column in a:
            newDf.at[i,column] = df.iloc[i-k:i-1][column].mean() 
        
    return newDf.iloc[k:]
#get the column names that need to process
def getBasicColumnList(df):
    a=list(df.columns)
    for i in ['teamIndex','gameDate','result',"oppoIndex","preResult"]:
        a.remove(i)
    return a
#get the column names that need to process
def getBasicColumnList2(df):
    a=list(df.columns)
    for i in ['teamIndex','gameDate','result',"oppoIndex"]:
        a.remove(i)
    return a
#helper function for adding the up to date seasonal win records
def compareDate(df):
    newDf = df.copy()
    DELTALimit = datetime.timedelta(weeks=16)
    newDf.at[0,"preResult"] = 0
    newSeasonIndex = 0
    for i in range(1,df.shape[0]):
        delta = newDf.at[i,"gameDate"]- newDf.at[i-1,"gameDate"]
        
        if delta < DELTALimit:
            newDf.at[i,"preResult"] = df.iloc[newSeasonIndex:i]["result"].astype("float64").sum()/(i-newSeasonIndex)
        else:
            
            newSeasonIndex = i
            newDf.at[i,"preResult"] = 0
    return newDf


basicDataFileNames = ["%s_basic_data.pickle"%x for x in TEAM_INDEX]
advancedDataFileNames = ["%s_advanced_data.pickle"%x for x in TEAM_INDEX]

In [None]:
#this function combine all of the data that we scraped form the stat website
#including considering the changed of team names
#it takes some time to process, I store them in a file
def combineData(DataFileNames,k=5):
    dataFrameSet = dict()
    for fileName in DataFileNames:
        df = pd.read_pickle(fileName)
        df = changeDtypes(df)
        df = compareDate(df)
        df = refactorDF(df,k,ignoreCol)
        
        curIndex = df.iloc[0]["teamIndex"]
        if curIndex not in dataFrameSet:
            dataFrameSet[curIndex] = df
        else:
            dataFrameSet[curIndex] = pd.concat([dataFrameSet[curIndex],df])
    return dataFrameSet


dataFrameSet = combineData(advancedDataFileNames)

with open("join_data.pickle","wb") as f:
    pickle.dump(dataFrameSet,f,protocol=pickle.HIGHEST_PROTOCOL)
f.close()

In [None]:
#this function create all of the features in the dataframe and output as 
#numpy array
def createTheFeatures(dataFrameSet,ignoreCol):
    finalArray = np.array([])
    flag = True
    yValue = np.array([])
    for key in dataFrameSet:
        df = dataFrameSet[key]
        
        for i in range(df.shape[0]):
            tmpArray = df.iloc[i][ignoreCol].values
            tmpArray = tmpArray.reshape(1,tmpArray.shape[0])
            if np.isnan(np.min(tmpArray)):
                continue
            oppoDf = dataFrameSet[df.iloc[i]["oppoIndex"]]
            time = df.iloc[i]["gameDate"]
            oppoRow = oppoDf.loc[oppoDf["gameDate"]==time]
            if not oppoRow.empty:
                
                oppoArray = oppoRow[ignoreCol].values
                if np.isnan(np.min(oppoArray)):
                    continue
                tmpArray = np.append(tmpArray,oppoArray)
                if flag:
                    finalArray = tmpArray
                    flag = False
                else:   
                    finalArray = np.vstack((finalArray,tmpArray))
                yValue = np.append(yValue,df.iloc[i]["result"])
                print(finalArray.shape)
                print(yValue.shape)
    return finalArray,yValue


with open("join_data.pickle","rb") as f:
    dataFrameSet=pickle.load(f)
ignoreCol = getBasicColumnList2(dataFrameSet[0])
X,y = createTheFeatures(dataFrameSet,ignoreCol)

In [144]:
#store the data into files
with open("X_ad3.pickle","wb") as f:
    pickle.dump(X,f,protocol=pickle.HIGHEST_PROTOCOL)
f.close()
with open("y_ad3.pickle","wb") as f:
    pickle.dump(y,f,protocol=pickle.HIGHEST_PROTOCOL)
f.close()

## Model Selection

After the data have been processed, we are ready to start the model seletion. In this project, we will use two methods for parameter selection - random search and grid search.

In [None]:
# import runtime_path
from scipy.stats import randint, uniform
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
# from sklearn.model_selection import RandomizedSearchCV, cross_val_score, train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
# from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier

In order to find the hyper parameters for each model, we first implement random search and grid search.

For logistic regression, we use random search to decide the value of C, and wether or not using dual formulation.




In [None]:
# log regression
def lr_search(train_X, train_y, verbose=False):
    if verbose:
        print('Logistic regression')
    clf = LogisticRegression(max_iter=10000, random_state=0, tol=1e-4)
    param_dist = {'C': uniform(0.1, 20), 'dual': [True, False]}

    # hyper params search
    search_iters = 30
    random_search = RandomizedSearchCV(
        clf, param_distributions=param_dist, n_iter=search_iters,
        verbose=False, cv=5)

    if verbose:
        print("Staring random search for logistic regression hyper parameters...")
    random_search.fit(train_X, train_y)
    if verbose:
        print("Random search complete")
        print("Random search best score: ", random_search.best_score_)
        print("Random search gives best params as: ", random_search.best_params_)

    return random_search.best_params_

For SVM, we use grid search to find type of kernal, value of gamma and C.

In [None]:
#SVM
def svm_search(train_X, train_y, verbose=False):
    if verbose:
        print('SVM classifier...')
    train_X = train_X / train_X.sum(axis = 0) # normalize data

    clf = svm.SVC(random_state=0, tol=1e-4, max_iter=10000)
    param_dist = {
        'gamma': np.logspace(-3,1,5),
        'kernel': ['linear', 'rbf'],
        'C': [0.1,1,2,3,4,5,6,7,8,9,10]}

    # hyper params search
    search_iters = 30
    grid_search = GridSearchCV(
        clf, param_grid=param_dist,
        verbose=False, cv=5)
    
    if verbose:
        print("Staring grid search for svm hyper parameters...")
    grid_search.fit(train_X, train_y)
    if verbose:
        print("Grid search complete")
        print("Grid search best score: ", grid_search.best_score_)
        print("Grid search gives best params as: ", grid_search.best_params_)

    return grid_search.best_params_

For neural network, we use grid search to find number of hidden units, learning rate and regularizer.

In [None]:
# Neural Network
def nn_search(train_X, train_y, verbose=False):
    if verbose:
        print('Neural Net classifier...')
    scaler = StandardScaler()  
    scaler.fit(train_X)  
    train_X = scaler.transform(train_X)  

    clf = MLPClassifier(solver='adam', activation='relu', 
                        max_iter=10000)

    param_dist = {
        'hidden_layer_sizes': [(5,),(10,),(15,),(20,)],
        'alpha': [1e-2,1e-1,1],
        'learning_rate_init': [1e-1, 1, 10]}

    # hyper params search
    search_iters = 30
    grid_search = GridSearchCV(
        clf, param_grid=param_dist,
        verbose=False, cv=5)
    
    if verbose:
        print("Staring grid search for neural network hyper parameters...")
    grid_search.fit(train_X, train_y)
    if verbose:
        print("Grid search complete")
        print("Grid search best score: ", grid_search.best_score_)
        print("Grid search gives best params as: ", grid_search.best_params_)

    return grid_search.best_params_


Now let's compare these 4 models and see who does a better job.

In [None]:
def model_contest(X, y):
    # get small data for model selection
#     data = Data()
#     train_X, train_y = data.get_train_set()
#     test_X, test_y = data.get_test_set()
#     scaler = StandardScaler()
#     train_X = scaler.fit_transform(train_X)
#     test_X = scaler.fit_transform(test_X)
    
    scaler = StandardScaler()
    np.random.seed(100)
#     para_X = np.random.randint(100, size=(100,28))
#     para_y = np.random.randint(2, size = (100,))
#     para_X = scaler.fit_transform(para_X)
    

#     train_X = np.random.randint(100, size=(1000,28))
#     train_y = np.random.randint(2, size = (1000,))
#     test_X = np.random.randint(20, size=(100,28))
#     test_y = np.random.randint(2, size = (100,))
#     train_X = scaler.fit_transform(train_X)
#     test_X = scaler.fit_transform(test_X)
    
    
    
    n = X.shape[0]
    P = np.random.permutation(n)
    para_X = X[P[:5000],:]
    para_y = y[P[:5000]]
    train_X = X[P[-10000:],:]
    train_y = y[P[-10000:]]
    test_X = X[P[:-10000],:]
    test_y = y[P[:-10000]]
    
    para_X = scaler.fit_transform(para_X)
    train_X = scaler.fit_transform(train_X)
    test_X = scaler.fit_transform(test_X)
    
    print(para_X.shape)
    print(para_y.shape)
    print(train_X.shape)
    print(train_y.shape)
    print(test_X.shape)
    print(test_y.shape)
    
    
    
    # Hyper parameter search for Logistic Regression
    lr_params = lr_search(para_X, para_y)
    # Hyper parameter search for SVM
#     svm_params = svm_search(para_X, para_y)
    # Hyper parameter search for Neural Network
    nn_params = nn_search(para_X, para_y)
    
    # Initialize models
    nb_model = GaussianNB()
    lr_model = LogisticRegression(C=lr_params["C"], dual = lr_params["dual"], 
                                  max_iter=10000, random_state=0, tol=1e-4)
#     svm_model = svm.SVC(C=svm_params['C'], gamma=svm_params['gamma'], kernel=svm_params['kernel'], 
#                         random_state=0, tol=1e-4, max_iter=10000)
    nn_model = MLPClassifier(alpha=nn_params['alpha'], hidden_layer_sizes=nn_params['hidden_layer_sizes'],
                             learning_rate_init=nn_params['learning_rate_init'], 
                             solver='adam', activation='relu', max_iter=10000)
    # Train models
    nb_model.fit(train_X, train_y)
    lr_model.fit(train_X, train_y)
#     svm_model.fit(train_X, train_y)
    nn_model.fit(train_X, train_y)
    
#     result = {'NB': nb_model.score(test_X, test_y), 'LR': lr_model.score(test_X, test_y), 
#               'SVM': svm_model.score(test_X, test_y), 'NN': nn_model.score(test_X, test_y)}
        
    result = {'NB': nb_model.score(test_X, test_y), 'LR': lr_model.score(test_X, test_y), 
              'NN': nn_model.score(test_X, test_y)}
    return result
    pass
    
    
    
    

In [None]:
with open('X_data.pickle', 'rb') as f_x:
    X = pickle.load(f_x)
    print(X.shape)
    
with open('y_data.pickle', 'rb') as f_y:
    y = pickle.load(f_y)
    print(y.shape)

In [None]:
model_contest(X, y)

As we can see, ... has the highest accuracy over all models. The difference between each model are very small, indicating that this is the best result we can get from these features. The information revealed by these features are not enough for us to increase our accuracy. Here we use the average recent performance to represent how good a team is, which is a very rough estimate. There are many more factors that will affect the result of a game which we didn't consider.

To get a 

In [None]:
with open('X_ad3.pickle', 'rb') as f_x:
    X_ad = pickle.load(f_x)
    print(X_ad.shape)
    
with open('y_ad3.pickle', 'rb') as f_y:
    y_ad = pickle.load(f_y)
    print(y_ad.shape)

In [None]:
X_both = np.hstack((X,X_ad))
# X_both = X_both[~np.isnan(X_both).any(axis=1)]
X_both.shape

In [None]:
model_contest(X_both, y_ad)