# ENGG 192 
### Dartmouth College - January 7, 2019
### Spencer Bertsch
This program uses both TensorFlow and PyTorch to create dense, fully connected neural networks designed to predict the results of NBA games. 

The data source for this project can be found [here](https://www.kaggle.com/ionaskel/nba-games-stats-from-2014-to-2018#nba.games.stats.csv). 



In [89]:
# Imports 
import tensorflow as tf
import numpy as np 
import pandas as pd
import keras

#Plotting
%matplotlib inline 
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

#Data cleaning and preprocessing
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import LabelBinarizer


## Load the Dataset 

Our dataset exists as a .csv file downloaded form Kaggle. In a more developed project we could develop the way that data is loaded into this notebook, but for now we can simply use a handy feature of Pandas called 'read_csv' to import a .csv to a pandas dataframe. 

In [23]:
# Import the dataset - load .csv file
path = '/Users/spencerbertsch/Desktop/ENGS 192/Dense_NNs/Data/nba_raw_data.csv'
#create a dataframe from the NBA data stored in the .csv file
NBA_data_raw = pd.read_csv(path)

print("Shapw of raw dataset:", (NBA_data_raw.shape))
print("So we can see that we have 40 features and one label vector")
print("We can also see that we have 9,840 rows")
print(" ")
print("We can also preview our dataset by observing the top five rows")
NBA_data_raw.head(5)

Shapw of raw dataset: (9840, 41)
So we can see that we have 40 features and one label vector
We can also see that we have 9,840 rows
 
We can also preview our dataset by observing the top five rows


Unnamed: 0.1,Unnamed: 0,Team,Game,Date,Home,Opponent,WINorLOSS,TeamPoints,OpponentPoints,FieldGoals,...,Opp.FreeThrows,Opp.FreeThrowsAttempted,Opp.FreeThrows.,Opp.OffRebounds,Opp.TotalRebounds,Opp.Assists,Opp.Steals,Opp.Blocks,Opp.Turnovers,Opp.TotalFouls
0,1,ATL,1,10/29/14,Away,TOR,L,102,109,40,...,27,33,0.818,16,48,26,13,9,9,22
1,2,ATL,2,11/1/14,Home,IND,W,102,92,35,...,18,21,0.857,11,44,25,5,5,18,26
2,3,ATL,3,11/5/14,Away,SAS,L,92,94,38,...,27,38,0.711,11,50,25,7,9,19,15
3,4,ATL,4,11/7/14,Away,CHO,L,119,122,43,...,20,27,0.741,11,51,31,6,7,19,30
4,5,ATL,5,11/8/14,Home,NYK,W,103,96,33,...,8,11,0.727,13,44,26,2,6,15,29


In [25]:
print(" ")
print("We can then make a list from the column headers and observe all of the information that our dataset has aggregated for each game")
cols = list(NBA_data_raw.columns.values) #Make a list of all of the columns in the df
cols

 
We can then make a list from the column headers and observe all of the information that our dataset has aggregated for each game


['Unnamed: 0',
 'Team',
 'Game',
 'Date',
 'Home',
 'Opponent',
 'WINorLOSS',
 'TeamPoints',
 'OpponentPoints',
 'FieldGoals',
 'FieldGoalsAttempted',
 'FieldGoals.',
 'X3PointShots',
 'X3PointShotsAttempted',
 'X3PointShots.',
 'FreeThrows',
 'FreeThrowsAttempted',
 'FreeThrows.',
 'OffRebounds',
 'TotalRebounds',
 'Assists',
 'Steals',
 'Blocks',
 'Turnovers',
 'TotalFouls',
 'Opp.FieldGoals',
 'Opp.FieldGoalsAttempted',
 'Opp.FieldGoals.',
 'Opp.3PointShots',
 'Opp.3PointShotsAttempted',
 'Opp.3PointShots.',
 'Opp.FreeThrows',
 'Opp.FreeThrowsAttempted',
 'Opp.FreeThrows.',
 'Opp.OffRebounds',
 'Opp.TotalRebounds',
 'Opp.Assists',
 'Opp.Steals',
 'Opp.Blocks',
 'Opp.Turnovers',
 'Opp.TotalFouls']

## Data Cleaning

We can see that our target vector (Win or Loss) is the seventh column from the left, so the first thing we need to do is move that vector to the end (right side) of the dataframe. 

The next thing we can see is that several of these feature vectors in addition to our label vector contain categorical data in the form of strings. The general way to deal with categorical data is to encode this data into numeric data - this will allow our dense neural netowrk to learn from this data. 

After our features are seperated from the label and we have purely numeric data, we can scale our dataset and we will be ready for training! 

In [120]:
#move label vector to the end of the dataframe 

cols = list(NBA_data_raw.columns.values) #Make a list of all of the columns in the df
cols.pop(cols.index('WINorLOSS')) #Remove WINorLOSS from list
df = NBA_data_raw[cols+['WINorLOSS']] #Create new dataframe with columns in the order you want
df.head(5) 

Unnamed: 0.1,Unnamed: 0,Team,Game,Date,Home,Opponent,TeamPoints,OpponentPoints,FieldGoals,FieldGoalsAttempted,...,Opp.FreeThrowsAttempted,Opp.FreeThrows.,Opp.OffRebounds,Opp.TotalRebounds,Opp.Assists,Opp.Steals,Opp.Blocks,Opp.Turnovers,Opp.TotalFouls,WINorLOSS
0,1,ATL,1,10/29/14,Away,TOR,102,109,40,80,...,33,0.818,16,48,26,13,9,9,22,L
1,2,ATL,2,11/1/14,Home,IND,102,92,35,69,...,21,0.857,11,44,25,5,5,18,26,W
2,3,ATL,3,11/5/14,Away,SAS,92,94,38,92,...,38,0.711,11,50,25,7,9,19,15,L
3,4,ATL,4,11/7/14,Away,CHO,119,122,43,93,...,27,0.741,11,51,31,6,7,19,30,L
4,5,ATL,5,11/8/14,Home,NYK,103,96,33,81,...,11,0.727,13,44,26,2,6,15,29,W


### Continued data cleaning

We can see that our label vector has been moved to the end of the dataframe. We can now encode our categorical data so that we have a purely numeric dataset. 

In [121]:
cols = df.shape[1]
rows = df.shape[0]

for i in range(0, cols, 1):
    
    values = df.iloc[:,i]
    
    #integer encode each column
    label_encoder = LabelEncoder()
    integer_encoded = label_encoder.fit_transform(values)
   
    encoded_vec = pd.DataFrame(integer_encoded) #convert to pd.dataframe
    
    df = pd.concat([df, encoded_vec], axis=1)
    

In [122]:
cols

41

In [136]:

final_df = df.iloc[:,-cols:]
print(final_df.shape)

final_df

(9840, 41)


Unnamed: 0,0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,...,0.10,0.11,0.12,0.13,0.14,0.15,0.16,0.17,0.18,0.19
0,0,0,0,143,0,27,37,44,21,19,...,32,210,16,27,20,13,9,7,15,0
1,1,0,1,155,1,11,37,27,16,8,...,20,231,11,23,19,5,5,16,19,1
2,2,0,2,251,0,26,27,29,19,31,...,37,145,11,29,19,7,9,17,8,0
3,3,0,3,259,0,4,54,57,24,32,...,26,163,11,30,25,6,7,17,23,0
4,4,0,4,263,1,19,38,31,14,20,...,10,154,13,23,20,2,6,13,22,1
5,5,0,5,159,0,19,26,20,8,10,...,11,68,11,19,17,4,2,13,19,1
6,6,0,6,167,1,28,35,32,20,15,...,11,38,8,9,22,12,8,9,10,1
7,7,0,7,175,1,15,49,38,23,14,...,24,267,5,11,21,10,3,12,13,1
8,8,0,8,179,0,5,29,62,21,29,...,12,180,12,26,33,12,2,11,7,0
9,9,0,9,191,1,13,44,49,22,24,...,21,101,13,23,18,7,0,9,17,0


In [133]:
#define a function to label encode a 2D (x,y) matrix and return an encoded matrix
def int_encode(df):
    cols = df.shape[1]
    rows = df.shape[0]
    original_size = df.shape[1]
    
    for i in range(0, cols, 1):
    
        values = df.iloc[:,i]

        #integer encode each column
        label_encoder = LabelEncoder()
        integer_encoded = label_encoder.fit_transform(values)

        encoded_vec = pd.DataFrame(integer_encoded) #convert to pd.dataframe

        df = pd.concat([df, encoded_vec], axis=1)
    

    final_df = df.iloc[:,-original_size:] #why does it work above and not here??
    print(final_df.shape)

    return(final_df)


In [134]:
cols

41

In [135]:
XX = int_encode(input_df)

XX

(9840, 82)


Unnamed: 0,0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,...,0.10,0.11,0.12,0.13,0.14,0.15,0.16,0.17,0.18,0.19
0,0,0,0,143,0,27,37,44,21,19,...,32,210,16,27,20,13,9,7,15,0
1,1,0,1,155,1,11,37,27,16,8,...,20,231,11,23,19,5,5,16,19,1
2,2,0,2,251,0,26,27,29,19,31,...,37,145,11,29,19,7,9,17,8,0
3,3,0,3,259,0,4,54,57,24,32,...,26,163,11,30,25,6,7,17,23,0
4,4,0,4,263,1,19,38,31,14,20,...,10,154,13,23,20,2,6,13,22,1
5,5,0,5,159,0,19,26,20,8,10,...,11,68,11,19,17,4,2,13,19,1
6,6,0,6,167,1,28,35,32,20,15,...,11,38,8,9,22,12,8,9,10,1
7,7,0,7,175,1,15,49,38,23,14,...,24,267,5,11,21,10,3,12,13,1
8,8,0,8,179,0,5,29,62,21,29,...,12,180,12,26,33,12,2,11,7,0
9,9,0,9,191,1,13,44,49,22,24,...,21,101,13,23,18,7,0,9,17,0


The last step in preprocessing is to scale our dataset. 

In [137]:
#Scaling

# Deep Learning - TensorFlow 

Before we jump to our feet and run to the sports betting websites, we have to realize that the features in our dataset contain data from the games that we are trying to predict. In other words, we would have to wait through the entire game to collect and aggreagate the data from the game before cleaning it and feeding it to our dense, fully connected network. 

Unless we could do this very efficiently (instantly..) there is little use for this program in terms of actual sports betting. Still, it was a good way to practice using several deep learning libraries running on different operating systems. It's important to remember that althouhg libraries such as TensorFlow have reached a relatively mature stage, the APIs that are often used on top of them can be potentially underdeveloped. 