We will look at predicting the winner of games of the National
Basketball Association (NBA). The National Basketball Association (NBA) is the major men's professional basketball league in North America, and is widely considered to be the premier men's professional basketball league in the world. It has 30 teams (29 in the United States and 1 in Canada) 

Various research into predicting the winner suggests that there may be an upper
limit to sports outcome prediction accuracy which, depending on the sport, is
between 70 percent and 80 percent accuracy.


The data is from 
https://www.basketball-reference.com/leagues/NBA_2017_games-october.html
assembled in a csv file

In [4]:
import pandas as pd
dataset = pd.read_csv("NBA_2017_regularGames.csv",parse_dates=["Date"])

In [5]:
dataset.head(2)

Unnamed: 0,Date,Start (ET),Visitor/Neutral,PTS,Home/Neutral,PTS.1,Unnamed: 6,Unnamed: 7,Notes
0,2016-10-25,7:30 pm,New York Knicks,88,Cleveland Cavaliers,117,Box Score,,
1,2016-10-25,10:30 pm,San Antonio Spurs,129,Golden State Warriors,100,Box Score,,


In [6]:
#Rename the columns
dataset.columns = ["Date","Time","Visitor Team","Visitor Points","Home Team","Home Points","Score Type","OT?","Notes"]

In [7]:
dataset.head(2)

Unnamed: 0,Date,Time,Visitor Team,Visitor Points,Home Team,Home Points,Score Type,OT?,Notes
0,2016-10-25,7:30 pm,New York Knicks,88,Cleveland Cavaliers,117,Box Score,,
1,2016-10-25,10:30 pm,San Antonio Spurs,129,Golden State Warriors,100,Box Score,,


Now that we have our dataset, we can compute a baseline. A baseline is an accuracy
that indicates an easy way to get a good accuracy. Any data mining solution should
beat this.

In each match, we have two teams: a home team and a visitor team. An obvious
baseline, called the chance rate, is 50 percent. Choosing randomly will (over time)
result in an accuracy of 50 percent.

###### Prediction Class
We need to specify our class value, which will give
our classification algorithm something to compare against to see if its prediction
is correct or not. This could be encoded in a number of ways; however, for this
application, we will specify our class as 1 if the home team wins and 0 if the visitor
team wins. In basketball, the team with the most points wins. So, while the data set
doesn't specify who wins, we can compute it easily.

In [8]:
dataset["HomeWin"] = dataset["Visitor Points"] < dataset["Home Points"]

In [12]:
print("Home Win percentage: {0:.1f}%".format(100 * dataset["HomeWin"].sum() / dataset["HomeWin"].count()))

Home Win percentage: 58.4%


In [9]:
y_true = dataset["HomeWin"].values

In [10]:
#The array now holds our class values in a format that scikit-learn can read.
y_true

array([ True, False,  True, ...,  True, False,  True], dtype=bool)

##### Feature Engineering

The first two features we want to create to help us predict which team will win
are whether either of those two teams won their last game. This would roughly
approximate which team is playing well.

We will compute this feature by iterating through the rows in order and recording
which team won. When we get to a new row, we look up whether the team won the
last time we saw them.

Currently, this gives a false value to all teams (including the previous year's
champion!) when they are first seen.

In [13]:
dataset["HomeLastWin"] = False
dataset["VisitorLastWin"] = False
# This creates two new columns, all set to False
dataset.ix[:5]

Unnamed: 0,Date,Time,Visitor Team,Visitor Points,Home Team,Home Points,Score Type,OT?,Notes,HomeWin,HomeLastWin,VisitorLastWin
0,2016-10-25,7:30 pm,New York Knicks,88,Cleveland Cavaliers,117,Box Score,,,True,False,False
1,2016-10-25,10:30 pm,San Antonio Spurs,129,Golden State Warriors,100,Box Score,,,False,False,False
2,2016-10-25,10:00 pm,Utah Jazz,104,Portland Trail Blazers,113,Box Score,,,True,False,False
3,2016-10-26,7:30 pm,Brooklyn Nets,117,Boston Celtics,122,Box Score,,,True,False,False
4,2016-10-26,7:00 pm,Dallas Mavericks,121,Indiana Pacers,130,Box Score,OT,,True,False,False
5,2016-10-26,10:30 pm,Houston Rockets,114,Los Angeles Lakers,120,Box Score,,,True,False,False


In [14]:
# Now compute the actual values for these
# Did the home and visitor teams win their last game?
# We first create a (default) dictionary to store the team's last result:
from collections import defaultdict
won_last = defaultdict(int)

The key of this dictionary will be the team and the value will be whether they won
their previous game. We can then iterate over all the rows and update the current
row with the team's last result. 

Note that the preceding code relies on our dataset being in chronological order. Our
dataset is in order; however, if you are using a dataset that is not in order, you will
need to replace dataset.iterrows() with dataset.sort("Date").iterrows().

In [15]:
for index, row in dataset.iterrows():
    home_team = row["Home Team"]
    visitor_team = row["Visitor Team"]
    row["HomeLastWin"] = won_last[home_team]
    row["VisitorLastWin"] = won_last[visitor_team]
    dataset.ix[index] = row
    #We then set our dictionary with the each team's result (from this row) for the next
    #time we see these teams.
    #Set current Win
    won_last[home_team] = row["HomeWin"]
    won_last[visitor_team] = not row["HomeWin"]
    

There isn't much point in
looking at the first five games though. Due to the way our code runs, we didn't have
data for them at that point. Therefore, until a team's second game of the season, we
won't know their current form. We can instead look at different places in the list.
The following code will show the 20th to the 25th games of the season:

In [16]:
dataset.ix[20:25]

Unnamed: 0,Date,Time,Visitor Team,Visitor Points,Home Team,Home Points,Score Type,OT?,Notes,HomeWin,HomeLastWin,VisitorLastWin
20,2016-10-28,8:00 pm,Charlotte Hornets,97,Miami Heat,91,Box Score,,,False,True,True
21,2016-10-28,9:30 pm,Golden State Warriors,122,New Orleans Pelicans,114,Box Score,,,False,False,False
22,2016-10-28,8:00 pm,Phoenix Suns,110,Oklahoma City Thunder,113,Box Score,OT,,True,True,False
23,2016-10-28,7:00 pm,Cleveland Cavaliers,94,Toronto Raptors,91,Box Score,,,False,True,True
24,2016-10-28,9:00 pm,Los Angeles Lakers,89,Utah Jazz,96,Box Score,,,True,False,True
25,2016-10-29,8:00 pm,Indiana Pacers,101,Chicago Bulls,118,Box Score,,,True,True,False


The scikit-learn package implements the CART (Classification and Regression
Trees) algorithm as its default decision tree class, which can use both categorical and
continuous features.

The decision tree implementation in scikit-learn provides a method to stop the
building of a tree using the following options:

    • min_samples_split: This specifies how many samples are needed in order
to create a new node in the decision tree

    • min_samples_leaf: This specifies how many samples must be resultingfrom a node for it to stay
    
The first dictates whether a decision node will be created, while the second dictates whether a decision node will be kept.

Another parameter for decision tress is the criterion for creating a decision.
Gini impurity and Information gain are two popular ones:

    • Gini impurity: This is a measure of how often a decision node would incorrectly predict a sample's class

    •`Information gain: This uses information-theory-based entropy to indicate how much extra information is gained by the decision node


In [17]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=14)

We now need to extract the dataset from our pandas data frame in order to use
it with our scikit-learn classifier. We do this by specifying the columns we
wish to use and using the values parameter of a view of the data frame. The
following code creates a dataset using our last win values for both the home
team and the visitor team:

In [18]:
X_previouswins = dataset[["HomeLastWin", "VisitorLastWin"]].values

In [21]:
import numpy as np
from sklearn.model_selection import cross_val_score

clf = DecisionTreeClassifier(random_state=14)
scores = cross_val_score(clf, X_previouswins, y_true, scoring='accuracy')
print("Using just the last result from the home and visitor teams")
print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))

Using just the last result from the home and visitor teams
Accuracy: 58.4%


###### This scores 58.4% we are better than choosing randomly! 
We should be
able to do better

###### More Feature Engineering

We will try the following
questions:

    • Which team is considered better generally?
    • Which team won their last encounter?

We will also try putting the raw teams into the algorithm to check whether the
algorithm can learn a model that checks how different teams play against each other.

In [25]:
# What about win streaks?
dataset["HomeWinStreak"] = 0
dataset["VisitorWinStreak"] = 0
# Did the home and visitor teams win their last game?
from collections import defaultdict
win_streak = defaultdict(int)

for index, row in dataset.iterrows():  # Note that this is not efficient
    home_team = row["Home Team"]
    visitor_team = row["Visitor Team"]
    row["HomeWinStreak"] = win_streak[home_team]
    row["VisitorWinStreak"] = win_streak[visitor_team]
    dataset.ix[index] = row    
    # Set current win
    if row["HomeWin"]:
        win_streak[home_team] += 1
        win_streak[visitor_team] = 0
    else:
        win_streak[home_team] = 0
        win_streak[visitor_team] += 1

In [28]:
dataset.ix[100:105]

Unnamed: 0,Date,Time,Visitor Team,Visitor Points,Home Team,Home Points,Score Type,OT?,Notes,HomeWin,HomeLastWin,VisitorLastWin,HomeWinStreak,VisitorWinStreak
100,2016-11-08,7:00 pm,Atlanta Hawks,110,Cleveland Cavaliers,106,Box Score,,,False,True,True,6,1
101,2016-11-08,10:30 pm,Dallas Mavericks,109,Los Angeles Lakers,97,Box Score,,,False,True,True,3,1
102,2016-11-08,8:00 pm,Denver Nuggets,107,Memphis Grizzlies,108,Box Score,,,True,False,True,0,1
103,2016-11-08,10:00 pm,Phoenix Suns,121,Portland Trail Blazers,124,Box Score,,,True,True,False,2,0
104,2016-11-08,10:30 pm,New Orleans Pelicans,94,Sacramento Kings,102,Box Score,,,True,True,False,1,0
105,2016-11-09,7:30 pm,Chicago Bulls,107,Atlanta Hawks,115,Box Score,,,True,True,True,2,1


In [29]:
clf = DecisionTreeClassifier(random_state=14)
X_winstreak =  dataset[["HomeLastWin", "VisitorLastWin", "HomeWinStreak", "VisitorWinStreak"]].values
scores = cross_val_score(clf, X_winstreak, y_true, scoring='accuracy')
print("Using whether the home team is ranked higher")
print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))

Using whether the home team is ranked higher
Accuracy: 56.3%


In [39]:
# Let's try see which team is better on the ladder. Using the previous year's ladder
#https://www.basketball-reference.com/leagues/NBA_2016_standings.html
standing = pd.read_csv("ExapandedStanding.csv")
standing = standing.set_index('Rk')