# 3 Predicting Sports Winners with Decision Trees

Loading the dataset

Using pandas to load the dataset

The pandas library is a library for loading, managing, and manipulating data.
It handles data structures behind-the-scenes and supports analysis methods,
such as computing the mean

In [85]:
import pandas as pd
results=pd.read_csv('leagues_NBA_2014_games_games.csv')
data_filename='leagues_NBA_2014_games_games.csv'
results.ix[:5]

Unnamed: 0,Date,Unnamed: 1,Visitor/Neutral,PTS,Home/Neutral,PTS.1,Unnamed: 6,Notes
0,Tue Oct 29 2013,Box Score,Orlando Magic,87,Indiana Pacers,97,,
1,Tue Oct 29 2013,Box Score,Los Angeles Clippers,103,Los Angeles Lakers,116,,
2,Tue Oct 29 2013,Box Score,Chicago Bulls,95,Miami Heat,107,,
3,Wed Oct 30 2013,Box Score,Brooklyn Nets,94,Cleveland Cavaliers,98,,
4,Wed Oct 30 2013,Box Score,Atlanta Hawks,109,Dallas Mavericks,118,,
5,Wed Oct 30 2013,Box Score,Washington Wizards,102,Detroit Pistons,113,,


Cleaning up the dataset

In [86]:
'''The pandas.read_csv function has parameters to fix each of these issues, which we
can specify when loading the file. We can also change the headings after loading the
file'''

'The pandas.read_csv function has parameters to fix each of these issues, which we\ncan specify when loading the file. We can also change the headings after loading the\nfile'

In [88]:
# Don't read the first row, as it is blank, and parse the date column as a date
results = pd.read_csv(data_filename, skiprows=[0,])
# Fix the name of the columns
results.columns = ["Date", "Score Type", "Visitor Team", "VisitorPts", 
                   "Home Team", "HomePts", "OT?", "Notes"]

results.ix[:5]

Unnamed: 0,Date,Score Type,Visitor Team,VisitorPts,Home Team,HomePts,OT?,Notes
0,Tue Oct 29 2013,Box Score,Los Angeles Clippers,103,Los Angeles Lakers,116,,
1,Tue Oct 29 2013,Box Score,Chicago Bulls,95,Miami Heat,107,,
2,Wed Oct 30 2013,Box Score,Brooklyn Nets,94,Cleveland Cavaliers,98,,
3,Wed Oct 30 2013,Box Score,Atlanta Hawks,109,Dallas Mavericks,118,,
4,Wed Oct 30 2013,Box Score,Washington Wizards,102,Detroit Pistons,113,,
5,Wed Oct 30 2013,Box Score,Los Angeles Lakers,94,Golden State Warriors,125,,


In [89]:
results["HomeWin"] = results["VisitorPts"] < results["HomePts"]
# Our "class values"
y_true = results["HomeWin"].values
results.ix[:5]

Unnamed: 0,Date,Score Type,Visitor Team,VisitorPts,Home Team,HomePts,OT?,Notes,HomeWin
0,Tue Oct 29 2013,Box Score,Los Angeles Clippers,103,Los Angeles Lakers,116,,,True
1,Tue Oct 29 2013,Box Score,Chicago Bulls,95,Miami Heat,107,,,False
2,Wed Oct 30 2013,Box Score,Brooklyn Nets,94,Cleveland Cavaliers,98,,,True
3,Wed Oct 30 2013,Box Score,Atlanta Hawks,109,Dallas Mavericks,118,,,True
4,Wed Oct 30 2013,Box Score,Washington Wizards,102,Detroit Pistons,113,,,True
5,Wed Oct 30 2013,Box Score,Los Angeles Lakers,94,Golden State Warriors,125,,,False


Extracting new features

In [90]:
import numpy as np
#Make sure HomeWin column was added in dataset
print("Home Win percentage: {0:.1f}%".
      format(100 * results["HomeWin"].sum() / results["HomeWin"].count())) 
dataset['HomeWin']=dataset['Visitor Team']<dataset['HomePts']

Home Win percentage: 47.0%


In [70]:
dataset["HomeLastWin"] = False
dataset["VisitorLastWin"] = False
# This creates two new columns, all set to False
dataset.ix[:5]

Unnamed: 0,Date,Score Type,Visitor Team,VisitorPts,Home Team,HomePts,OT?,Notes,HomeWin,HomeLastWin,VisitorLastWin
0,Tue Oct 29 2013,Box Score,Los Angeles Clippers,103,Los Angeles Lakers,116,,,False,False,False
1,Tue Oct 29 2013,Box Score,Chicago Bulls,95,Miami Heat,107,,,False,False,False
2,Wed Oct 30 2013,Box Score,Brooklyn Nets,94,Cleveland Cavaliers,98,,,False,False,False
3,Wed Oct 30 2013,Box Score,Atlanta Hawks,109,Dallas Mavericks,118,,,False,False,False
4,Wed Oct 30 2013,Box Score,Washington Wizards,102,Detroit Pistons,113,,,False,False,False
5,Wed Oct 30 2013,Box Score,Los Angeles Lakers,94,Golden State Warriors,125,,,False,False,False


In [91]:
# Now compute the actual values for these
# Did the home and visitor teams win their last game?
from collections import defaultdict
won_last = defaultdict(int)

for index, row in results.iterrows():  # Note that this is not efficient
    home_team = row["Home Team"]
    visitor_team = row["Visitor Team"]
    row["HomeLastWin"] = won_last[home_team]
    row["VisitorLastWin"] = won_last[visitor_team]
    results.ix[index] = row    
    # Set current win
    won_last[home_team] = row["HomeWin"]
    won_last[visitor_team] = not row["HomeWin"]
results.ix[20:25]

Unnamed: 0,Date,Score Type,Visitor Team,VisitorPts,Home Team,HomePts,OT?,Notes,HomeWin
20,Fri Nov 1 2013,Box Score,Milwaukee Bucks,105,Boston Celtics,98,,,True
21,Fri Nov 1 2013,Box Score,Miami Heat,100,Brooklyn Nets,101,,,True
22,Fri Nov 1 2013,Box Score,Cleveland Cavaliers,84,Charlotte Bobcats,90,,,True
23,Fri Nov 1 2013,Box Score,Portland Trail Blazers,113,Denver Nuggets,98,,,True
24,Fri Nov 1 2013,Box Score,Dallas Mavericks,105,Houston Rockets,113,,,True
25,Fri Nov 1 2013,Box Score,San Antonio Spurs,91,Los Angeles Lakers,85,,,False


# Decision trees

Parameters in decision trees

Using decision trees

In [95]:
'''import the DecisionTreeClassifier class and create a decision tree using
scikit-learn:'''
from sklearn.tree import DecisionTreeClassifier
clf=DecisionTreeClassifier(random_state=14)#enalbe exmaine random_stete as any number

In [96]:
from sklearn.cross_validation import cross_val_score

# Create a dataset with just the neccessary information
X_previouswins = dataset[["HomeLastWin", "VisitorLastWin"]].values
clf = DecisionTreeClassifier(random_state=14)
scores = cross_val_score(clf, X_previouswins, y_true, scoring='accuracy')
print("Using just the last result from the home and visitor teams")
print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))
scores = cross_val_score(clf, X_previouswins, y_true, scoring='accuracy')
print("Using just the last result from the home and visitor teams")
print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))

Using just the last result from the home and visitor teams
Accuracy: 50.0%
Using just the last result from the home and visitor teams
Accuracy: 50.0%


Sports outcome prediction

In [104]:
# Let's try see which team is better on the ladder. Using the previous year's ladder
ladder_filename = ("leagues_NBA_2013_standings_expanded-standings.csv")
ladder = pd.read_csv(ladder_filename, skiprows=[0,1])
ladder.head()

Unnamed: 0,Rk,Team,Overall,Home,Road,E,W,A,C,SE,...,Post,≤3,≥10,Oct,Nov,Dec,Jan,Feb,Mar,Apr
0,1,Miami Heat,66-16,37-4,29-12,41-11,25-5,14-4,12-6,15-1,...,30-2,9-3,39-8,1-0,10-3,10-5,8-5,12-1,17-1,8-1
1,2,Oklahoma City Thunder,60-22,34-7,26-15,21-9,39-13,7-3,8-2,6-4,...,21-8,3-6,44-6,,13-4,11-2,11-5,7-4,12-5,6-2
2,3,San Antonio Spurs,58-24,35-6,23-18,25-5,33-19,8-2,9-1,8-2,...,16-12,9-5,31-10,1-0,12-4,12-4,12-3,8-3,10-4,3-6
3,4,Denver Nuggets,57-25,38-3,19-22,19-11,38-14,5-5,10-0,4-6,...,24-4,11-7,28-8,0-1,8-8,9-6,12-3,8-4,13-2,7-1
4,5,Los Angeles Clippers,56-26,32-9,24-17,21-9,35-17,7-3,8-2,6-4,...,17-9,3-5,38-12,1-0,8-6,16-0,9-7,8-5,7-7,7-1


In [102]:
# We can create a new feature -- HomeTeamRanksHighe
results["HomeTeamRanksHigher"] = 0
for index, row in results.iterrows():
    home_team = row["Home Team"]
    visitor_team = row["Visitor Team"]
    if home_team == "New Orleans Pelicans":
        home_team = "New Orleans Hornets"
    elif visitor_team == "New Orleans Pelicans":
        visitor_team = "New Orleans Hornets"
    home_rank = ladder[ladder["Team"] == home_team]["Rk"].values[0]
    visitor_rank = ladder[ladder["Team"] == visitor_team]["Rk"].values[0]
    row["HomeTeamRanksHigher"] = int(home_rank > visitor_rank)
    results.ix[index] = row
results.head()

IndexError: index 0 is out of bounds for axis 0 with size 0