# NBA Isolation Forests

I was curious how outlier detection methods from machine learning would do when applied to NBA rate stats using data from Basketball Reference.

The method we're going to use is called an isolation forest. It measures how easy a point is to separate from the underlying distribution. Points that are easier to separate are more likely to be outliers. Our training set is going to be the 2013 to 2016 seasons and we're going to test on the 2017 data.

In [1]:
from requests import get
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from sklearn.ensemble import IsolationForest

In [2]:
columns = [
    'Player','Pos','Age','Tm','G','GS','MP',
    'FG','FGA','FG%','3P','3PA','3P%','2P','2PA','2P%',
    'FT','FTA','FT%','ORB','DRB','TRB','AST','STL','BLK',
    'TOV','PF','PTS','Year'
]

In [3]:
per36 = []
for year in range(2013, 2018):
    soup = BeautifulSoup(get("http://www.basketball-reference.com/leagues/NBA_{}_per_minute.html".format(year)).text,'lxml')
    data = map(lambda x: x+[year], filter(lambda x: bool(x), [[i.text for i in row.findAll('td')][:28] for row in soup.findAll('tr')][1:]))
    per36 += data

In [4]:
per100 = []
for year in range(2013, 2018):
    soup = BeautifulSoup(get("http://www.basketball-reference.com/leagues/NBA_{}_per_poss.html".format(year)).text,'lxml')
    data = map(lambda x: x+[year], filter(lambda x: bool(x), [[i.text for i in row.findAll('td')][:28] for row in soup.findAll('tr')][1:]))
    per100 += data

In [13]:
pergame = []
for year in range(2013, 2018):
    soup = BeautifulSoup(get("http://www.basketball-reference.com/leagues/NBA_{}_per_game.html".format(year)).text,'lxml')
    data = map(lambda x: x+[year], filter(lambda x: bool(x), [[i.text for i in row.findAll('td')][:28] for row in soup.findAll('tr')][1:]))
    pergame += data

In [17]:
pd36 = pd.DataFrame(per36, columns = columns).convert_objects(convert_numeric=True)
pd100 = pd.DataFrame(per100, columns = columns).convert_objects(convert_numeric=True)
pdgame = pd.DataFrame(pergame, columns = columns).convert_objects(convert_numeric=True)

  if __name__ == '__main__':
  from ipykernel import kernelapp as app
  app.launch_new_instance()


In [18]:
train36 = pd36[(pd36['Year'] <= 2016) & (pd36['G']>10)]
del train36['Player'], train36['Pos'], train36['Tm'], train36['G'], train36['GS']
del train36['MP'], train36['Year']
test36 = pd36[(pd36['Year'] > 2016) & (pd36['G']>10)]
del test36['Player'], test36['Pos'], test36['Tm'], test36['G'], test36['GS'], test36['MP'], test36['Year']
test36.head()

train100 = pd100[(pd100['Year'] <= 2016) & (pd100['G']>10)]
del train100['Player'], train100['Pos'], train100['Tm'], train100['G']
del train100['GS'], train100['MP'], train100['Year']
test100 = pd100[(pd100['Year'] > 2016) & (pd100['G']>10)]
del test100['Player'], test100['Pos'], test100['Tm']
del test100['G'], test100['GS'], test100['MP'], test100['Year']
test100.head()

traingame = pdgame[(pdgame['Year'] <= 2016) & (pdgame['G']>10)]
del traingame['Player'], traingame['Pos'], traingame['Tm'], traingame['G']
del traingame['GS'], traingame['MP'], traingame['Year']
testgame = pdgame[(pdgame['Year'] > 2016) & (pdgame['G']>10)]
del testgame['Player'], testgame['Pos'], testgame['Tm']
del testgame['G'], testgame['GS'], testgame['MP'], testgame['Year']
testgame.head()

Unnamed: 0,Age,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
2413,23,1.3,3.8,0.342,0.7,2.7,0.264,0.6,1.2,0.522,...,0.6,0.909,0.2,0.8,0.9,0.4,0.3,0.1,0.7,0.9
2415,23,4.7,8.0,0.585,0.0,0.0,0.0,4.7,8.0,0.588,...,3.3,0.742,2.9,4.7,7.5,0.7,1.3,0.9,2.0,2.3
2416,31,2.6,6.4,0.405,0.6,2.0,0.311,2.0,4.5,0.447,...,1.6,0.917,0.2,2.0,2.2,1.0,0.3,0.0,0.2,1.5
2417,28,2.2,4.6,0.481,0.0,0.2,0.0,2.2,4.5,0.5,...,0.7,0.75,1.1,3.4,4.5,0.3,0.4,0.7,0.9,1.9
2418,28,1.2,2.2,0.556,0.0,0.0,,1.2,2.2,0.556,...,0.6,0.688,1.3,2.6,4.0,0.6,0.6,0.6,0.3,1.9


The isolation forest has a tunable parameter called contamination, it's the fraction of outliers in the training set. Smaller values correspond to fewer outliers, larger corresponds to more outliers

With a small value, we find that Westbrook is the most outlier-y player

In [23]:
g36 = IsolationForest(contamination=0.001).fit(train36.fillna(0))
g100 = IsolationForest(contamination=0.001).fit(train100.fillna(0))
gg = IsolationForest(contamination=0.001).fit(traingame.fillna(0))

In [24]:
pd36[(pd36['Year'] > 2016) & (pd36['G']>10)][[True if i < 0 else False for i in g36.predict(test36.fillna(0))]]

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year
2506,Anthony Davis,PF,23,NOP,27,27,1011,10.0,20.4,0.493,...,1.7,8.8,10.5,2.1,1.5,2.7,2.4,2.0,28.4,2017
2529,Joel Embiid,C,22,PHI,18,18,436,9.2,19.7,0.471,...,3.0,8.1,11.1,2.7,1.0,3.7,5.4,4.9,27.2,2017
2835,Russell Westbrook,PG,28,OKC,27,27,949,10.3,24.4,0.423,...,2.1,8.6,10.8,11.3,1.4,0.3,5.7,2.5,31.1,2017


In [25]:
pd100[(pd100['Year'] > 2016) & (pd100['G']>10)][[True if i < 0 else False for i in g100.predict(test100.fillna(0))]]

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year
2835,Russell Westbrook,PG,28,OKC,27,27,949,14,33.1,0.423,...,2.9,11.7,14.6,15.3,1.9,0.5,7.8,3.4,42.2,2017


In [26]:
pdgame[(pdgame['Year'] > 2016) & (pdgame['G']>10)][[True if i < 0 else False for i in gg.predict(testgame.fillna(0))]]

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year
2497,DeMarcus Cousins,C,26,SAC,26,26,34.5,9.7,21.6,0.451,...,0.756,2.3,8.4,10.7,3.4,1.2,1.4,2.9,3.7,2017
2506,Anthony Davis,PF,23,NOP,27,27,37.4,10.4,21.2,0.493,...,0.813,1.8,9.1,10.9,2.2,1.6,2.8,2.5,2.1,2017
2577,James Harden,PG,27,HOU,28,28,36.8,8.0,18.3,0.439,...,0.836,1.3,6.7,8.0,11.8,1.5,0.3,5.6,2.6,2017
2835,Russell Westbrook,PG,28,OKC,27,27,35.1,10.1,23.8,0.423,...,0.814,2.1,8.4,10.5,11.0,1.3,0.3,5.6,2.5,2017


With a slightly larger value we find more outliers

In [27]:
g36 = IsolationForest(contamination=0.01).fit(train36.fillna(0))
g100 = IsolationForest(contamination=0.01).fit(train100.fillna(0))
gg = IsolationForest(contamination=0.01).fit(traingame.fillna(0))

In [28]:
pd36[(pd36['Year'] > 2016) & (pd36['G']>10)][[True if i < 0 else False for i in g36.predict(test36.fillna(0))]]

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year
2497,DeMarcus Cousins,C,26,SAC,26,26,898,10.1,22.5,0.451,...,2.4,8.8,11.1,3.5,1.2,1.5,3.0,3.9,29.5,2017
2506,Anthony Davis,PF,23,NOP,27,27,1011,10.0,20.4,0.493,...,1.7,8.8,10.5,2.1,1.5,2.7,2.4,2.0,28.4,2017
2529,Joel Embiid,C,22,PHI,18,18,436,9.2,19.7,0.471,...,3.0,8.1,11.1,2.7,1.0,3.7,5.4,4.9,27.2,2017
2575,A.J. Hammons,C,24,DAL,14,0,65,3.3,10.0,0.333,...,2.2,8.3,10.5,1.1,0.0,3.3,0.6,5.5,9.4,2017
2577,James Harden,PG,27,HOU,28,28,1030,7.9,17.9,0.439,...,1.3,6.6,7.8,11.6,1.4,0.3,5.5,2.6,27.1,2017
2637,James Jones,SF,36,CLE,14,0,90,6.8,10.4,0.654,...,0.4,1.6,2.0,0.8,1.2,1.6,1.6,2.8,20.8,2017
2692,JaVale McGee,C,29,GSW,24,2,197,10.6,15.9,0.667,...,4.0,6.6,10.6,1.3,0.9,2.9,3.3,5.7,24.5,2017
2835,Russell Westbrook,PG,28,OKC,27,27,949,10.3,24.4,0.423,...,2.1,8.6,10.8,11.3,1.4,0.3,5.7,2.5,31.1,2017


In [29]:
pd100[(pd100['Year'] > 2016) & (pd100['G']>10)][[True if i < 0 else False for i in g100.predict(test100.fillna(0))]]

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year
2497,DeMarcus Cousins,C,26,SAC,26,26,898,14.2,31.5,0.451,...,3.3,12.3,15.6,4.9,1.7,2.1,4.2,5.5,41.3,2017
2506,Anthony Davis,PF,23,NOP,27,27,1011,13.6,27.7,0.493,...,2.4,11.9,14.3,2.9,2.0,3.6,3.3,2.8,38.6,2017
2529,Joel Embiid,C,22,PHI,18,18,436,12.8,27.1,0.471,...,4.1,11.2,15.3,3.8,1.4,5.1,7.4,6.7,37.6,2017
2575,A.J. Hammons,C,24,DAL,14,0,65,4.9,14.6,0.333,...,3.2,12.1,15.4,1.6,0.0,4.9,0.8,8.1,13.8,2017
2577,James Harden,PG,27,HOU,28,28,1030,10.7,24.4,0.439,...,1.7,9.0,10.7,15.8,2.0,0.4,7.4,3.5,36.9,2017
2835,Russell Westbrook,PG,28,OKC,27,27,949,14.0,33.1,0.423,...,2.9,11.7,14.6,15.3,1.9,0.5,7.8,3.4,42.2,2017


In [30]:
pdgame[(pdgame['Year'] > 2016) & (pdgame['G']>10)][[True if i < 0 else False for i in gg.predict(testgame.fillna(0))]]

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year
2428,Giannis Antetokounmpo,SG,22,MIL,25,25,34.7,8.4,16.0,0.524,...,0.765,1.8,7.2,9.1,6.0,2.1,2.0,3.3,3.8,2017
2497,DeMarcus Cousins,C,26,SAC,26,26,34.5,9.7,21.6,0.451,...,0.756,2.3,8.4,10.7,3.4,1.2,1.4,2.9,3.7,2017
2506,Anthony Davis,PF,23,NOP,27,27,37.4,10.4,21.2,0.493,...,0.813,1.8,9.1,10.9,2.2,1.6,2.8,2.5,2.1,2017
2577,James Harden,PG,27,HOU,28,28,36.8,8.0,18.3,0.439,...,0.836,1.3,6.7,8.0,11.8,1.5,0.3,5.6,2.6,2017
2622,LeBron James,SF,32,CLE,23,23,36.8,9.2,17.8,0.517,...,0.687,1.2,6.4,7.6,9.0,1.4,0.5,4.0,1.7,2017
2830,John Wall,PG,26,WAS,24,24,36.2,8.6,18.8,0.456,...,0.807,0.9,3.5,4.4,9.7,2.3,0.6,4.5,2.3,2017
2835,Russell Westbrook,PG,28,OKC,27,27,35.1,10.1,23.8,0.423,...,0.814,2.1,8.4,10.5,11.0,1.3,0.3,5.6,2.5,2017
2837,Hassan Whiteside,C,27,MIA,28,28,33.4,7.4,13.6,0.543,...,0.527,4.2,10.6,14.8,0.7,0.8,2.3,2.0,3.3,2017
