Some good ole' sports betting. I think there's a big opportunity for reinforcement learning in the field of sports betting. For now, we'll look into analyzing betting lines for NBA games- and hopefully soon we can try some reinforcement learning algos to decide which bets to take, based on financial reward. Let's take a look at this data.

In [1]:
#imports
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt


In [2]:
nba_df=pd.read_csv('../input/nba_game_scores_1g.csv')
nba_df.head()

A little introduction to these betting terms- the line is how many points are given to (or taken away from) the home team, in order to even out the likelihood of both teams winning that bet. The over/under is similar- it allows people to bet on whether both team's combined points will be more than that number. 
Not included, but also important, is the money line bet- this is just betting on one team beating the other.  Let me show an example:

In [3]:
nba_df.iloc[0]

In the above recored, the Detroit Pistons played the Atlanta Hawks on Oct 27th. The Hawks lost at home, 106-94. The line was -7, showing that Atlanta was a favorite. If you took the spread (interchangeable for line), you would have lost 106-87 (87 = 94-7). The over/under was 197, and the combined score of both teams was (106+94) = 200. If you took the over, you would have won! **note** I won't be talking about payouts, because that data is not included. For now, let's add some more factors to the data frame, so we can easily analyze this whole dataset. 

We should create columns for: 
* actual combined score
* line adjusted
* did the over or under hit
* which team won the line
* was the winner home or away
* was the line winner home or away


In [4]:
nba_df['line_adjusted']= nba_df['line'] + nba_df['home.score']
nba_df['total_points']= nba_df['away.score'] + nba_df['home.score']
nba_df['total_points'] = nba_df['away.score'] + nba_df['home.score']

#set a default colum 
nba_df['over_under_hit'] = 'over'
#replace all over with under when it was actually the under 
nba_df['over_under_hit'][nba_df['total_points'] < nba_df['over_under']] = 'under'


#who won the spread
nba_df['spread_winner'] = nba_df['home.team']
#replace all home with away when they had more points than the adjusted line
nba_df['spread_winner'][nba_df['line_adjusted'] < nba_df['away.score']] = nba_df['away.team']

#was winner home or away
nba_df['win_home_away'] = 'home'
#replace all home with away when they won
nba_df['win_home_away'][nba_df['home.score'] < nba_df['away.score']] = 'away'

#was line winner home or away
nba_df['line_win_home_away'] = 'home'
#replace all home with away when they won
nba_df['line_win_home_away'][nba_df['line_adjusted'] < nba_df['away.score']] = 'away'


Cool! Now let's take a look at the new fields:

In [5]:
nba_df.head()

Let's try to make a predictive model. In our case, we are interested in predicting the *likelihood* of our bets winning- not necessarily the teams! But, only a few of these fields can actually be used as independent variables in our analysis! These variables are: home team, away team, over/under, and line- the rest are the results! Let's try to look at this data differently: Let's get each NBA team's schedule- this will let us get info on win streaks, loss streaks, road games after a home win, etc!

In [6]:
#get a list of all teams (you can get this from the away or home teams field becuase they all play both home and away)
teams = nba_df['away.team'].unique()

away_df = pd.DataFrame()
home_df = pd.DataFrame()
schedule= []
for team in teams:
    away=[]
    away_df = (nba_df[['date','season','home.team','away.score','home.score','line','over_under']][nba_df['away.team']==team])
    away= away_df.values.tolist()
    #let's add an indicator saying these are away games
    away=[[team] + ['away']+ game for game in away]
    
    home=[]
    home_df = (nba_df[['date','season','away.team','home.score','away.score','line','over_under']][nba_df['home.team']==team])
    home= home_df.values.tolist()
    #let's add an indicator saying these are home games
    home =[[team] +['home']+ game for game in home]
    for a_games in away: 
        schedule.append(a_games)
    for h_games in home:
        schedule.append(h_games)
    
    
#let's load the list to a dataframe
schedule_df= pd.DataFrame(schedule, columns = ['team','home_away','date','season','opponent','team_score','opponent_score','line','over_under'])
schedule_df.head()


Now we are able to see the games with respect to each specific team. Now we can sort by team, by date.

In [7]:
schedule_df = schedule_df.sort_values(['team','date'],ascending=[True,True])
schedule_df.head()

This data is now in an easier format to analyze: let's quickly check out how many wins and losses each team had

In [8]:
#add a column for win/loss
schedule_df['win/loss'] = 'win'
schedule_df['win/loss'][schedule_df['team_score'] < schedule_df['opponent_score']] = 'loss'

schedule_df.head()

#graph it 
win_by_team = schedule_df.groupby(['team','win/loss'])['win/loss'].size().unstack()

win_by_team =win_by_team.sort_values(['win'],ascending=[False])
win_by_team.plot.bar(figsize=(20,12))


Let's look into win streaks and loss streaks:

In [30]:
#get the previous game out come, and create new column
schedule_df['prev_game'] = schedule_df['win/loss'].shift(1)
schedule_df['prev_team'] = schedule_df['team'].shift(1)
schedule_df.head()

#see whether or not the team is on a streak 
schedule_df['streaks']= 0
schedule_df['streaks'][(schedule_df['team'] == schedule_df['prev_team']) & (schedule_df['win/loss'] == schedule_df['prev_game']) ] = 1

schedule_df['win_streak'] = 0
schedule_df['lose_streak'] = 0
def win_streak(x):
    x['win_streak'] = x.groupby( (x['streaks'] == 0).cumsum()).cumcount() + ( (x['streaks'] == 0).cumsum() != 0).astype(int) 
    return x

def lose_streak(x):
    x['lose_streak'] = x.groupby( (x['streaks'] == 0).cumsum()).cumcount() + ( (x['streaks'] == 0).cumsum() != 0).astype(int) 
    return x

schedule_df[schedule_df['win/loss'] == 'win'] = schedule_df.groupby('team', sort=False).apply(win_streak)
schedule_df[schedule_df['win/loss'] == 'loss'] = schedule_df.groupby('team', sort=False).apply(lose_streak)

#clean it up by deleting the streaks column we used as an indicator
del schedule_df['streaks']
del schedule_df['prev_team']

schedule_df[0:20]

We can use a similar approach to see how many games were played at home and away in a row:

In [32]:
#get the previous game out come, and create new column
schedule_df['prev_location'] = schedule_df['home_away'].shift(1)
schedule_df['prev_team'] = schedule_df['team'].shift(1)
schedule_df.head()

#see whether or not the team is on a streak 
schedule_df['streaks']= 0
schedule_df['streaks'][(schedule_df['team'] == schedule_df['prev_team']) & (schedule_df['home_away'] == schedule_df['prev_location']) ] = 1
schedule_df.head()



schedule_df['home_streak'] = 0
schedule_df['away_streak'] = 0
def home_streak(x):
    x['home_streak'] = x.groupby( (x['streaks'] == 0).cumsum()).cumcount() + ( (x['streaks'] == 0).cumsum() != 0).astype(int) 
    return x

def away_streak(x):
    x['away_streak'] = x.groupby( (x['streaks'] == 0).cumsum()).cumcount() + ( (x['streaks'] == 0).cumsum() != 0).astype(int) 
    return x

schedule_df[schedule_df['home_away'] == 'home'] = schedule_df.groupby('team', sort=False).apply(home_streak)
schedule_df[schedule_df['home_away'] == 'away'] = schedule_df.groupby('team', sort=False).apply(away_streak)

#clean it up by deleting the streaks column we used as an indicator
del schedule_df['prev_location']
del schedule_df['prev_team']
del schedule_df['streaks']
schedule_df[0:20]

If we are going to build a predictive model, we need to lag the streaks, to get the performance of the team going into their next game. We won't be able to input variables for win streak and lose streak, beacuse that is dependent on the outcome of the game we are trying to predict! What we can do, is look at the streaks prior to that game. Let's look at an example:

In [34]:
schedule_df[2:4]

If we are trying to predict the outcome for the ATL game on 2015-11-01, we can use the previous games streak, as independent variables. In this example, the Falcons are coming off of a 2 game win streak, and a 1 game streak at home. This can easily be done by creating 4 new columns, and lagging them by 1.

In [36]:
schedule_df['lag_win_streak'] = schedule_df['win_streak'].shift(1)
schedule_df['lag_lose_streak'] = schedule_df['lose_streak'].shift(1)
schedule_df['lag_home_streak'] = schedule_df['home_streak'].shift(1)
schedule_df['lag_away_streak'] = schedule_df['away_streak'].shift(1)
schedule_df.head()