## Assumption

According to Skiena's note, we could start our model with a linear model likes this:
$$P(Wins|\sim) = f(score,\ time\ left,\ field\ position)$$ where $\sim$ are the given observation.


### Logistic Regression

#### Motivation
Since we are going to decide whether a team is going to clear its bets, this might goes to a classification problem (**clear** and **not clear**). We could use **logistic regression** to decide what's the probability that team A is going to clear the bets.

#### Usage
```python
lg = sklearn.linear_model.LogisticRegression()
lg.fit(X, y) # X would be the collection of vectors of score, time_left, field_position
             # y would be a bunch of value that clear the bet or not
prob = predict(X_new) # get the probability
```

This comes to a **question**: How to decide the ground truth of each time stamp (time_left) in the training(historical) dataset?

##### Possible solution
1. Count the number of clearing bet of a specific team that given time left,

  a. score
  
  b. field position, since these two variables are time-related.
  


## Import library

In [5]:
import pandas as pd
import numpy as np
import csv
import json
import requests
from xml.etree import ElementTree
import glob
import re
import sklearn.linear_model as lm

## Load dataset

In [105]:
points = pd.read_csv('data/2014_point_spread_footballlocks.csv')
points

Unnamed: 0,eid,season,week,Home,Away,HomeScore,AwayScore,Day,Time,Favorite,Underdog,Spread,CoverOrNot
0,2014090400,2014,1,SEA,GB,36,16,Thu,8:30,SEA,GB,-5.0,1.0
1,2014090700,2014,1,ATL,NO,37,34,Sun,1:00,NO,ATL,-3.0,-1.0
2,2014090701,2014,1,BAL,CIN,16,23,Sun,1:00,BAL,CIN,-1.0,-1.0
3,2014090702,2014,1,CHI,BUF,20,23,Sun,1:00,CHI,BUF,-7.0,-1.0
4,2014090703,2014,1,HOU,WAS,17,6,Sun,1:00,HOU,WAS,-3.0,1.0
5,2014090704,2014,1,KC,TEN,10,26,Sun,1:00,KC,TEN,-3.0,-1.0
6,2014090705,2014,1,MIA,NE,33,20,Sun,1:00,NE,MIA,-4.0,-1.0
7,2014090706,2014,1,NYJ,OAK,19,14,Sun,1:00,NYJ,OAK,-6.5,-1.0
8,2014090707,2014,1,PHI,JAC,34,17,Sun,1:00,PHI,JAC,-10.0,1.0
9,2014090708,2014,1,PIT,CLE,30,27,Sun,1:00,PIT,CLE,-6.0,-1.0


In [11]:
# ## TODO: change to complete dataset
# covers = []
# for i, r in points.iterrows():
#     h = r['Home']
#     a = r['Away']
#     f = r['Favorite']
#     u = r['Underdog']
#     hs = r['HomeScore']
#     vs = r['AwayScore']
#     fs = -1
#     us = -1
#     if h == f:
#         fs = hs
#         us = vs
#     else:
#         fs = vs
#         us = hs
#     if (fs == -1) or (us == -1):
#         print "Favorite or Underdog not match"
#         print r
#         break
#     if (fs - us) > -r['Spread']:
#         cover = 1
#     elif (fs - us) == -r['Spread']:
#         cover = 0
#     else:
#         cover = -1
#     covers.append(cover)
# points["CoverOrFail"] = pd.Series(covers)

## Take Season 2014 week 1 game 1 as example

In [108]:
eid = 2014090400

In [162]:
plays = pd.read_csv('data/game_data/'+str(eid)+'_plays.csv')
plays = pd.read_csv(str(eid)+'.csv')

In [164]:
plays_parse = plays.loc[:, ['time', 'desc', 'qtr', 'yrdln', 'posteam', 'note']].copy()
plays_parse = plays_parse.sort_values(by=['qtr', 'time'], ascending=[True, False], axis=0).reset_index()

# [time, Missing Value] make endtime row as 0, since they are nan
time_null_idx = plays_parse['time'].index[pd.isnull(plays_parse['time'].values)]
plays_parse.loc[index, 'time'] = '0:00'
plays_parse['time'] = pd.Series([float(a.split(':')[0])*60+float(a.split(':')[1]) for a in plays_parse['time'].values])
plays_parse['lefttime'] = 3600 - 15*60*(plays_parse['qtr']-1) - plays_parse['time']

# [yrdln, Missing Value] Missing Value: make nan the same as previous one
plays_parse['yrdln'] = plays_parse['yrdln'].fillna(method='ffill')

plays_parse

Unnamed: 0,index,time,desc,qtr,yrdln,posteam,note,lefttime
0,86,900.0,S.Hauschka kicks 71 yards from SEA 35 to GB -6...,1,SEA 35,SEA,KICKOFF,2700.0
1,84,896.0,(14:56) E.Lacy right tackle to GB 19 for 6 yar...,1,GB 13,GB,,2704.0
2,90,870.0,(14:30) E.Lacy left tackle to GB 22 for 3 yard...,1,GB 19,GB,PENALTY,2730.0
3,88,851.0,(14:11) (Shotgun) E.Lacy up the middle to GB 3...,1,GB 24,GB,,2749.0
4,87,812.0,(13:32) (No Huddle) J.Starks right guard to GB...,1,GB 39,GB,,2788.0
5,83,786.0,(13:06) (Shotgun) A.Rodgers pass short left to...,1,GB 41,GB,,2814.0
6,89,750.0,(12:30) (Shotgun) A.Rodgers sacked at GB 39 fo...,1,GB 39,GB,,2850.0
7,85,717.0,"(11:57) T.Masthay punts 29 yards to SEA 32, Ce...",1,GB 39,GB,PUNT,2883.0
8,98,713.0,(11:53) R.Wilson pass short left to P.Harvin t...,1,SEA 35,SEA,,2887.0
9,102,680.0,(11:20) M.Lynch left tackle to SEA 44 for 5 ya...,1,SEA 39,SEA,,2920.0


In [197]:
# Assign Home, Away, HomeScore, AwayScore, CoverOrNot to current match
plays_parse['Home'] = points[points['eid'] == eid].Home.iloc[0]
plays_parse['Away'] = points[points['eid'] == eid].Away.iloc[0]
# plays_parse['Favorite'] = points[points['eid'] == eid].Favorite.iloc[0]
# plays_parse['Underdog'] = points[points['eid'] == eid].Underdog.iloc[0]

plays_parse['HomeScore'] = 0
plays_parse['AwayScore'] = 0
plays_parse['CoverOrNot'] = points[points['eid'] == eid].CoverOrNot.iloc[0]
plays_parse

Unnamed: 0,index,time,desc,qtr,yrdln,posteam,note,lefttime,Home,Away,HomeScore,AwayScore,CoverOrNot,h_yrdln,v_yrdln
0,86,900.0,S.Hauschka kicks 71 yards from SEA 35 to GB -6...,1,SEA 35,SEA,KICKOFF,2700.0,SEA,GB,0,0,1.0,35.0,65.0
1,84,896.0,(14:56) E.Lacy right tackle to GB 19 for 6 yar...,1,GB 13,GB,,2704.0,SEA,GB,0,0,1.0,87.0,13.0
2,90,870.0,(14:30) E.Lacy left tackle to GB 22 for 3 yard...,1,GB 19,GB,PENALTY,2730.0,SEA,GB,0,0,1.0,81.0,19.0
3,88,851.0,(14:11) (Shotgun) E.Lacy up the middle to GB 3...,1,GB 24,GB,,2749.0,SEA,GB,0,0,1.0,76.0,24.0
4,87,812.0,(13:32) (No Huddle) J.Starks right guard to GB...,1,GB 39,GB,,2788.0,SEA,GB,0,0,1.0,61.0,39.0
5,83,786.0,(13:06) (Shotgun) A.Rodgers pass short left to...,1,GB 41,GB,,2814.0,SEA,GB,0,0,1.0,59.0,41.0
6,89,750.0,(12:30) (Shotgun) A.Rodgers sacked at GB 39 fo...,1,GB 39,GB,,2850.0,SEA,GB,0,0,1.0,61.0,39.0
7,85,717.0,"(11:57) T.Masthay punts 29 yards to SEA 32, Ce...",1,GB 39,GB,PUNT,2883.0,SEA,GB,0,0,1.0,61.0,39.0
8,98,713.0,(11:53) R.Wilson pass short left to P.Harvin t...,1,SEA 35,SEA,,2887.0,SEA,GB,0,0,1.0,35.0,65.0
9,102,680.0,(11:20) M.Lynch left tackle to SEA 44 for 5 ya...,1,SEA 39,SEA,,2920.0,SEA,GB,0,0,1.0,39.0,61.0


In [198]:
# Add score according FG and TD
for i, r in plays_parse.iterrows():
    
    # Dealing with points
    if r['note'] == 'TD':
        if r['posteam'] == r['Home']:
            plays_parse.loc[i:, 'HomeScore'] = plays_parse.loc[i:, 'HomeScore'] + 6
        elif r['posteam'] == r['Away']:
            plays_parse.loc[i:, 'AwayScore'] = plays_parse.loc[i:, 'AwayScore'] + 6
    if r['note'] == 'FG':
        if r['posteam'] == r['Home']:
            plays_parse.loc[i:, 'HomeScore'] = plays_parse.loc[i:, 'HomeScore'] + 3
        elif r['posteam'] == r['Away']:
            plays_parse.loc[i:, 'AwayScore'] = plays_parse.loc[i:, 'AwayScore'] + 3
    if r['note'] == 'XP':
        if r['posteam'] == r['Home']:
            plays_parse.loc[i:, 'HomeScore'] = plays_parse.loc[i:, 'HomeScore'] + 1
        elif r['posteam'] == r['Away']:
            plays_parse.loc[i:, 'AwayScore'] = plays_parse.loc[i:, 'AwayScore'] + 1
    if r['note'] == 'SAF':
        if r['posteam'] == r['Home']:
            plays_parse.loc[i:, 'AwayScore'] = plays_parse.loc[i:, 'AwayScore'] + 2
        elif r['posteam'] == r['Away']:
            plays_parse.loc[i:, 'HomeScore'] = plays_parse.loc[i:, 'HomeScore'] + 2
            
    # yrdln: home 0, away 100
    side = r['yrdln'].split(' ')[0]
    yrdln = int(r['yrdln'].split(' ')[1])
    if side == r['Home']:
        h_yrdln = yrdln
        v_yrdln = 100 - yrdln
    elif side == r['Away']:
        v_yrdln = yrdln
        h_yrdln = 100 - yrdln
    plays_parse.loc[i, 'h_yrdln'] = h_yrdln
    plays_parse.loc[i, 'v_yrdln'] = v_yrdln

plays_parse

Unnamed: 0,index,time,desc,qtr,yrdln,posteam,note,lefttime,Home,Away,HomeScore,AwayScore,CoverOrNot,h_yrdln,v_yrdln
0,86,900.0,S.Hauschka kicks 71 yards from SEA 35 to GB -6...,1,SEA 35,SEA,KICKOFF,2700.0,SEA,GB,0,0,1.0,35.0,65.0
1,84,896.0,(14:56) E.Lacy right tackle to GB 19 for 6 yar...,1,GB 13,GB,,2704.0,SEA,GB,0,0,1.0,87.0,13.0
2,90,870.0,(14:30) E.Lacy left tackle to GB 22 for 3 yard...,1,GB 19,GB,PENALTY,2730.0,SEA,GB,0,0,1.0,81.0,19.0
3,88,851.0,(14:11) (Shotgun) E.Lacy up the middle to GB 3...,1,GB 24,GB,,2749.0,SEA,GB,0,0,1.0,76.0,24.0
4,87,812.0,(13:32) (No Huddle) J.Starks right guard to GB...,1,GB 39,GB,,2788.0,SEA,GB,0,0,1.0,61.0,39.0
5,83,786.0,(13:06) (Shotgun) A.Rodgers pass short left to...,1,GB 41,GB,,2814.0,SEA,GB,0,0,1.0,59.0,41.0
6,89,750.0,(12:30) (Shotgun) A.Rodgers sacked at GB 39 fo...,1,GB 39,GB,,2850.0,SEA,GB,0,0,1.0,61.0,39.0
7,85,717.0,"(11:57) T.Masthay punts 29 yards to SEA 32, Ce...",1,GB 39,GB,PUNT,2883.0,SEA,GB,0,0,1.0,61.0,39.0
8,98,713.0,(11:53) R.Wilson pass short left to P.Harvin t...,1,SEA 35,SEA,,2887.0,SEA,GB,0,0,1.0,35.0,65.0
9,102,680.0,(11:20) M.Lynch left tackle to SEA 44 for 5 ya...,1,SEA 39,SEA,,2920.0,SEA,GB,0,0,1.0,39.0,61.0


In [202]:
print plays_parse.iloc[-1]['HomeScore'] == points.iloc[0]['HomeScore'] and plays_parse.iloc[-1]['AwayScore'] == points.iloc[0]['AwayScore']

True


## Go through season 2014, take REG 1-16 as training set, 17 as testing set

In [261]:
def parseMatches(df, eid, points):
    try:
        df = df.loc[:, ['time', 'desc', 'qtr', 'yrdln', 'posteam', 'note']].copy()
        df = df.sort_values(by=['qtr', 'time'], ascending=[True, False], axis=0).reset_index()

        # [time, Missing Value] make endtime row as 0, since they are nan
        time_null_idx = df['time'].index[pd.isnull(df['time'].values)]
        df.loc[time_null_idx, 'time'] = '0:00'
        df['time'] = pd.Series([float(a.split(':')[0])*60+float(a.split(':')[1]) for a in df['time'].values])

        if len(df.qtr) == 4:
            df['lefttime'] = 3600 - 15*60*(df['qtr']) + df['time'] # unit: seconds
        else: # suppose only 1 overtime
            df['lefttime'] = 3600 + 15*60 - df.iloc[-1].time - 15*60*(df['qtr']) + df['time'] # unit: seconds


        # [yrdln, Missing Value] Missing Value: make nan the same as previous one
        df['yrdln'] = df['yrdln'].fillna(method='ffill')

        # Assign Home, Away, HomeScore, AwayScore, CoverOrNot to current match
        df['Home'] = points[points['eid'] == eid].Home.iloc[0]
        df['Away'] = points[points['eid'] == eid].Away.iloc[0]

        df['HomeScore'] = 0
        df['AwayScore'] = 0
        df['CoverOrNot'] = points[points['eid'] == eid].CoverOrNot.iloc[0]

        for i, r in df.iterrows():
            # Add score according FG and TD
            if r['note'] == 'TD':
                if r['posteam'] == r['Home']:
                    df.loc[i:, 'HomeScore'] = df.loc[i:, 'HomeScore'] + 6
                elif r['posteam'] == r['Away']:
                    df.loc[i:, 'AwayScore'] = df.loc[i:, 'AwayScore'] + 6
            if r['note'] == 'FG':
                if r['posteam'] == r['Home']:
                    df.loc[i:, 'HomeScore'] = df.loc[i:, 'HomeScore'] + 3
                elif r['posteam'] == r['Away']:
                    df.loc[i:, 'AwayScore'] = df.loc[i:, 'AwayScore'] + 3
            if r['note'] == 'XP':
                if r['posteam'] == r['Home']:
                    df.loc[i:, 'HomeScore'] = df.loc[i:, 'HomeScore'] + 1
                elif r['posteam'] == r['Away']:
                    df.loc[i:, 'AwayScore'] = df.loc[i:, 'AwayScore'] + 1
            if r['note'] == 'SAF':
                if r['posteam'] == r['Home']:
                    df.loc[i:, 'AwayScore'] = df.loc[i:, 'AwayScore'] + 2
                elif r['posteam'] == r['Away']:
                    df.loc[i:, 'HomeScore'] = df.loc[i:, 'HomeScore'] + 2

            # yrdln: home 0, away 100
            yrdln_split = r['yrdln'].split(' ')
            if len(yrdln_split) == 1 and yrdln_split[0] == str(50):
                h_yrdln = 50
                v_yrdln = 50
            elif len(yrdln_split) == 2:
                side = r['yrdln'].split(' ')[0]
                yrdln = int(r['yrdln'].split(' ')[1])
                if side == r['Home']:
                    h_yrdln = yrdln
                    v_yrdln = 100 - yrdln
                elif side == r['Away']:
                    v_yrdln = yrdln
                    h_yrdln = 100 - yrdln
            else:
                print 'Error in parsing yrdln, eid=', eid, 'yrdln=', r['yrdln']


            df.loc[i, 'h_yrdln'] = h_yrdln
            df.loc[i, 'v_yrdln'] = v_yrdln
    except AttributeError:
        print 'AttributeError:', eid, df['time'].values
    ## Verify with final value
    if df.iloc[-1]['HomeScore'] == points[points['eid'] == eid]['HomeScore'].values[0] and df.iloc[-1]['AwayScore'] == points[points['eid'] == eid]['AwayScore'].values[0]:
        print 'Successfully finish:', eid
    else:
        print 'Fail finishing:', eid, 'parse home score =', df.iloc[-1]['HomeScore']

In [262]:
for i, r in points[(points['season'] == 2014) & (points['week'] == 1)].iterrows():
    eid = r['eid']
    plays = pd.read_csv('data/game_data/'+str(eid)+'_plays.csv')
    parseMatches(plays, eid, points)

Successfully finish: 2014090400
Successfully finish: 2014090700
Fail finishing: 2014090701 parse home score = 16
Successfully finish: 2014090702
Fail finishing: 2014090703 parse home score = 10
Successfully finish: 2014090704
Successfully finish: 2014090705
Successfully finish: 2014090706
Fail finishing: 2014090707 parse home score = 27
Successfully finish: 2014090708
Fail finishing: 2014090709 parse home score = 12
Fail finishing: 2014090710 parse home score = 23
Successfully finish: 2014090711
Successfully finish: 2014090712
Fail finishing: 2014090800 parse home score = 33
Successfully finish: 2014090801


In [253]:
eid = 2014090700
df = pd.read_csv('data/game_data/'+str(eid)+'_plays.csv')
df = df.loc[:, ['time', 'desc', 'qtr', 'yrdln', 'posteam', 'note']].copy()
df = df.sort_values(by=['qtr', 'time'], ascending=[True, False], axis=0).reset_index()

# [time, Missing Value] make endtime row as 0, since they are nan
time_null_idx = df['time'].index[pd.isnull(df['time'].values)]
df.loc[time_null_idx, 'time'] = '0:00'
df['time'] = pd.Series([float(a.split(':')[0])*60+float(a.split(':')[1]) for a in df['time'].values])
if len(df.qtr) == 4:
    df['lefttime'] = 3600 - 15*60*(df['qtr']) + df['time'] # unit: seconds
else: # suppose only 1 overtime
    df['lefttime'] = 3600 + 15*60 - df.iloc[-1].time - 15*60*(df['qtr']) + df['time'] # unit: seconds

# [yrdln, Missing Value] Missing Value: make nan the same as previous one
df['yrdln'] = df['yrdln'].fillna(method='ffill')

# Assign Home, Away, HomeScore, AwayScore, CoverOrNot to current match
df['Home'] = points[points['eid'] == eid].Home.iloc[0]
df['Away'] = points[points['eid'] == eid].Away.iloc[0]

df['HomeScore'] = 0
df['AwayScore'] = 0
df['CoverOrNot'] = points[points['eid'] == eid].CoverOrNot.iloc[0]

for i, r in df.iterrows():
    # Add score according FG and TD
    if r['note'] == 'TD':
        if r['posteam'] == r['Home']:
            df.loc[i:, 'HomeScore'] = df.loc[i:, 'HomeScore'] + 6
        elif r['posteam'] == r['Away']:
            df.loc[i:, 'AwayScore'] = df.loc[i:, 'AwayScore'] + 6
    if r['note'] == 'FG':
        if r['posteam'] == r['Home']:
            df.loc[i:, 'HomeScore'] = df.loc[i:, 'HomeScore'] + 3
        elif r['posteam'] == r['Away']:
            df.loc[i:, 'AwayScore'] = df.loc[i:, 'AwayScore'] + 3
    if r['note'] == 'XP':
        if r['posteam'] == r['Home']:
            df.loc[i:, 'HomeScore'] = df.loc[i:, 'HomeScore'] + 1
        elif r['posteam'] == r['Away']:
            df.loc[i:, 'AwayScore'] = df.loc[i:, 'AwayScore'] + 1
    if r['note'] == 'SAF':
        if r['posteam'] == r['Home']:
            df.loc[i:, 'AwayScore'] = df.loc[i:, 'AwayScore'] + 2
        elif r['posteam'] == r['Away']:
            df.loc[i:, 'HomeScore'] = df.loc[i:, 'HomeScore'] + 2

    # yrdln: home 0, away 100
    yrdln_split = r['yrdln'].split(' ')
    if len(yrdln_split) == 1 and yrdln_split[0] == str(50):
        h_yrdln = 50
        v_yrdln = 50
    elif len(yrdln_split) == 2:
        side = r['yrdln'].split(' ')[0]
        yrdln = int(r['yrdln'].split(' ')[1])
        if side == r['Home']:
            h_yrdln = yrdln
            v_yrdln = 100 - yrdln
        elif side == r['Away']:
            v_yrdln = yrdln
            h_yrdln = 100 - yrdln
    else:
        print 'Error in parsing yrdln, eid=', eid, 'yrdln=', r['yrdln']


    df.loc[i, 'h_yrdln'] = h_yrdln
    df.loc[i, 'v_yrdln'] = v_yrdln

In [254]:
df

Unnamed: 0,index,time,desc,qtr,yrdln,posteam,note,lefttime,Home,Away,HomeScore,AwayScore,CoverOrNot,h_yrdln,v_yrdln
0,8,900.0,(15:00) (Shotgun) D.Brees pass short right to ...,1,NO 20,NO,,3698.0,ATL,NO,0,0,-1.0,80.0,20.0
1,10,900.0,M.Bosher kicks 65 yards from ATL 35 to end zon...,1,ATL 35,ATL,KICKOFF,3698.0,ATL,NO,0,0,-1.0,35.0,65.0
2,17,878.0,(14:38) D.Brees pass deep middle to B.Cooks to...,1,NO 25,NO,,3676.0,ATL,NO,0,0,-1.0,75.0,25.0
3,15,838.0,(13:58) M.Ingram left tackle to ATL 37 for 6 y...,1,ATL 43,NO,,3636.0,ATL,NO,0,0,-1.0,43.0,57.0
4,13,806.0,(13:26) (Shotgun) D.Brees pass short right to ...,1,ATL 37,NO,,3604.0,ATL,NO,0,0,-1.0,37.0,63.0
5,7,774.0,(12:54) (Shotgun) D.Brees pass short middle to...,1,ATL 35,NO,,3572.0,ATL,NO,0,0,-1.0,35.0,65.0
6,16,740.0,(12:20) M.Ingram left end pushed ob at ATL 17 ...,1,ATL 18,NO,,3538.0,ATL,NO,0,0,-1.0,18.0,82.0
7,9,710.0,(11:50) (Shotgun) D.Brees pass short middle to...,1,ATL 17,NO,PENALTY,3508.0,ATL,NO,0,0,-1.0,17.0,83.0
8,14,684.0,(11:24) D.Brees pass incomplete short right to...,1,ATL 27,NO,,3482.0,ATL,NO,0,0,-1.0,27.0,73.0
9,12,678.0,(11:18) (Shotgun) D.Brees pass short left to B...,1,ATL 27,NO,,3476.0,ATL,NO,0,0,-1.0,27.0,73.0


36