# College Football


In this notebook we're going to look at the raw football data and pare it down to a more managable size and normalized format.

- College Football schedules
..* Regular
..* Bowl
- Win/Loss
- Score delta

In [15]:
# Libraries, base baths, etc
import pandas as pd
import numpy as np
import json

dirData = '../data/'
dirDataProc = dirData + 'processed/'
dirCfb = dirData + 'external/cfb/'

## College Football

**Lines**: There is some dirty data with the `Line` column mispelled as `Lines ` with an extra space. We'll correct accordingly.

In [5]:
# Concatenate all the csv files from College Football regular season
yearsCfb = range(1978, 2015)

# Create a regular season data frame
gamesRegular = pd.DataFrame()
for year in yearsCfb:
    
    # conditional, the file formatting changes halfway through
    if year < 2009:
        path = dirCfb + 'ncaa{}lines.csv'.format(year)
    else:
        path = dirCfb + 'cfb{}lines.csv'.format(year)
        
    frame = pd.read_csv(path, parse_dates=['Date'])
    if ('Line ' in frame.columns.tolist()):
        frame.rename(columns={'Line ': 'Line'}, inplace=True)
        
    gamesRegular = gamesRegular.append(frame, ignore_index=True)

print ("There were {} regular season games from 1978 - 2014.".format(gamesRegular.shape[0]))
gamesRegular.head()

There were 25906 regular season games from 1978 - 2014.


Unnamed: 0,Date,Visitor,Visitor Score,Home Team,Home Score,Line
0,1978-09-01,Penn State,10.0,Temple,7.0,-24.5
1,1978-09-02,Arkansas State,20.0,Tulsa,21.0,1.0
2,1978-09-02,East Carolina,14.0,Western Carolina,6.0,
3,1978-09-02,Eastern Michigan,3.0,Northern Michigan,30.0,
4,1978-09-02,Nebraska,3.0,Alabama,20.0,11.5


In [6]:
# Create a bowl season data frame
gamesBowl = pd.DataFrame()

for year in yearsCfb:
    path = dirCfb + '/bowl{}lines.csv'.format(year)
    frame = pd.read_csv(path, parse_dates=['Date'])
    
    gamesBowl = gamesBowl.append(frame, ignore_index=True)
    
gamesBowl.head()

Unnamed: 0,Bowl Name,Date,Home Score,Home Team,Line,Visitor,Visitor Score
0,Garden State Bowl,1978-12-16,18,Rutgers,-11.0,Arizona State,34
1,Independence Bowl,1978-12-16,13,Louisiana Tech,,East Carolina,35
2,Hall of Fame Bowl,1978-12-20,28,Texas A+M,6.5,Iowa State,12
3,Holiday Bowl,1978-12-22,23,Navy,5.5,Brigham Young,16
4,Liberty Bowl,1978-12-23,20,Missouri,7.0,Louisiana State,15


In [7]:
# Make a master College Football data frame
gamesMaster = gamesBowl.append(gamesRegular)
gamesMaster.shape

(26732, 7)

In [7]:
# Create a helper function to grab any team
# Input is (in strings), output is a dataframe
# # college
# # gameType: regular, bowl, all
# # gameLoc: home, away, all

def getCollegeGames(college, gameType="regular", gameLoc="all"):
    # Switch to set which dataframe we'll be pulling from
    if (gameType == 'regular'):
        df = gamesRegular
    elif (gameType == 'bowl'):
        df = gamesBowl
    elif (gameType == 'all'):
        df = gamesMaster
        
    # Switch for which locations to look at
    if (gameLoc == 'all'):
        return df[(df['Home Team'] == college) | (df['Visitor'] == college)]
    elif (gameLoc == 'home'):
        return df[(df['Home Team'] == college)]
    elif (gameType == 'away'):
        return df[(df['Visitor'] == college)]

In [8]:
# Helper function to calc Win/Score delta column
def getGameWin(college):
    # create DataFrame of only home wins
    frame = gamesRegular[gamesRegular['Home Team'] == college]
    
    # Create var cols
    frame.loc[:, 'Win'] = frame.apply(lambda x: 1 if (x['Home Score'] > x['Visitor Score']) else 0, axis=1)
    frame.loc[:, 'Delta'] = frame.apply(lambda x: x['Home Score'] - x['Visitor Score'], axis=1)
    
    return frame
    

In [12]:
# iowa home games is what we're interested in.
iowaHome = getGameWin("Iowa")
iowaHome.iloc[-30:-20]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


Unnamed: 0,Date,Visitor,Visitor Score,Home Team,Home Score,Line,Win,Delta
22401,2010-10-30,Michigan State,6.0,Iowa,37.0,6.5,1,31.0
22576,2010-11-20,Ohio State,20.0,Iowa,17.0,-3.0,0,-3.0
22735,2011-09-03,Tennessee Tech,7.0,Iowa,34.0,,1,27.0
22837,2011-09-17,Pittsburgh,27.0,Iowa,31.0,3.0,1,4.0
22939,2011-09-24,UL-Monroe,17.0,Iowa,45.0,17.0,1,28.0
23106,2011-10-15,Northwestern,31.0,Iowa,41.0,6.0,1,10.0
23118,2011-10-22,Indiana,24.0,Iowa,45.0,23.0,1,21.0
23243,2011-11-05,Michigan,16.0,Iowa,24.0,-4.0,1,8.0
23303,2011-11-12,Michigan State,37.0,Iowa,21.0,-3.0,0,-16.0
23564,2012-09-08,Iowa State,9.0,Iowa,6.0,3.5,0,-3.0


Lets write these combined csv's to our processed data folder. We might yet do more analysis, so it's best to keep it in the 'rough draft'

In [11]:
gamesRegular.to_csv(dirDataProc + 'football/cfb-regular-games.csv')
gamesBowl.to_csv(dirDataProc + 'football/cfb-bowl-games.csv')
gamesMaster.to_csv(dirDataProc + 'football/cfb-all-games.csv')

#write iowa out
iowaHome.to_csv(dirDataProc + 'iowa-home-games.csv')