# Data Processing and Exploration


In this notebook we're going to look at the raw data and pare it down to a more managable size and normalized format.

- College Football schedules (regular and bowl seasons)
- University of Iowa arrests
- Johnson County Arrests
- Other related datasets as they come up.

In [1]:
# Libraries, base baths, etc
import pandas as pd
import numpy as np
import json

dirData = '../data/'
dirDataProc = dirData + 'processed/'

dirCfb = dirData + 'in/cfb/'
dirScrapeFran = dirData + 'in/franzen-scrape/'

## College Football

**Lines**: There is some dirty data with the `Line` column mispelled as `Lines ` with an extra space. We'll correct accordingly.

In [2]:
# Concatenate all the csv files from College Football regular season
yearsCfb = range(1978, 2015)

# Create a regular season data frame
gamesRegular = pd.DataFrame()
for year in yearsCfb:
    
    # conditional, the file formatting changes halfway through
    if year < 2009:
        path = dirCfb + 'ncaa{}lines.csv'.format(year)
    else:
        path = dirCfb + 'cfb{}lines.csv'.format(year)
        
    frame = pd.read_csv(path, parse_dates=True)
    if ('Line ' in frame.columns.tolist()):
        frame.rename(columns={'Line ': 'Line'}, inplace=True)
        
    gamesRegular = gamesRegular.append(frame, ignore_index=True)

print ("There were {} regular season games from 1978 - 2014.".format(gamesRegular.shape[0]))
gamesRegular.head()

There were 25906 regular season games from 1978 - 2014.


Unnamed: 0,Date,Visitor,Visitor Score,Home Team,Home Score,Line
0,09/01/1978,Penn State,10.0,Temple,7.0,-24.5
1,09/02/1978,Arkansas State,20.0,Tulsa,21.0,1.0
2,09/02/1978,East Carolina,14.0,Western Carolina,6.0,
3,09/02/1978,Eastern Michigan,3.0,Northern Michigan,30.0,
4,09/02/1978,Nebraska,3.0,Alabama,20.0,11.5


In [3]:
# Create a bowl season data frame
gamesBowl = pd.DataFrame()

for year in yearsCfb:
    path = dirCfb + '/bowl{}lines.csv'.format(year)
    frame = pd.read_csv(path, parse_dates=True)
    
    gamesBowl = gamesBowl.append(frame, ignore_index=True)
    
gamesBowl.head()

Unnamed: 0,Bowl Name,Date,Home Score,Home Team,Line,Visitor,Visitor Score
0,Garden State Bowl,12/16/1978,18,Rutgers,-11.0,Arizona State,34
1,Independence Bowl,12/16/1978,13,Louisiana Tech,,East Carolina,35
2,Hall of Fame Bowl,12/20/1978,28,Texas A+M,6.5,Iowa State,12
3,Holiday Bowl,12/22/1978,23,Navy,5.5,Brigham Young,16
4,Liberty Bowl,12/23/1978,20,Missouri,7.0,Louisiana State,15


In [4]:
# Make a master College Football data frame
gamesMaster = gamesBowl.append(gamesRegular)
gamesMaster.shape

(26732, 7)

In [5]:
# Lets write these combined csv's to our processed data folder. 
# We might yet do more analysis, so it's best to keep it in the 'rough draft'
gamesRegular.to_csv(dirDataProc + 'cfb-regular-games.csv')
gamesBowl.to_csv(dirDataProc + 'cfb-bowl-games.csv')
gamesMaster.to_csv(dirDataProc + 'cfb-all-games.csv')

In [6]:
# Create a helper function to grab any team
# Input is (in strings), output is a dataframe
# # college
# # gameType: regular, bowl, all
# # gameLoc: home, away, all

def getCollegeGames(college, gameType="regular", gameLoc="all"):
    # Switch to set which dataframe we'll be pulling from
    if (gameType == 'regular'):
        df = gamesRegular
    elif (gameType == 'bowl'):
        df = gamesBowl
    elif (gameType == 'all'):
        df = gamesMaster
        
    # Switch for which locations to look at
    if (gameLoc == 'all'):
        return df[(df['Home Team'] == college) | (df['Visitor'] == college)]
    elif (gameLoc == 'home'):
        return df[(df['Home Team'] == college)]
    elif (gameType == 'away'):
        return df[(df['Visitor'] == college)]

In [10]:
# iowa home games and bowl games which is what we're interested in.
iowaHome = getCollegeGames('Iowa', 'bowl', 'home')
iowaHome.head()

# write it out!
#iowaHome.to_csv(dirDataProc + 'iowa-home-and-bowl-games.csv')

## Franzen scraped data

**To do**
- Describe the scraping process
- Get caveats and possible flaws from L. Franzen


The `activities.json` is the most complete data set, so we'll start from there.

In [8]:
# The data is saved in JSON format, but it's just a dict per line

with open(dirScrapeFran + 'activities.json') as f:
    scrapeActivities = pd.DataFrame(json.loads(line) for line in f)
    
with open(dirScrapeFran + 'activityList.json') as f:
    scrapeActivitiesList = pd.DataFrame(json.loads(line) for line in f)
    
with open(dirScrapeFran + 'dispositionList.json') as f:
    scrapedispositionList = pd.DataFrame(json.loads(line) for line in f)

In [9]:
scrapeActivities.columns.tolist()

[u'_id',
 u'activity',
 u'address',
 u'addresses',
 u'apt',
 u'date',
 u'datetime',
 u'details',
 u'dispatch',
 u'disposition',
 u'inc',
 u'lat',
 u'link',
 u'loc',
 u'lon',
 u'time']