## You're ready to put into practice everything that you've learned so far. Here are the next steps for your capstone:

## 1) Go out and find a dataset of interest. It could be from one of the recommended resources or some other aggregation. Or it could be something that you scraped yourself. Just make sure that it has lots of variables, including an outcome of interest to you.

## 2) Explore the data. Get to know the data. Spend a lot of time going over its quirks. You should understand how it was gathered, what's in it, and what the variables look like.

## 3) Model your outcome of interest. You should try several different approaches and really work to tune a variety of models before using the model evaluation techniques to choose what you consider to be the best performer. Make sure to think about explanatory versus predictive power, and experiment with both.

## Execute the three tasks above in a Jupyter Notebook that you will submit to the grading team.

## Next, to prepare for your presentation, create a slide deck and a 15-minute presentation that guides viewers through your model. Be sure to cover a few specific topics:

* ## A specified research question that your model addresses
* ## How you chose your model specification and what alternatives you compared it to
* ## The practical uses of your model for an audience of interest
* ## Any weak points or shortcomings of your model

## This presentation is not a drill. You'll be presenting this slide deck live to a group as the culmination of all your work so far on supervised learning. As a secondary matter, your slides and the Jupyter Notebook should be worthy of inclusion as examples of your work product when applying to jobs.

```r
library(xlsx)
#fileUrl <- "http://www.aussportsbetting.com/historical_data/nfl.xlsx" # money line, open/close lines historic info
#download.file(fileUrl)
#nfl_asb <- read.csv("nfl_2014_2017_asb.csv") 

# pro football reference game info
library(XML)
library(RCurl)
library(rvest)
#filename <- NA
#for(year in 1966:2017){
#        filename[year] <- paste("https://www.pro-football-reference.com/years/",year,"/games.htm#games::none",sep="")
#} # read seasons 1966 to 2017
#url_pfr_games <- getURL(filename[1966:2017])  #getURL("https://www.pro-football-reference.com/years/2014/games.htm#games::none")
#pfr_games_raw <- readHTMLTable(url_pfr_games, trim=T, as.data.frame=T, header=T)
#pfr_games <-bind_rows(pfr_games_raw)
#my_df <- as.data.frame(read_html(url_pfr_games) %>% html_table(fill=TRUE))
```

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
all_dfs = []

# read seasons 1966 to 2017
for year in range(1966,2018):
    print(year)
    url = "https://www.pro-football-reference.com/years/" + str(year) + "/games.htm#games::none"
    df = pd.read_html(url)
    df = df[0]
    df['Year'] = year

    all_dfs.append(df)
    
scores1_df = pd.concat(all_dfs)
scores1_df.shape

1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017


(12743, 15)

In [3]:
scores1_df.shape

(12743, 15)

In [4]:
scores1_df.head()

Unnamed: 0,Week,Day,Date,Time,Winner/tie,Unnamed: 5,Loser/tie,Unnamed: 7,PtsW,PtsL,YdsW,TOW,YdsL,TOL,Year
0,1,Sat,September 10,,Green Bay Packers,,Baltimore Colts,boxscore,24,3,292,1,213,3,1966
1,1,Sun,September 11,,Los Angeles Rams,@,Atlanta Falcons,boxscore,19,14,421,2,237,2,1966
2,1,Sun,September 11,,Detroit Lions,,Chicago Bears,boxscore,14,3,208,3,256,3,1966
3,1,Sun,September 11,,San Francisco 49ers,,Minnesota Vikings,boxscore,20,20,307,1,298,1,1966
4,1,Sun,September 11,,Pittsburgh Steelers,,New York Giants,boxscore,34,34,404,3,279,5,1966


In [5]:
odds_df = pd.read_excel("http://www.aussportsbetting.com/historical_data/nfl.xlsx")
scores2_df = pd.read_csv('data/spreadspoke_scores.csv', encoding = "ISO-8859-1", engine='python')
stadiums_df = pd.read_csv('data/nfl_stadiums.csv', encoding = "ISO-8859-1", engine='python')
teams_df = pd.read_csv('data/nfl_teams.csv', encoding = "ISO-8859-1", engine='python')

In [6]:
odds_df.head()

Unnamed: 0,Date,Home Team,Away Team,Home Score,Away Score,Overtime?,Playoff Game?,Neutral Venue?,Home Odds Open,Home Odds Min,...,Total Score Close,Total Score Over Open,Total Score Over Min,Total Score Over Max,Total Score Over Close,Total Score Under Open,Total Score Under Min,Total Score Under Max,Total Score Under Close,Notes
0,2020-02-02,Kansas City Chiefs,San Francisco 49ers,31,20,,Y,Y,1.8,1.8,...,53.0,1.9,1.9,1.9,1.9,1.9,1.9,1.9,1.9,
1,2020-01-19,San Francisco 49ers,Green Bay Packers,37,20,,Y,,1.31,1.26,...,46.5,1.9,1.9,1.9,1.9,1.9,1.9,1.9,1.9,
2,2020-01-19,Kansas City Chiefs,Tennessee Titans,35,24,,Y,,1.29,1.28,...,51.0,1.9,1.9,1.9,1.9,1.9,1.9,1.9,1.9,
3,2020-01-12,Green Bay Packers,Seattle Seahawks,28,23,,Y,,1.52,1.45,...,45.5,1.9,1.9,1.9,1.9,1.9,1.9,1.9,1.9,
4,2020-01-12,Kansas City Chiefs,Houston Texans,51,31,,Y,,1.27,1.2,...,50.5,1.9,1.9,1.95,1.9,1.9,1.9,1.9,1.9,


In [7]:
scores2_df.head()

Unnamed: 0,schedule_date,schedule_season,schedule_week,schedule_playoff,team_home,score_home,score_away,team_away,team_favorite_id,spread_favorite,over_under_line,stadium,stadium_neutral,weather_temperature,weather_wind_mph,weather_humidity,weather_detail
0,9/2/1966,1966,1,False,Miami Dolphins,14,23,Oakland Raiders,,,,Orange Bowl,False,83.0,6.0,71,
1,9/3/1966,1966,1,False,Houston Oilers,45,7,Denver Broncos,,,,Rice Stadium,False,81.0,7.0,70,
2,9/4/1966,1966,1,False,San Diego Chargers,27,7,Buffalo Bills,,,,Balboa Stadium,False,70.0,7.0,82,
3,9/9/1966,1966,2,False,Miami Dolphins,14,19,New York Jets,,,,Orange Bowl,False,82.0,11.0,78,
4,9/10/1966,1966,1,False,Green Bay Packers,24,3,Baltimore Colts,,,,Lambeau Field,False,64.0,8.0,62,


In [8]:
stadiums_df.head()

Unnamed: 0,stadium_name,stadium_location,stadium_open,stadium_close,stadium_type,stadium_address,stadium_weather_station_code,stadium_weather_type,stadium_capacity,stadium_surface,STATION,NAME,LATITUDE,LONGITUDE,ELEVATION
0,Alamo Dome,"San Antonio, TX",,,indoor,"100 Montana St, San Antonio, TX 78203",78203.0,dome,72000.0,FieldTurf,,,,,
1,Alltel Stadium,"Jacksonville, FL",,,,,,,,,,,,,
2,Alumni Stadium,"Chestnut Hill, MA",,,outdoor,"Perimeter Rd, Chestnut Hill, MA 02467",2467.0,cold,,Grass,,,,,
3,Anaheim Stadium,"Anaheim, CA",1980.0,1994.0,outdoor,"2000 E Gene Autry Way, Anaheim, CA 92806",92806.0,warm,,,,,,,
4,Arrowhead Stadium,"Kansas City, MO",1972.0,,outdoor,"1 Arrowhead Dr, Kansas City, MO 64129",64129.0,cold,76416.0,Grass,US1MOJC0028,"KANSAS CITY 5.1 SE, MO US",39.0692,-94.4871,264.9


In [9]:
teams_df.head()

Unnamed: 0,team_name,team_name_short,team_id,team_id_pfr,team_conference,team_division,team_conference_pre2002,team_division_pre2002
0,Arizona Cardinals,Cardinals,ARI,CRD,NFC,NFC West,NFC,NFC West
1,Phoenix Cardinals,Cardinals,ARI,CRD,NFC,,NFC,NFC East
2,St. Louis Cardinals,Cardinals,ARI,ARI,NFC,,NFC,NFC East
3,Atlanta Falcons,Falcons,ATL,ATL,NFC,NFC South,NFC,NFC West
4,Baltimore Ravens,Ravens,BAL,RAV,AFC,AFC North,AFC,AFC Central
