## You're ready to put into practice everything that you've learned so far. Here are the next steps for your capstone:

## 1) Go out and find a dataset of interest. It could be from one of the recommended resources or some other aggregation. Or it could be something that you scraped yourself. Just make sure that it has lots of variables, including an outcome of interest to you.

## 2) Explore the data. Get to know the data. Spend a lot of time going over its quirks. You should understand how it was gathered, what's in it, and what the variables look like.

## 3) Model your outcome of interest. You should try several different approaches and really work to tune a variety of models before using the model evaluation techniques to choose what you consider to be the best performer. Make sure to think about explanatory versus predictive power, and experiment with both.

## Execute the three tasks above in a Jupyter Notebook that you will submit to the grading team.

## Next, to prepare for your presentation, create a slide deck and a 15-minute presentation that guides viewers through your model. Be sure to cover a few specific topics:

* ## A specified research question that your model addresses
* ## How you chose your model specification and what alternatives you compared it to
* ## The practical uses of your model for an audience of interest
* ## Any weak points or shortcomings of your model

## This presentation is not a drill. You'll be presenting this slide deck live to a group as the culmination of all your work so far on supervised learning. As a secondary matter, your slides and the Jupyter Notebook should be worthy of inclusion as examples of your work product when applying to jobs.

```r
library(xlsx)
#fileUrl <- "http://www.aussportsbetting.com/historical_data/nfl.xlsx" # money line, open/close lines historic info
#download.file(fileUrl)
#nfl_asb <- read.csv("nfl_2014_2017_asb.csv") 

# pro football reference game info
library(XML)
library(RCurl)
library(rvest)
#filename <- NA
#for(year in 1966:2017){
#        filename[year] <- paste("https://www.pro-football-reference.com/years/",year,"/games.htm#games::none",sep="")
#} # read seasons 1966 to 2017
#url_pfr_games <- getURL(filename[1966:2017])  #getURL("https://www.pro-football-reference.com/years/2014/games.htm#games::none")
#pfr_games_raw <- readHTMLTable(url_pfr_games, trim=T, as.data.frame=T, header=T)
#pfr_games <-bind_rows(pfr_games_raw)
#my_df <- as.data.frame(read_html(url_pfr_games) %>% html_table(fill=TRUE))
```

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [23]:
all_dfs = []

# read seasons 1966 to 2017
for year in range(1966,2018):
    print(year)
    url = "https://www.pro-football-reference.com/years/" + str(year) + "/games.htm#games::none"
    df = pd.read_html(url)
    line = str(df[0])
    rows = line.split('\n')
    
    for i in range(0, len(rows)):
        if i == 0:
            rows[0] = 'Year' + ' ' + rows[0]
        else:
            rows[i] = str(year) + ' ' + rows[i]
            
        print(rows[i])
        
        
    df = pd.DataFrame(np.array(rows).reshape(1,27), columns = ['Year','Week','Day','Date','Time','Winner/tie','Unnamed: 5',
                                                               'Loser/tie','Unnamed: 5','PtsW','PtsL','YdsW','TOW','YdsL', 
                                                               'TOL'])
    print(df)
    all_dfs.extend(df)
    
scores1_df = pd.concat(all_dfs)
scores1_df.shape

1966
Year           Week  Day          Date Time           Winner/tie Unnamed: 5  \
1966 0            1  Sat  September 10  NaN    Green Bay Packers        NaN   
1966 1            1  Sun  September 11  NaN     Los Angeles Rams          @   
1966 2            1  Sun  September 11  NaN        Detroit Lions        NaN   
1966 3            1  Sun  September 11  NaN  San Francisco 49ers        NaN   
1966 4            1  Sun  September 11  NaN  Pittsburgh Steelers        NaN   
1966 ..         ...  ...           ...  ...                  ...        ...   
1966 117         15  Sun   December 18  NaN  Pittsburgh Steelers          @   
1966 118         15  Sun   December 18  NaN    Green Bay Packers          @   
1966 119        NaN  NaN      Playoffs  NaN                  NaN        NaN   
1966 120      Champ  Sun     January 1  NaN    Green Bay Packers          @   
1966 121  SuperBowl  Sun    January 15  NaN    Green Bay Packers          N   
1966 
1966               Loser/tie Unnamed: 7 P

ValueError: Shape of passed values is (1, 27), indices imply (1, 15)

In [11]:
scores1_df.head()

Unnamed: 0,Week,Day,Date,Time,Winner/tie,Unnamed: 5,Loser/tie,Unnamed: 7,PtsW,PtsL,YdsW,TOW,YdsL,TOL
0,1,Sat,September 10,,Green Bay Packers,,Baltimore Colts,boxscore,24,3,292,1,213,3
1,1,Sun,September 11,,Los Angeles Rams,@,Atlanta Falcons,boxscore,19,14,421,2,237,2
2,1,Sun,September 11,,Detroit Lions,,Chicago Bears,boxscore,14,3,208,3,256,3
3,1,Sun,September 11,,San Francisco 49ers,,Minnesota Vikings,boxscore,20,20,307,1,298,1
4,1,Sun,September 11,,Pittsburgh Steelers,,New York Giants,boxscore,34,34,404,3,279,5


In [None]:
odds_df = pd.read_excel("http://www.aussportsbetting.com/historical_data/nfl.xlsx")
scores2_df = pd.read_csv('data/spreadspoke_scores.csv', encoding = "ISO-8859-1", engine='python')
stadiums_df = pd.read_csv('data/nfl_stadiums.csv', encoding = "ISO-8859-1", engine='python')
teams_df = pd.read_csv('data/nfl_teams.csv', encoding = "ISO-8859-1", engine='python')

In [None]:
odds_df.head()

In [None]:
scores2_df.head()

In [None]:
stadiums_df.head()

In [None]:
teams_df.head()