In [7]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

# PREMISE

I'm sure we all know someone who's go to excuse is "[insert team here] vs Refs", or a sarcastic "I wonder who Vegas has in this one", when their favorite team loses a game. I know its hard to accept when your team has been flat out beat, but with data science we can actually analyze and see if refs tend to favorite the Vegas-backed team, and if refs really do impact games in significant way.

What we are going to do in this tutorial to start, is scrape data on NFL games - which referee reffed it, the moneyline and other betting lines, the play-by-play, who was home and who was away, etc. 

Once we get all the data we need in a tidy manner, and analyze it, we will be able to answer a lot of questions about refs tendencies, and you will be able to tell that friend either to just accept the loss, or that he has a legitimate claim. 

Some good questions to look at are
1. When the favorited team goes behind, do the refs call more penalties on the opposing team
2. Do certain refs tend to favor certain calls, or certain teams
3. ...

## DATA COLLECTION

What we want to do is first collect all the data we need, via webscraping with BeautifulSoup and Requests, and then storing all the data in a Pandas dataframe.

Here, I have already downloaded an excel sheet from https://www.sportsbookreviewsonline.com/scoresoddsarchives/nfl/nfloddsarchives.html

We read it into an excel-sheet like data type called a Dataframe, from the Pandas module. (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)

In [8]:
gamedata = pd.read_excel('nflodds.xlsx')
gamedata

Unnamed: 0,Date,Rot,VH,Team,1st,2nd,3rd,4th,Final,Open,Close,ML,2H
0,909,451,V,Dallas,7,9,10,3,29,52.5,52.5,375,27
1,909,452,H,TampaBay,7,14,7,3,31,7,10,-450,2
2,912,453,V,Pittsburgh,0,0,6,17,23,51,47.5,240,0.5
3,912,454,H,Buffalo,3,7,0,6,16,7,7,-280,21.5
4,912,455,V,NYJets,0,0,8,6,14,43.5,44.5,160,21
...,...,...,...,...,...,...,...,...,...,...,...,...,...
565,130,322,H,KansasCity,7,14,0,3,24,7,7,-310,3
566,130,323,V,SanFrancisco,0,10,7,0,17,47.5,45.5,170,23.5
567,130,324,H,LARams,0,7,0,13,20,3,-3.5,-180,0.5
568,213,101,N,LARams,7,6,3,7,23,4,4,-200,0.5


This data contains information on games, and the moneyline, opening, halftime, and closing lines for a game.

What we now want to do now is scrape data for the play-by-play so we can see exactly when every single flag was thrown. The website we are scraping from, https://www.nflpenalties.com/all-referees.php?year=2021, contains links to each referees individual page. To analyze webpages, you don't need to know anything about HTML except how to  press right-click inspect and look through the HTML for what you want, and then utilize BeautifulSoup (https://www.crummy.com/software/BeautifulSoup/bs4/doc/) and Regular Expressions to extract the information you want. BeautifulSoup parses the content returned by a request sent through the requests module, and turns it into something that you can use to search for specific elements of HTML.

In [9]:
URL = 'https://www.nflpenalties.com/all-referees.php?year=2021'
PREFIX = 'https://www.nflpenalties.com'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')

elts = soup.find_all(href=True)
match_referee = re.compile(r'<a href="(/referee/.*)">(.*)<')
ref_to_games = {}
for elt in elts: 
    if ref := re.match(match_referee, str(elt)):
        ref_to_games[ref.group(2)] = PREFIX+ref.group(1)
        
ref_to_games['Adrian Hill']

'https://www.nflpenalties.com/referee/adrian-hill?year=2021'

Utilizing BeautifulSoup, the re module, and requests, we create a python dictionary, ref_to_games, that maps a referee to their webpage's link, as shown above.

From there, we then have to search through each link and see where that takes us. Looking at the HTML for the pages, I noticed that there is a link to every single game refereed by that ref. From there, we now the next step is to extract those links for each referee.

In [10]:
match_game = re.compile(r'<a href="(/game/.*)">')
for name, link in ref_to_games.items():
    games = []
    page = requests.get(link)
    soup = BeautifulSoup(page.content, 'html.parser')
    elts = soup.find_all(href=True)
    for elt in elts: 
        if game := re.match(match_game, str(elt)):
            games.append(PREFIX+game.group(1))
    ref_to_games[name] = games
    
ref_to_games['Adrian Hill']

['https://www.nflpenalties.com/game/minnesota-vikings-at-cincinnati-bengals-09-12-2021',
 'https://www.nflpenalties.com/game/detroit-lions-at-green-bay-packers-09-20-2021',
 'https://www.nflpenalties.com/game/atlanta-falcons-at-new-york-giants-09-26-2021',
 'https://www.nflpenalties.com/game/tennessee-titans-at-new-york-jets-10-03-2021',
 'https://www.nflpenalties.com/game/kansas-city-chiefs-at-washington-football-team-10-17-2021',
 'https://www.nflpenalties.com/game/denver-broncos-at-cleveland-browns-10-21-2021',
 'https://www.nflpenalties.com/game/jacksonville-jaguars-at-seattle-seahawks-10-31-2021',
 'https://www.nflpenalties.com/game/houston-texans-at-miami-dolphins-11-07-2021',
 'https://www.nflpenalties.com/game/carolina-panthers-at-arizona-cardinals-11-14-2021',
 'https://www.nflpenalties.com/game/chicago-bears-at-detroit-lions-11-25-2021',
 'https://www.nflpenalties.com/game/washington-football-team-at-las-vegas-raiders-12-05-2021',
 'https://www.nflpenalties.com/game/seattle-s

In this code, we have updated the ref_to_games dictionary to map each referee to all links to the play-by-plays for the games that they refereed, as shown above.

Now, we have to go to through those links, and extract the play-by-play for each game.

In [11]:
for name, links in ref_to_games.items():
    games = []
    for link in links:
        games.append(pd.read_html(link, attrs={'class':'standard'})[0])
    ref_to_games[name] = games
    
ref_to_games['Adrian Hill'][0]

Unnamed: 0,MIN,CIN,Poss,Time,Down,Distance,Yardline,Play
0,0,0,CIN,Q1 15:00,0,0,CIN 35,2-E.McPherson kicks 65 yards from CIN 35 to en...
1,0,0,MIN,Q1 15:00,1,10,MIN 25,"(15:00) PENALTY on MIN-30-C.Ham, False Start, ..."
2,0,0,MIN,Q1 15:00,1,15,MIN 20,(15:00) 8-K.Cousins pass short right to 33-D.C...
3,0,0,MIN,Q1 14:25,2,6,MIN 29,"(14:25) PENALTY on MIN-83-T.Conklin, False Sta..."
4,0,0,MIN,Q1 14:10,2,11,MIN 24,(14:10) 8-K.Cousins pass short middle to 83-T....
...,...,...,...,...,...,...,...,...
235,24,24,CIN,Q5 00:06,2,5,MIN 15,(:06) 9-J.Burrow spiked the ball to stop the c...
236,24,24,,Q5 00:05,0,0,,Timeout #1 by MIN at 00:05.
237,24,27,CIN,Q5 00:05,3,5,MIN 15,(:05) 2-E.McPherson 38 yard field goal is GOOD...
238,MIN,CIN,Poss,Time,Down,Distance,Yardline,Play


These few lines of code take a while to run, upwards of 2 minutes on my laptop to iterate through every NFL game played in a season, but after they run we have successfully updated ref_to_games dictionary to map referees to a list of Dataframes of play-by-play information for each game the referee coached. At this point we have finished scraping our data, and we have successfully stored the play-by-play for every single game in the season.

Printed above is the first game refereed by Adrian Hill in the 2021 season, ```ref_to_games[name][i]``` corresponds to the ```(i+1)th game``` coached by ```name```

## DATA PROCESSING