# NFL Data Scraping Project
This is a notebook that will describe the web scraping process for NFL team statistics

To start we need to figure out what information and where to find that information. For this project I want to analyze how regular season statistics for a team impact their performance during the regular season and in the playoffs.
- To find the regular season statistics I will be using the information provided by the NFL here: **https://www.nfl.com/stats/team-stats/**
- To find the win and loss information for teams I used information from this link: **https://www.teamrankings.com/nfl/trends/win_trends/**

Let's start by gathering the win loss data

### Packages
For this project I will be using beautiful soup, pandas, datetime, and requests packages to conduct web scraping.

In [95]:
# IMPORT STATEMENTS
from bs4 import BeautifulSoup
from requests_html import HTMLSession
from urllib.parse import urljoin
import requests as r
import pandas as pd

In [96]:
url = "https://www.teamrankings.com/nfl/trends/win_trends/?sc=is_regular_season"
response = r.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
specific_div = soup.find('table') # finding the table with the record of each team
rows = specific_div.find_all('tr') # get the rows of data

# Now that we have the data as a list of rows, we can parse the data to construct a data frame
data = []
for row in rows:
        cells = row.find_all(['td', 'th'])  # 'td' for regular cells, 'th' for header cells
        row_data = [cell.text.strip() for cell in cells] #extract the contents in each cell
        data.append(row_data)
columns = data[0]
df = pd.DataFrame(data[1:], columns=columns)
print(df)


             Team Win-Loss Record  Win %    MOV ATS +/-
0    Philadelphia           8-1-0  88.9%    6.3    +1.2
1     Kansas City           7-2-0  77.8%    7.2    +1.1
2         Detroit           7-2-0  77.8%    4.2    +1.5
3       Baltimore           7-3-0  70.0%   11.3    +6.3
4         Seattle           6-3-0  66.7%   -0.1    -2.0
5           Miami           6-3-0  66.7%    6.7    +3.2
6   San Francisco           6-3-0  66.7%   12.1    +5.4
7          Dallas           6-3-0  66.7%   11.6    +6.3
8       Cleveland           6-3-0  66.7%    4.9    +4.7
9      Pittsburgh           6-3-0  66.7%   -2.9    -2.1
10   Jacksonville           6-3-0  66.7%    0.7    -0.1
11      Minnesota           6-4-0  60.0%    2.4    +3.6
12     Cincinnati           5-4-0  55.6%   -1.1    -3.0
13        Houston           5-4-0  55.6%    2.8    +5.3
14      Las Vegas           5-5-0  50.0%   -3.3    -1.9
15        Buffalo           5-5-0  50.0%    7.8    +1.7
16   Indianapolis           5-5-0  50.0%   -0.6 

We have now parsed the data for 1 year. The next challenge is to change the filters to get postseason data, and then to change the year filter to get data for each year. When inspecting the html code we can see that there a 'div class=filter' which holds the different filter that we can change. When a year filter is changed the url changes. For examples *https://www.teamrankings.com/nfl/trends/win_trends/?sc=is_regular_season&__range=yearly_2022__*. Using the final parameter "&range=yearly_2022" we can adjust the url to select the year to scrape the data


In [98]:
url = url
year = 2022
url = url + str(year)
print(url)

https://www.teamrankings.com/nfl/trends/win_trends/?sc=is_regular_season2022


Now the goal is to create a function of the above code, and iterate through each year to pull data. Note that for the data to be scraped, we want a full year of data. As of this time the 2023 season is going on, so we will look to pull data up to 2022.

In [None]:
def read_data(year):
    # Reads the data from the website extracting the values of the table based on the year
    # Param: year - year of data to be collected

    url = "https://www.teamrankings.com/nfl/trends/win_trends/?sc=is_regular_season&range=yearly_"
    url = url + year
    response = r.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    specific_div = soup.find('table') # finding the table with the record of each team
    rows = specific_div.find_all('tr') # get the rows of data

    # Now that we have the data as a list of rows, we can parse the data to construct a data frame
    data = []
    for row in rows:
            cells = row.find_all(['td', 'th'])  # 'td' for regular cells, 'th' for header cells
            row_data = [cell.text.strip() for cell in cells] #extract the contents in each cell
            data.append(row_data)
    columns = data[0]
    df = pd.DataFrame(data[1:], columns=columns)
read_data("2003")

Now that we have made a function, we want to have 1 table with data from all possible years. To do this we need to identify each data point by year. Therefore we need to adjust our original function to add column for year.

In [67]:
def read_data(year):
    # Reads the data from the website extracting the values of the table based on the year
    # Param: year - year of data to be collected

    url = "https://www.teamrankings.com/nfl/trends/win_trends/?sc=is_regular_season&range=yearly_"
    url = url + str(year)
    response = r.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    specific_div = soup.find('table') # finding the table with the record of each team
    rows = specific_div.find_all('tr') # get the rows of data

    # Now that we have the data as a list of rows, we can parse the data to construct a data frame
    data = []
    for row in rows:
            cells = row.find_all(['td', 'th'])  # 'td' for regular cells, 'th' for header cells
            row_data = [cell.text.strip() for cell in cells] #extract the contents in each cell
            data.append(row_data)
    columns = data[0]
    df = pd.DataFrame(data[1:], columns=columns)
    df["Year"] = year #helps identify the data points based on year
    return df
read_data("2004")

Unnamed: 0,Team,Win-Loss Record,Win %,MOV,ATS +/-,Year
0,Pittsburgh,15-1-0,93.8%,7.6,4.8,2004
1,New England,14-2-0,87.5%,11.1,4.2,2004
2,Philadelphia,13-3-0,81.3%,7.9,1.8,2004
3,Indianapolis,12-4-0,75.0%,10.7,5.4,2004
4,LA Chargers,12-4-0,75.0%,8.3,9.0,2004
5,Atlanta,11-5-0,68.8%,0.2,-2.0,2004
6,NY Jets,10-6-0,62.5%,4.5,2.3,2004
7,Green Bay,10-6-0,62.5%,2.8,0.8,2004
8,Denver,10-6-0,62.5%,4.8,0.1,2004
9,Seattle,9-7-0,56.3%,-0.1,-3.8,2004


Now that we can identify the year for each data table, we need to loop through each year and pull the data for the regular season. To get the data for years, the __datetime__ package will be used to always get the current year. We will loop through years from 2003 to the year before the current year. All of this data will then be placed into 1 table

In [69]:
import datetime as dt
all_data = []
current_year = int(dt.datetime.today().strftime("%Y"))
for i in range(2003, current_year):
    x = read_data(i)
    all_data.append(x) # adding data frame objects to a list

regular_seasonn_data = pd.concat(all_data) # using pandas concat function which is more efficient than for loop
regular_seasonn_data


Unnamed: 0,Team,Win-Loss Record,Win %,MOV,ATS +/-,Year
0,New England,14-2-0,87.5%,6.9,+5.1,2003
1,Kansas City,13-3-0,81.3%,9.5,+3.7,2003
2,Philadelphia,12-4-0,75.0%,5.4,+3.1,2003
3,Tennessee,12-4-0,75.0%,6.9,+3.0,2003
4,Indianapolis,12-4-0,75.0%,6.9,+3.4,2003
...,...,...,...,...,...,...
27,LA Rams,5-12-0,29.4%,-4.5,-3.0,2022
28,Indianapolis,4-12-1,25.0%,-8.1,-6.6,2022
29,Arizona,4-13-0,23.5%,-6.4,-2.5,2022
30,Houston,3-13-1,18.8%,-7.7,-0.4,2022


Now that we have the regular season win loss information, we can adjust our function to get information about the playoff games. We can do this by adding a parameter for the type of game we want to search for. 

In [88]:
def read_data_v2(year, game_type='is_playoff'):
    # Reads the data from the website extracting the values of the table based on the year
    # Param: 
    # year - year of data to be collected
    # game_type: either playoff ('playoff') or regular season ("regular_season")

    url = f"https://www.teamrankings.com/nfl/trends/win_trends/?sc=is_{game_type}&range=yearly_"
    url = url + str(year)
    response = r.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    specific_div = soup.find('table') # finding the table with the record of each team
    rows = specific_div.find_all('tr') # get the rows of data

    # Now that we have the data as a list of rows, we can parse the data to construct a data frame
    data = []
    for row in rows:
            cells = row.find_all(['td', 'th'])  # 'td' for regular cells, 'th' for header cells
            row_data = [cell.text.strip() for cell in cells] #extract the contents in each cell
            data.append(row_data)
    columns = data[0]
    df = pd.DataFrame(data[1:], columns=columns)
    df["Year"] = year #helps identify the data points based on year
    return df
read_data_v2("2022", "playoff")

Unnamed: 0,Team,Win-Loss Record,Win %,MOV,ATS +/-,Year
0,Kansas City,3-0-0,100.0%,4.3,0.8,2022
1,Philadelphia,2-1-0,66.7%,17.3,13.5,2022
2,San Francisco,2-1-0,66.7%,0.3,-3.2,2022
3,Cincinnati,2-1-0,66.7%,7.0,7.2,2022
4,Dallas,1-1-0,50.0%,5.0,5.5,2022
5,NY Giants,1-1-0,50.0%,-12.0,-6.8,2022
6,Jacksonville,1-1-0,50.0%,-3.0,2.8,2022
7,Buffalo,1-1-0,50.0%,-7.0,-17.0,2022
8,LA Chargers,0-1-0,0.0%,-1.0,-3.0,2022
9,Seattle,0-1-0,0.0%,-18.0,-8.5,2022


Now lets create a function that will loop through the data like earlier


In [92]:
def capture_all_data(game_type):
    all_data = []
    current_year = int(dt.datetime.today().strftime("%Y"))
    for i in range(2003, current_year):
        x = read_data_v2(i, game_type)
        all_data.append(x) # adding data frame objects to a list
    data = pd.concat(all_data)
    return data
capture_all_data('playoff')

Unnamed: 0,Team,Win-Loss Record,Win %,MOV,ATS +/-,Year
0,New England,3-0-0,100.0%,5.3,0.0,2003
1,Carolina,3-1-0,75.0%,8.3,+12.4,2003
2,Indianapolis,2-1-0,66.7%,9.3,+10.3,2003
3,Philadelphia,1-1-0,50.0%,-4.0,-9.3,2003
4,Green Bay,1-1-0,50.0%,1.5,+0.8,2003
...,...,...,...,...,...,...
9,Tampa Bay,0-1-0,0.0%,-17.0,-14.5,2022
10,Seattle,0-1-0,0.0%,-18.0,-8.5,2022
11,Miami,0-1-0,0.0%,-3.0,+11.0,2022
12,Baltimore,0-1-0,0.0%,-7.0,+0.5,2022


## PART 2: Gathering NFL Team Data

In [100]:
url = "https://www.nfl.com/stats/team-stats/"
response = r.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
specific_div = soup.find('table') # finding the table with the record of each team
rows = specific_div.find_all('tr') # get the rows of data

# Now that we have the data as a list of rows, we can parse the data to construct a data frame
data = []
for row in rows:
        cells = row.find_all(['td', 'th'])  # 'td' for regular cells, 'th' for header cells
        row_data = [cell.text.strip() for cell in cells] #extract the contents in each cell
        data.append(row_data)
columns = data[0]
df = pd.DataFrame(data[1:], columns=columns)
df

Unnamed: 0,Team,Att,Cmp,Cmp %,Yds/Att,Pass Yds,TD,INT,Rate,1st,1st%,20+,40+,Lng,Sck,SckY
0,Commanders\n \n \n\n ...,397,264,66.5,7.0,2783,17,9,91.5,127,32.0,32,3,51T,47,317
1,Vikings\n \n \n\n ...,385,267,69.4,7.4,2858,21,5,103.6,138,35.8,43,5,62T,23,134
2,Saints\n \n \n\n ...,382,248,64.9,6.6,2526,13,7,87.4,113,29.6,28,8,58T,24,152
3,Bills\n \n \n\n ...,350,246,70.3,7.4,2600,19,11,96.6,124,35.4,27,5,55T,13,64
4,Panthers\n \n \n\n ...,349,217,62.2,5.5,1928,10,7,78.1,97,27.8,17,4,48T,32,261
5,Bengals\n \n \n\n ...,349,233,66.8,6.3,2208,14,6,90.3,110,31.5,20,4,64T,22,166
6,Patriots\n \n \n\n ...,349,222,63.6,6.1,2135,10,11,77.0,98,28.1,19,2,58T,23,139
7,Colts\n \n \n\n ...,340,213,62.6,6.8,2298,11,7,84.7,106,31.2,30,6,75,22,117
8,Chiefs\n \n \n\n ...,340,232,68.2,7.3,2473,17,10,93.7,120,35.3,29,4,54,12,89
9,Lions\n \n \n\n ...,326,223,68.4,7.7,2507,14,5,99.0,120,36.8,41,5,46,16,100
