# NFL Data Scraping Project
This is a notebook that will describe the web scraping process for NFL team statistics

To start we need to figure out what information and where to find that information. For this project I want to analyze how regular season statistics for a team impact their performance during the regular season and in the playoffs.
- To find the regular season statistics I will be using the information provided by the NFL here: **https://www.nfl.com/stats/team-stats/**
- To find the win and loss information for teams I used information from this link: **https://www.teamrankings.com/nfl/trends/win_trends/**

Let's start by gathering the win loss data

### Packages
For this project I will be using beautiful soup, pandas, datetime, and requests packages to conduct web scraping.

In [2]:
# IMPORT STATEMENTS
from bs4 import BeautifulSoup
from requests_html import HTMLSession
from urllib.parse import urljoin
import requests as r
import pandas as pd
import datetime as dt

In [2]:
url = "https://www.teamrankings.com/nfl/trends/win_trends/?sc=is_regular_season"
response = r.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
specific_div = soup.find('table') # finding the table with the record of each team
rows = specific_div.find_all('tr') # get the rows of data

# Now that we have the data as a list of rows, we can parse the data to construct a data frame
data = []
for row in rows:
        cells = row.find_all(['td', 'th'])  # 'td' for regular cells, 'th' for header cells
        row_data = [cell.text.strip() for cell in cells] #extract the contents in each cell
        data.append(row_data)
columns = data[0]
df = pd.DataFrame(data[1:], columns=columns)
print(df)


             Team Win-Loss Record  Win %    MOV ATS +/-
0       Baltimore          13-4-0  76.5%   11.9    +8.3
1          Dallas          12-5-0  70.6%   11.4    +5.6
2         Detroit          12-5-0  70.6%    3.9    +0.6
3   San Francisco          12-5-0  70.6%   11.4    +3.4
4    Philadelphia          11-6-0  64.7%    0.3    -4.2
5       Cleveland          11-6-0  64.7%    2.0    +1.5
6     Kansas City          11-6-0  64.7%    4.5    -1.3
7           Miami          11-6-0  64.7%    6.2    +1.4
8         Buffalo          11-6-0  64.7%    8.2    +2.4
9      Pittsburgh          10-7-0  58.8%   -1.2    -1.0
10        Houston          10-7-0  58.8%    1.4    +2.1
11        LA Rams          10-7-0  58.8%    1.6    +1.9
12   Jacksonville           9-8-0  52.9%    0.4    -1.3
13      Green Bay           9-8-0  52.9%    1.9    +2.7
14      Tampa Bay           9-8-0  52.9%    1.4    +3.4
15   Indianapolis           9-8-0  52.9%   -1.1    +0.1
16        Seattle           9-8-0  52.9%   -2.2 

We have now parsed the data for 1 year. The next challenge is to change the filters to get postseason data, and then to change the year filter to get data for each year. When inspecting the html code we can see that there a 'div class=filter' which holds the different filter that we can change. When a year filter is changed the url changes. For examples *https://www.teamrankings.com/nfl/trends/win_trends/?sc=is_regular_season&__range=yearly_2022__*. Using the final parameter "&range=yearly_2022" we can adjust the url to select the year to scrape the data


In [98]:
url = url
year = 2022
url = url + str(year)
print(url)

https://www.teamrankings.com/nfl/trends/win_trends/?sc=is_regular_season2022


Now the goal is to create a function of the above code, and iterate through each year to pull data. Note that for the data to be scraped, we want a full year of data. As of this time the 2023 season is going on, so we will look to pull data up to 2022.

In [None]:
def read_data(year):
    # Reads the data from the website extracting the values of the table based on the year
    # Param: year - year of data to be collected

    url = "https://www.teamrankings.com/nfl/trends/win_trends/?sc=is_regular_season&range=yearly_"
    url = url + year
    response = r.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    specific_div = soup.find('table') # finding the table with the record of each team
    rows = specific_div.find_all('tr') # get the rows of data

    # Now that we have the data as a list of rows, we can parse the data to construct a data frame
    data = []
    for row in rows:
            cells = row.find_all(['td', 'th'])  # 'td' for regular cells, 'th' for header cells
            row_data = [cell.text.strip() for cell in cells] #extract the contents in each cell
            data.append(row_data)
    columns = data[0]
    df = pd.DataFrame(data[1:], columns=columns)
read_data("2003")

Now that we have made a function, we want to have 1 table with data from all possible years. To do this we need to identify each data point by year. Therefore we need to adjust our original function to add column for year.

In [67]:
def read_data(year):
    # Reads the data from the website extracting the values of the table based on the year
    # Param: year - year of data to be collected

    url = "https://www.teamrankings.com/nfl/trends/win_trends/?sc=is_regular_season&range=yearly_"
    url = url + str(year)
    response = r.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    specific_div = soup.find('table') # finding the table with the record of each team
    rows = specific_div.find_all('tr') # get the rows of data

    # Now that we have the data as a list of rows, we can parse the data to construct a data frame
    data = []
    for row in rows:
            cells = row.find_all(['td', 'th'])  # 'td' for regular cells, 'th' for header cells
            row_data = [cell.text.strip() for cell in cells] #extract the contents in each cell
            data.append(row_data)
    columns = data[0]
    df = pd.DataFrame(data[1:], columns=columns)
    df["Year"] = year #helps identify the data points based on year
    return df
read_data("2004")

Unnamed: 0,Team,Win-Loss Record,Win %,MOV,ATS +/-,Year
0,Pittsburgh,15-1-0,93.8%,7.6,4.8,2004
1,New England,14-2-0,87.5%,11.1,4.2,2004
2,Philadelphia,13-3-0,81.3%,7.9,1.8,2004
3,Indianapolis,12-4-0,75.0%,10.7,5.4,2004
4,LA Chargers,12-4-0,75.0%,8.3,9.0,2004
5,Atlanta,11-5-0,68.8%,0.2,-2.0,2004
6,NY Jets,10-6-0,62.5%,4.5,2.3,2004
7,Green Bay,10-6-0,62.5%,2.8,0.8,2004
8,Denver,10-6-0,62.5%,4.8,0.1,2004
9,Seattle,9-7-0,56.3%,-0.1,-3.8,2004


Now that we can identify the year for each data table, we need to loop through each year and pull the data for the regular season. To get the data for years, the __datetime__ package will be used to always get the current year. We will loop through years from 2003 to the year before the current year. All of this data will then be placed into 1 table

In [69]:

all_data = []
current_year = int(dt.datetime.today().strftime("%Y"))
for i in range(2003, current_year):
    x = read_data(i)
    all_data.append(x) # adding data frame objects to a list

regular_seasonn_data = pd.concat(all_data) # using pandas concat function which is more efficient than for loop
regular_seasonn_data


Unnamed: 0,Team,Win-Loss Record,Win %,MOV,ATS +/-,Year
0,New England,14-2-0,87.5%,6.9,+5.1,2003
1,Kansas City,13-3-0,81.3%,9.5,+3.7,2003
2,Philadelphia,12-4-0,75.0%,5.4,+3.1,2003
3,Tennessee,12-4-0,75.0%,6.9,+3.0,2003
4,Indianapolis,12-4-0,75.0%,6.9,+3.4,2003
...,...,...,...,...,...,...
27,LA Rams,5-12-0,29.4%,-4.5,-3.0,2022
28,Indianapolis,4-12-1,25.0%,-8.1,-6.6,2022
29,Arizona,4-13-0,23.5%,-6.4,-2.5,2022
30,Houston,3-13-1,18.8%,-7.7,-0.4,2022


Now that we have the regular season win loss information, we can adjust our function to get information about the playoff games. We can do this by adding a parameter for the type of game we want to search for. 

In [88]:
def read_data_v2(year, game_type='is_playoff'):
    # Reads the data from the website extracting the values of the table based on the year
    # Param: 
    # year - year of data to be collected
    # game_type: either playoff ('playoff') or regular season ("regular_season")

    url = f"https://www.teamrankings.com/nfl/trends/win_trends/?sc=is_{game_type}&range=yearly_"
    url = url + str(year)
    response = r.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    specific_div = soup.find('table') # finding the table with the record of each team
    rows = specific_div.find_all('tr') # get the rows of data

    # Now that we have the data as a list of rows, we can parse the data to construct a data frame
    data = []
    for row in rows:
            cells = row.find_all(['td', 'th'])  # 'td' for regular cells, 'th' for header cells
            row_data = [cell.text.strip() for cell in cells] #extract the contents in each cell
            data.append(row_data)
    columns = data[0]
    df = pd.DataFrame(data[1:], columns=columns)
    df["Year"] = year #helps identify the data points based on year
    return df
read_data_v2("2022", "playoff")

Unnamed: 0,Team,Win-Loss Record,Win %,MOV,ATS +/-,Year
0,Kansas City,3-0-0,100.0%,4.3,0.8,2022
1,Philadelphia,2-1-0,66.7%,17.3,13.5,2022
2,San Francisco,2-1-0,66.7%,0.3,-3.2,2022
3,Cincinnati,2-1-0,66.7%,7.0,7.2,2022
4,Dallas,1-1-0,50.0%,5.0,5.5,2022
5,NY Giants,1-1-0,50.0%,-12.0,-6.8,2022
6,Jacksonville,1-1-0,50.0%,-3.0,2.8,2022
7,Buffalo,1-1-0,50.0%,-7.0,-17.0,2022
8,LA Chargers,0-1-0,0.0%,-1.0,-3.0,2022
9,Seattle,0-1-0,0.0%,-18.0,-8.5,2022


Now lets create a function that will loop through the data like earlier


In [25]:
def capture_all_data(game_type):
    all_data = []
    current_year = int(dt.datetime.today().strftime("%Y"))
    for i in range(2003, current_year):
        x = read_data_v2(i, game_type)
        all_data.append(x) # adding data frame objects to a list
    data = pd.concat(all_data)
    return data
Playoff_win_loss = capture_all_data('playoff')
regular_seasonn_win_loss = capture_all_data('regular_season')

NameError: name 'read_data_v2' is not defined

## PART 2: Gathering NFL Team Data

We can use a similar code as above to gather the data as the data in this case is also in tables.

In [8]:
url = "https://www.nfl.com/stats/team-stats/offense/passing/2023/reg/all"
response = r.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
specific_div = soup.find('table') # finding the table with the record of each team
rows = specific_div.find_all('tr') # get the rows of data

# Now that we have the data as a list of rows, we can parse the data to construct a data frame
data = []
for row in rows:
        cells = row.find_all(['td', 'th'])  # 'td' for regular cells, 'th' for header cells
        row_data = [cell.text.strip() for cell in cells] #extract the contents in each cell
        data.append(row_data)
columns = data[0]
df = pd.DataFrame(data[1:], columns=columns)
df["Year"] = 2023
df

Unnamed: 0,Team,Att,Cmp,Cmp %,Yds/Att,Pass Yds,TD,INT,Rate,1st,1st%,20+,40+,Lng,Sck,SckY,Year
0,Commanders\n \n \n\n ...,636,407,64.0,6.6,4174,24,21,81.6,190,29.9,48,5,51T,65,449,2023
1,Chiefs\n \n \n\n ...,635,421,66.3,6.9,4383,28,17,89.6,216,34.0,52,8,67T,28,195,2023
2,Chargers\n \n \n\n ...,632,409,64.7,6.8,4312,24,8,91.8,204,32.3,55,7,79,43,355,2023
3,Vikings\n \n \n\n ...,631,424,67.2,7.4,4700,30,19,92.4,220,34.9,74,7,62T,47,341,2023
4,Browns\n \n \n\n ...,624,355,56.9,6.4,4011,24,23,73.7,173,27.7,53,15,75,45,318,2023
5,Jaguars\n \n \n\n ...,620,412,66.4,7.1,4377,22,14,89.3,199,32.1,51,10,65T,41,251,2023
6,Bengals\n \n \n\n ...,615,420,68.3,6.9,4257,27,14,93.0,208,33.8,46,10,80,50,362,2023
7,Cowboys\n \n \n\n ...,614,428,69.7,7.6,4660,36,10,104.6,229,37.3,64,7,92T,40,263,2023
8,Lions\n \n \n\n ...,606,408,67.3,7.6,4606,30,12,98.1,228,37.6,70,9,70T,31,205,2023
9,Saints\n \n \n\n ...,606,406,67.0,7.0,4225,28,11,94.8,199,32.8,52,11,58T,35,235,2023


Looking at the resulting data, there is data cleaning of the data table that is required. The **team** and the **long** columns need some data cleaning.

In [4]:
df['Team'] = df['Team'].apply(lambda x: x.split('\n')[0])
df['Lng'] = df['Lng'].apply(lambda x: x.split('T')[0])
df

Unnamed: 0,Team,Att,Cmp,Cmp %,Yds/Att,Pass Yds,TD,INT,Rate,1st,1st%,20+,40+,Lng,Sck,SckY,Year
0,Commanders,636,407,64.0,6.6,4174,24,21,81.6,190,29.9,48,5,51,65,449,2023
1,Chiefs,635,421,66.3,6.9,4383,28,17,89.6,216,34.0,52,8,67,28,195,2023
2,Chargers,632,409,64.7,6.8,4312,24,8,91.8,204,32.3,55,7,79,43,355,2023
3,Vikings,631,424,67.2,7.4,4700,30,19,92.4,220,34.9,74,7,62,47,341,2023
4,Browns,624,355,56.9,6.4,4011,24,23,73.7,173,27.7,53,15,75,45,318,2023
5,Jaguars,620,412,66.4,7.1,4377,22,14,89.3,199,32.1,51,10,65,41,251,2023
6,Bengals,615,420,68.3,6.9,4257,27,14,93.0,208,33.8,46,10,80,50,362,2023
7,Cowboys,614,428,69.7,7.6,4660,36,10,104.6,229,37.3,64,7,92,40,263,2023
8,Lions,606,408,67.3,7.6,4606,30,12,98.1,228,37.6,70,9,70,31,205,2023
9,Saints,606,406,67.0,7.0,4225,28,11,94.8,199,32.8,52,11,58,35,235,2023


However, at this time the season has not ended. Thus, I want to choose the previous year. Also, I am trying to not only find the passing stats. But also the rushing stats. (Note: Receiving yards and scoring tabs can be found through calculations from the rushing and passing tabs.)

To complete this we need to think about how to store the data that we are scraping. For my analysis I want to have an individual table that show the passing stats for all years and another table for rushing yards.

To achvieve this I am going to create a nested loop where for each type ('passing', 'rushing'), return a table that has the stats per team for each year until 2003 which is the limit of data we have from the wins and losses table.

In [25]:
url = "https://www.nfl.com/stats/team-stats/offense"
types = ["passing", "rushing"]
current_year = int(dt.datetime.today().strftime("%Y"))
# url = url + (f'/{types[0]}/{current_year}/reg/all')
offense_data_dict = {'passing': [], 'rushing': []}
for i in offense_data_dict.keys():
        for year in range(2003, current_year):
                url = "https://www.nfl.com/stats/team-stats/offense"
                url = url + (f'/{i}/{year}/reg/all')
                response = r.get(url)
                soup = BeautifulSoup(response.text, 'html.parser')
                specific_div = soup.find('table')
                #print(soup)
                rows = specific_div.find_all('tr')
                data = []
                for row in rows:
                        cells = row.find_all(['td', 'th'])  # 'td' for regular cells, 'th' for header cells
                        row_data = [cell.text.strip() for cell in cells] #extract the contents in each cell
                        data.append(row_data)
                columns = data[0]
                data_table = pd.DataFrame(data[1:], columns=columns)
                data_table["Year"] = year
                tables = offense_data_dict[i]
                tables.append(data_table)
        df = pd.concat(tables)
        df['Team'] = df['Team'].apply(lambda x: x.split('\n')[0])
        df['Lng'] = df['Lng'].apply(lambda x: x.split('T')[0])
        offense_data_dict[i] = df



In [None]:
offense_data_dict['passing']

Now that we have extracted information about the offense. Lets utilize the same process for defense.

In [29]:
url = "https://www.nfl.com/stats/team-stats/defense"
types = ["Passing", "Rushing", "Downs", "Fumbles", "Interceptions"] # found by visting the actual site
current_year = int(dt.datetime.today().strftime("%Y"))
# url = url + (f'/{types[0]}/{current_year}/reg/all')
defense_data_dict = {'passing': [], 'rushing': [], 'Downs': [], 'Fumbles': [], 'Interceptions': []}

for i in defense_data_dict.keys():
        for year in range(2003, current_year):
                url = "https://www.nfl.com/stats/team-stats/offense"
                url = url + (f'/{i}/{year}/reg/all')
                response = r.get(url)
                soup = BeautifulSoup(response.text, 'html.parser')
                specific_div = soup.find('table')
                #print(soup)
                rows = specific_div.find_all('tr')
                data = []
                for row in rows:
                        cells = row.find_all(['td', 'th'])  # 'td' for regular cells, 'th' for header cells
                        row_data = [cell.text.strip() for cell in cells] #extract the contents in each cell
                        data.append(row_data)
                columns = data[0]
                data_table = pd.DataFrame(data[1:], columns=columns)
                data_table["Year"] = year
                tables = defense_data_dict[i]
                tables.append(data_table)
        df = pd.concat(tables)
        df['Team'] = df['Team'].apply(lambda x: x.split('\n')[0])

In [22]:
# url = "https://www.nfl.com/stats/team-stats/defense"
# def_types = ["Passing", "Rushing", "Downs", "Fumbles", "Interceptions"] # found by visting the actual site
# off_types = ['passing', 'rushing']
current_year = int(dt.datetime.today().strftime("%Y"))
# url = url + (f'/{types[0]}/{current_year}/reg/all')
defense_data_dict = {'Passing': [], 'Pushing': [], 'Downs': [], 'Fumbles': [], 'Interceptions': []}
offense_data_dict = {'Passing': [], 'Rushing': []}
for side in ['offense', 'defense']:
    key = None
    if side == 'offense':
        key = offense_data_dict
    else:
        key = defense_data_dict
    for i in key.keys():
        for year in range(2003, current_year):
            url = f"https://www.nfl.com/stats/team-stats/{side}"
            url = url + (f'/{i}/{year}/reg/all')
            response = r.get(url)
            soup = BeautifulSoup(response.text, 'html.parser')
            specific_div = soup.find('table')
            rows = specific_div.find_all('tr')
            data = []
            for row in rows:
                cells = row.find_all(['td', 'th'])  # 'td' for regular cells, 'th' for header cells
                row_data = [cell.text.strip() for cell in cells] #extract the contents in each cell
                data.append(row_data)
            columns = data[0]
            data_table = pd.DataFrame(data[1:], columns=columns)
            data_table["Year"] = year
            tables = key[i]
            tables.append(data_table)
        df = pd.concat(tables)
        df['Team'] = df['Team'].apply(lambda x: x.split('\n')[0])
        if 'Lng' in df.columns:
            df['Lng'] = df['Lng'].apply(lambda x: str(x).split('T')[0])
        key[i] = df

In [25]:
x = offense_data_dict['Passing']
x

Unnamed: 0,Team,Att,Cmp,Cmp %,Yds/Att,Pass Yds,TD,INT,Rate,1st,1st%,20+,40+,Lng,Sck,SckY,Year
0,Giants,616,344,55.8,5.9,3642,16,20,68.4,184,29.9,35,8,77,44,259,2003
1,Rams,600,377,62.8,7.2,4287,23,23,81,211,35.2,64,6,48,43,326,2003
2,Buccaneers,592,369,62.3,6.7,3941,27,22,81.5,190,32.1,38,9,76,23,136,2003
3,Lions,588,319,54.2,5.1,2988,17,24,61.1,152,25.8,26,3,72,11,64,2003
4,Colts,569,381,67,7.5,4289,29,10,99,212,37.3,45,9,79,19,110,2003
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27,Broncos,513,337,65.7,7,3566,28,9,96.7,154,30,44,13,60,52,304,2023
28,Steelers,506,323,63.8,6.8,3421,13,9,84.6,153,30.2,41,9,86,36,258,2023
29,Ravens,494,328,66.4,7.9,3881,27,7,102.5,180,36.4,52,9,80,41,246,2023
30,Titans,494,304,61.5,7.1,3512,14,11,83.2,158,32,54,10,70,64,445,2023


In [21]:
one = {'a': [], 'b': []}
two = {'c': [1, 2], 'd': [1, 2]}
key = one
key = two
print(key)
key = None
print(key)

{'c': [1, 2], 'd': [1, 2]}
None
