## Web Scraping Part I

This notebook is associated with the lesson titled **Introduction to Web Scraping Part I** in the Web Scraping and Data Storage Module. In this notebook we perform the following tasks:

    - Make a request to the pro-football-reference webpage
    - Parse through the HTML to find the data we are interested in
    - Extract and store the data
    - Generalize the process for all seasons

In [None]:
# Imports
import requests
from bs4 import BeautifulSoup
import numpy as np

## 2020 Proof of Concept

Below we explore fetching each team's landing page from the 2020 season. We will first demonstrate it for a single team and then generalize it for all teams.

In [None]:
# 2020 NFL Standings & Team Stats URL
url = "https://www.pro-football-reference.com/years/2020/"

# ping the page
resp = requests.get(url)

# store the underlying HTML
soup = BeautifulSoup(resp.text, 'html.parser')

In [None]:
print(soup)

### Automating Table Scraping

In [None]:
# Use Soup to find the AFC Standing Table
afc_div = soup.find(id="div_AFC")
# Within the AFC Standing Table find all team hyperlinks
afc_links = afc_div.find_all('a')

print(afc_links[:5])

In [None]:
# Repeat for NFC
nfc_div = soup.find(id="div_NFC")
nfc_links = nfc_div.find_all('a')

print(nfc_links[:5])

## Generalize the Process

Now that productize what we have learned by exploring the 2020 season. The way the site is structure is such that all of the work we did above will apply to any season, so let's pass the year in as an argument to our new function. We can make the url dynamic by using an f-string and repeat what we did above for the AFC and the NFC. To finish up we will simply merge the two lists of links we generated and return the merged list.

In [None]:
def extract_team_links(year):
    """Takes a season year, requests the NFL Standings & Team Stats page for the given year and returns a list 
    of links to each season + team landing page. """
    
    resp = requests.get(f"https://www.pro-football-reference.com/years/{year}/")
    soup = BeautifulSoup(resp.text, 'html.parser')
    nfc_div = soup.find(id="div_NFC")
    afc_div = soup.find(id="div_AFC")
    nfc_links = nfc_div.find_all('a')
    afc_links = afc_div.find_all('a')
    team_links = afc_links + nfc_links
    return team_links

extract_team_links(year=2020)