## Web Scraping Part II

This notebook is associated with the lesson titled **Introduction to Web Scraping Part 2** in the Web Scraping and Data Storage Module. In this notebook we perform the following tasks:

    - Explore the output of extract_team_links
    - Retrieve the URLs associated with each teams boxscore hyperlink
    - Consolidate links to games during a season

In [None]:
# Imports
import requests
from bs4 import BeautifulSoup
import numpy as np

## Reviewing Output of extract_team_links

In [None]:
# copy of function created in the previous lesson
def extract_team_links(year):
    """Takes a season year, requests the NFL Standings & Team Stats page for the given year and returns a list 
    of links to each season + team landing page. """
    
    resp = requests.get(f"https://www.pro-football-reference.com/years/{year}/")
    soup = BeautifulSoup(resp.text, 'html.parser')
    nfc_div = soup.find(id="div_NFC")
    afc_div = soup.find(id="div_AFC")
    nfc_links = nfc_div.find_all('a')
    afc_links = afc_div.find_all('a')
    team_links = afc_links + nfc_links
    return team_links


In [None]:
team_links = extract_team_links(year=2020)

print(f"Element: {team_links[0]}")
print(f"Element Type: {type(team_links[0])}")
print(f"Reference: {team_links[0]['href']}")


## Retrive URLs from Boxscore Hyperlinks

In [None]:
sample_href = team_links[0]["href"]
full_url = "https://www.pro-football-reference.com" + sample_href
print(full_url)

In [None]:
def extract_boxscore_links(team_season_overview_suffix):
    """ Takes a string associated with a teams season overview url, requests access to the page and extracts all 
    hyperlink addresses associated with the boxscore hyperlinks. Returns a list of hyperlink suffix strings for
    all of a team's games during a season. """
    
    full_url = "https://www.pro-football-reference.com" + team_season_overview_suffix
    resp = requests.get(full_url)
    soup = BeautifulSoup(resp.text, 'html.parser')
    link_elements = [a for a in soup.find_all("a") if a.text == 'boxscore']
    links = [l['href'] for l in link_elements]
    return links

In [None]:
links = extract_boxscore_links(team_season_overview_suffix=sample_href)
links

## Join Functions

In [None]:
def unique_game_links(year):
    """ Takes a year. Extracts each team's season overview url. For each team extracts all associated games they
    participated in during the season. Merges all game links and removes duplicates. Returns a list of url suffix
    strings. """
    
    all_boxscores = [extract_boxscore_links(url['href']) for url in extract_team_links(year)]
    flattened_list = np.hstack([np.array(b) for b in all_boxscores])
    unique_game_links = np.unique(flattened_list)
    return unique_game_links

In [None]:
links = unique_game_links(year=2020)
links