## DA 320 

| Key         | Value |
| ----------- | ----------- |
| Assignment  | Basics of Web Scraping  |
| Author   | Ted Spence        |
| Date   | 2022-10-13        |

This example notebook contains tutorials on the basic usage of web scraping techniques.

***
# Fetch the contents of a webpage
***

In [None]:
import urllib3
import certifi

# Demonstration of how to fetch text from a website using certifi, urllib3, and UTF-8 encoding.
# This also simulates a browser agent.
# Adapted from https://tedspence.com/teaching-python-and-jupyter-cbffb68e2194

# This line of code generates a pool for HTTP requests.  Pooling means that a second request to the same
# website will be faster than the first one since it will retain an open connection for a period of time.
http = urllib3.PoolManager(ca_certs=certifi.where())

# These lines of code choose which website we will fetch and which browser we will simulate.
url = 'https://www.wikipedia.org/'
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:105.0) Gecko/20100101 Firefox/105.0"

# This instruction requests the contents of the page and decodes it using UTF-8, the standard text
# encoding model for the Internet today.
response = http.request('GET', url, headers={"User-Agent": user_agent})
page_content = response.data.decode('utf-8')

# Print out what we achieved!
print(f"Fetched {len(page_content)} characters from {url} (status: {response.status})")

***
# Fetch the contents of multiple pages in a loop
***

In [None]:
import urllib3
import certifi

# Construct an HTTP request pool to make multiple page fetches faster
http = urllib3.PoolManager(ca_certs=certifi.where())

# Loop through the years 1914 - 1918 and gather data from wikipedia.
# Note that Python ranges stop *before* the final number.
for year in range(1914, 1919):
    print(f"Collecting data for {year}...")
    url = f"https://en.wikipedia.org/wiki/{year}"
    response = http.request('GET', url, headers={'User-Agent': 'Mozilla/5.0'})
    page_content = str(response.data, "utf-8")

    # Here is where we would apply regular expressions to the page content
    print(f"Fetched {len(page_content)} characters from {url} (status: {response.status})")

***
# Searching for facts within text using Regular Expressions
***

In [None]:
import re

# Here is the text we will search
content = "The quick brown fox jumps over the lazy dog."

# This defines our regular expression. You can use https://regex101 as a great test site!
regular_expression = r"quick brown (.*) jumps"

# Demonstration of how to search and find matches within some text
matches = re.findall(regular_expression, content)
print(f"Found {len(matches)} matches.  The first match is '{matches[0]}'.")

***
# Formatting output using a loop
***

In [None]:
# Using a loop is the simplest form of printing tabular data.

# Construct a Pandas dataframe using named columns
cities = ['Seattle', 'Bellevue', 'Redmond']
founded = [1851, 1953, 1912]
population = [737015, 151854, 73256]

# Loop through the dataset starting at position zero
for i in range(0, 3):
    print(f"City {cities[i]} was founded in {founded[i]} and has a population of {population[i]}")

***
# Formatting Output using Pandas DataFrames
***

In [None]:
import pandas as pd
from IPython.core.display import HTML

# Construct a Pandas dataframe using named columns
dataset = {
    "City": ['Seattle', 'Bellevue', 'Redmond'], 
    "Year Founded": [1851, 1953, 1912], 
    "Population 2020": [737015, 151854, 73256],
    "Thumbnail": ['https://upload.wikimedia.org/wikipedia/commons/thumb/7/7e/Downtown_Seattle_skyline_from_Kerry_Park_-_October_2019.jpg/536px-Downtown_Seattle_skyline_from_Kerry_Park_-_October_2019.jpg', 'https://upload.wikimedia.org/wikipedia/commons/thumb/9/9a/BellevueAndSeattle.jpg/560px-BellevueAndSeattle.jpg', 'https://upload.wikimedia.org/wikipedia/commons/thumb/1/16/Bicycle_Capital_of_the_Northwest.JPG/1600px-Bicycle_Capital_of_the_Northwest.JPG']
    }
dataFrame = pd.DataFrame(dataset)

# Create a formatter to convert URLs into inline images
def path_to_image_html(path: str):
    return f"<img src=\"{path}\" width=\"60\" />"

# Format the pandas dataframe to output using the formatter we created
formatters = dict(Thumbnail = path_to_image_html)
html = dataFrame.to_html(escape = False, formatters = formatters)
HTML(html)
