## DA 320 

| Key         | Value |
| ----------- | ----------- |
| Assignment  | Basics of Web Scraping  |
| Author   | Ted Spence        |
| Date   | 2022-10-13        |

This is an example of a markdown header segment in a Jupyter notebook with a nice table.

***
# Check whether Jupyter and Python are installed correctly
***

In [2]:
# A basic Python command you can use to verify that Python is running correctly
print("Hello, world!")

Hello, world!


***
# Use "Pip Install" within a Jupyter Notebook
***

The next code segment demonstrates how you can use `pip install` within a notebook.

Note that if you have *any* python code at all in the same segment with the `pip install` command, it won't work!

In [3]:
pip install urllib3

Note: you may need to restart the kernel to use updated packages.


***
# Retrieve connection strings, passwords, or secrets from a JSON file 
***

In [None]:
import json

# Demonstration of how to load a file that contains secrets without accidentally leaking those secrets
with open('f:\\git\\teds-secrets.json') as f:
    data = json.load(f)

    # If you want your data to be secure, don't print this variable out!
    # Jupyter will retain a cached version of any printed data and it can be
    # accidentally committed to version control.
    secret_key = data['my-secret-key']

# We can safely print the length of the secret key. That won't leak any sensitive information.
print(f"My secret key is {len(secret_key)} characters in length.")

***
# Fetch the contents of a webpage
***

In [None]:
import urllib3
import certifi

# Demonstration of how to fetch text from a website using certifi, urllib3, and UTF-8 encoding.
# This also simulates a browser agent.
# Adapted from https://tedspence.com/teaching-python-and-jupyter-cbffb68e2194

# This line of code generates a pool for HTTP requests.  Pooling means that a second request to the same
# website will be faster than the first one since it will retain an open connection for a period of time.
http = urllib3.PoolManager(ca_certs=certifi.where())

# These lines of code choose which website we will fetch and which browser we will simulate.
url = 'https://www.wikipedia.org/'
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:105.0) Gecko/20100101 Firefox/105.0"

# This instruction requests the contents of the page and decodes it using UTF-8, the standard text
# encoding model for the Internet today.
response = http.request('GET', url, headers={"User-Agent": user_agent})
page_content = response.data.decode('utf-8')

# Print out what we achieved!
print(f"Fetched {len(page_content)} characters from {url} (status: {response.status})")

Fetched 73183 characters from https://www.wikipedia.org/ (status: 200)


***
# Searching for facts within text using Regular Expressions
***

In [None]:
import re

# Here is the text we will search
content = "The quick brown fox jumps over the lazy dog."

# This defines our regular expression. You can use https://regex101 as a great test site!
regular_expression = r"quick brown (.*) jumps"

# Demonstration of how to search and find matches within some text
matches = re.findall(regular_expression, content)
print(f"Found {len(matches)} matches.  The first match is '{matches[0]}'.")

Found 1 matches.  The first match is 'fox'.


***
# Formatting Output using Pandas DataFrames
***

In [None]:
import pandas as pd
from IPython.core.display import HTML

# Construct a Pandas dataframe using named columns
dataset = {
    "City": ['Seattle', 'Bellevue', 'Redmond'], 
    "Year Founded": [1851, 1953, 1912], 
    "Population 2020": [737015, 151854, 73256],
    "Thumbnail": ['https://upload.wikimedia.org/wikipedia/commons/thumb/7/7e/Downtown_Seattle_skyline_from_Kerry_Park_-_October_2019.jpg/536px-Downtown_Seattle_skyline_from_Kerry_Park_-_October_2019.jpg', 'https://upload.wikimedia.org/wikipedia/commons/thumb/9/9a/BellevueAndSeattle.jpg/560px-BellevueAndSeattle.jpg', 'https://upload.wikimedia.org/wikipedia/commons/thumb/1/16/Bicycle_Capital_of_the_Northwest.JPG/1600px-Bicycle_Capital_of_the_Northwest.JPG']
    }
dataFrame = pd.DataFrame(dataset)

# Create a formatter to convert URLs into inline images
def path_to_image_html(path: str):
    return f"<img src=\"{path}\" width=\"60\" />"

# Format the pandas dataframe to output using the formatter we created
formatters = dict(Thumbnail = path_to_image_html)
html = dataFrame.to_html(escape = False, formatters = formatters)
HTML(html)


Unnamed: 0,City,Year Founded,Population 2020,Thumbnail
0,Seattle,1851,737015,
1,Bellevue,1953,151854,
2,Redmond,1912,73256,
