### <img style="float: left; padding-right: 10px; width: 45px" src="https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/iacs.png"> CS109A Introduction to Data Science 

## Lab 1: Introduction to Web Scraping

**Harvard University**<br/>
**Fall 2024**<br/>
**Instructors**: Pavlos Protopapas and Natesh Pillai<br/>
**Lab Authors**: Chris Gumb and Eleni Kaxiras

<hr style='height:2px'>

In [1]:
# Importing necessary libraries
# Python standard library
from collections import Counter, defaultdict
import html # for converting escaped characters in HTML content
import json # key-value data structure
import os
import re # regular expressions
import time
# 3rd party libraries
from bs4 import BeautifulSoup # HTML parsing
from IPython.display import HTML # render HTML in the notebook
import matplotlib.pyplot as plt # plotting library
import numpy as np
import pandas as pd # tabular data (Dataframes)
import requests # get/post HTML requests

## Lab Learning Objectives

When we're done today, you will approach messy, real-world data on the web with the confidence that you can get it into a format that you can use to answer your questions of interest.

Specifically, our learning objectives are:
* Getting HTML content from the web programmatically with [Requests](https://requests.readthedocs.io/en/latest/)
* Navigating the tree-like structure of an HTML document and using that structure to extract desired information with the help of [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
* Wrangling, parsing, and storing information pulled from the web using regular expressions, lists, and dictionaries.

We'll also have a preview of:
* Creating visualizations with `matplotlib`
* Creating a tabular data object (`pandas` DataFrame) from parsed data and saving it as a comma separated values file (csv)


## Nobel Prize Website

The source of our data with be the offical site for the Nobel Prize,
https://www.nobelprize.org/.

Some questions we might want to answer:
- Who has been awarded more than one prize?
- How many unique prizes are there and has this always been the case?
- How have the number of recipients changed over time?
- Which prize has the fewest number of recipients?


## Getting HTML from the Web

If our end goal is to extract data from HTML documents or first step is to acquire those HTML files from the web.\
We could of course visit a web page in our browser and save the page to disk. But that sort of tedious clicking is what we use programming to avoid!\
There are command line applications like [curl](https://curl.se/) which can be used to pull web data, and options like this can be worked into larger scripts or programming procects. But we're looking for something a bit more... Pythonic.

## Requests

"[Requests](https://requests.readthedocs.io/en/latest/) is an elegant and simple [HTTP](https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol) library for Python, built for human beings."

Its ease of use has made this 3rd party library a very popular option for pulling content from the web via the HTTP `GET` request. This is the request your browser sends to the webserver to view a webpage.

First, we'll need the target URL. Specifically, this will be the page that lists all prizes across all years.

In [2]:
all_prizes_url = 'https://www.nobelprize.org/prizes/lists/all-nobel-prizes/all/'

Then we use the `requests.get` method to grab the content of that URL.\
The returned object is a `Response` which has some useful information beyond the website content. For example, the response status code.

In [3]:
response = requests.get(all_prizes_url)

In [4]:
response.status_code

200

Response [200] is a success status code. Let's google: [`response 200 meaning`](https://www.google.com/search?q=response+200+meaning&oq=response+%5B200%5D+m&aqs=chrome.1.69i57j0l5.6184j0j7&sourceid=chrome&ie=UTF-8). All possible codes and their meanings can be found [here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status).

What if we accidentally make a get request for a page that doesn't exist?

In [5]:
bogus_url = 'https://www.google.com/oops'
bogus_response = requests.get(bogus_url)
bogus_response.status_code

404

The dreaded "404 NOT FOUND" error! ☠️\
(JupyterHub users may see 403)

**HTML Content**

Now let's look at the HTML we just scraped which is contained in the `text` attribute of our original Request object which we called `response`.

In [6]:
# peek at the HTML
response.text[:200]

'\t<!DOCTYPE html>\n<html lang="en-US" class="no-js">\n<head>\n<meta charset="UTF-8"><script type="text/javascript">(window.NREUM||(NREUM={})).init={privacy:{cookies_enabled:true},ajax:{deny_list:["bam.eu0'

In [7]:
# stash it in a variable
html = response.text

Great! We now have the page's HTML data in a Python string. 

## HTML

The first step in web scraping is to understand the HTML structure of the webpage.

HTML (**H**yper **T**ext **M**arkup **L**anguage), is the standard markup language for documents designed to be displayed in a web browser. It is often complemented by CSS for styling and JavaScript for dynamic, interactive content.

HTML describes the structure of a web page *semantically*. That is, describes not only the position of each element in the document's tree structure (e.g., 2nd child of tree's root), but also what each element *means* (title, level 2 header, paragraph, etc.).

### Standard HTML documents

HTML documents generally have the following structure:

\<!DOCTYPE html>

<html>

    <head>

        <title>Page Title</title>

    </head>

    <body>

        <h1>Page Heading</h1>

        <p>The first paragraph of page</p>

        ...
        ...
        ...

    </body>

\</html>

### What does each of these tags indicate?

- The **\<!DOCTYPE html>** declaration defines that this document is an HTML5 document

- The **\<html>** element is the root element of an HTML page

- The **\<head>** element contains meta information about the HTML page

- The **\<title>** element specifies a title for the HTML page (which is shown in the browser's title bar or in the page's tab)

- The **\<body>** element defines the document's body, and is a container for all the visible contents, such as headings, paragraphs, images, hyperlinks, tables, lists, etc.

- The **\<h1>** element defines a large heading. There are other heading tags in html, **\<h2>, \<h3>, \<h4>, \<h5>, \<h6>**

- The **\<p>** element defines a paragraph


### What is an HTML Element?

An HTML element is defined by a start tag, some content, and an end tag:

**\<tagname> Tag content \</tagname>**

An example of an HTML element is as follows:

**\<h1> The Page Heading \</h1>**



Let's take a look at the [Nobel Prizes across all years](https://www.nobelprize.org/prizes/lists/all-nobel-prizes/all/) and inspect the HTML under-the-hood: right-click on the page and select `inspect`.You should see something like this.
![20240906_02h56m43s_grim.png](attachment:b2472872-0fcc-419d-a890-ddcc533e753f.png)

### Mapping the HTML tags to the webpage

When you inspect, try to map each element on the webpage to its HTML. 

![html2.png](attachment:ace6ff66-272d-4742-a12b-d9f1390dc1c2.png)

But how should we go about parsing and navigating all this text to find what we are interested in? You might be tempted to try...

## Regular Expressions

You can find specific patterns or strings in text by using Regular Expressions (often abbreviated "re", "regex", or "regexp"): This is a pattern matching mechanism used throughout Computer Science and programming (it's not just specific to Python). 

Some great resources that we recommend, if you are interested in them (could be very useful for a homework problem):
- https://docs.python.org/3/howto/regex.html 
- https://regexone.com (tutorial)
- https://regex101.com (testing patters)

Regular expressions use specific characters to stand for sets of characters or modifiers. A regex itself defines a set of strings which we call "matches."\
Some examples: 
- ```\s``` : Matches any Unicode whitespace character: spaces, tabs, newlines
- ```\S``` : Matches any character which is **NOT** a Unicode whitespace character (note: capital 'S')
- ```\d``` : Matches any Unicode decimal digit, `0`, `1`, ..., `9`
- ```\w``` : Matches any "word" characters: letters, numbers, and underscores
- ```*``` : Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible.
- ```+``` : Similar to `*` but matches 1 or more prepitions of prededing RE 
- ```{n}```: Matches `n` occurances of the preceding RE. Ex: `b{3}` would match `'bbb'`

Python includes [re](https://docs.python.org/3.12/library/re.html) in its standard library. We can use its `find_all` method to find all substrings in a target string that match a regular expression of our devising. 

**Let's find all the occurances of 'Marie' in our raw_html:**

In [8]:
re.findall(r'Marie', html)

['Marie', 'Marie', 'Marie', 'Marie']

**Now use ```\S``` to match 'Marie' + ' ' + 'any character which is not a Unicode whitespace character':**

In [9]:
re.findall(r'Marie \S', html)

['Marie G', 'Marie L', 'Marie C', 'Marie C']

**🤔 Q: How would we find the lastnames that come after Marie?**


In [10]:
# your code here

This is an example of code that is *intended* to grab the h3 header containing an anchor (link) tag with non-empty text content.

In [11]:
first_title = re.findall(r'<h3>\n<a.*>\n.+<\/a>\n<\/h3>', html)[0]
print(first_title)

<h3>
<a href="https://www.nobelprize.org/prizes/physics/2023/summary/">
The Nobel Prize in Physics 2023 </a>
</h3>


It seems to work in this particular situation. But regex can quickly become quite long and obtuse, and while they can prove useful in many scenarios (like those in HW0!), it is ill-advised to use them to parse HTML tags. This is because regexs are not sufficiently complex to capture all possible HTML structures. If you're curious you can read about the distinctions between [Regular Grammars](https://en.wikipedia.org/wiki/Regular_grammar) (regex) and [Context-free Grammars](https://en.wikipedia.org/wiki/Context-free_grammar) (HTML).

## Parse the HTML with BeautifulSoup

"[BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/) is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work."

BeautifulSoup works by parsing the raw html text into a tree. Every tag in the raw html becomes a node in the tree. We can then navigate the tree by selecting a node and querying its parent, children, siblings, etc.

![html-dom.png](attachment:5ab1c47c-976a-428f-8e57-c21cef6afa8b.png)

In [12]:
soup = BeautifulSoup(html)

Key BeautifulSoup functions we’ll be using in this lab:

**Visually Inspecting**
- **`tag.prettify()`**: Returns cleaned-up version of raw HTML, useful for printing

**Searching**
- **`tag.select(selector)`**: Return a list of nodes matching a [CSS selector](https://developer.mozilla.org/en-US/docs/Learn/CSS/Introduction_to_CSS/Simple_selectors)
- **`tag.select_one(selector)`**: Return the first node matching a CSS selector
- **`tag.text`/`soup.get_text()`**: Returns visible text of a node (e.g.,"`<p>Some text</p>`" -> "Some text")
- **`tag.contents`**: A list of the immediate children of this node

You can also use these functions to find nodes.
- **`tag.find_all(tag_name, attrs=attributes_dict)`**: Returns a list of matching nodes
- **`tag.find(tag_name, attrs=attributes_dict)`**: Returns first matching node

You are not limit to searching for tags. You can also search for text, even using regular expressions!:
- **`tag.find("div", string="Text of Interest")`**: Returns the first `div` tag who `text` content is "Text of Interest"
- **`tag.find("a", string=re.compile(r"\d+"))`**: Returns the first `a` tag who `text` content is one or more digits

Or you can move about the tree using relative position:

**Going Down**
- **`tag.children`**: Returns a list of tags directly below `tag` in the tree.
- **`tag.descendants`**: Returns a list of *all* tags directly below `tag`. Like a recursive version of `tag.children`
**Going Up**
- **`tag.parent`**: Returns tag directly above `tag`
- **`tag.parents`**: Returns a list of recursive parents of `tag` all the way to the root of the tree

**Going Sideways**
- **`tag.next_sibling`**: Returns next tag with the same parent as `tag`
- **`tag.previous_sibling`**: Returns next tag with the same parent ass `tag`


BeautifulSoup is a very powerful library -- much more info here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

### Let's practice some BeautifulSoup commands...

**Output a portion of HTML but with nice indenting**

In [13]:
pretty_soup = soup.prettify()
print(pretty_soup[:500])

<!DOCTYPE html>
<html class="no-js" lang="en-US">
 <head>
  <meta charset="utf-8"/>
  <script type="text/javascript">
   (window.NREUM||(NREUM={})).init={privacy:{cookies_enabled:true},ajax:{deny_list:["bam.eu01.nr-data.net"]},distributed_tracing:{enabled:true}};(window.NREUM||(NREUM={})).loader_config={agentID:"212354677",accountID:"3196970",trustKey:"3196970",xpid:"VwcOV19UCBACVVNRAgUCUlc=",licenseKey:"NRJS-0940f7eaf4cb46e1cfe",applicationID:"212354543"};;/*! For license information please see


**Extract the text of first “title” tag** 

In [15]:
# your code here
soup.select_one('title').text

'NobelPrize.org'

Return the first 5 anchor tags (`<a ....></a>`) which are themselves contained inside of an `<h3>` header tag:

In [17]:
# your code here
soup.select('h3 a')[0].get('href')

'https://www.nobelprize.org/prizes/physics/2023/summary/'

## Extracting award data

Let's use the structure of the HTML document to extract the data we want.

From inspecting the page in our browser, we found that each award is in a `div` with a `card-prize` class. Let's get the first one.

When using `select` we can prepend a `'.'` to a string to specify that we are talking about the tag's class attribute, rather than the name of a type of tag. This is just like CSS syntax.

In [18]:
tag = soup.select('.card-prize')[0]

In [19]:
# Print the tag with 'pretty' indenting
print(tag.prettify())

<div class="card-prize">
 <h3>
  <a href="https://www.nobelprize.org/prizes/physics/2023/summary/">
   The Nobel Prize in Physics 2023
  </a>
 </h3>
 <div class="card-prize--laureates">
  <div class="card-prize--laureates --three">
   <div class="card-prize--laureates--links">
    <a class="card-prize--laureates--links--link" href="https://www.nobelprize.org/prizes/physics/2023/agostini/facts/">
     Pierre Agostini
    </a>
    ,
    <a class="card-prize--laureates--links--link" href="https://www.nobelprize.org/prizes/physics/2023/krausz/facts/">
     Ferenc Krausz
    </a>
    and
    <a class="card-prize--laureates--links--link" href="https://www.nobelprize.org/prizes/physics/2023/lhuillier/facts/">
     Anne L’Huillier
    </a>
   </div>
   <blockquote class="card-prize--laureates--motivation --last">
    “for experimental methods that generate attosecond pulses of light for the study of electron dynamics in matter”
   </blockquote>
  </div>
 </div>
</div>



In [20]:
# Render the HTML for the tag in our notebook
HTML(tag.prettify())

#### Let's practice getting data out of a BS Node

### Prize title (no year)

In [22]:
# Prize Name
tag.h3.text.strip()[:-5]

'The Nobel Prize in Physics'

In [23]:
# A regex would be more robust but also more involved
pattern = r'(.*) \d{4}'
s = 'Blah blah 1991'
re.sub(pattern, r'\1', s)

'Blah blah'

### Prize year

In [24]:
# Prize Name
int(tag.h3.text.strip().split()[-1])

2023

### Prize description
(i.e., the reason the prize was awarded)

In [25]:
# Description
tag.blockquote.text.strip()

'“for experimental methods that generate attosecond pulses of light for the study of electron dynamics in matter”'

### Laureates
Each prize can have multiple laureates. For each laureate, the HTML contains their name as well as a URL to a page with additional facts. 

For our example prize, let's create a `list` of `tuples` where thei first element in each tuple is the laureate's name and the second element is the URL of their facts page.

This one can be tricky...

**Hints:** Be sure to consult some of the BeautifulSoup methods listed above.

In [26]:
# Laureates 
tag.find_all('a', attrs={'class': re.compile(r'.*')})

[<a class="card-prize--laureates--links--link" href="https://www.nobelprize.org/prizes/physics/2023/agostini/facts/">
 Pierre Agostini</a>,
 <a class="card-prize--laureates--links--link" href="https://www.nobelprize.org/prizes/physics/2023/krausz/facts/">
 Ferenc Krausz</a>,
 <a class="card-prize--laureates--links--link" href="https://www.nobelprize.org/prizes/physics/2023/lhuillier/facts/">
 Anne L’Huillier</a>]

In [27]:
[t.text.strip() for t in tag.find_all('a', {'class': re.compile(r'.*')})]

['Pierre Agostini', 'Ferenc Krausz', 'Anne L’Huillier']

In [28]:
[t.get('href') for t in tag.find_all('a', {'class': re.compile(r'.*')})]

['https://www.nobelprize.org/prizes/physics/2023/agostini/facts/',
 'https://www.nobelprize.org/prizes/physics/2023/krausz/facts/',
 'https://www.nobelprize.org/prizes/physics/2023/lhuillier/facts/']

In [29]:
[(t.text.strip(), t.get('href')) for t in tag.find_all('a', {'class': re.compile(r'.*')})]

[('Pierre Agostini',
  'https://www.nobelprize.org/prizes/physics/2023/agostini/facts/'),
 ('Ferenc Krausz',
  'https://www.nobelprize.org/prizes/physics/2023/krausz/facts/'),
 ('Anne L’Huillier',
  'https://www.nobelprize.org/prizes/physics/2023/lhuillier/facts/')]

### Helper Functions
Let's now take the code we developed above for extracting various information and place it into helper functions.

This will allow us to reuse the same code logic on different inputs. For simple, one-line functions it can be nice to define them using the `lambda` keyword rather than `def`.

In [None]:
# helper functions
# your code here



We can now use our helper functions to create a dictionary from each prize with keys 'title', 'year', 'laureates', and 'description'.


In [None]:
prize_soup = get_prizes(soup)

parsed_prizes = []
for p in prize_soup:
    d = {
        'title': get_title(p),
        'year': get_year(p),
        'laureates': get_laureates(p),
        'description': get_desc(p),
    }
    parsed_prizes.append(d)

In [None]:
# Last prize in our list
parsed_prizes[-1]

In [None]:
# How many prizes?
len(parsed_prizes)

## Asking Questions

### "What are the all unique prize titles?"

In [None]:
# Unique prize titles
# your code here

### "When was the Economics prize added?"
Save your result as `first_econ`

In [None]:
# First year for Economics prize
# your code here

In [None]:
first_econ

### "Who has been awarded more than one prize?"

In [None]:
# Simple append example
l1 = [1,2,3]
l2 = [4,5,6]
l1.append(l2)
l1

In [None]:
# Simple extend example
l1 = [1,2,3]
l2 = [4,5,6]
l1.extend(l2)
l1

In [None]:
# List of all laureate name/link tuples
laureates = []
for p in parsed_prizes:
    laureates.extend(p['laureates'])
laureates[:10]

In [None]:
[l[0] for l in laureates if 'Marie' in l[0]]

In [None]:
len(laureates)

In [None]:
counts = Counter(l[0] for l in laureates)

In [None]:
# display only those with a count > 2
[(name, count) for name, count in counts.items() if count > 1]

In [None]:
# sorted by count
sorted([(name, count) for name, count in counts.items() if count > 1], key= lambda x: x[1], reverse=True)

### "How many laureates were there for each year?"

Initial plan: Iterate over prizes and accumulate the count of winners for each year.

Data structure idea 💡: A dictionary where keys are years and values are number of laureates that year.

Save your resulting dictionary as `winners_per_year`.

**Hint:** This accumulation method can cause problems when you encounter key that is not already in the dictionary. You could add some conditional logic or pre-populate all the keys, but the `defaultdict` from the `Collections` module makes this very simple.

In [None]:
# This causes an error since the key 2023 isn't in the dict
try:
    d = {}
    d[2023] += 1
    d
except Exception as e:
    print(e)
    print("Oops!")
d

In [None]:
# defaultdict will assume a sane default for missing keys
# for ints it is 0
d = defaultdict(int)
d[2023] += 1
d

In [None]:
# Create winners_per_year dictionary
# your code here

In [None]:
# how many winners in 2023?
winners_per_year[2023]

## Visualization (Preview)

While we now have the laureate counts for each year this isn't very helpful since it is just a long list of numbers.

We can use the `matplotlib` library visualize the number of winners per year.

In [None]:
plt.figure(figsize=(10,4))
first_year = min(winners_per_year.keys())
last_year = max(winners_per_year.keys())
prize_added = (last_year-first_econ)/(last_year-first_year)
plt.bar(winners_per_year.keys(), winners_per_year.values())
plt.axhline(len(unique_prizes)-1, xmax=prize_added, c='k', ls=':', label='# of unique prizes')
plt.axhline(len(unique_prizes), xmin=prize_added, c='k', ls=':')
plt.title("Number of Nobel Prize Winners by Year");
plt.legend()
plt.xlabel("Year")
plt.ylabel("# of winners");

**🤔 Q: Do you notice anything interesting in the plot above?**

## Pandas DataFrame (preview)

Next, we parse the collected data and create a `pandas.DataFrame`. A DataFrame is like a table, where each row corresponds to a data entry and each column corresponds to a feature. Once we have a DataFrame, we can easily export it to our disk in CSV, JSON, or other formats.

The easiest way to create a DataFrame is to build a list of dictionaries.

Each entry in the list (a dictionary) is a data point, where keys are column names in the table. Let's see it in action.

In thise case, we'd actually like each row in the table to be a laureate, so we will construct a new list of laureate dictionaries from our exisitng list of prize dictionaries!

In [None]:
laureate_dicts = []
for p in parsed_prizes:
    for l in p['laureates']:
        laureate_dicts.append({
            'name': l[0],
            'url': l[1],
            'year': p['year'],
            'prize': re.sub(r'(.*) \d{4}', r'\1', p['title']),
            'desc': p['description']
        })

In [None]:
laureate_dicts[0]

In [None]:
df = pd.DataFrame(laureate_dicts)
df.head()

Finally, we can save all of our work by writing our DataFrame to a csv file.

In [None]:
df.to_csv('data/scraped_awards.csv', index=False)

Which can of course be loaded to reproduce our DataFrame! 

In [None]:
df_reloaded = pd.read_csv('data/scraped_awards.csv')
df_reloaded.head()

🌈 **The End**