# Getting Data - Part 2 


Some information comes from Ch 9 of Data Science from Scratch, 2nd Edition by Joel Grus.  This book is available for free through the library's connection to O'reilly's learning platform. Additional content comes from DSC 80.  Some examples adapted from:  [zlotnick's Text As Data examples](http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html)  *Note, the site is now unavailable*. 

### The data science lifecycle

<center><img src="imgs/ds-lifecycle.svg" width="60%"></center>

This week we are continuing to focus on Step 2 - Obtaining Data 

### Data Sources 

* Often, the data you need doesn't exist in "clean" `.csv` files.

* **Solution**: Collect your own data!
    - Design and administer your own survey or run an experiment.
    - Find related data on the internet.

- The internet contains **massive** amounts of historical record; for most questions you can think of, the answer exists somewhere on the internet.

### Collecting data from the internet

- There are two ways to programmatically access data on the internet:
    - through an API.
    - by scraping.


- We will discuss the differences between both approaches, but for now, the important part is that they **both use HTTP**.

## HTTP

- HTTP stands for **Hypertext Transfer Protocol**.
    - It was developed in 1989 by Tim Berners-Lee (and friends).

- It is a **request-response** protocol.
    - Protocol = set of rules.

- HTTP allows...
    - computers to talk to each other over a network.
    - devices to fetch data from "web servers."

- The "S" in HTTPS stands for "secure".

### The request-response model

HTTP follows the **request-response** model.

<center><img src='imgs/req-response.png' width=500></center>

- A <b><span style="color:blue">request</span></b> is made by the <b><span style="color:blue">client</span></b>.

- A <b><span style="color:orange">response</span></b> is returned by the <b><span style="color:orange">server</span></b>.

- **Example**: YouTube search.
    - Consider the following URL: https://www.youtube.com/results?search_query=apple+vision+pro.
    - Your web browser, a **client**, makes an HTTP **request** with a search query.
    - The **server**, YouTube, is a computer that is sitting somewhere else.
    - The server returns a **response** that contains the search results.
    - Note: ?search_query=apple+vision+pro is called a "query string."

### Request methods

The request methods you will use most often are `GET` and `POST`; see [Mozilla's web docs](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods) for a detailed list of request methods.    

- `GET` is used to request data **from** a specified resource.

- `POST` is used to **send** data to the server. 
    - For example, uploading a photo to Instagram or entering credit card information on Amazon.

### Example `GET` request

Below is an example `GET` HTTP request made by safari when accessing [www.mtu.edu](https://www.mtu.edu).

<img src="imgs/get-safari.png">


The request information includes the following: 

```HTTP
GET / HTTP/1.1
Connection: keep-alive
Host: www.mtu.edu
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/605.1.15
...
```

- The first line (`GET / HTTP/1.1`) is called the "request line", and the lines afterwards are called "header fields". Header fields contain metadata. 

- We _could_ also provide a "body" after the header fields.

- To see HTTP requests in Google Chrome, follow [these steps](https://mkyong.com/computer-tips/how-to-view-http-headers-in-google-chrome/).

<img src="imgs/get-chrome.png">

#### Developer Tools 

You are going to need to use the developer's tools in whatever browser you use to explore informaiton for this week. 

### Example `GET` response

```HTML

<!doctype html>
<html lang="en">

<head>
	<meta charset="UTF-8" />
	<meta name="viewport" content="width=device-width, initial-scale=1" />
	<meta name="theme-color" content="#000000" />
	<title>Michigan Technological University</title>
	
	<!-- <link type="text/css" rel="stylesheet" href="//www.mtu.edu/mtu_resources/styles/n/normalize.css" /> -->
			<link type="text/css" rel="stylesheet" href="//www.mtu.edu/mtu_resources/styles/n/base.css" />
		<link href="//www.mtu.edu/mtu_resources/styles/n/print.css" type="text/css" rel="stylesheet" media="print" />

...
```

### Consequences of the request-response model

- When a request is sent to view content on a webpage, the server must:
    - process your request (i.e. prepare data for the response).
    - send content back to the client in its response.

- Remember, servers are computers. 
    - Someone has to pay to keep these computers running.
    - **This means that every time you access a website, someone has to pay.**

## Making HTTP requests 

There are (at least) two ways to make HTTP requests outside of a browser:

- From the command line, with `curl`.

- **From Python, with the `requests` package.**

### The `requests` module 

`requests` is a Python module that allows you to use Python to interact with the internet!  

There are other packages that work similarly (e.g. `urllib`), but `requests` is arguably the easiest to use.



Some other libraries we will be using this lesson are: 

* `re` - we learned about this last time for regular expressions
* `requests` 
* `beautifulsoup4` - more on this later. 
* `html5lib` 

Let's import our libraries. 


In [None]:
import pandas as pd 
from pathlib import Path 
import re
from IPython.display import Image
from IPython.display import HTML

from bs4 import BeautifulSoup
import requests

## Example 1 - `GET` request

Lets' look at the source code of the MTU homepage, [https://mtu.edu](https://mtu.edu) 

In [None]:
resp = requests.get("https://mtu.edu")

`resp` is a `Response Object`

In [None]:
resp

The `text` attribute of `resp` is a string that contains the entire response. 

In [None]:
...

In [None]:
...

In [None]:
...

We could also chain our operators together to get the html directly. 

In [None]:
html = requests.get("https://mtu.edu").text
html

## Example 2 - `POST` request 

The following call to `requests.post` makes a post request to https://httpbin.org/post, with a `'name'` parameter of `'Tester'`.

In [None]:
post_res = requests.post('https://httpbin.org/post',
                         data={'name': 'Tester'})
post_res

In [None]:
post_res.text

In [None]:
# More on this shortly!
post_res.json()

What happens when we try and make a `POST` request somewhere where we're unable to?

In [None]:
yt_res = requests.post('https://youtube.com',
                       data={'name': 'Tester'})
yt_res

In [None]:
...

`yt_res.text` is a string containing HTML – we can render this in-line using `IPython.display.HTML`.

In [None]:
...

### HTTP status codes 

When we **request** data from a website, the server includes an **HTTP status code** in the response.  

The most common status code is `200`, which means there were no issues.  

* Other times, you will see a different status code, describing some sort of event or error.
    - Common examples: `400` – bad request, `404` – page not found, `500` – internal server error.
    - [The first digit of a status describes its general "category".](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status)
    
- See [https://httpstat.us](https://httpstat.us/) for a list of all HTTP status codes.
    - It also has example sites for each status code; for example, https://httpstat.us/404 returns a `404`.

In [None]:
...

In [None]:
# ok checks if the result was successful.
...

### Handling unsuccessful requests 

- Unsuccessful requests can be re-tried, depending on the issue.
    - A good first step is to wait a little, then try again.

- A common issue is that you're making too many requests to a particular server at a time – if this is the case, increase the time between each request. You can even do this programatically, say, using `time.sleep`.

- See this [textbook](https://learningds.org/ch/14/web_http.html) for more examples.

## Data formats 

The data formats of internet responses often come in two formats: HTML or JSON. 

* The response body of a `GET` request is usually either JSON (when using an API) or HTML (when accessing a webpage). 

* The response body of a `POST` request is usually JSON. 

* XML is also a common format, but not as popular as it once was. 

### Next Week - APIs and JSON 

Next week we are going to focus on using APIs or application programming interfaces for getting data (in the JSON format). 

## Web Scraping 

Web scraping is the process of  programmatically "browsing" the web, downloading the source code (HTML) of pages that you're interested in extracting data from.

Big advantage: You can always do it! For example, Google scrapes webpages in order to make them searchable.

Disadvantages:

- It is often difficult to parse and clean scraped data.
    - Source code often includes a lot of content unrelated to the data you're trying to find (e.g. formatting, advertisements, other text).

- Websites can change often, so scraping code can get outdated quickly.

- Websites may not want you to scrape their data!

- **In general, we prefer APIs, but scraping is a useful skill to learn.**

### Legality of Web Scraping 

It is impolite or potentially illegal to scrap a site. The legality of web scraping has been the subject of several lawsuits. 

A well-watched court case involved Linked In : 

* https://techcrunch.com/2016/08/15/linkedin-sues-scrapers/
* https://www.eff.org/cases/hiq-v-linkedin
* https://arstechnica.com/tech-policy/2019/09/web-scraping-doesnt-violate-anti-hacking-law-appeals-court-rules/

Ultimately, the court rules that scraping public data doesn't violate hacking laws. 


#### Question to Ask Yourself 

Before embarking on a project that involves web scrapping there are several questions you may want to ask yourself or find answers to: 

* Are you allowed to access the data, or is the data public? 
    * private data (requires username / password) generally should not be scraped 
* Is the data copyrighted? 
* Did you read the Terms of Use?  Does it exist? Does scraping violate these policies? 
* Did you read the `robots.txt` file? 
* Are you rate limiting your requests? 

### Best practices for scraping

1. **Send requests slowly** and be upfront about what you are doing!
2. Respect the policy published in the page's `robots.txt` file.
    - Many sites have a `robots.txt` file in their root directory, which contains a policy that allows or disallows automatic access to their site. 
    - If there isn't one, like in Project 3, use a 0.5 second delay between requests.
3. Don't spoof your User-agent (i.e. don't try to trick the server into thinking you are a person).
4. Read the Terms of Service for the site and follow it.

#### Consequences of irresponsible scraping

If you make too many requests:
* The server may block your IP Address.
* You may take down the website.
    - A journalist scraped and accidentally took down the Cook County Inmate Locater.
    - As a result, inmate's families weren't able to contact them while the site was down.

### Web Scraping - GenAI 

With more and more tech and AI companies, needing fast amounts of data to train their models.  Many services are changing their Terms of Service to allow for data to be used to train their own AI models. 

Right now there is jst a lot of uncertainty in the space. 


References: 

* https://www.nytimes.com/2024/06/26/technology/terms-service-ai-training.html
* https://news.bloomberglaw.com/ip-law/openais-legal-woes-driven-by-unclear-mesh-of-web-scraping-laws

## The anatomy of HTML documents

Let's look more closely at the HTML documents returned in our requests. 

### What is HTML?

* HTML (HyperText Markup Language) is **the** basic building block of the internet. 
* It defines the content and layout of a webpage, and as such, it is what you get back when you scrape a webpage.
* See [this tutorial](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics) for more details.


For instance, here's the content of a very basic webpage.

In [None]:
!cat data/lec10_ex1.html

Using `IPython.display.HTML`, we can render it directly in our notebook.

In [None]:
# from IPython.display import HTML  # already imported above
HTML(filename=Path('data') / 'lec10_ex1.html')

### The anatomy of HTML documents

* **HTML document**: The totality of markup that makes up a webpage.

* **Document Object Model (DOM)**: The internal representation of an HTML document as a hierarchical **tree** structure.

* **HTML element**: An object in the DOM, such as a paragraph, header, or title.
* **HTML tags**: Markers that denote the **start** and **end** of an element, such as `<p>` and `</p>`.

<center><img src='imgs/dom.jpg'></center>

<center><a href='https://simplesnippets.tech/what-is-document-object-modeldom-how-js-interacts-with-dom/'>(source)</a></center>

### Useful tags to know


|Element|Description|
|:---|:---|
|`<html>`|the document|
|`<head>`|the header|
|`<body>`|the body|
|`<div>` |a logical division of the document|
|`<span>`|an *inline* logical division|
|`<p>`|a paragraph|
| `<a>`| an anchor (hyperlink)|
|`<h1>, <h2>, ...`| header(s) |
|`<img>`| an image |

There are many, many more, but these are by far the most common. See [this article](https://en.wikipedia.org/wiki/HTML_element) for examples.

#### Example: Images and hyperlinks

Tags can have **attributes**, which further specify how to display information on a webpage.

For instance, `<img>` tags have `src` and `alt` attributes (among others):

```html
<img src="blizzard.png" alt="A photograph of Blizzard T. Husky." width=500>
```

Hyperlinks have `href` attributes 
```
Click <a href="https://www.mtu.edu">this link</a> to see the MTU homepage. 
```

In [None]:
!cat data/lec10_ex2.html

### The `<div>` tag

```html
<div style="background-color:lightblue">
  <h3>This is a heading</h3>
  <p>This is a paragraph.</p>
</div>
```

* The `<div>` tag defines a division or a "section" of an HTML document.
    * Think of a `<div>` as a "cell" in a Jupyter Notebook.

* The `<div>` element is often used as a container for other HTML elements to style them with CSS or to perform operations involving them using JavaScript.

* `<div>` elements often have attributes, **which are important when scraping**!

### Document trees

Under the document object model (DOM), HTML documents are trees. In DOM trees, child nodes are **ordered**.

<center>

<img src="imgs/webpage_anatomy.png" width="50%">

</center>    

What does the DOM tree look like for this document?

<center><img src="imgs/dom_tree.png" width="50%"></center>

## Parsing HTML using Beautiful Soup

* [Beautiful Soup 4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a Python HTML parser.
    - To "parse" means to "extract meaning from a sequence of symbols".
* **Warning**: Beautiful Soup 4 and Beautiful Soup 3 work differently, so make sure you are using and looking at documentation for Beautiful Soup 4.


### Example HTML document

To start, we'll work with the source code for an HTML page with the DOM tree shown below:

<center><img src="imgs/dom_tree_1.png" width="50%"></center>

The string `html_string` contains an HTML "document".

In [None]:
html_string = '''
<html>
    <body>
      <div id="content">
        <h1>Heading here</h1>
        <p>My First paragraph</p>
        <p>My <em>second</em> paragraph</p>
        <hr>
      </div>
      <div id="nav">
        <ul>
          <li>item 1</li>
          <li>item 2</li>
          <li>item 3</li>
        </ul>
      </div>
    </body>
</html>
'''.strip()

In [None]:
HTML(html_string)

### `BeautifulSoup` objects 

`bs4.BeautifulSoup` takes in a string or file-like object representing HTML (`markup`) and returns a **parsed** document.

Normally, we pass the result of a GET request to `BeautifulSoup`, but here we will pass our hand-crafted `html_string`.

In [None]:
soup = BeautifulSoup(html_string)
soup

In [None]:
type(soup)

`BeautifulSoup` objects have several useful attributes, e.g., `text`: 

In [None]:
print(soup.text)

### Traversing through `descendants`

The `descendants` attribute traverses a `BeautifulSoup` tree using **depth-first traversal**.

Why depth-first? Elements closer to one another on a page are more likely to be related than elements further away.

<center><img src="imgs/dom_tree_1.png" width="60%"></center>

In [None]:
soup.descendants

In [None]:
for child in soup.descendants:
#     print(child) # What would happen if we ran this instead?
    if isinstance(child, str):
        continue
    print(child.name)

### Finding elements in a tree

Practically speaking, you will not use the `descendants` attribute (or the related `children` attribute) directly very often. Instead, you will use the following methods:

- `soup.find(tag)`, which finds the **first** instance of a tag (the first one on the page, i.e. the first one that DFS sees).
    - More general: `soup.find(name=None, attrs={}, recursive=True, text=None, **kwargs)`.
- `soup.find_all(tag)` will find **all** instances of a tag.

**`find` finds tags!**

### Using `find` 

Let's try and extract the first `<div>` subtree.

<center><img src="imgs/dom_tree_1.png" width="60%"></center>  

In [None]:
...

<center><img src="imgs/dom_subtree_1.png" width="30%"></center>  

Let's try and find the `<div>` element that has an `id` attribute equal to `'nav'`.

In [None]:
...

`find` will return the first occurrence of a tag, regardless of its depth in the tree.

In [None]:
# The ul child is not at the top of the tree, but we can still find it.
...

### Using `find_all`

`find_all` returns a list of all matches.

In [None]:
soup.find_all('div')

In [None]:
soup.find_all('li')

In [None]:
[x.text for x in soup.find_all('li')]

### Node attributes
* The `text` attribute of a tag element gets the text between the opening and closing tags.
* The `attrs` attribute of a tag element lists all of its attributes.
* The `get` method of a tag element **gets the value of an attribute**.

In [None]:
soup.find('p')

The `get` method must be called directly on the node that contains the attribute you're looking for.

## Example 1 - Revisited 

Let's get the mtu homepage again.  We saw that the response is a very long HTML document. 

In [None]:
resp = requests.get("https://mtu.edu")

In [None]:
len(resp.text)

To **parse** HTML, we'll use the BeautifulSoup library. 

In [None]:
# Add the html5lib parameter for better formatting 
soup = BeautifulSoup(resp.text, 'html5lib')
soup

Typically, we want to find things using either the text on the page or `Tag` objects. 

For example, let's find the first `<p>` tag and its contents on the page: 

In [None]:
first_p = 


In [None]:
first_p = 

You can access the text within the `<p>` or other `Tag`s using the `text` property:

In [None]:
first_p_text = 

You can get individual words using the string `split` function: 

In [None]:
first_p_words = 

You can extract a tag's attributes by treating it like a `dict`:

In [None]:
first_p_id

You can get multiple tags at once:

## Example 3 

Let's look at another example of another website: [Slashdot Articles filtered by Python](https://slashdot.org/index2.pl?fhfilter=python)

In [None]:
url = "https://slashdot.org/index2.pl?fhfilter=python"
soup = BeautifulSoup(requests.get(url).text, 'html5lib')

It is often very useful to look at the pages source (methods to do this in Safari, Chrome, Firefox, IE, etc.)

https://slashdot.org/index2.pl?fhfilter=python

Here is some of the relevant sections (note the example was adapted somewhat to remove some extra whitespace): 

```html
<article onclick="javascript:return false;"  id="firehose-166258475" data-fhid="166258475" data-fhtype="story" class="fhitem fhitem-story briefarticle usermode thumbs grid_24">
		<span class="sd-info-block" style="display: none">
			<span class="sd-key-firehose-id">166258475</span>
			<span class="type">story</span>
			
		</span>

<header>
	
		<span class="topic" id="topic-166258475">
			<a href="//slashdot.org/index2.pl?fhfilter=censorship" onclick="return addfhfilter('censorship');">
			
				<img src="//a.fsdn.com/sd/topics/censorship_64.png" width="64" height="64" alt="Censorship" title="Censorship">
			
		</a>
		</span>
	

	<h2 class="story">
		<span id="title-166258475" class="story-title"> <a onclick="return toggle_fh_body_wrap_return(this);"  href="//yro.slashdot.org/story/22/09/18/2246254/do-americas-free-speech-protections-protect-code---and-prevent-cryptocurrency-regulation">Do America's Free-Speech Protections Protect Code - and Prevent Cryptocurrency Regulation?</a> <span class=" no extlnk"><a class="story-sourcelnk" href="https://www.marketplace.org/shows/marketplace-tech/why-the-first-amendment-also-protects-code/"  title="External link - https://www.marketplace.org/shows/marketplace-tech/why-the-first-amendment-also-protects-code/" target="_blank"> (marketplace.org) </a></span></span>
		<!--<span class="comments commentcnt-166258475" >64</span>-->
		<!-- comment bubble -->
		
			<span class="comment-bubble"><a href="//yro.slashdot.org/story/22/09/18/2246254/do-americas-free-speech-protections-protect-code---and-prevent-cryptocurrency-regulation#comments" title="">64</a></span>
		
	</h2>
	<div class="details" id="details-166258475">
		<span class="story-details">
		<span class="story-views">
			<span class="sodify" onclick="firehose_set_options('color', 'red')" title="Filter Firehose to entries rated red or better"></span><span class="icon-beaker pop1 " alt="Popularity" title="Filter Firehose to entries rated red or better" onclick="firehose_set_options('color', 'red')"><span></span></span> 
		</span>
		</span>
		<span class="story-byline">			
			Posted
				by 	
				  EditorDavid
	
		<time id="fhtime-166258475" datetime="on Sunday September 18, 2022 @06:51PM">on Sunday September 18, 2022 @06:51PM</time>	
			 from the <span class="dept-text">million-dollar-questions</span> dept.
		
		</span>
	</div>
</header>

<div class="hide" id="fhbody-166258475">
	

	
		
		<div id="text-166258475" class="p">
			...
		</div>

	</div>
	<aside class="novote">
		
	</aside>
		<footer class="clearfix meta article-foot">
			<div class="story-controls">
			</div>
			
				<div class="story-tags">
					<span class="tright tags"><menu type="toolbar" class="edit-bar">
		<span id="tagbar-166258475" class="tag-bar none">
			<a  class="topic tag" rel="statictag" href="//slashdot.org/tag/" target="_blank"></a>

		</span>
		<div class="tag-menu">
			<input class="tag-entry default" type="text" value="apply tags">
		</div>
	</menu></span>
				</div>
		</footer>
	
	</article><article onclick="javascript:return false;"  id="firehose-166233741" data-fhid="166233741" data-fhtype="story" class="fhitem fhitem-story briefarticle usermode thumbs grid_24">
		<span class="sd-info-block" style="display: none">
			<span class="sd-key-firehose-id">166233741</span>
			<span class="type">story</span>
			
		</span>


```

It looks like each item returned is listed by the `article` tag with the class `fhitem`. 

In [None]:
soup

How do we find each of the stories?  What tag or tag/attribute combination should be used?

In [None]:
articles = soup('article', 'fhitem')
print(len(articles))

 
Let's try to collect for each news story: the title, the author (who posted it), and the number of comments. 

In [None]:
articles = soup('article', 'fhitem')
for a in articles: 
    # get title, title inside <span class="story-title"> stored as text in link. 
    title = ...
    print(title)
    # get author inside <span class="story-byline"> 
    author = ...
    print(author)
    # get the number of comments <span class="comment-bubble"> stored as text in link.
    comments = ...
    print(comments)

When writing code using an iterator, remember you can always test it for a single or few cases: 

In [None]:
articles = soup('article', 'fhitem')
story = articles[1]
story

In [None]:
author = story.find('span', 'story-byline').text
print(author)

We may want to think about how to strip the Posted by, time and extra white space from the author information. 


In [None]:
author = story.find('span', 'story-byline').text
print(author)
aut = re.match(..., author)
print(aut.group(1))

Let's create a data frame to store the news story information in. 

In [None]:
# create the data frame to store the information
df = pd.DataFrame(columns = ['Title', 'Author', 'Comments'],
                 index = range(0,len(articles)))

articles = soup('article', 'fhitem')
ai = 0
for a in articles: 
    
    # get title, title inside <span class="story-title"> stored as text in link. 
    title = ...
    #print(title)
    # get author inside <span class="story-byline"> 
    author = ...
    #print(author)
    aut = ...
    # get the number of comments <span class="comment-bubble"> stored as text in link.
    comments = ...
    #print(comments)
    df.iloc[ai,0] = 
    df.iloc[ai,1] = 
    df.iloc[ai,2] = 
    ai = ai+1
    
df

## Example 4 

Let's look at some data from the US Representatives - https://www.house.gov/representatives 

In [None]:
html = requests.get("https://www.house.gov/representatives").text
soup = BeautifulSoup(html, 'html5lib')
print (type(soup))

In [None]:
# Examine the webpage (HTML)
print (soup.prettify()[0:1000])

**Grab Representatives Websites**

Let's first just try to grab all the URLs linked from the page.

In [None]:
all_urls = [a['href'] 
            for a in soup('a')
            if a.has_attr('href')]

print(len(all_urls))

*ASIDE: List Comprehensives*

In [None]:
# As an aside we will use expressions as above extensively in this section.  
# Namely, list comprehensives 
evens = [a for a in range(10) if a % 2 == 0]
evens

In [None]:
odds = [i for i in range(10) if i % 2 == 1]
odds

In [None]:
print(all_urls[1:10])

This is too many! First, some of these are relative links to this site, rather than the representative websites. 

In [None]:
# Let's examine the part of the HTML file that stores the tables
print (soup.prettify()[21000:23000])

By looking at a number of the different links we want to find, we can see that most start with http:// or https:// and end with .house.gov or .house.gov/

Therefore, we should think about using a regular expression.

In [None]:
# Must start with http:// or https://
# Must end with .house.gov or .house.gov/
pat = r"^https?://.*\.house\.gov/?$"

In [None]:
# Let's check some tests! 
assert re.match(pat, "http://joel.house.gov")
assert re.match(pat, "https://joel.house.gov")
assert re.match(pat, "http://joel.house.gov/")
assert re.match(pat, "https://joel.house.gov/")
assert not re.match(pat, "joel.house.gov")
assert not re.match(pat, "http://joel.house.com")
assert not re.match(pat, "https://joel.house.gov/biography")
assert re.match(pat, "http://joel.lname.house.gov")

In [None]:
# And now apply
good_urls = [url for url in all_urls if re.match(pat, url)]
print(len(good_urls))

This is still a lot of urls.  

Let's try to trim this list. 

In [None]:
[i for i in range(len(good_urls)) if good_urls[i]== "https://barrymoore.house.gov"]

There are multiple instances of the same url. 

Let's convert the list to a **set** to only consider the unique urls (get rid of duplicates). 

In [None]:
good_urls = list(set(good_urls))
print(len(good_urls))

This is getting really close, we expect to see 435, only a few extra.

Let's go forward at this point, we can always clean the list further as a next step.

Let's look at individual representative sites for links to press releases. For example:

In [None]:
html = requests.get('https://bergman.house.gov').text
soup = BeautifulSoup(html, 'html5lib')

# Use a set because the links might appear multiple times.
links = {a['href'] for a in soup('a') if 'press releases' in a.text.lower()}

print(links) # {'/media/press-releases'}

Notice this is a relative link, which means we need to remember the originating site.

Let's put this together.

In [None]:
from typing import Dict, Set

press_releases: Dict[str, Set[str]] = {}

for house_url in good_urls:
    html = requests.get(house_url).text
    soup = BeautifulSoup(html, 'html5lib')
    pr_links = {a['href'] for a in soup('a') if 'press releases'
                                             in a.text.lower()}
    print(f"{house_url}: {pr_links}")
    press_releases[house_url] = pr_links

## Example 5 

For this example, we want to scrape down an HTML table. 

Here we will look at a Wikipedia page: https://among-us.fandom.com/wiki/Tasks

We are interested in the Table tasks. 

In [None]:
html = requests.get("https://among-us.fandom.com/wiki/Tasks").text
soup = BeautifulSoup(html, 'html5lib')

We can find the table in the source code: 

```html 
<table class="wikitable list-table sortable mw-collapsible">
<tbody><tr>
<th style="min-width: 25%">Task
</th>
<th style="min-width: 15%">Map
</th>
<th style="min-width: 15%">Type
</th>
<th>List of steps
</th></tr>
<tr>
<td><a href="/wiki/Align_Engine_Output" title="Align Engine Output">Align Engine Output</a>
</td>
<td><a href="/wiki/The_Skeld" title="The Skeld">The Skeld</a>
</td>
<td>Long
</td>
<td style="padding-left: 10px;">
<ul><li><a href="/wiki/Upper_Engine" title="Upper Engine">Upper Engine</a></li>
<li><a href="/wiki/Lower_Engine" title="Lower Engine">Lower Engine</a></li></ul>
</td></tr>
<tr>
<td><a href="/wiki/Align_Telescope" title="Align Telescope">Align Telescope</a>
</td>
<td><a href="/wiki/Polus" title="Polus">Polus</a>
</td>
<td>Short
```

In [None]:
# Grab all instances within the HTML there is a 
#  <table class="wikitable list-table sortable mw-collapsible"> tag 
tasks_tab = soup.find_all("table", 
                          {"class": "wikitable list-table sortable mw-collapsible"})
tasks_tab

In [None]:
len(tasks_tab)

In [None]:
# How can we grab all the rows of the table? 
tasks_dat = tasks_tab[0].find_all("tr")
tasks_dat

**Take Advantage of Pandas**

Pandas has the `read_html` function which can parse in html tables as well. 

In [None]:
df2 = ...
df2.head(10)

In [None]:
df3 = ...
df3.head(10)