# Getting Data - Part 2 


Some information comes from Ch 9 of Data Science from Scratch, 2nd Edition by Joel Grus.  This book is available for free through the library's connection to O'reilly's learning platform. Additional content comes from DSC 80.  Some examples adapted from:  [zlotnick's Text As Data examples](http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html)  *Note, the site is now unavailable*. 

### The data science lifecycle

<center><img src="imgs/ds-lifecycle.svg" width="60%"></center>

This week we are continuing to focus on Step 2 - Obtaining Data 

### Data Sources 

* Often, the data you need doesn't exist in "clean" `.csv` files.

* **Solution**: Collect your own data!
    - Design and administer your own survey or run an experiment.
    - Find related data on the internet.

- The internet contains **massive** amounts of historical record; for most questions you can think of, the answer exists somewhere on the internet.

### Collecting data from the internet

- There are two ways to programmatically access data on the internet:
    - through an API.
    - by scraping.


- We will discuss the differences between both approaches, but for now, the important part is that they **both use HTTP**.

## HTTP

- HTTP stands for **Hypertext Transfer Protocol**.
    - It was developed in 1989 by Tim Berners-Lee (and friends).

- It is a **request-response** protocol.
    - Protocol = set of rules.

- HTTP allows...
    - computers to talk to each other over a network.
    - devices to fetch data from "web servers."

- The "S" in HTTPS stands for "secure".

### The request-response model

HTTP follows the **request-response** model.

<center><img src='imgs/req-response.png' width=500></center>

- A <b><span style="color:blue">request</span></b> is made by the <b><span style="color:blue">client</span></b>.

- A <b><span style="color:orange">response</span></b> is returned by the <b><span style="color:orange">server</span></b>.

- **Example**: YouTube search.
    - Consider the following URL: https://www.youtube.com/results?search_query=apple+vision+pro.
    - Your web browser, a **client**, makes an HTTP **request** with a search query.
    - The **server**, YouTube, is a computer that is sitting somewhere else.
    - The server returns a **response** that contains the search results.
    - Note: ?search_query=apple+vision+pro is called a "query string."

### Request methods

The request methods you will use most often are `GET` and `POST`; see [Mozilla's web docs](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods) for a detailed list of request methods.    

- `GET` is used to request data **from** a specified resource.

- `POST` is used to **send** data to the server. 
    - For example, uploading a photo to Instagram or entering credit card information on Amazon.

### Example `GET` request

Below is an example `GET` HTTP request made by safari when accessing [www.mtu.edu](https://www.mtu.edu).

<img src="imgs/get-safari.png">


The request information includes the following: 

```HTTP
GET / HTTP/1.1
Connection: keep-alive
Host: www.mtu.edu
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/605.1.15
...
```

- The first line (`GET / HTTP/1.1`) is called the "request line", and the lines afterwards are called "header fields". Header fields contain metadata. 

- We _could_ also provide a "body" after the header fields.

- To see HTTP requests in Google Chrome, follow [these steps](https://mkyong.com/computer-tips/how-to-view-http-headers-in-google-chrome/).

<img src="imgs/get-chrome.png">

#### Developer Tools 

You are going to need to use the developer's tools in whatever browser you use to explore informaiton for this week. 

### Example `GET` response

```HTML

<!doctype html>
<html lang="en">

<head>
	<meta charset="UTF-8" />
	<meta name="viewport" content="width=device-width, initial-scale=1" />
	<meta name="theme-color" content="#000000" />
	<title>Michigan Technological University</title>
	
	<!-- <link type="text/css" rel="stylesheet" href="//www.mtu.edu/mtu_resources/styles/n/normalize.css" /> -->
			<link type="text/css" rel="stylesheet" href="//www.mtu.edu/mtu_resources/styles/n/base.css" />
		<link href="//www.mtu.edu/mtu_resources/styles/n/print.css" type="text/css" rel="stylesheet" media="print" />

...
```

### Consequences of the request-response model

- When a request is sent to view content on a webpage, the server must:
    - process your request (i.e. prepare data for the response).
    - send content back to the client in its response.

- Remember, servers are computers. 
    - Someone has to pay to keep these computers running.
    - **This means that every time you access a website, someone has to pay.**

## Making HTTP requests 

There are (at least) two ways to make HTTP requests outside of a browser:

- From the command line, with `curl`.

- **From Python, with the `requests` package.**

### The `requests` module 

`requests` is a Python module that allows you to use Python to interact with the internet!  

There are other packages that work similarly (e.g. `urllib`), but `requests` is arguably the easiest to use.



Some other libraries we will be using this lesson are: 

* `re` - we learned about this last time for regular expressions
* `requests` 
* `beautifulsoup4` - more on this later. 
* `html5lib` 

Let's import our libraries. 


In [1]:
import pandas as pd 
from pathlib import Path 
import re
from IPython.display import Image
from IPython.display import HTML

from bs4 import BeautifulSoup
import requests

## Example 1 - `GET` request

Lets' look at the source code of the MTU homepage, [https://mtu.edu](https://mtu.edu) 

In [2]:
resp = requests.get("https://mtu.edu")

`resp` is a `Response Object`

In [3]:
resp

<Response [200]>

The `text` attribute of `resp` is a string that contains the entire response. 

In [4]:
type(resp.text)

str

In [5]:
len(resp.text)

100206

In [6]:
print(resp.text[:1000])


<!doctype html>
<html lang="en">

<head>
	<meta charset="UTF-8" />
	<meta name="viewport" content="width=device-width, initial-scale=1" />
	<meta name="theme-color" content="#000000" />
	<title>Michigan Technological University</title>
	
	<!-- <link type="text/css" rel="stylesheet" href="//www.mtu.edu/mtu_resources/styles/n/normalize.css" /> -->
			<link type="text/css" rel="stylesheet" href="//www.mtu.edu/mtu_resources/styles/n/base.css" />
		<link href="//www.mtu.edu/mtu_resources/styles/n/print.css" type="text/css" rel="stylesheet" media="print" />

		<meta name="p:domain_verify" content="41b1dc4a38dcb175c2a5a7b8ffb3f20b" />
    <meta name="facebook-domain-verification" content="9trxpqskskav4wnzj1urr9mnw175a7" />
    <meta name="google-site-verification" content="_AZlprY5eM4Vq-CEZYqi5Vdijf5ACXlgjUcmaZO44NU" />
	<meta itemprop="genre" content="Education" />
	<meta name="description" content="Michigan Technological University is a flagship public research university founded in 1885. 

We could also chain our operators together to get the html directly. 

In [7]:
html = requests.get("https://mtu.edu").text
html

'\n<!doctype html>\n<html lang="en">\n\n<head>\n\t<meta charset="UTF-8" />\n\t<meta name="viewport" content="width=device-width, initial-scale=1" />\n\t<meta name="theme-color" content="#000000" />\n\t<title>Michigan Technological University</title>\n\t\n\t<!-- <link type="text/css" rel="stylesheet" href="//www.mtu.edu/mtu_resources/styles/n/normalize.css" /> -->\n\t\t\t<link type="text/css" rel="stylesheet" href="//www.mtu.edu/mtu_resources/styles/n/base.css" />\n\t\t<link href="//www.mtu.edu/mtu_resources/styles/n/print.css" type="text/css" rel="stylesheet" media="print" />\n\n\t\t<meta name="p:domain_verify" content="41b1dc4a38dcb175c2a5a7b8ffb3f20b" />\n    <meta name="facebook-domain-verification" content="9trxpqskskav4wnzj1urr9mnw175a7" />\n    <meta name="google-site-verification" content="_AZlprY5eM4Vq-CEZYqi5Vdijf5ACXlgjUcmaZO44NU" />\n\t<meta itemprop="genre" content="Education" />\n\t<meta name="description" content="Michigan Technological University is a flagship public res

## Example 2 - `POST` request 

The following call to `requests.post` makes a post request to https://httpbin.org/post, with a `'name'` parameter of `'Tester'`.

In [8]:
post_res = requests.post('https://httpbin.org/post',
                         data={'name': 'Tester'})
post_res

<Response [200]>

In [9]:
post_res.text

'{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {\n    "name": "Tester"\n  }, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encoding": "gzip, deflate, br", \n    "Content-Length": "11", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "httpbin.org", \n    "User-Agent": "python-requests/2.31.0", \n    "X-Amzn-Trace-Id": "Root=1-66fb679f-47582cdd4faa22516541cce7"\n  }, \n  "json": null, \n  "origin": "47.6.17.142", \n  "url": "https://httpbin.org/post"\n}\n'

In [10]:
# More on this shortly!
post_res.json()

{'args': {},
 'data': '',
 'files': {},
 'form': {'name': 'Tester'},
 'headers': {'Accept': '*/*',
  'Accept-Encoding': 'gzip, deflate, br',
  'Content-Length': '11',
  'Content-Type': 'application/x-www-form-urlencoded',
  'Host': 'httpbin.org',
  'User-Agent': 'python-requests/2.31.0',
  'X-Amzn-Trace-Id': 'Root=1-66fb679f-47582cdd4faa22516541cce7'},
 'json': None,
 'origin': '47.6.17.142',
 'url': 'https://httpbin.org/post'}

What happens when we try and make a `POST` request somewhere where we're unable to?

In [11]:
yt_res = requests.post('https://youtube.com',
                       data={'name': 'Tester'})
yt_res

<Response [400]>

In [12]:
yt_res.text

'<html lang="en" dir="ltr"><head><title>Oops</title><style nonce="7Xn_6L9E-1d74nmHM2avHQ">html{font-family:Roboto,Arial,sans-serif;font-size:14px}body{background-color:#f9f9f9;margin:0}#content{max-width:440px;margin:128px auto}svg{display:block;pointer-events:none}#monkey{width:280px;margin:0 auto}h1,p{text-align:center;margin:0;color:#131313}h1{padding:24px 0 8px;font-size:24px;font-weight:400}p{line-height:21px}sentinel{}</style><link rel="shortcut icon" href="https://www.youtube.com/img/favicon.ico" type="image/x-icon"><link rel="icon" href="https://www.youtube.com/img/favicon_32.png" sizes="32x32"><link rel="icon" href="https://www.youtube.com/img/favicon_48.png" sizes="48x48"><link rel="icon" href="https://www.youtube.com/img/favicon_96.png" sizes="96x96"><link rel="icon" href="https://www.youtube.com/img/favicon_144.png" sizes="144x144"></head><body><div id="content"><h1>Something went wrong</h1><p><svg id="monkey" viewBox="0 0 490 525"><path fill="#6A1B9A" d="M325 85c1 12-1 25-

`yt_res.text` is a string containing HTML – we can render this in-line using `IPython.display.HTML`.

In [13]:
HTML(yt_res.text)

### HTTP status codes 

When we **request** data from a website, the server includes an **HTTP status code** in the response.  

The most common status code is `200`, which means there were no issues.  

* Other times, you will see a different status code, describing some sort of event or error.
    - Common examples: `400` – bad request, `404` – page not found, `500` – internal server error.
    - [The first digit of a status describes its general "category".](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status)
    
- See [https://httpstat.us](https://httpstat.us/) for a list of all HTTP status codes.
    - It also has example sites for each status code; for example, https://httpstat.us/404 returns a `404`.

In [14]:
yt_res.status_code

400

In [15]:
# ok checks if the result was successful.
yt_res.ok

False

### Handling unsuccessful requests 

- Unsuccessful requests can be re-tried, depending on the issue.
    - A good first step is to wait a little, then try again.

- A common issue is that you're making too many requests to a particular server at a time – if this is the case, increase the time between each request. You can even do this programatically, say, using `time.sleep`.

- See this [textbook](https://learningds.org/ch/14/web_http.html) for more examples.

## Data formats 

The data formats of internet responses often come in two formats: HTML or JSON. 

* The response body of a `GET` request is usually either JSON (when using an API) or HTML (when accessing a webpage). 

* The response body of a `POST` request is usually JSON. 

* XML is also a common format, but not as popular as it once was. 

### Next Week - APIs and JSON 

Next week we are going to focus on using APIs or application programming interfaces for getting data (in the JSON format). 

## Web Scraping 

Web scraping is the process of  programmatically "browsing" the web, downloading the source code (HTML) of pages that you're interested in extracting data from.

Big advantage: You can always do it! For example, Google scrapes webpages in order to make them searchable.

Disadvantages:

- It is often difficult to parse and clean scraped data.
    - Source code often includes a lot of content unrelated to the data you're trying to find (e.g. formatting, advertisements, other text).

- Websites can change often, so scraping code can get outdated quickly.

- Websites may not want you to scrape their data!

- **In general, we prefer APIs, but scraping is a useful skill to learn.**

### Legality of Web Scraping 

It is impolite or potentially illegal to scrap a site. The legality of web scraping has been the subject of several lawsuits. 

A well-watched court case involved Linked In : 

* https://techcrunch.com/2016/08/15/linkedin-sues-scrapers/
* https://www.eff.org/cases/hiq-v-linkedin
* https://arstechnica.com/tech-policy/2019/09/web-scraping-doesnt-violate-anti-hacking-law-appeals-court-rules/

Ultimately, the court rules that scraping public data doesn't violate hacking laws. 


#### Question to Ask Yourself 

Before embarking on a project that involves web scrapping there are several questions you may want to ask yourself or find answers to: 

* Are you allowed to access the data, or is the data public? 
    * private data (requires username / password) generally should not be scraped 
* Is the data copyrighted? 
* Did you read the Terms of Use?  Does it exist? Does scraping violate these policies? 
* Did you read the `robots.txt` file? 
* Are you rate limiting your requests? 

### Best practices for scraping

1. **Send requests slowly** and be upfront about what you are doing!
2. Respect the policy published in the page's `robots.txt` file.
    - Many sites have a `robots.txt` file in their root directory, which contains a policy that allows or disallows automatic access to their site. 
    - If there isn't one, like in Project 3, use a 0.5 second delay between requests.
3. Don't spoof your User-agent (i.e. don't try to trick the server into thinking you are a person).
4. Read the Terms of Service for the site and follow it.

#### Consequences of irresponsible scraping

If you make too many requests:
* The server may block your IP Address.
* You may take down the website.
    - A journalist scraped and accidentally took down the Cook County Inmate Locater.
    - As a result, inmate's families weren't able to contact them while the site was down.

### Web Scraping - GenAI 

With more and more tech and AI companies, needing fast amounts of data to train their models.  Many services are changing their Terms of Service to allow for data to be used to train their own AI models. 

Right now there is jst a lot of uncertainty in the space. 


References: 

* https://www.nytimes.com/2024/06/26/technology/terms-service-ai-training.html
* https://news.bloomberglaw.com/ip-law/openais-legal-woes-driven-by-unclear-mesh-of-web-scraping-laws

## The anatomy of HTML documents

Let's look more closely at the HTML documents returned in our requests. 

### What is HTML?

* HTML (HyperText Markup Language) is **the** basic building block of the internet. 
* It defines the content and layout of a webpage, and as such, it is what you get back when you scrape a webpage.
* See [this tutorial](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics) for more details.


For instance, here's the content of a very basic webpage.

In [16]:
!cat data/lec10_ex1.html

<html>
  <head>
    <title>Page title</title>
  </head>

  <body>
    <h1>This is a heading</h1>
    <p>This is a paragraph.</p>
    <p>This is <b>another</b> paragraph.</p>
  </body>
</html>


Using `IPython.display.HTML`, we can render it directly in our notebook.

In [17]:
# from IPython.display import HTML  # already imported above
HTML(filename=Path('data') / 'lec10_ex1.html')

### The anatomy of HTML documents

* **HTML document**: The totality of markup that makes up a webpage.

* **Document Object Model (DOM)**: The internal representation of an HTML document as a hierarchical **tree** structure.

* **HTML element**: An object in the DOM, such as a paragraph, header, or title.
* **HTML tags**: Markers that denote the **start** and **end** of an element, such as `<p>` and `</p>`.

<center><img src='imgs/dom.jpg'></center>

<center><a href='https://simplesnippets.tech/what-is-document-object-modeldom-how-js-interacts-with-dom/'>(source)</a></center>

### Useful tags to know


|Element|Description|
|:---|:---|
|`<html>`|the document|
|`<head>`|the header|
|`<body>`|the body|
|`<div>` |a logical division of the document|
|`<span>`|an *inline* logical division|
|`<p>`|a paragraph|
| `<a>`| an anchor (hyperlink)|
|`<h1>, <h2>, ...`| header(s) |
|`<img>`| an image |

There are many, many more, but these are by far the most common. See [this article](https://en.wikipedia.org/wiki/HTML_element) for examples.

#### Example: Images and hyperlinks

Tags can have **attributes**, which further specify how to display information on a webpage.

For instance, `<img>` tags have `src` and `alt` attributes (among others):

```html
<img src="blizzard.png" alt="A photograph of Blizzard T. Husky." width=500>
```

Hyperlinks have `href` attributes 
```
Click <a href="https://www.mtu.edu">this link</a> to see the MTU homepage. 
```

In [18]:
!cat data/lec10_ex2.html

<html>
  <head>
    <title>UN 5550, Fall 2024</title>
    <link
      href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0-alpha1/dist/css/bootstrap.min.css"
      rel="stylesheet"
    />
  </head>

  <body>
    <h1>Class Overview</h1>
    <img src="../imgs/pizza.png" width="200" alt="My favorite pizza." />
    <p>
      Information on the class is available on:
      <a href="https://mtu.instructure.com/courses/1528222">Canvas course</a
      >.
    </p>
    <center>
      <h3>
        Information about discussion is available on EdStem platform.
      </h3>
    </center>
  </body>
</html>


### The `<div>` tag

```html
<div style="background-color:lightblue">
  <h3>This is a heading</h3>
  <p>This is a paragraph.</p>
</div>
```

* The `<div>` tag defines a division or a "section" of an HTML document.
    * Think of a `<div>` as a "cell" in a Jupyter Notebook.

* The `<div>` element is often used as a container for other HTML elements to style them with CSS or to perform operations involving them using JavaScript.

* `<div>` elements often have attributes, **which are important when scraping**!

### Document trees

Under the document object model (DOM), HTML documents are trees. In DOM trees, child nodes are **ordered**.

<center>

<img src="imgs/webpage_anatomy.png" width="50%">

</center>    

What does the DOM tree look like for this document?

<center><img src="imgs/dom_tree.png" width="50%"></center>

## Parsing HTML using Beautiful Soup

* [Beautiful Soup 4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a Python HTML parser.
    - To "parse" means to "extract meaning from a sequence of symbols".
* **Warning**: Beautiful Soup 4 and Beautiful Soup 3 work differently, so make sure you are using and looking at documentation for Beautiful Soup 4.


### Example HTML document

To start, we'll work with the source code for an HTML page with the DOM tree shown below:

<center><img src="imgs/dom_tree_1.png" width="50%"></center>

The string `html_string` contains an HTML "document".

In [19]:
html_string = '''
<html>
    <body>
      <div id="content">
        <h1>Heading here</h1>
        <p>My First paragraph</p>
        <p>My <em>second</em> paragraph</p>
        <hr>
      </div>
      <div id="nav">
        <ul>
          <li>item 1</li>
          <li>item 2</li>
          <li>item 3</li>
        </ul>
      </div>
    </body>
</html>
'''.strip()

In [20]:
HTML(html_string)

### `BeautifulSoup` objects 

`bs4.BeautifulSoup` takes in a string or file-like object representing HTML (`markup`) and returns a **parsed** document.

Normally, we pass the result of a GET request to `BeautifulSoup`, but here we will pass our hand-crafted `html_string`.

In [21]:
soup = BeautifulSoup(html_string)
soup

<html>
<body>
<div id="content">
<h1>Heading here</h1>
<p>My First paragraph</p>
<p>My <em>second</em> paragraph</p>
<hr/>
</div>
<div id="nav">
<ul>
<li>item 1</li>
<li>item 2</li>
<li>item 3</li>
</ul>
</div>
</body>
</html>

In [22]:
type(soup)

bs4.BeautifulSoup

`BeautifulSoup` objects have several useful attributes, e.g., `text`: 

In [23]:
print(soup.text)




Heading here
My First paragraph
My second paragraph




item 1
item 2
item 3






### Traversing through `descendants`

The `descendants` attribute traverses a `BeautifulSoup` tree using **depth-first traversal**.

Why depth-first? Elements closer to one another on a page are more likely to be related than elements further away.

<center><img src="imgs/dom_tree_1.png" width="60%"></center>

In [24]:
soup.descendants

<generator object Tag.descendants at 0x12c04f990>

In [25]:
for child in soup.descendants:
#     print(child) # What would happen if we ran this instead?
    if isinstance(child, str):
        continue
    print(child.name)

html
body
div
h1
p
p
em
hr
div
ul
li
li
li


### Finding elements in a tree

Practically speaking, you will not use the `descendants` attribute (or the related `children` attribute) directly very often. Instead, you will use the following methods:

- `soup.find(tag)`, which finds the **first** instance of a tag (the first one on the page, i.e. the first one that DFS sees).
    - More general: `soup.find(name=None, attrs={}, recursive=True, text=None, **kwargs)`.
- `soup.find_all(tag)` will find **all** instances of a tag.

**`find` finds tags!**

### Using `find` 

Let's try and extract the first `<div>` subtree.

<center><img src="imgs/dom_tree_1.png" width="60%"></center>  

In [26]:
soup.find('div')

<div id="content">
<h1>Heading here</h1>
<p>My First paragraph</p>
<p>My <em>second</em> paragraph</p>
<hr/>
</div>

<center><img src="imgs/dom_subtree_1.png" width="30%"></center>  

Let's try and find the `<div>` element that has an `id` attribute equal to `'nav'`.

In [27]:
soup.find('div', attrs={'id': 'nav'})

<div id="nav">
<ul>
<li>item 1</li>
<li>item 2</li>
<li>item 3</li>
</ul>
</div>

`find` will return the first occurrence of a tag, regardless of its depth in the tree.

In [28]:
# The ul child is not at the top of the tree, but we can still find it.
soup.find('ul')

<ul>
<li>item 1</li>
<li>item 2</li>
<li>item 3</li>
</ul>

### Using `find_all`

`find_all` returns a list of all matches.

In [29]:
soup.find_all('div')

[<div id="content">
 <h1>Heading here</h1>
 <p>My First paragraph</p>
 <p>My <em>second</em> paragraph</p>
 <hr/>
 </div>,
 <div id="nav">
 <ul>
 <li>item 1</li>
 <li>item 2</li>
 <li>item 3</li>
 </ul>
 </div>]

In [30]:
soup.find_all('li')

[<li>item 1</li>, <li>item 2</li>, <li>item 3</li>]

In [31]:
[x.text for x in soup.find_all('li')]

['item 1', 'item 2', 'item 3']

### Node attributes
* The `text` attribute of a tag element gets the text between the opening and closing tags.
* The `attrs` attribute of a tag element lists all of its attributes.
* The `get` method of a tag element **gets the value of an attribute**.

In [32]:
soup.find('p')

<p>My First paragraph</p>

In [33]:
soup.find('p').text

'My First paragraph'

In [34]:
soup.find('div')

<div id="content">
<h1>Heading here</h1>
<p>My First paragraph</p>
<p>My <em>second</em> paragraph</p>
<hr/>
</div>

In [35]:
soup.find('div').text

'\nHeading here\nMy First paragraph\nMy second paragraph\n\n'

In [36]:
soup.find('div').attrs

{'id': 'content'}

In [37]:
soup.find('div').get('id')

'content'

The `get` method must be called directly on the node that contains the attribute you're looking for.

## Example 1 - Revisited 

Let's get the mtu homepage again.  We saw that the response is a very long HTML document. 

In [38]:
resp = requests.get("https://mtu.edu")

In [39]:
len(resp.text)

100206

To **parse** HTML, we'll use the BeautifulSoup library. 

In [40]:
# Add the html5lib parameter for better formatting 
soup = BeautifulSoup(resp.text, 'html5lib')
soup

<!DOCTYPE html>
<html lang="en"><head>
	<meta charset="utf-8"/>
	<meta content="width=device-width, initial-scale=1" name="viewport"/>
	<meta content="#000000" name="theme-color"/>
	<title>Michigan Technological University</title>
	
	<!-- <link type="text/css" rel="stylesheet" href="//www.mtu.edu/mtu_resources/styles/n/normalize.css" /> -->
			<link href="//www.mtu.edu/mtu_resources/styles/n/base.css" rel="stylesheet" type="text/css"/>
		<link href="//www.mtu.edu/mtu_resources/styles/n/print.css" media="print" rel="stylesheet" type="text/css"/>

		<meta content="41b1dc4a38dcb175c2a5a7b8ffb3f20b" name="p:domain_verify"/>
    <meta content="9trxpqskskav4wnzj1urr9mnw175a7" name="facebook-domain-verification"/>
    <meta content="_AZlprY5eM4Vq-CEZYqi5Vdijf5ACXlgjUcmaZO44NU" name="google-site-verification"/>
	<meta content="Education" itemprop="genre"/>
	<meta content="Michigan Technological University is a flagship public research university founded in 1885. Our campus in Michigan's Upper 

In [41]:
type(soup)

bs4.BeautifulSoup

Typically, we want to find things using either the text on the page or `Tag` objects. 

For example, let's find the first `<p>` tag and its contents on the page: 

In [42]:
first_p = soup.find('p')
first_p

<p>Learning by doing is the foundation of a Michigan Tech education. Huskies can be involved in <a href="https://www.mtu.edu/admissions/academics/research/">undergraduate research</a> from their very first year on campus.</p>

In [43]:
first_p = soup.p
first_p

<p>Learning by doing is the foundation of a Michigan Tech education. Huskies can be involved in <a href="https://www.mtu.edu/admissions/academics/research/">undergraduate research</a> from their very first year on campus.</p>

You can access the text within the `<p>` or other `Tag`s using the `text` property:

In [44]:
first_p_text = first_p.text 
first_p_text

'Learning by doing is the foundation of a Michigan Tech education. Huskies can be involved in\xa0undergraduate research from their very first year on campus.'

You can get individual words using the string `split` function: 

In [45]:
first_p_words = first_p.text.split()
first_p_words

['Learning',
 'by',
 'doing',
 'is',
 'the',
 'foundation',
 'of',
 'a',
 'Michigan',
 'Tech',
 'education.',
 'Huskies',
 'can',
 'be',
 'involved',
 'in',
 'undergraduate',
 'research',
 'from',
 'their',
 'very',
 'first',
 'year',
 'on',
 'campus.']

You can extract a tag's attributes by treating it like a `dict`:

In [46]:
first_p_id = soup.p['id']           # raises a KeyError if no 'id'

KeyError: 'id'

In [47]:
first_p_id = soup.p.get('id')       # returns None if no 'id'
print(first_p_id)

None


In [48]:
first_meta_content = soup.meta.get('charset')
first_meta_content

'UTF-8'

You can get multiple tags at once:

In [49]:
all_p = soup.find_all('p')
all_p

[<p>Learning by doing is the foundation of a Michigan Tech education. Huskies can be involved in <a href="https://www.mtu.edu/admissions/academics/research/">undergraduate research</a> from their very first year on campus.</p>,
 <p>Michigan Technological University <span itemprop="description">is a flagship public research university founded in <span itemprop="foundingDate">1885</span>. Our campus in Michigan's Upper Peninsula overlooks the Keweenaw Waterway and is just a few miles from Lake Superior.</span></p>,
 <p class="margin-bottom-2x text-center text-left-medium">
 						Find what makes you tick. Work hard. Change the world. With more than 125 degree programs to choose from, you’ll be sure to find an academic home at Michigan Tech that suits your unique passions and strengths.
 					</p>,
 <p>in the nation for return on investment (Forbes)</p>,
 <p>Best Public College in the US (Wall Street Journal)</p>,
 <p>median early career pay</p>,
 <p>enrolled students from 55+ countries</

In [50]:
for p in soup.find_all('p'):
    print(p.text)

Learning by doing is the foundation of a Michigan Tech education. Huskies can be involved in undergraduate research from their very first year on campus.
Michigan Technological University is a flagship public research university founded in 1885. Our campus in Michigan's Upper Peninsula overlooks the Keweenaw Waterway and is just a few miles from Lake Superior.

						Find what makes you tick. Work hard. Change the world. With more than 125 degree programs to choose from, you’ll be sure to find an academic home at Michigan Tech that suits your unique passions and strengths.
					
in the nation for return on investment (Forbes)
Best Public College in the US (Wall Street Journal)
median early career pay
enrolled students from 55+ countries
Our beautiful waterfront campus in Michigan's Upper Peninsula is situated just miles
                     from Lake Superior. Find yourself in a close-knit college town where winters are epic
                     and outdoor adventure awaits. At Tech, w

## Example 3 

Let's look at another example of another website: [Slashdot Articles filtered by Python](https://slashdot.org/index2.pl?fhfilter=python)

In [51]:
url = "https://slashdot.org/index2.pl?fhfilter=python"
soup = BeautifulSoup(requests.get(url).text, 'html5lib')

It is often very useful to look at the pages source (methods to do this in Safari, Chrome, Firefox, IE, etc.)

https://slashdot.org/index2.pl?fhfilter=python

Here is some of the relevant sections (note the example was adapted somewhat to remove some extra whitespace): 

```html
<article onclick="javascript:return false;"  id="firehose-166258475" data-fhid="166258475" data-fhtype="story" class="fhitem fhitem-story briefarticle usermode thumbs grid_24">
		<span class="sd-info-block" style="display: none">
			<span class="sd-key-firehose-id">166258475</span>
			<span class="type">story</span>
			
		</span>

<header>
	
		<span class="topic" id="topic-166258475">
			<a href="//slashdot.org/index2.pl?fhfilter=censorship" onclick="return addfhfilter('censorship');">
			
				<img src="//a.fsdn.com/sd/topics/censorship_64.png" width="64" height="64" alt="Censorship" title="Censorship">
			
		</a>
		</span>
	

	<h2 class="story">
		<span id="title-166258475" class="story-title"> <a onclick="return toggle_fh_body_wrap_return(this);"  href="//yro.slashdot.org/story/22/09/18/2246254/do-americas-free-speech-protections-protect-code---and-prevent-cryptocurrency-regulation">Do America's Free-Speech Protections Protect Code - and Prevent Cryptocurrency Regulation?</a> <span class=" no extlnk"><a class="story-sourcelnk" href="https://www.marketplace.org/shows/marketplace-tech/why-the-first-amendment-also-protects-code/"  title="External link - https://www.marketplace.org/shows/marketplace-tech/why-the-first-amendment-also-protects-code/" target="_blank"> (marketplace.org) </a></span></span>
		<!--<span class="comments commentcnt-166258475" >64</span>-->
		<!-- comment bubble -->
		
			<span class="comment-bubble"><a href="//yro.slashdot.org/story/22/09/18/2246254/do-americas-free-speech-protections-protect-code---and-prevent-cryptocurrency-regulation#comments" title="">64</a></span>
		
	</h2>
	<div class="details" id="details-166258475">
		<span class="story-details">
		<span class="story-views">
			<span class="sodify" onclick="firehose_set_options('color', 'red')" title="Filter Firehose to entries rated red or better"></span><span class="icon-beaker pop1 " alt="Popularity" title="Filter Firehose to entries rated red or better" onclick="firehose_set_options('color', 'red')"><span></span></span> 
		</span>
		</span>
		<span class="story-byline">			
			Posted
				by 	
				  EditorDavid
	
		<time id="fhtime-166258475" datetime="on Sunday September 18, 2022 @06:51PM">on Sunday September 18, 2022 @06:51PM</time>	
			 from the <span class="dept-text">million-dollar-questions</span> dept.
		
		</span>
	</div>
</header>

<div class="hide" id="fhbody-166258475">
	

	
		
		<div id="text-166258475" class="p">
			...
		</div>

	</div>
	<aside class="novote">
		
	</aside>
		<footer class="clearfix meta article-foot">
			<div class="story-controls">
			</div>
			
				<div class="story-tags">
					<span class="tright tags"><menu type="toolbar" class="edit-bar">
		<span id="tagbar-166258475" class="tag-bar none">
			<a  class="topic tag" rel="statictag" href="//slashdot.org/tag/" target="_blank"></a>

		</span>
		<div class="tag-menu">
			<input class="tag-entry default" type="text" value="apply tags">
		</div>
	</menu></span>
				</div>
		</footer>
	
	</article><article onclick="javascript:return false;"  id="firehose-166233741" data-fhid="166233741" data-fhtype="story" class="fhitem fhitem-story briefarticle usermode thumbs grid_24">
		<span class="sd-info-block" style="display: none">
			<span class="sd-key-firehose-id">166233741</span>
			<span class="type">story</span>
			
		</span>


```

It looks like each item returned is listed by the `article` tag with the class `fhitem`. 

In [52]:
soup

<!-- html-header type=current begin --><!DOCTYPE html>
<html lang="en"><head>
	<!-- Render IE9 -->
	<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>

	

	<script id="before-content" type="text/javascript">
(function () {
    if (typeof window.sdmedia !== 'object') {
         window.sdmedia = {};
    }
    if (typeof window.sdmedia.site !== 'object') {
        window.sdmedia.site = {};
    }

    var site = window.sdmedia.site;
    site.rootdir = "//slashdot.org";
}());

var pageload = {
	pagemark: '704282867769648278',
	before_content: (new Date).getTime()
};
function pageload_done( $, console, maybe ){
	pageload.after_readycode	= (new Date).getTime();
	pageload.content_ready_time	= pageload.content_ready - pageload.before_content;
	pageload.script_ready_time	= pageload.after_readycode - pageload.content_ready;
	pageload.ready_time		= pageload.after_readycode - pageload.before_content;
	// Only report 1% of cases.
	maybe || (Math.random()>0.01) || $.ajax({ type: 'POST', 

How do we find each of the stories?  What tag or tag/attribute combination should be used?

In [53]:
articles = soup('article', 'fhitem')
print(len(articles))

20


 
Let's try to collect for each news story: the title, the author (who posted it), and the number of comments. 

In [54]:
articles = soup('article', 'fhitem')
for a in articles: 
    # get title, title inside <span class="story-title"> stored as text in link. 
    title = a.find('span', 'story-title').a.text.strip()
    print(title)
    # get author inside <span class="story-byline"> 
    author = a.find('span', 'story-byline').text
    print(author)
    # get the number of comments <span class="comment-bubble"> stored as text in link.
    comments = a.find('span', 'comment-bubble').a.text
    print(comments)

Microsoft Releases and Patents 'Python In Excel'

	
				
			Posted
				by 
		
		
			
				  BeauHD
			
		
		

		
		
		on Wednesday September 18, 2024 @06:00AM
		
		
			 from the would-you-look-at-that dept.
		
		
67
Fake Python Coding Tests Installed Malicious Software Packages From North Korea

	
				
			Posted
				by 
		
		
			
				  EditorDavid
			
		
		

		
		
		on Sunday September 15, 2024 @03:34AM
		
		
			 from the passing-is-failing dept.
		
		
22
JavaScript, Python, Java:  Redmonk's Programming Language Ranking Sees Lack of Change

	
				
			Posted
				by 
		
		
			
				  EditorDavid
			
		
		

		
		
		on Saturday September 14, 2024 @11:34AM
		
		
			 from the static-variables dept.
		
		
30
Python, JavaScript, Java: ZDNet Calculates The Most Popular Programming Languages

	
				
			Posted
				by 
		
		
			
				  EditorDavid
			
		
		

		
		
		on Sunday September 01, 2024 @08:09PM
		
		
			 from the popularity-contests dept.
		
		
39
VS Code Fork 'Cursor' - the ChatGPT of Codin

When writing code using an iterator, remember you can always test it for a single or few cases: 

In [55]:
articles = soup('article', 'fhitem')
story = articles[1]
story

<article class="fhitem fhitem-story briefarticle usermode thumbs grid_24" data-fhid="174997541" data-fhtype="story" id="firehose-174997541" onclick="javascript:return false;">
		<span class="sd-info-block" style="display: none">
			<span class="sd-key-firehose-id">174997541</span>
			<span class="type">story</span>
			
		</span>







	
	

<header>
	
		<span class="topic" id="topic-174997541">
			<a href="//slashdot.org/index2.pl?fhfilter=python" onclick="return addfhfilter('python');">
			
				<img alt="Python" height="64" src="//a.fsdn.com/sd/topics/python_64.png" title="Python" width="64"/>
			
		</a>
		</span>
	

	<h2 class="story">
		

		

		
		

		

		

		

		<span class="story-title" id="title-174997541"> <a href="//developers.slashdot.org/story/24/09/15/0030229/fake-python-coding-tests-installed-malicious-software-packages-from-north-korea" onclick="return toggle_fh_body_wrap_return(this);">Fake Python Coding Tests Installed Malicious Software Packages From North Korea</a> <sp

In [56]:
author = story.find('span', 'story-byline').text
print(author)


	
				
			Posted
				by 
		
		
			
				  EditorDavid
			
		
		

		
		
		on Sunday September 15, 2024 @03:34AM
		
		
			 from the passing-is-failing dept.
		
		


We may want to think about how to strip the Posted by, time and extra white space from the author information. 


In [57]:
author = story.find('span', 'story-byline').text
print(author)
aut = re.match(r'\s+Posted\s+by\s+(\w+)\s+on.*', author)
print(aut.group(1))


	
				
			Posted
				by 
		
		
			
				  EditorDavid
			
		
		

		
		
		on Sunday September 15, 2024 @03:34AM
		
		
			 from the passing-is-failing dept.
		
		
EditorDavid


Let's create a data frame to store the news story information in. 

In [58]:
# create the data frame to store the information
df = pd.DataFrame(columns = ['Title', 'Author', 'Comments'],
                 index = range(0,len(articles)))

articles = soup('article', 'fhitem')
ai = 0
for a in articles: 
    
    # get title, title inside <span class="story-title"> stored as text in link. 
    title = a.find('span', 'story-title').a.text.strip()
    #print(title)
    # get author inside <span class="story-byline"> 
    author = a.find('span', 'story-byline').text
    #print(author)
    aut = re.match(r'\s+Posted\s+by\s+(\w+)\s+on.*', author)
    # get the number of comments <span class="comment-bubble"> stored as text in link.
    comments = a.find('span', 'comment-bubble').a.text
    #print(comments)
    df.iloc[ai,0] = title
    df.iloc[ai,1] = aut.group(1)
    df.iloc[ai,2] = comments
    ai = ai+1
    
df

Unnamed: 0,Title,Author,Comments
0,Microsoft Releases and Patents 'Python In Excel',BeauHD,67
1,Fake Python Coding Tests Installed Malicious S...,EditorDavid,22
2,"JavaScript, Python, Java: Redmonk's Programmi...",EditorDavid,30
3,"Python, JavaScript, Java: ZDNet Calculates The...",EditorDavid,39
4,VS Code Fork 'Cursor' - the ChatGPT of Coding?,EditorDavid,69
5,"Python Developer Survey: 55% Use Linux, 6% Us...",EditorDavid,68
6,Ryzen 9 9950X Performs 16% Faster On Intel-Opt...,BeauHD,21
7,Cancel Bill Gates? New Book Paints Philanthro...,EditorDavid,176
8,NIST Releases an Open-Source Platform for AI S...,EditorDavid,4
9,"Coders Don't Fear AI, Reports Stack Overflow's...",EditorDavid,134


## Example 4 

Let's look at some data from the US Representatives - https://www.house.gov/representatives 

In [59]:
html = requests.get("https://www.house.gov/representatives").text
soup = BeautifulSoup(html, 'html5lib')
print (type(soup))

<class 'bs4.BeautifulSoup'>


In [60]:
# Examine the webpage (HTML)
print (soup.prettify()[0:1000])

<!DOCTYPE html>
<html dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <link href="http://www.house.gov/representatives" rel="canonical"/>
  <meta content="Drupal 10 (https://www.drupal.org)" name="Generator"/>
  <meta content="width" name="MobileOptimized"/>
  <meta content="true" name="HandheldFriendly"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <link href="/sites/default/themes/housegov/favicon.ico" rel="icon" type="image/vnd.microsoft.icon"/>
  <title>
   Representatives | house.gov
  </title>
  <link href="/sites/default/files/css/css_gIwMhjeUxSP62kHxwDqZMkz0w-XtnSO_QfeU6wSxDo0.css?delta=0&amp;language=en&amp;theme=housegov&amp;include=eJxFjOESwiAMg1-oG4_EFegxtKNKKcrbOz2nf3Jfckl0aqfdBVSCTUwpy3CZJSAv2ieXmv_5CUvE9kYGevajcnWp2Q15_Vq43I3a9FY8xigtFanuR8A4xbpPRaOMo-ekUhSGUeih7qPrLsmY4JjQed4xMG2EidoLy5dH-Q" media="all" rel="stylesheet"/>
  <link href="/sites/default/files/css/css_umwJEJp5ol85E_gcKRD6ggKhFtIVzn7D11xlaPlBYLk.css?delta=1&amp;langua

**Grab Representatives Websites**

Let's first just try to grab all the URLs linked from the page.

In [61]:
all_urls = [a['href'] 
            for a in soup('a')
            if a.has_attr('href')]

print(len(all_urls))

967


*ASIDE: List Comprehensives*

In [62]:
# As an aside we will use expressions as above extensively in this section.  
# Namely, list comprehensives 
evens = [a for a in range(10) if a % 2 == 0]
evens

[0, 2, 4, 6, 8]

In [63]:
odds = [i for i in range(10) if i % 2 == 1]
odds

[1, 3, 5, 7, 9]

In [64]:
print(all_urls[1:10])

['/', '/', '/representatives', '/leadership', '/committees', '/legislative-activity', '/the-house-explained', '/visitors', '/educators-and-students']


This is too many! First, some of these are relative links to this site, rather than the representative websites. 

In [65]:
# Let's examine the part of the HTML file that stores the tables
print (soup.prettify()[21000:23000])

                  Phone
                 </th>
                 <th class="views-field views-field-markup" id="view-markup-table-column" scope="col">
                  Committee Assignment
                 </th>
                </tr>
               </thead>
               <tbody>
                <tr>
                 <td class="views-field views-field-value-2" headers="view-value-2-table-column">
                  1st
                 </td>
                 <td class="views-field views-field-value-4 views-field-value-5" headers="view-value-4-table-column">
                  <a href="https://carl.house.gov">
                   Carl, Jerry
                  </a>
                 </td>
                 <td class="views-field views-field-value-7" headers="view-value-7-table-column">
                  R
                 </td>
                 <td class="views-field views-field-value-8 views-field-value-9" headers="view-value-9-table-column">
                  1330 LHOB
                 </td

By looking at a number of the different links we want to find, we can see that most start with http:// or https:// and end with .house.gov or .house.gov/

Therefore, we should think about using a regular expression.

In [66]:
# Must start with http:// or https://
# Must end with .house.gov or .house.gov/
pat = r"^https?://.*\.house\.gov/?$"

In [67]:
# Let's check some tests! 
assert re.match(pat, "http://joel.house.gov")
assert re.match(pat, "https://joel.house.gov")
assert re.match(pat, "http://joel.house.gov/")
assert re.match(pat, "https://joel.house.gov/")
assert not re.match(pat, "joel.house.gov")
assert not re.match(pat, "http://joel.house.com")
assert not re.match(pat, "https://joel.house.gov/biography")
assert re.match(pat, "http://joel.lname.house.gov")

In [68]:
# And now apply
good_urls = [url for url in all_urls if re.match(pat, url)]
print(len(good_urls))

874


This is still a lot of urls.  

Let's try to trim this list. 

In [69]:
[i for i in range(len(good_urls)) if good_urls[i]== "https://barrymoore.house.gov"]

[1, 718]

There are multiple instances of the same url. 

Let's convert the list to a **set** to only consider the unique urls (get rid of duplicates). 

In [70]:
good_urls = list(set(good_urls))
print(len(good_urls))

437


This is getting really close, we expect to see 435, only a few extra.

Let's go forward at this point, we can always clean the list further as a next step.

Let's look at individual representative sites for links to press releases. For example:

In [71]:
html = requests.get('https://bergman.house.gov').text
soup = BeautifulSoup(html, 'html5lib')

# Use a set because the links might appear multiple times.
links = {a['href'] for a in soup('a') if 'press releases' in a.text.lower()}

print(links) # {'/media/press-releases'}

{'/news/documentquery.aspx?DocumentTypeID=27'}


Notice this is a relative link, which means we need to remember the originating site.

Let's put this together.

In [72]:
from typing import Dict, Set

press_releases: Dict[str, Set[str]] = {}

for house_url in good_urls:
    html = requests.get(house_url).text
    soup = BeautifulSoup(html, 'html5lib')
    pr_links = {a['href'] for a in soup('a') if 'press releases'
                                             in a.text.lower()}
    print(f"{house_url}: {pr_links}")
    press_releases[house_url] = pr_links

https://clarke.house.gov/: {'https://clarke.house.gov/category/pr/'}
https://gaetz.house.gov: {'/media/press-releases'}
https://bera.house.gov: {'/news/documentquery.aspx?DocumentTypeID=2402'}
https://halrogers.house.gov/: {'/press-releases'}
https://lindasanchez.house.gov/: {'/media-center/press-releases'}
https://boebert.house.gov: {'/media/press-releases'}
https://juliabrownley.house.gov: {'https://juliabrownley.house.gov/category/press-releases/'}
https://fong.house.gov/: {'/media/press-releases'}
https://williams.house.gov: {'/media-center/press-releases'}
https://tenney.house.gov/: {'/media/press-releases'}
https://bilirakis.house.gov/: {'/media/press-releases'}
https://rouzer.house.gov/: {'/news/documentquery.aspx?DocumentTypeID=27'}
https://davids.house.gov/: {'/media/press-releases'}
https://amodei.house.gov: set()
https://kean.house.gov: {'/media/press-releases'}
https://scalise.house.gov/: {'/media/press-releases'}
https://waters.house.gov: {'/media-center/press-releases'}
h

KeyboardInterrupt: 

## Example 5 

For this example, we want to scrape down an HTML table. 

Here we will look at a Wikipedia page: https://among-us.fandom.com/wiki/Tasks

We are interested in the Table tasks. 

In [73]:
html = requests.get("https://among-us.fandom.com/wiki/Tasks").text
soup = BeautifulSoup(html, 'html5lib')

We can find the table in the source code: 

```html 
<table class="wikitable list-table sortable mw-collapsible">
<tbody><tr>
<th style="min-width: 25%">Task
</th>
<th style="min-width: 15%">Map
</th>
<th style="min-width: 15%">Type
</th>
<th>List of steps
</th></tr>
<tr>
<td><a href="/wiki/Align_Engine_Output" title="Align Engine Output">Align Engine Output</a>
</td>
<td><a href="/wiki/The_Skeld" title="The Skeld">The Skeld</a>
</td>
<td>Long
</td>
<td style="padding-left: 10px;">
<ul><li><a href="/wiki/Upper_Engine" title="Upper Engine">Upper Engine</a></li>
<li><a href="/wiki/Lower_Engine" title="Lower Engine">Lower Engine</a></li></ul>
</td></tr>
<tr>
<td><a href="/wiki/Align_Telescope" title="Align Telescope">Align Telescope</a>
</td>
<td><a href="/wiki/Polus" title="Polus">Polus</a>
</td>
<td>Short
```

In [74]:
# Grab all instances within the HTML there is a 
#  <table class="wikitable list-table sortable mw-collapsible"> tag 
tasks_tab = soup.find_all("table", 
                          {"class": "wikitable list-table sortable mw-collapsible"})
tasks_tab

[<table class="wikitable list-table sortable mw-collapsible">
 <tbody><tr>
 <th style="min-width: 25%">Task
 </th>
 <th style="min-width: 15%">Map
 </th>
 <th style="min-width: 15%">Type
 </th>
 <th>List of steps
 </th></tr>
 <tr>
 <td><a href="/wiki/Align_Engine_Output" title="Align Engine Output">Align Engine Output</a>
 </td>
 <td><a href="/wiki/The_Skeld" title="The Skeld">The Skeld</a>
 </td>
 <td>Long
 </td>
 <td style="padding-left: 10px;">
 <ul><li><a href="/wiki/Upper_Engine" title="Upper Engine">Upper Engine</a></li>
 <li><a href="/wiki/Lower_Engine" title="Lower Engine">Lower Engine</a></li></ul>
 </td></tr>
 <tr>
 <td><a href="/wiki/Align_Telescope" title="Align Telescope">Align Telescope</a>
 </td>
 <td><a href="/wiki/Polus" title="Polus">Polus</a>
 </td>
 <td>Short
 </td>
 <td>
 <ul><li><a href="/wiki/Laboratory" title="Laboratory">Laboratory</a></li></ul>
 </td></tr>
 <tr>
 <td rowspan="2"><a href="/wiki/Assemble_Artifact" title="Assemble Artifact">Assemble Artifact</a>


In [75]:
len(tasks_tab)

1

In [76]:
# How can we grab all the rows of the table? 
tasks_dat = tasks_tab[0].find_all("tr")
tasks_dat

[<tr>
 <th style="min-width: 25%">Task
 </th>
 <th style="min-width: 15%">Map
 </th>
 <th style="min-width: 15%">Type
 </th>
 <th>List of steps
 </th></tr>,
 <tr>
 <td><a href="/wiki/Align_Engine_Output" title="Align Engine Output">Align Engine Output</a>
 </td>
 <td><a href="/wiki/The_Skeld" title="The Skeld">The Skeld</a>
 </td>
 <td>Long
 </td>
 <td style="padding-left: 10px;">
 <ul><li><a href="/wiki/Upper_Engine" title="Upper Engine">Upper Engine</a></li>
 <li><a href="/wiki/Lower_Engine" title="Lower Engine">Lower Engine</a></li></ul>
 </td></tr>,
 <tr>
 <td><a href="/wiki/Align_Telescope" title="Align Telescope">Align Telescope</a>
 </td>
 <td><a href="/wiki/Polus" title="Polus">Polus</a>
 </td>
 <td>Short
 </td>
 <td>
 <ul><li><a href="/wiki/Laboratory" title="Laboratory">Laboratory</a></li></ul>
 </td></tr>,
 <tr>
 <td rowspan="2"><a href="/wiki/Assemble_Artifact" title="Assemble Artifact">Assemble Artifact</a>
 </td>
 <td><a href="/wiki/MIRA_HQ" title="MIRA HQ">MIRA HQ</a>
 <

**Scrape Table into DataFrame**

Start with just the html in the table.

In [77]:
# Grab all instances within the HTML there is a 
#  <table class="wikitable list-table sortable mw-collapsible"> tag 
tasks_tab = soup.find_all("table", {"class": "wikitable list-table sortable mw-collapsible"})
tasks_tab[0]

<table class="wikitable list-table sortable mw-collapsible">
<tbody><tr>
<th style="min-width: 25%">Task
</th>
<th style="min-width: 15%">Map
</th>
<th style="min-width: 15%">Type
</th>
<th>List of steps
</th></tr>
<tr>
<td><a href="/wiki/Align_Engine_Output" title="Align Engine Output">Align Engine Output</a>
</td>
<td><a href="/wiki/The_Skeld" title="The Skeld">The Skeld</a>
</td>
<td>Long
</td>
<td style="padding-left: 10px;">
<ul><li><a href="/wiki/Upper_Engine" title="Upper Engine">Upper Engine</a></li>
<li><a href="/wiki/Lower_Engine" title="Lower Engine">Lower Engine</a></li></ul>
</td></tr>
<tr>
<td><a href="/wiki/Align_Telescope" title="Align Telescope">Align Telescope</a>
</td>
<td><a href="/wiki/Polus" title="Polus">Polus</a>
</td>
<td>Short
</td>
<td>
<ul><li><a href="/wiki/Laboratory" title="Laboratory">Laboratory</a></li></ul>
</td></tr>
<tr>
<td rowspan="2"><a href="/wiki/Assemble_Artifact" title="Assemble Artifact">Assemble Artifact</a>
</td>
<td><a href="/wiki/MIRA_HQ"

**Take Advantage of Pandas**

Pandas has the `read_html` function which can parse in html tables as well. 

In [78]:
df2 = pd.read_html('https://among-us.fandom.com/wiki/Tasks')[0]
df2.head(10)

Unnamed: 0,Task,Map,Type,List of steps
0,Align Engine Output,The Skeld,Long,Upper Engine Lower Engine
1,Align Telescope,Polus,Short,Laboratory
2,Assemble Artifact,MIRA HQ,Short,Laboratory
3,Assemble Artifact,The Fungle,Short,Laboratory
4,Build Sandcastle,The Fungle,Short,Splash Zone
5,Buy Beverage,MIRA HQ,Short,Cafeteria
6,Calibrate Distributor,The Skeld,Short,Electrical
7,Calibrate Distributor,The Airship,Short,Electrical
8,Chart Course,The Skeld,Short,Navigation
9,Chart Course,MIRA HQ,Short,Admin


In [79]:
df3 = pd.read_html(soup.prettify(), flavor=['bs4'])[0]
df3.head(10)

Unnamed: 0,Task,Map,Type,List of steps
0,Align Engine Output,The Skeld,Long,Upper Engine Lower Engine
1,Align Telescope,Polus,Short,Laboratory
2,Assemble Artifact,MIRA HQ,Short,Laboratory
3,Assemble Artifact,The Fungle,Short,Laboratory
4,Build Sandcastle,The Fungle,Short,Splash Zone
5,Buy Beverage,MIRA HQ,Short,Cafeteria
6,Calibrate Distributor,The Skeld,Short,Electrical
7,Calibrate Distributor,The Airship,Short,Electrical
8,Chart Course,The Skeld,Short,Navigation
9,Chart Course,MIRA HQ,Short,Admin
