In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10, 5)


# Lecture 10 - HTTP

## DSC 80, Fall 2022

## Today, in DSC 80...

- How do we programatically get data from the internet?

## Introduction to HTTP

<center><img src="imgs/DSLC.png" width="40%"></center>

### Collecting data

* Often, the data you need doesn't exist in "clean" `.csv` files.
* **Solution:** collect your own data!
    - Design and administer your own survey or run an experiment.
    - Find related data on the internet.

### Data on the internet

- The internet contains **massive** amounts of historical record.
    - News stories provide a record of world events.
    - Social networks and commerce sites provide an insight into human behavior.
- For most questions you can think of, the data to answer the question exists somewhere on the internet.

### Collecting data from the internet

- There are two ways to programmatically access data on the internet.
    - Through an API.
    - By scraping.
- We will discuss the differences between both approaches next lecture, but for now, the important part is that they **both use HTTP**.

### HTTP

- HTTP stands for **Hypertext Transfer Protocol**.
    - It was developed in 1989 by Tim Berners-Lee (and friends).
- It is a **request-response** protocol.
    - Protocol = set of rules.
- HTTP allows...
    - computers to talk to each other over a network.
    - devices to fetch data from "web servers".
- The "S" in HTTPS stands for "secure".

<center><img src='imgs/ucsd.png' width=750></center>

UCSD was a node in ARPANET, the predecessor to the modern internet ([source](https://en.wikipedia.org/wiki/ARPANET#/media/File:Arpanet_map_1973.jpg/)).

### The request-response model

HTTP follows the **request-response** model.

<center><img src='imgs/req-response.png' width=600></center>
    
- A **request** is made by the **client**.
- A **response** is returned by the **server**.

- **Example:** YouTube 🎥.
    - Your phone's web browser, a **client**, makes an HTTP **request** to view a video.
    - The **server**, YouTube, is a computer that is sitting somewhere else.
    - The server returns a **response** that contains the video.

### Request methods

- The request methods you will use most often are `GET` and `POST`.

    - `GET` is used to request data **from** a specified resource.

    - `POST` is used to **send** data to the server. 
        - e.g. uploading a photo to Instagram or entering credit card information on Amazon.
    
- See [Mozilla's web docs](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods) for a detailed list of request methods.

### Example `GET` request

Below is an example `GET` HTTP request made by a browser when accessing [datascience.ucsd.edu](https://datascience.ucsd.edu).

```HTTP
GET / HTTP/1.1
Host: datascience.ucsd.edu
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36
Connection: keep-alive
Accept-Language: en-US,en;q=0.9
```

- The first line (`GET / HTTP/1.1`) is called the "request line", and the lines afterwards are called "header fields". We could also provide a "body" after the header fields.
- To see HTTP requests in Google Chrome, follow [these steps](https://mkyong.com/computer-tips/how-to-view-http-headers-in-google-chrome/).

### Example `GET` response

The response below was generated by executing the request on the previous slide.

```HTTP
HTTP/1.1 200 OK
Date: Fri, 29 Apr 2022 02:54:41 GMT
Server: Apache
Link: <https://datascience.ucsd.edu/wp-json/>; rel="https://api.w.org/"
Link: <https://datascience.ucsd.edu/wp-json/wp/v2/pages/2427>; rel="alternate"; type="application/json"
Link: <https://datascience.ucsd.edu/>; rel=shortlink
Content-Type: text/html; charset=UTF-8

<!DOCTYPE html>
<html lang="en-US">
<head>
	<meta charset="UTF-8">
	<link rel="profile" href="https://gmpg.org/xfn/11">
	<style media="all">img.wp-smiley,img.emoji{display:inline !important;border:none
...
```

### Consequences of the request-response model

- When a request is sent to view content on a webpage, the server must:
    - process your request (i.e. prepare data for the response).
    - send content back to the client in its response.
- Remember, servers are computers. 
    - Someone has to pay to keep these computers running.
    - **This means that every time you access a website, someone has to pay.**

### Example: [istheshipstuck.com](https://istheshipstillstuck.com)

<center><img src='imgs/ship.png' width=500></center>

Read [_Inside a viral website_](https://notfunatparties.substack.com/p/inside-a-viral-website), an account of what it's like to run a site that gained 50 million+ views in 5 days.

## Making HTTP requests

### Making HTTP requests

We'll see two ways to make HTTP requests outside of a browser:

- From the command line, with `curl`.
- From Python, with the `requests` package.

### Making HTTP requests using `curl`

[`curl`](https://curl.haxx.se/docs/httpscripting.html) is a **command-line tool** that sends HTTP requests, like a browser.

1. The client, `curl`, sends a HTTP request. 
2. The request contains a method (e.g. `GET` or `POST`).
3. The HTTP server responds with 
    - a status line, indicating if things went well, 
    - response headers, and
    - (usually) a response body, containing the requested data.

### Example: `GET` requests via `curl`

- By default, `curl` issues a `GET` request.
- Remember, you can run command-line commands in a Jupyter Notebook by placing a `!` before them.

```zsh
curl -v https://httpbin.org/html
# (`-v` is short for verbose)
```

- After running the command, go to [https://httpbin.org/html](https://httpbin.org/html) in your browser. What do you notice?

In [4]:
!curl -v https://httpbin.org/html

*   Trying 54.161.34.85:443...
* Connected to httpbin.org (54.161.34.85) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*  CAfile: /etc/ssl/cert.pem
*  CApath: none
* (304) (OUT), TLS handshake, Client hello (1):
* (304) (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=httpbin.org
*  start date: Oct 21 00:00:00 2022 GMT
*  expire date: Nov 19 23:59:59 2023 GMT
*  subjectAltName: host 

### Queries in a `GET` request

- In order to request more specific information, we can include a **query string** in the URL.
- `?` begins a query. For instance,

<a href="https://www.google.com/search?q=ucsd+dsc+80+hard&client=safari"><pre>
https://www.google.com/search?q=ucsd+dsc+80+hard&client=safari
</pre></a>

- This method works well when sending small amounts of data.
- We will use a similiar technique when working with APIs next lecture.

### Example: `POST` requests via `curl`

- When using `curl`, `-d` is short for `POST`.
- Below is an example `curl` `POST` request that sends `'King Triton'` as the parameter `'name'`.

```zsh
curl -d 'name=King Triton' https://httpbin.org/post
```

In [5]:
!curl -d 'name=King Triton' https://httpbin.org/post

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "name": "King Triton"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Content-Length": "16", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "curl/7.79.1", 
    "X-Amzn-Trace-Id": "Root=1-63575905-2e50836e339664382527fba7"
  }, 
  "json": null, 
  "origin": "70.95.126.121", 
  "url": "https://httpbin.org/post"
}


- Run the cell below. Notice the difference?

In [6]:
!curl -d 'name=King Triton' https://youtube.com

<html lang="en" dir="ltr"><head><title>Oops</title><style nonce="Or5IryB8pxTGN3ljrNwA4g">html{font-family:Roboto,Arial,sans-serif;font-size:14px}body{background-color:#f9f9f9;margin:0}#content{max-width:440px;margin:128px auto}svg{display:block;pointer-events:none}#monkey{width:280px;margin:0 auto}h1,p{text-align:center;margin:0;color:#131313}h1{padding:24px 0 8px;font-size:24px;font-weight:400}p{line-height:21px}</style><link rel="shortcut icon" href="https://www.youtube.com/img/favicon.ico" type="image/x-icon"><link rel="icon" href="https://www.youtube.com/img/favicon_32.png" sizes="32x32"><link rel="icon" href="https://www.youtube.com/img/favicon_48.png" sizes="48x48"><link rel="icon" href="https://www.youtube.com/img/favicon_96.png" sizes="96x96"><link rel="icon" href="https://www.youtube.com/img/favicon_144.png" sizes="144x144"></head><body><div id="content"><h1>Something went wrong</h1><p><svg id="monkey" viewBox="0 0 490 525"><path fill="#6A1B9A" d="M325 85c1 12-1 25-5 38-8 29-3

### Making HTTP requests using `requests`

- `requests` is a Python package that allows you to use Python to interact with the internet!  
- There are other packages that work similarly (e.g. `urllib`), but `requests` is arguably the easiest to use.

In [7]:
import requests

### Example: `GET` requests via `requests`

To access the source code of the UCSD home page, all we need to run is the following:

```py
text = requests.get('https://ucsd.edu').text
```

In [8]:
url = 'https://ucsd.edu'
resp = requests.get(url)

`resp` is now a `Response` object.

In [9]:
resp

<Response [200]>

The `text` attribute of `resp` is a string that containing the entire response.

In [10]:
type(resp.text)

str

In [11]:
len(resp.text)

42833

In [12]:
print(resp.text[:1000])

<!DOCTYPE html>
<html lang="en">
  <head>
  
  

 





    <meta charset="utf-8"/>
    <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
    <meta content="width=device-width, initial-scale=1" name="viewport"/>
    <title>University of California San Diego</title>
    <meta content="University of California, San Diego" name="ORGANIZATION"/>
    <meta content="index,follow,noarchive" name="robots"/>
    <meta content="UCSD" name="SITE"/>
    <meta content="University of California San Diego" name="PAGETITLE"/>
    <meta content="The University California San Diego is one of the world's leading public research universities, located in beautiful La Jolla, California" name="DESCRIPTION"/>
    <link href="favicon.ico" rel="icon"/>


    
  




<!-- Site-specific CSS files -->
    
  <link href="https://www.ucsd.edu/_resources/css/vendor/brix_sans.css" rel="stylesheet" type="text/css"/>
  
  <!-- CSS complied from style overrides -->
  <link href="https://www.ucsd.edu/_resources/css/s

The `url` attribute contains the URL that we accessed.

In [13]:
resp.request.url

'https://ucsd.edu/'

### Example: `POST` requests via `requests`

In [14]:
post_response = requests.post('https://httpbin.org/post',
                              data={'name': 'King Triton'})
post_response

<Response [200]>

In [15]:
print(post_response.text)

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "name": "King Triton"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Content-Length": "16", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.26.0", 
    "X-Amzn-Trace-Id": "Root=1-63575951-26cf2d611b5ae8586b5de85d"
  }, 
  "json": null, 
  "origin": "70.95.126.121", 
  "url": "https://httpbin.org/post"
}



### HTTP status codes

- When we **request** data from a website, the server includes an **HTTP status code** in the response.  
* The most common status code is `200`, which means there were no issues.  
* Other times, you will see a different status code, describing some sort of event or error.
    - e.g. `404`: page not found; `500`: internal server error.
    - [The first digit of a status describes its general "category".](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status)
- See [https://httpstat.us](https://httpstat.us/) for a list of all HTTP status codes.
    - It also has example sites for each status code.
    - For example, https://httpstat.us/404 returns a `404`.

In [16]:
r = requests.get('https://httpstat.us/503')
print(r.status_code)

503


In [17]:
r.text

'503 Service Unavailable'

### Successful requests ✅

- You can check if a request was successful using the `ok` attribute, which returns a bool.
    - If a status is in the 200s, then it is successful.
- Unsuccessful requests can be re-tried, depending on the issue.
    - Wait a little, then try the request again.
    - You can even re-try requests programmatically (e.g. using a loop). If rate of requests is too high, slow down requests between each retry (e.g. using `time.sleep`).
- See the [course notes](https://notes.dsc80.com/content/07/requests.html#responsible-use-of-http-requests) for more examples.


In [18]:
status_codes = [200, 201, 403, 404, 503]

for code in status_codes:
    r = requests.get(f'https://httpstat.us/{code}')
    print(f'{code} ok: {r.ok}')

200 ok: True
201 ok: True
403 ok: False
404 ok: False
503 ok: False


- The `raise_for_status` request method raises an exception when the status code is not-ok.

In [19]:
requests.get('https://httpstat.us/400').raise_for_status()

HTTPError: 400 Client Error: Bad Request for url: https://httpstat.us/400

### The data formats of the internet

Responses typically come in one of two formats: HTML or JSON.
- The response body of a `GET` request is usually either JSON (when using an API) or HTML (when accessing a webpage).
- The response body of a `POST` request is usually JSON.
- XML is also a common format, but not as popular as it once was.

<center><img src='imgs/json.png' width=400></center>

### JSON

- JSON stands for **JavaScript Object Notation**.
- It is a lightweight format for storing and transferring data.
- It is:
    - very easy for computers to read and write.
    - moderately easy for programmers to read and write by hand.
    - meant to be generated and parsed.
- Most modern languages have an interface for working with JSON objects.
    - JSON objects _resemble_ Python dictionaries (but are not the same!).
- JSON is replacing XML, another text-based format for sending data from/to servers.

### JSON data types

- string: anything inside double quotes.
- number: any number (no difference between ints and floats).
- boolean: `true` and `false`.
- array: anything wrapped in `[]`.
- null: JSON's empty value, denoted by `null`.
- object: a collection of key-value pairs (like dictionaries).
    - Keys must be strings, values can be anything (even other objects).

See [json-schema.org](https://json-schema.org/understanding-json-schema/reference/type.html) for more details.

### Example JSON object

See `data/family.json`.

<center><img src='imgs/hierarchy.png' width=500></center>

In [21]:
import json

In [23]:
import os
f = open(os.path.join('data', 'family.json'), 'r')
family_tree = json.load(f)

In [24]:
family_tree

{'name': 'Grandma',
 'age': 94,
 'children': [{'name': 'Dad',
   'age': 60,
   'children': [{'name': 'Me', 'age': 23}, {'name': 'Brother', 'age': 21}]},
  {'name': 'My Aunt',
   'children': [{'name': 'Cousin 1', 'age': 34},
    {'name': 'Cousin 2',
     'age': 36,
     'children': [{'name': 'Cousin 2 Jr.', 'age': 2}]}]}]}

In [25]:
family_tree['children'][0]['children'][0]['age']

23

### Aside: `eval`

- `eval`, which stands for "evaluate", is a function built into Python.
- It takes in a **string containing a Python expression** and evaluates it in the current context.

In [26]:
x = 4

In [27]:
eval('x + 5')

9

- It seems like `eval` can do the same thing that `json.load` does...

In [28]:
f = open(os.path.join('data', 'family.json'), 'r')
eval(f.read())

{'name': 'Grandma',
 'age': 94,
 'children': [{'name': 'Dad',
   'age': 60,
   'children': [{'name': 'Me', 'age': 23}, {'name': 'Brother', 'age': 21}]},
  {'name': 'My Aunt',
   'children': [{'name': 'Cousin 1', 'age': 34},
    {'name': 'Cousin 2',
     'age': 36,
     'children': [{'name': 'Cousin 2 Jr.', 'age': 2}]}]}]}

- But you should **never use `eval`**. The next slide demonstrates why.

### `eval` gone wrong

- Observe what happens when we use `eval` on a string representation of a JSON object:

In [30]:
import util
f_other = open(os.path.join('data', 'evil_family.json'))
eval(f_other.read())

ValueError: i just deleted all your files lol 😂

- Oh no! Since `evil_family.json`, which could have been downloaded from the internet, contained malicious code, we now lost all of our files.
- This happened because `eval` **evaluates** all parts of the input string as if it were Python code. **You never need to do this – instead, use the `json` library.**
    - `json.load` loads a JSON file from a file.
    - `json.loads` loads a JSON file from a string.

In [31]:
f_other = open(os.path.join('data', 'evil_family.json'))
s = f_other.read()
s

'{\n    "name": "Grandma",\n    "age": 94,\n    "children": [\n        {\n        "name": util.err(),\n        "age": 60,\n        "children": [{"name": "Me", "age": 23}, \n                     {"name": "Brother", "age": 21}]\n        },\n        {\n        "name": "My Aunt",\n        "children": [{"name": "Cousin 1", "age": 34}, \n                     {"name": "Cousin 2", "age": 36, "children": \n                        [{"name": "Cousin 2 Jr.", "age": 2}]\n                     }\n                    ]\n        }\n    ]\n}'

In [32]:
json.loads(s)

JSONDecodeError: Expecting value: line 6 column 17 (char 84)

- Since `util.err()` is not a string in JSON (there are no quotes around it), `json.loads` is not able to parse it as a JSON object.
- This "safety check" is intentional.

### Handling _unfamiliar_ data
- Never trust data from an unfamiliar site.
- **Never** use `eval` on "raw" data that you didn't create!
- The JSON data format needs to be **parsed**, not evaluated as a dictionary.
    - It was designed with safety in mind!

## APIs and Scraping

### Programmatic requests

* We learned how to use the Python `requests` package to exchange data via HTTP.
    - `GET` requests are used to request data **from** a server.
    - `POST` requests are used to **send** data to a server. 
* There are two ways of collecting data via requests:
    * By using a published API (application programming interface).
    * By scraping a webpage to collect its HTML source code.

### APIs

* An API is a service that makes data directly available to the user in a convenient fashion.

* Advantages:
    - The data are usually clean, up-to-date, and ready to use.
    - The presence of a API signals that the data provider is okay with you using their data.
    - The data provider can plan and regulate data usage.
        - Some APIs require you to create an API "key", which is like an account for using the API.
        - APIs can also give you access to data that isn't publicly available on a webpage.

* Disadvantages:
    - APIs don't always exist for the data you want!

### API terminology

- A URL, or uniform resource locator, describes the location of a website or resource.

- An **API endpoint** is a URL of the data source that the user wants to make requests to.

- For example, on the [Reddit API](https://www.reddit.com/dev/api/):
    * the `/comments` endpoint retrieves information about comments.
    * the `/hot` endpoint retrieves data about posts labeled "hot" right now. 
    - To access these endpoints, you add the endpoint name to the base URL of the API.

### API requests

- API requests are just `GET`/`POST` requests to a specially maintained URL.
- Let's test out the [Pokémon API](https://pokeapi.co).

First, let's make a `GET` request for `'squirtle'`.

In [2]:
r = requests.get('https://pokeapi.co/api/v2/pokemon/squirtle')
r

<Response [200]>

Remember, the 200 status code is good! Let's take a look at the **content**:

In [3]:
r.content[:1000]

b'{"abilities":[{"ability":{"name":"torrent","url":"https://pokeapi.co/api/v2/ability/67/"},"is_hidden":false,"slot":1},{"ability":{"name":"rain-dish","url":"https://pokeapi.co/api/v2/ability/44/"},"is_hidden":true,"slot":3}],"base_experience":63,"forms":[{"name":"squirtle","url":"https://pokeapi.co/api/v2/pokemon-form/7/"}],"game_indices":[{"game_index":177,"version":{"name":"red","url":"https://pokeapi.co/api/v2/version/1/"}},{"game_index":177,"version":{"name":"blue","url":"https://pokeapi.co/api/v2/version/2/"}},{"game_index":177,"version":{"name":"yellow","url":"https://pokeapi.co/api/v2/version/3/"}},{"game_index":7,"version":{"name":"gold","url":"https://pokeapi.co/api/v2/version/4/"}},{"game_index":7,"version":{"name":"silver","url":"https://pokeapi.co/api/v2/version/5/"}},{"game_index":7,"version":{"name":"crystal","url":"https://pokeapi.co/api/v2/version/6/"}},{"game_index":7,"version":{"name":"ruby","url":"https://pokeapi.co/api/v2/version/7/"}},{"game_index":7,"version":{"n

Looks like JSON. We can extract the JSON from this request with the `json` method (or by passing `r.text` to `json.loads`).

In [4]:
r.json()

{'abilities': [{'ability': {'name': 'torrent',
    'url': 'https://pokeapi.co/api/v2/ability/67/'},
   'is_hidden': False,
   'slot': 1},
  {'ability': {'name': 'rain-dish',
    'url': 'https://pokeapi.co/api/v2/ability/44/'},
   'is_hidden': True,
   'slot': 3}],
 'base_experience': 63,
 'forms': [{'name': 'squirtle',
   'url': 'https://pokeapi.co/api/v2/pokemon-form/7/'}],
 'game_indices': [{'game_index': 177,
   'version': {'name': 'red', 'url': 'https://pokeapi.co/api/v2/version/1/'}},
  {'game_index': 177,
   'version': {'name': 'blue', 'url': 'https://pokeapi.co/api/v2/version/2/'}},
  {'game_index': 177,
   'version': {'name': 'yellow',
    'url': 'https://pokeapi.co/api/v2/version/3/'}},
  {'game_index': 7,
   'version': {'name': 'gold', 'url': 'https://pokeapi.co/api/v2/version/4/'}},
  {'game_index': 7,
   'version': {'name': 'silver',
    'url': 'https://pokeapi.co/api/v2/version/5/'}},
  {'game_index': 7,
   'version': {'name': 'crystal',
    'url': 'https://pokeapi.co/api/

Let's try a `GET` request for `'billy'`.

In [5]:
r = requests.get('https://pokeapi.co/api/v2/pokemon/billy')
r

<Response [404]>

Uh oh...

### Scraping

* Scraping is the act of programmatically "browsing" the web, downloading the source code (HTML) of pages that you're interested in extracting data from.

* Advantages:
    * You can always do it!
        - e.g. Google scrapes webpages in order to make them searchable.

* Disadvantages:
    - It is often difficult to parse and clean scraped data.
        - Source code often includes a lot of content unrelated to the data you're trying to find (e.g. formatting, advertisements, other text).
    - Websites can change often, so scraping code can get outdated quickly.
    - Websites may not want you to scrape their data!

- In general, we prefer APIs.

### Accessing HTML

Let's make a `GET` request to the HDSI Faculty page and see what the resulting HTML looks like. 

In [6]:
url = 'https://datascience.ucsd.edu/about/faculty/faculty/'
r = requests.get(url)
r

<Response [200]>

In [7]:
urlText = r.text
len(urlText)

875232

In [8]:
print(urlText[:1000])

<!DOCTYPE html><html lang="en-US"><head><meta charset="UTF-8"><link rel="profile" href="https://gmpg.org/xfn/11"><style media="all">img.wp-smiley,img.emoji{display:inline !important;border:none !important;box-shadow:none !important;height:1em !important;width:1em !important;margin:0 .07em !important;vertical-align:-.1em !important;background:0 0 !important;padding:0 !important}
/*!
 * Font Awesome Free 6.1.2 by @fontawesome - https://fontawesome.com
 * License - https://fontawesome.com/license/free (Icons: CC BY 4.0, Fonts: SIL OFL 1.1, Code: MIT License)
 * Copyright 2022 Fonticons, Inc.
 */
.fa{font-family:var(--fa-style-family,"Font Awesome 6 Free");font-weight:var(--fa-style,900)}.fa,.fa-brands,.fa-duotone,.fa-light,.fa-regular,.fa-solid,.fa-thin,.fab,.fad,.fal,.far,.fas,.fat{-moz-osx-font-smoothing:grayscale;-webkit-font-smoothing:antialiased;display:var(--fa-display,inline-block);font-style:normal;font-variant:normal;line-height:1;text-rendering:auto}.fa-1x{font-size:1em}.fa-2x{f

Wow, that is gross looking! 😰 

- It is **raw** HTML, which web browsers use to display websites.
- The information we are looking for – faculty information – is in there somewhere, but we have to search for it and extract it, which we wouldn't have to do if we had an API.

### Best practices for scraping

1. **Send requests slowly** and be upfront about what you are doing!
2. Respect the policy published in the page's `robots.txt` file.
    - Many sites have a `robots.txt` file in their root directory, which contains a policy that allows or disallows automatic access to their site. 
    - See [here](https://moz.com/learn/seo/robotstxt) or Lab 5, Question 5 for more details.
3. Don't spoof your User-agent (i.e. don't try to trick the server into thinking you are a person).
4. Read the Terms of Service for the site and follow it.

### Consequences of irresponsible scraping

If you make too many requests:
* The server may block your IP Address.
    - Everyone in your dorm might lose access to Google! (Seriously!)
* You may take down the website.
    - A journalist scraped and accidentally took down the Cook County Inmate Locater.
    - As a result, inmate's families weren't able to contact them while the site was down.

## Next time in DSC 80...

- Parsing HTML