# Homework 8: HTTP, REST APIs, Data Formats

You may use the following packages/imports for any of the questions below:

* `bs4` ([docs](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) - should already be installed with Anaconda) 
* `requests` ([docs](https://docs.python-requests.org/en/latest/) - should already be installed with Anaconda)
* `json` (standard library)
* `re` (standard library)

Total questions: 5<br/>
Total points: 8

## Question 1

Write a function called `get_api_header_data` that calls `http://numbersapi.com/21/trivia`, and returns a `dict` of the following keys, with the values found in the response's headers:

* `"Server"`
* `"Content-Type"`
* `"Content-Length"`

[1 point]

In [1]:
import requests

def get_api_header_data() -> dict:
    resp = requests.get("http://numbersapi.com/21/trivia")
    return {
        "Server": resp.headers["Server"],
        "Content-Type": resp.headers["Content-Type"],
        "Content-Length": resp.headers["Content-Length"],
    }

In [2]:
# autograder tests

# Ensure the returned result is a dictionary
result = get_api_header_data()
assert isinstance(result, dict)

In [3]:
# autograder tests
# Ensure the expected results are returned
actual = get_api_header_data()
assert "nginx/1.4.6 (Ubuntu)" == actual["Server"]
assert "text/plain; charset=utf-8" == actual["Content-Type"]

# Content-Length is variable depending on the number of characters in the fact,
# but it should be able to be casted into an integer.
assert isinstance(int(actual["Content-Length"]), int)

## Question 2

Write a function called `post_data` that takes one argument, `data` (a dictionary), and returns a `requests.Response` object.

This function should submit a `POST` request to the url `https://httpbin.org/post`. The request should post the `data` argument as JSON. It should also set the headers `"Content-Type"` and `"Accept"` to the appropriate [MIME type](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types) for sending JSON data. Finally, it should return the response received from the post request.

[2 points]

In [4]:
import json
import requests

def post_data(data):
    headers = {
        "Content-Type": "application/json",
        "Accept": "application/json",
    }
    resp = requests.post("https://httpbin.org/post", headers=headers, json=data)
    # also accepted:
    # data = json.dumps(data)
    # resp = requests.post("https://httpbin.org/post", headers=headers, data=data)
    return resp


In [5]:
# autograder tests
result = post_data({"foo": "bar", "baz": "blah"})

# Ensure the correct type is returned
assert isinstance(result, requests.Response)

# Ensure the correct request method was used
assert "POST" == result.request.method

In [7]:
# autograder tests
result = post_data({"foo": "bar", "baz": "blah"})
json_data = json.dumps({"foo": "bar", "baz": "blah"})
exp_req_headers = {
    "Content-Length": str(len(json_data)),
    "Content-Type": "application/json",
    "Accept": "application/json"
}

# Ensure the correct headers were sent in the request
for exp_key, exp_val in exp_req_headers.items():
    assert exp_val == result.request.headers[exp_key]

In [8]:
# autograder tests
result = post_data({"foo": "bar", "baz": "blah"})

# Ensure the data is sent as JSON and not altered
act_resp_json = result.json()["json"]
assert {"foo": "bar", "baz": "blah"} == act_resp_json

## Question 3

Define a function called `get_html` that returns an HTML string. The HTML returned should contain the following HTML elements:

* Three paragraphs
* One photo
* Three hypertext reference link

The paragraphs and link elements must contain content. The HTML returned must be valid HTML (you can use [this HTML validator](https://validator.w3.org/nu/#textarea) to check if your HTML is valid), and must contain all necessary container tags.

[2 points]

In [9]:
def get_html():
    return """
    <!doctype html>
    <html lang=en>
        <head>
            <meta charset=utf-8>
            <title>My Title</title>
        </head>
        <body>
            <img src="test.jpg">
            <p>I'm some content. <a class="large" href="https://www.google.com">link2</a></p>
            <p>
                <a class="large" href="https://www.google.com">link1</a>
            </p>
            <p>
                Some content. <a class="large" href="https://www.google.com">link3</a>
            </p>
        </body>
    </html>
    """


In [10]:
# autograder tests
# Ensure the HTML has the correct header
html = get_html()
# first normalize possible correct implementations
cleaned_html = ' '.join(html.split()).strip().lower()
# then assert that it starts with the expected header
assert cleaned_html.startswith("<!doctype html>")

In [11]:
# autograder tests
import bs4

soup = bs4.BeautifulSoup(get_html(), 'html.parser')

# Ensure there's a `<head>` tag with a title set for the website
assert len(soup.find_all('title')) == 1
assert soup.find('title').parent.name.lower() == 'head'

In [12]:
# autograder tests
# Ensure there are three links, and three paragraphs
assert len(soup.find_all('a')) == 3
assert len(soup.find_all('p')) == 3

In [13]:
# autograder tests
assert len(soup.find_all('img')) == 1

## Question 4

Write a function called `ones` which takes two arguments, `m` and `n`, and produces an `m` row by `n` column JSON array-of-arrays, with all elements being 1.

For example:

```py
>>> ones(2, 2)
'[[1, 1], [1, 1]]'
```

Another example:

```py
>>> ones(1, 4)
'[[1, 1, 1, 1]]'
```

You may assume that `m` and `n` will only ever be a positive integer.

[1 points]

In [14]:
import json

def ones(m, n):
    return json.dumps([[1] * n] * m)

In [15]:
# autograder tests
assert json.loads(ones(2, 2)) == [[1, 1], [1, 1]]
assert json.loads(ones(1, 4)) == [[1, 1, 1, 1]]

In [16]:
# autograder tests
assert json.loads(ones(3, 2)) == [[1, 1], [1, 1], [1, 1]]

## Question 5

As practice to help prepare your final project, write a function called `find_taxi_parquet_links`. It should call the `get_taxi_html` function already defined for you (which returns the entire HTML of the NYC taxi data website used for your final project), and returns a `list` of `str`s of all links for _both_ Yellow and Green taxi trip records. 

Use BeautifulSoup to parse the HTML returned by `get_taxi_html`. Do not include any other types of links, including for-hire vehicle trip records.

[2 points]

In [17]:
import bs4
import requests


TAXI_URL = "https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page"


def get_taxi_html():
    response = requests.get(TAXI_URL)
    html = response.content
    return html


def find_taxi_parquet_links():
    ### BEGIN SOLUTION
    html = get_taxi_html()
    soup = bs4.BeautifulSoup(html, "html.parser")
    green_a_tags = soup.find_all("a", attrs={"title": "Green Taxi Trip Records"})
    yellow_a_tags = soup.find_all("a", attrs={"title": "Yellow Taxi Trip Records"})
    all_a_tags = green_a_tags + yellow_a_tags
    return [a["href"] for a in all_a_tags]
    ### END SOLUTION

In [18]:
# autograder tests
result = find_taxi_parquet_links()

# Ensure the returned results are the expected type,
assert isinstance(result, list)
assert len(result) > 0
assert isinstance(result[0], str)

# Ensure the number of returned results;
# Since they regularly update the site with new monthly data,
# this is only a rough approximation.
assert len(result) >= 269
assert len(result) < 300

In [19]:
# autograder tests
import re

result = find_taxi_parquet_links()

pattern = re.compile(
    r"(green|yellow)_tripdata_20([0-9]{2})-([0-9]{2}).parquet"
)
expected = []
for link in result:
    match = pattern.search(link)
    if match:
        expected.append(match.string)

assert len(result) == len(expected)

In [20]:
# autograder tests
import re

result = find_taxi_parquet_links()
result = sorted(result)

# Just spot check a couple links
expected0 = "https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2013-08.parquet"
expected180 = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2014-09.parquet"
expected268 = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet"

assert expected0 == result[0], result[0]
assert expected180 == result[180], result[180]
assert expected268 == result[268], result[268]