# HTTP clients
FPNP3e ch9

# References
- [HTTP](https://en.wikipedia.org/wiki/HTTP)
- [HTTP Documentation](https://httpwg.org/specs/)

Objectives
---
- Learn how to use the HTTP protocol from the perspective of a client
  - fetch and cache documents
  - submit queries or data to the server 
- Get familiar with HTTP version 1.1 defined in [RFCs 9110-9112](https://httpwg.org/specs/)
  -  the most common version in use today

HTTP overview
---
- a request–response protocol in the client–server model
  - HTTP headers are managed end-to-end
- intermediate HTTP nodes (proxy servers, web caches, etc.) may be used to improve performance
  - HTTP headers are managed hop-by-hop
- a stateless protocol
  - no requirements on the web server to retain information or status about each user for the duration of multiple requests
- states can be implemented to manage user sessions
  - using cookies or hidden variables 
- HTTP 1.1/2 runs on TCP
  - HTTP 3 runs on QUIC + UDP


Python Client Libraries
---
- [urllib](https://docs.python.org/3/library/urllib.html), built into PSL
- [Requests](https://requests.readthedocs.io/en/latest/), a full-featured third-party solution
- Their basic interfaces are quite similar
    - a callable that opens an HTTP connection,
    - makes a request, and waits for the response headers 
        - before returning a response object that presents them to the programmer
    - The response body is left queued on the incoming socket
        - and read only when the programmer asks
- testbed website: [httpbin.org](http://httpbin.org/)
  ```bash
    # install required packages
    pip install gunicorn requests
    # Host httpbin.org locally with docker
    docker run -p 80:80 kennethreitz/httpbin
  ```

In [None]:
# fetch httpbin with Requests
# http://httpbin.org
# or 
# https://pie.dev

import requests
# r = requests.get('http://localhost/headers')
r = requests.get('http://httpbin.org/headers')
print(r.text)

In [None]:
# fetch httpbin with urllib
from urllib.request import urlopen
import urllib.error
#r = urlopen('http://localhost/headers')
r = urlopen('http://httpbin.org/headers')
print(r.read().decode('utf-8'))

Differences between urllib and Requests
---
| lib\feature | supports gzip | determines correct decoding |
| --- | --- | --- |
| Requests | Y | Y |
| urllib | N | N |

- To go beyond the HTTP protocol to be more browser-like
    - refer to related libraries such as *mechanize*
- Here we focus on the HTTP protocol

Ports, Encryption, and Framing
---
- 80:  the standard port for plain-text HTTP conversations
- 443:  the standard port for HTTP conversations wrapped by TLS
- Non standard ports can be used. 
  - The client needs to specify it in the URLs

```mermaid
sequenceDiagram
  Client->>Server: send a request that names a document
  Note right of Client: wait for a complete response 
  Server-->>Client: a response of an error or  the requested document 
```
- the request and response use the same rules to establish formatting and framing
- In HTTP/1.1, the client is not permitted to transmit a second request over the same socket until the response is finished


🔭 Practice
---
- Run httpbin with docker
  - Access http://localhost/ip with curl
    ```bash
    curl -v localhost/ip

    # or use telnet
    telnet httpbin.org 80
    # GET / HTTP/1.1
    # Host: httpbin.org
    ```
- *Optional:* Explore HTTP request and response using 
  - [httpie](https://httpie.io/)
    - [Install httpie](https://httpie.io/docs/cli/debian-and-ubuntu) then play with the examples
  - or [http-prompt](https://http-prompt.com/)


HTTP message structure
---
- Both HTTP request and response are called a HTTP message
- Each message is composed of three parts
  - Each part consists of at least one line
    - each line ends with a carriage return and linefeed (CRLF, ASCII codes 13 and 10)
  1. A first line that names
     - a method and document in the request
     - a return code and description in the response
  2. Zero or more lines represents header entries 
     - each entry consist of a name, a colon, and a value
     - entry name is case-insensitive
     - A *mandatory* blank line (CRLFCRLF) terminates the entire list of entries
  3. An optional body
     - There are several options for framing the body 
- No prior warning about how long the line and headers might be
  - commonsense maximums are set on their length to avoid DoS attack



Three framing options for the message body
---
1. *a Content-Length header entry with value of a decimal integer* specifies the length of the body in bytes similar to framing method **M5**.
   - may not be feasible for data generated dynamically
2. a header entry specifies *Transfer-Encoding of chunked* similar to framing method **M6**
   - used to frame a body without knowing its length before hand
   - separately delivered in smaller pieces each prefixed by its length in the format below in order
     - a *hexadecimal* length field
     - (optional $O_1$): a semicolon and extension option
     - a line delimiter CRLF 
     - a block of data of the stated length 
     - again a line delimiter CRLF
   - the last chunk has length 0 bytes without the block of data
   - (optional $O_2$): a few last HTTP header entries if $O_1$ specified
3. *Connection: close* specified by the server to send a body of arbitrary length then close the TCP socket

[Methods](https://en.wikipedia.org/wiki/HTTP)
---
- The first word of an HTTP request specifies 
  - the *action, operation, or method* that the client is requesting of the server
- Two basic methods, GET and POST, provide the basic “read” and “write” operations of HTTP
- GET method syntax
  ```
  GET URL HTTP/1.1
  ```
  - URL - [Uniform Resource Locator](https://en.wikipedia.org/wiki/URL) locates the document requested
  - No body
  - The client can only modify the document that is being returned
  - The client cannot modify data on the server so
    - lets a client safely re-attempt a GET if a first attempt is interrupted
    - allows GET responses to be cached
    - makes it safe for web scraping programs to visit as many URLs as they want
  - a GET request can be sent by urllib.request.urlopen() or requests.get()
- POST is used to submit new data to the server
  - the results of a POST cannot be cached
  - cannot be retried automatically if the response does not arrive
  - a POST request can be sent by urllib.request.urlopen(data) or requests.post()
- The methods like GET are OPTIONS and HEAD
  -  OPTIONS asks what header values will work with a particular path
  -  HEAD method asks the server to  transmit only the response headers 
- The methods like POST are PUT and DELETE
  -  PUT uploads a new document to the path that the request specifies
  -  DELETE deletes the path and any content associated with it
  -  both methods are *idempotent*, but POST is not
- TRACE is used for debugging 
- CONNECT for switching protocols to something besides HTTP
  - turn on WebSockets

### Issue HTTP methods/commands

| HTTP Method | [cURL](https://curl.se/) Command                                                                                                        | [HTTPie](https://httpie.io/) Command                                                              |
|-------------|----------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------|
| GET         | `curl https://api.example.com/data`                                                                                  | `http GET https://api.example.com/data`                                     |
| POST        | `curl -X POST https://api.example.com/data -H "Content-Type: application/json" -d '{"name":"John Doe", "age":30}'`   | `http POST https://api.example.com/data name="John Doe" age=30`             |
| PUT         | `curl -X PUT https://api.example.com/data/123 -H "Content-Type: application/json" -d '{"name":"Jane Smith", "age":25}'` | `http PUT https://api.example.com/data/123 name="Jane Smith" age=25`        |
| DELETE      | `curl -X DELETE https://api.example.com/data/123`                                                                   | `http DELETE https://api.example.com/data/123`                              |
| HEAD        | `curl -I https://api.example.com/data`                                                                              | `http HEAD https://api.example.com/data`                                    |
| OPTIONS     | `curl -X OPTIONS https://api.example.com/data`                                                                      | `http OPTIONS https://api.example.com/data`                                 |

- 🔭 An interactive HTTP client [http-prompt](https://github.com/httpie/http-prompt)

| HTTP Method | `urllib` Command                                                                                                  | `requests` Command                                      |
|-------------|-------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------|
| GET         | `urllib.request.urlopen("https://api.example.com/data")`                                                          | `requests.get("https://api.example.com/data")`          |
| POST        | `urllib.request.urlopen("https://api.example.com/data", data=urllib.parse.urlencode({"name": "John Doe", "age": 30}).encode("utf-8"))` | `requests.post("https://api.example.com/data", json={"name": "John Doe", "age": 30})` |
| PUT         | `urllib.request.Request("https://api.example.com/data/123", data=urllib.parse.urlencode({"name": "Jane Smith", "age": 25}).encode("utf-8"), method="PUT")` | `requests.put("https://api.example.com/data/123", json={"name": "Jane Smith", "age": 25})` |
| DELETE      | `urllib.request.Request("https://api.example.com/data/123", method="DELETE")`                                     | `requests.delete("https://api.example.com/data/123")`   |
| HEAD        | `urllib.request.Request("https://api.example.com/data", method="HEAD")`                                           | `requests.head("https://api.example.com/data")`         |
| OPTIONS     | `urllib.request.Request("https://api.example.com/data", method="OPTIONS")`                                        | `requests.options("https://api.example.com/data")`      |


### A Server used to test HTTP requests

In [None]:
from http.server import BaseHTTPRequestHandler, HTTPServer

class RequestHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        self._print_request_details()
        self._send_response()

    def do_POST(self):
        self._print_request_details()
        self._send_response()

    def do_PUT(self):
        self._print_request_details()
        self._send_response()

    def do_DELETE(self):
        self._print_request_details()
        self._send_response()

    def do_HEAD(self):
        self._print_request_details()
        self._send_response()

    def do_OPTIONS(self):
        self._print_request_details()
        self._send_response()

    def _print_request_details(self):
        # Print request method and path
        print(f"Received {self.command} request for {self.path}")
        
        # Print headers
        print("Headers:")
        for key, value in self.headers.items():
            print(f"  {key}: {value}")
        
        # Print body if available
        content_length = int(self.headers.get('Content-Length', 0))
        if content_length > 0:
            body = self.rfile.read(content_length)
            print("Body:")
            print(body.decode("utf-8"))
        print("-" * 40)

    def _send_response(self):
        # Send a simple 200 OK response
        self.send_response(200)
        self.send_header("Content-type", "text/plain")
        self.end_headers()
        self.wfile.write(b"Request received and logged.")

# Define server details
def run(server_class=HTTPServer, handler_class=RequestHandler, port=8080):
    server_address = ('', port)
    httpd = server_class(server_address, handler_class)
    print(f"Starting HTTP server on port {port}")
    httpd.serve_forever()

if __name__ == "__main__":
    run()

Paths and Hosts
---
- 'GET /html/rfc7230' is legal in early versions of HTTP
  - illegal in modern versions
- modern versions of HTTP requests
  ```html
  ---
  GET /html/rfc7230 HTTP/1.1
  Host: tools.ietf.org
  ---
  ```
  - support many websites on a single web server

[Status Codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
---
- The response line consists of *HttpVersion StatusCode StatusDescription*
  - e.g: HTTP/1.1 200 OK
  - StatusDescription is informal, could be text from locale
- There are 5 classes of status code

| class | meaning |
| --- | --- |
| 1xx informational response | the request was received, continuing process |
| 2xx successful | the request was successfully received, understood, and accepted |
| 3xx redirection | further action needs to be taken in order to complete the request |
| 4xx client error | the request contains bad syntax or cannot be fulfilled |
| 5xx server error | the server failed to fulfil an apparently valid request |

- 4xx and 5xx responses have entities offering human-readable description of the error
  - handcrafted by the server programmers to help developers recover from the error


🔭 Explore
----
- Explore the [list of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- Further explore the status code in [the HTTP standard](https://datatracker.ietf.org/doc/html/rfc9110)
 

How to handle 3xx redirection?
---
- use the correct URL to avoid redirecton

In [None]:
# For the 3xx redirects, PSL httplib make you follow them yourself
# But the urllib module will follow them for you in conformance with the standard

from urllib.request import urlopen
import urllib.error
r = urlopen('http://httpbin.org/status/301')
(r.status, r.url)


In [None]:
# requests offer more with a history attribute 
# that lists the whole series of redirects that brought you to the final location

import requests
r = requests.get('http://httpbin.org/status/301')
print("requests.get():", r.status_code, r.url)
r.history

In [None]:
# redirection can be turned off with requests
r = requests.get('http://httpbin.org/status/301', allow_redirects=False)
r.raise_for_status() # raise the exception manually
(r.status_code, r.url, r.headers['Location'])

In [None]:
# detect 301 errors to avoid wrong URLs for every request
# the most common redirections are whether the prefix www belongs to the hostname
r = requests.get('http://google.com/')
print("Google homepage:", r.url, r.history)

r = requests.get('http://www.twitter.com/')
print('Twitter homepage:', r.url, r.history)

How to handle 4xx client errors and 5xx server errors?
---
- for 4xx client errors, find and solve them
- for 5xx server errors, report to the webmaster

In [None]:
# libraries have their own handling approaches
# the PSL urllib.urlopen raises an exception
urlopen('http://httpbin.org/status/500')

In [None]:
# handle the exceptions by yourself to keep the client continue instead of crash
try:
  urlopen('http://httpbin.org/status/500')
except urllib.error.HTTPError as e:
  print(e.status, repr(e.headers['Content-Type']))

In [None]:
# The Requests lib returns a response from error status 
# instead of raising an exception. 
r = requests.get('http://httpbin.org/status/500')
r.status_code

In [None]:
# Further tests are needed to solve the problems
# an exception can be raised manually
r.raise_for_status()

Caching and Validation
---
- Caching improves performance by avoiding repeated GET requests
- cacheable resources are indicated by header entries in the responses
- the clients manage the caches
- detailed in [RFC 9111](https://datatracker.ietf.org/doc/html/rfc9111)


Transfer Encoding
---
- Transfer encoding turns a resource into an HTTP response body
  - a wrapper used for data delivery, not a change in the underlying data itself
  - modern web browsers support several transfer encodings
    - the most popular one is the compressed transfer encoding: *gzip*
      - not supported by urllib
      - supported by Requests

```http
# A client indicates capability of gzip in an Accept-Encoding header
---
GET / HTTP/1.1
Accept-Encoding: gzip
---
# and examines whether the server supports gzip from its responses
---
HTTP/1.1 200 OK
Content-Length: 3913
Transfer-Encoding: gzip
---
```

Content Negotiation and Content Type
---
- Content type and content encoding are visible to client program
- Popular content types
  - text/plain, text/html
    - charset or text encoding needs to be specified, like
    - Content-Type: text/html; charset=utf-8
  - image/gif, image/jpg, image/png
  - application/pdf, application/octet-stream
- The client and server negotiate the file format representing a given resource
  - and encoding if the format is text
  - the client states its acceptable content types using request headers such as
    - Accept, Accept-Charset, Accept-Language, User-Agent
  - the server states its choice in the response headers
- Content negotiation is often ignored to let user control over their user experience


In [None]:
# negotiation headers can be set in both urllib and Requests
s = requests.Session()
s.headers.update({'Accept-Language': 'en-US,en;q=0.8'})
# q=0.8 indicates the user's preference level for a specific language
# refer to https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept-Language
print('reqest headers:\n', s.headers)
r = s.get('http://httpbin.org/status/200')
print('response headers:\n', r.headers)

[HTTP Authentication](https://developer.mozilla.org/en-US/docs/Web/HTTP/Authentication)
---
- describes the procedures determining whether a request really comes from authorized users
- The error code '401 Not Authorized' indicates
  - the requestor identity cannot be authenticated, or
  - the identity is fine but is not one authorized to view the requested resource
  - rarely used, usually redirect to the login page by the error code '303 See Other'
- every HTTP request is standalone and independent of all other requests
   - even those that come right before and after it on the same socket
   - so any authenticating information must be carried separately in every single request
   - This independence makes it safe for proxy servers and load balancers to distribute HTTP requests
- Refer to [RFC 9110](https://datatracker.ietf.org/doc/html/rfc9110) for further information
- Basic Authentication (or “Basic Auth”)
  - has a string called a *realm* in its 401 Not Authorized headers
    -  allows a single server to protect different parts of its document tree with different passwords
  - The client then repeats its request with an Authorization header giving the base-64 encoded username and password
    - enhanced with 'Digest access authentication'
      - the server issues a challenge 
      - the client replies with an MD5 hash of the challenge-plus-password
      - username is still visible in the clear
  - all communications are in plaintext
    - vulnerable to sniff attack, man-in-the-middle attack 

```
GET / HTTP/1.1
...
HTTP/1.1 401 Unauthorized
WWW-Authenticate: Basic realm="engineering team"
...
GET / HTTP/1.1
Authorization: Basic YnJhbmRvbjphdGlnZG5nbmF0d3dhbA==
...
HTTP/1.1 200 OK
```
- today 'Basic Auth' is protected with HTTPS
  - used by many simple HTTPS-protected APIs and web applications

In [None]:
# 'Basic Auth' is supported by urllib and partly by Requests
# for a single request
r = requests.get('http://httpbin.org/api', auth=('username', 'password'))
r

In [None]:
# for multiple requests
s = requests.Session()
s.auth = 'brandon', 'atigdngnatwwal' # 'username', 'password'
s.get('http://httpbin.org/basic-auth/brandon/atigdngnatwwal')

🔭 Explore
---
- Find the authentication methods for
  - [Spotify Web API](https://developer.spotify.com/documentation/web-api)
  - [Slack Web API](https://api.slack.com/web)

[Cookies](https://developer.mozilla.org/en-US/docs/Web/HTTP/Cookies)
---
- HTTP-mediated authentication is rare today
- Websites provide their own login pages assisted with cookies
- A cookie is a key-value pair received from a successful response
  - and should be opaque or encrypted
    - otherwise the cookie can be forged
  - cookie transmission should be protected with HTTPS
    - otherwise the cookie can be stolen and the user can be impersonated
- cookies can also be used to track users' browsing activities
  - supported by both [urllib](https://docs.python.org/3/library/urllib.html) and [Requests](https://requests.readthedocs.io/en/latest/user/advanced/)

```
# the client receives a cookie from the server after successful authentication
GET /login HTTP/1.1
...
HTTP/1.1 200 OK
Set-Cookie: session-id=d41d8cd98f00b204e9800998ecf8427e; Path=/
...
# submit the cookie for all further requests
GET /login HTTP/1.1
Cookie: session-id=d41d8cd98f00b204e9800998ecf8427e
```

[Connections, Keep-Alive, and httplib](https://developer.mozilla.org/en-US/docs/Web/HTTP/Connection_management_in_HTTP_1.x)
---
- connection reuse saves time and resources without starting a new three-way TCP handshake
- HTTP/1.1 keeps an HTTP connection open after a request
- Either the client or the server can specify 'Connection: close' to hang up
- client programs usually create multiple TCP connections for parallel communication
- [requests.Session keep-alive by default](https://stackoverflow.com/questions/25239650/python-requests-speed-up-using-keep-alive)

In [None]:
#  requests.Session keep-alive by default
import logging
import requests

logging.basicConfig(level=logging.DEBUG)
s = requests.Session()
s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
s.get('http://httpbin.org/cookies/set/anothercookie/123456789')
r = s.get("http://httpbin.org/cookies")
print(r.text)