<div style="position: relative;">
<img src="https://user-images.githubusercontent.com/7065401/98728503-5ab82f80-2378-11eb-9c79-adeb308fc647.png"></img>

<h1 style="color: white; position: absolute; top:27%; left:10%;">
    Introduction to HTTP using Python
</h1>

<h3 style="color: #ef7d22; font-weight: normal; position: absolute; top:56%; left:10%;">
    David Mertz, Ph.D.
</h3>

<h3 style="color: #ef7d22; font-weight: normal; position: absolute; top:63%; left:10%;">
    Data Scientist
</h3>
</div>

## HTTP Protocol
![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

Everyone who is taking this class uses the HTTP protocol hundreds of times a day.  Web browsers, and the servers the connect to, are an obvious example. But many dedicated applications, whether on desktop computers, servers, cluster nodes, or mobile devices, also use the HTTP protocol.

Understanding the underlying protocol is important to developing automated consumers of "web" content, as well as to writing servers.  Even simply when using web browsers or more specialized applications, an understanding of HTTP is useful to have a conceptual model of security and network issues that underlay its use.

HTTP is an "application-layer" protocol, in contrast to TCP and IP which are "transport-layer" and "internet-layer" protocols, respectively.  HTTP is almost always transmitted over TCP/IP, but can technically operate over different lower-level layers.

## Protocol components
![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

1. A client establishes a TCP/IP **connection** to a server (the transport layer is outside the specific scope of this course)
2. An HTTP **session** is a sequence of *request* and *response* messages, the former from client to server, the latter in reply to each, from server to client.
3. Each message consists of a **header** and a (optional) **body**, separated by a blank line.  Within each header and the blank line, lines are separated by CRLF (U+000D U+000A).
   * A request header consists of a request line followed by zero or more header fields
   * A response header consists of a status line followed by zero or more header fields
   * Headers only allow 7-bit ASCII characters within them
4. A body, if present, may be any binary data.  The `Content-Length` header field will indicate how many bytes to expect.  Often the `Content-Type` header field will provide a hint about how to interpret those bytes.

A number of topics that are presented only passingly in this course are addressed in more depth in the INE course *Secure RESTful APIs using Python*.  Some of these include more details of HTTP headers, status codes, the HTTPS protocol and SSL/TLS encryption, and choosing and working with different content types.

It's worth briefly summarizing the most common status codes.  200 indicates success and carries the status line message "OK".  404 is "Not Found", 403 is "Forbidden", other 4xx codes indicate various flaws in the client request.  The 5xx messages indicate a problem on the server, most commonly 500 for "Internal Server Error".  The 3xx messages, but especially 301 "Moved Permanently" tell a client that it needs to take additional actions to obtain a resource.

## Sample sessions
![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

To illustrate the HTTP protocol itself, this lesson will not use higher-level wrapper like `requests` or the standard library `urllib`.  Here we use the `telnetlib` module, which very closely matches how you might use the command-line `telnet` command.

In principle, we could also use the still-lower level `sockets` module to make the connection, or the `ssl` module for an encrypted socket, and send bytes over it.  What we wish to do is send certain requests to a server at a given hostname and port, and examine what the raw content the server returns as a response.

In [1]:
from sys import stderr
from telnetlib import Telnet
from time import sleep
eot = '⌁'.encode()    # end-of-transmission character
host = 'popbox.kdm.local'
port = 2501

def connection(reqs: list, host: str=host, port: int=port, raw=False):
    # telnet popbox.kdm.local 2501
    with Telnet(host, port) as conn:
        conn.interact()
        for request in reqs:
            conn.write(request)
            resp = conn.read_until(eot, 0.01)
            print(resp.decode() if not raw else resp)

Let's try sending a particular message to a running server.  The first line is the request line, and has a method (such as `GET`) followed by a resource name then a protocol version.  The next lines are each some header field, followed by a colon and space, followed by a value.  Many of these headers are standard or semi-standard, others begin with `X-` to indicate they serve a custom purpose.  

The IANA has a list of [Permanent Message Header Field Names](https://www.iana.org/assignments/message-headers/message-headers.xml#perm-headers) with several hundred widely used headers.  All are optional, but several are nearly ubiquitous in actual HTTP use.

In [2]:
req = (
    b'GET /greeting HTTP/1.1\r\n'
    b'Accept-Encoding: identity\r\n'
    b'Accept-Charset: utf-8\r\n'
    b'Host: popbox.kdm.local\r\n'
    b'User-Agent: telnet (INE course)\r\n'
    b'X-INE-Student: David\r\n'
    b'\r\n')
connection([req])

HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Content-Length: 110
X-INE-Course: HTTP with Python
Server: Werkzeug/2.0.0 Python/3.8.10
Date: Tue, 08 Jun 2021 00:46:47 GMT

<html>
  <head>
    <title>Test Page</title>
  </head>
  <body>
    <p>Hello David!</p>
  </body>
</html>
    


The first message was particularly pedantically correct.  In particular, while CRLF is strictly required, essentially every HTTP server of the last 25 years will gracefully handle use of LF alone.  This follows Postel's Law (Robustness Principle): 

> Be conservative in what you send, be liberal in what you accept

In [3]:
req = b'''GET /greeting?lang=zh HTTP/1.0

'''
connection([req], raw=True)

b'HTTP/1.1 200 OK\r\nContent-Type: text/html; charset=utf-8\r\nContent-Length: 116\r\nX-INE-Course: HTTP with Python\r\nServer: Werkzeug/2.0.0 Python/3.8.10\r\nDate: Tue, 08 Jun 2021 00:50:45 GMT\r\n\r\n<html>\n  <head>\n    <title>Test Page</title>\n  </head>\n  <body>\n    <p>N\xc7\x90n h\xc7\x8eo Student!</p>\n  </body>\n</html>\n    '


That response is a bit harder to read as unencoded bytes, let's present it as text.

In [4]:
req = b'''GET /greeting?lang=zh HTTP/1.0

'''
connection([req])

HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Content-Length: 116
X-INE-Course: HTTP with Python
Server: Werkzeug/2.0.0 Python/3.8.10
Date: Tue, 08 Jun 2021 00:52:38 GMT

<html>
  <head>
    <title>Test Page</title>
  </head>
  <body>
    <p>Nǐn hǎo Student!</p>
  </body>
</html>
    


## Connection life and HTTP versions
![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

HTTP is versioned, as we can see in request lines and status lines of corresponding client and server headers.  Version 0.9 was a retroactively numbered earlier version that was, as the number suggests, a "beta version."  You are unlikely to encounter HTTP/0.9 servers in the wild.  One notable aspect of this very old version is that it has only a body and no header:

In [5]:
connection([b'GET /greeting HTTP/0.9\n\n'])

<html>
  <head>
    <title>Test Page</title>
  </head>
  <body>
    <p>Hello Student!</p>
  </body>
</html>
    


HTTP/1.0 has most of the semantics of later HTTP versions.  One notable change, however, between HTTP/1.0 and HTTP/1.1 is that in the earlier version, every connection consisted of a single request/response pair, and the connection was closed after those two messages.

HTTP/1.1 adds a mechanism for persistent connections.  This doesn't really change anything in the HTTP protocol itself, but the underlying TCP/IP sockets take overhead to establish; allowing multiple request/response pairs in the same session is often considerably faster.

Whether or not HTTP/1.1 (and higher) use persistent connections is controlled by the `Connection` header.  Nearly all servers default to `keep-alive` if not otherwise specified.  We can see this by passing several requests within one connection.

In [6]:
req1 = b'''GET /greeting HTTP/1.1
Connection: keep-alive

'''
req2 = b'''GET /greeting?lang=fr HTTP/1.1
Connection: close

'''
req3 = b'''GET /greeting?lang=en HTTP/1.1

'''
try:
    connection([req1, req2, req3])
except Exception as err:
    print(repr(err), file=stderr)

HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Content-Length: 112
X-INE-Course: HTTP with Python
Server: Werkzeug/2.0.0 Python/3.8.10
Date: Tue, 08 Jun 2021 02:22:19 GMT

<html>
  <head>
    <title>Test Page</title>
  </head>
  <body>
    <p>Hello Student!</p>
  </body>
</html>
    
HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Content-Length: 114
X-INE-Course: HTTP with Python
Server: Werkzeug/2.0.0 Python/3.8.10
Date: Tue, 08 Jun 2021 02:22:19 GMT

<html>
  <head>
    <title>Test Page</title>
  </head>
  <body>
    <p>Bonjour Student!</p>
  </body>
</html>
    


EOFError('telnet connection closed')


## Streams and chunks
![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

HTTP/2 has built on HTTP/1.1 in a backwards compatible way, since 2015.  Most popular websites and all popular web browsers support HTTP/2.  HTTP/3 is a proposed successor to HTTP/2 that builds on many of the same optimizations.

In large part, HTTP/2 takes ideas developed as SPDY by Google.  The main advantage of HTTP/2 over the HTTP/1.1 "chunked transfer encoding" is that HTTP/2 supports multiple "streams" within a connection, each "DATA frame" tagged by an integer.  HTTP/2 *does not* support chunked transfer encoding.

One principle use of HTTP/2 streams is to allow a server to push to a client cache resources it anticipates the client will want, before the client actually requests it.

For example, imagine this HTML web page that might be retrieved over HTTP:

```html
<html>
  <head>
    <title>A large page with other resources</title>
    <script src="http//example.com/large_script.js"></script>
    <link rel="stylesheet" href="http://example.com/many_styles.css">
  </head>
  <body>
    <img src="http://example.com/big_img1.png"/>
    <img src="http://example.com/big_img2.png"/>
    <!-- ... much more HTML content ... -->
    <img src="http://example.com/big_img98.png"/>
    <img src="http://example.com/big_img99.png"/>
    <script src="http//example.com/other_script.js"></script>
  </body>
</html>    
```

Under HTTP/1.1, a client such as a web browser would need to read this entire page before it would see that `other_script.js` might be used to manipulate the page. Moreover, the entire page would likewise need to be read before the client was aware of the related resources `big_img98.png` and `big_img99.png` whose presence might affect the overall layout of the rendered page.  

Even if the client realizes after a few bytes that it will need `large_script.js` and `many_styles.css`, under HTTP/1.1 the client would need to launch a new connection to obtain these resources before the first HTML page request completed.

If both ends of the communication agree to HTTP/2, the server can start a stream for these anticipated resource needs before the client realizes it will need them.  Moreover, HTTP/2 streams, including their headers, are both binary and compressed to minimize size.

Unfortunately, being a binary protocol that is very difficult to read and create manually in a manner similar to the above examples.  Moreover, the popular Python `requests` library that is discussed in this course does not support HTTP/2 currently.  HTTP/2 support in `requests` is planned, and the newer modules `Hyper` and `HTTPX` support HTTP/2, but are not themselves widely adopted.

---
### Reading streams

Because of limited library support and its binary format, we will not look in more detail at HTTP/2 streamed content.  This is also not an example of `Transfer-Encoding: chunked` which is deprecated in HTTP/2.  

However, let us look at responses that may simply arrive slowly, but still benefit from incremental processing as body content arrives.  A more real-world example might incrementally process CSV records or JSON lines objects.

In [7]:
def stream_conn(request: str, host: str=host, port: int=port):
    with Telnet(host, port) as conn:
        conn.interact()
        conn.write(request)
        while True:
            resp = conn.read_until(eot, 0.1)
            if b'\r\n\r\n' in resp:
                header, resp = resp.split(b'\r\n\r\n')
                print(header.decode(), end='\n\n')
            if eot in resp:
                break
            elif resp:
                greet = resp.decode()
                print(f'{greet} [{len(greet)} characters]')

In [8]:
req = b'''GET /stream HTTP/1.1
Connection: keep-alive

'''
stream_conn(req)

HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Connection: close
Server: Werkzeug/2.0.0 Python/3.8.10
Date: Tue, 08 Jun 2021 02:43:26 GMT

Hello [5 characters]
Nǐn hǎo [7 characters]
Bonjour [7 characters]
Hola [4 characters]
Zdravstvuyte [12 characters]


## Wrapping up
![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

This section has presented the basics of HTTP as a protocol.  In the next sections and lessons, we turn to Python libraries for making HTTP requests to servers from clients, and to creating HTTP servers for clients to utilize.  

In the course of talking about those libararies, we will address topics such as status codes and content types in more detail.