# Network data and network errors

## Bytes and strings

In [1]:
# 98 represented in different numerical systems
98 == 0b1100010 == 0o142 == 0x62

True

In [5]:
#  a list of numbers -> bytes
b = bytes([0,11,22,33,99,111,222,255])
b, len(b), type(b), chr(33),chr(99),chr(111)

(b'\x00\x0b\x16!co\xde\xff', 8, bytes, '!', 'c', 'o')

In [6]:
list(b)

[0, 11, 22, 33, 99, 111, 222, 255]

## Character strings

In [8]:
for i in range(32,128,32):
    print(' '.join(chr(j) for j in range(i, i+32)))

  ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
@ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _
` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ 


- Encoding: string -> bytes
- Decoding: bytes -> string
- Further info: [codecs — Codec registry and base classes](https://docs.python.org/3/library/codecs.html)

In [10]:
mybytes = '和平万岁！'.encode('utf-8')
mybytes

b'\xe5\x92\x8c\xe5\xb9\xb3\xe4\xb8\x87\xe5\xb2\x81\xef\xbc\x81'

In [11]:
mybytes.decode('latin1')

'å\x92\x8cå¹³ä¸\x87å²\x81ï¼\x81'

In [12]:
mybytes.decode('latin2')

'ĺ\x92\x8cĺšłä¸\x87ĺ˛\x81ďź\x81'

In [13]:
mybytes.decode('greek')

'ε\x92\x8cεΉ³δΈ\x87ε²\x81οΌ\x81'

In [14]:
mybytes.decode('hebrew')

'ו\x92\x8cו¹³ה¸\x87ו²\x81ן¼\x81'

In [15]:
mybytes.decode('utf-8')

'和平万岁！'

In [16]:
'和平万岁！'.encode('utf-16')

b'\xff\xfe\x8cTs^\x07N\x81\\\x01\xff'

In [17]:
'和平万岁！'.encode('utf-32')

b'\xff\xfe\x00\x00\x8cT\x00\x00s^\x00\x00\x07N\x00\x00\x81\\\x00\x00\x01\xff\x00\x00'

In [18]:
mybytes.decode('ascii')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)

In [19]:
'ΛλΘθΩω'.encode('latin-1')

UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-5: ordinal not in range(256)

In [20]:
mybytes.decode('ascii', 'replace')

'���������������'

In [21]:
mybytes.decode('ascii', 'ignore')

''

In [22]:
'ΛλΘθΩω'.encode('latin-1', 'replace')

b'??????'

In [23]:
'ΛλΘθΩω'.encode('latin-1', 'ignore')

b''

## Binary numbers and network byte order

In [28]:
import struct
struct.pack('<i', 0x12abcdef) # little endian

b'\xef\xcd\xab\x12'

In [29]:
struct.pack('>i',0x12abcdef) # big endian

b'\x12\xab\xcd\xef'

In [34]:
struct.unpack('>i',b'\x12\xab\xcd\xef')

(313249263,)

In [35]:
hex(313249263)

'0x12abcdef'

In [36]:
struct.pack('!i',0x12abcdef) # network order

b'\x12\xab\xcd\xef'

## Python `bytes` and `bytearray`
- `bytes` and `bytearray` are fundamental types in Python for handling `binary data`.
- Both represent `sequences of bytes`, but they `differ in mutability`.
- Used in various applications like file I/O, network communication, and data processing.

---

### **Basic Comparison**
| **Feature**        | **`bytes`**               | **`bytearray`**           |
|--------------------|--------------------------|---------------------------|
| **Mutability**     | Immutable                 | Mutable                    |
| **Usage**          | When data should not change | When data needs modification |
| **Memory Efficiency** | More efficient for immutable data | Less efficient but flexible |
| **Methods Available** | Most string methods      | Most string methods, plus mutation methods |

---

### **Creation Methods**
| **Creation Method** | **`bytes`**               | **`bytearray`**           |
|--------------------|--------------------------|---------------------------|
| **From String**    | `bytes('string', 'encoding')` | `bytearray('string', 'encoding')` |
| **From List of Integers** | `bytes([65, 66, 67])` | `bytearray([65, 66, 67])` |
| **From Existing Byte Data** | `b = b'Hello'` | `ba = bytearray(b'Hello')` |

In [9]:
# 🍎
b = bytes('Hello', 'utf-8')
ba = bytearray([65, 66, 67])
print(b)  # Output: b'Hello'
print(ba)  # Output: bytearray(b'ABC')

b'Hello'
bytearray(b'ABC')


### **Common Operations**
| **Operation**      | **`bytes`**               | **`bytearray`**           |
|--------------------|--------------------------|---------------------------|
| **Indexing**       | `b[0]`                    | `ba[0]`                    |
| **Slicing**        | `b[1:4]`                  | `ba[1:4]`                  |
| **Concatenation**  | `b1 + b2`                 | `ba1 + ba2`                |
| **Repetition**     | `b * 2`                   | `ba * 2`                   |
| **Membership Test** | `72 in b`                | `72 in ba`                 |

In [10]:
# 🍎

b1 = b'Hello, '
b2 = b'World!'
ba1 = bytearray(b'Hello, ')
ba2 = bytearray(b'World!')

b = b1 + b2  # Concatenation
ba = ba1 + ba2  # Concatenation
print(b)  # Output: b'Hello, World!'
print(ba)  # Output: bytearray(b'Hello, World!')

b3 = b * 2  # Repetition
ba3 = ba * 2  # Repetition
print(b3)  # Output: b'Hello, World!Hello, World!'
print(ba3)  # Output: bytearray(b'Hello, World!Hello, World!')

b'Hello, World!'
bytearray(b'Hello, World!')
b'Hello, World!Hello, World!'
bytearray(b'Hello, World!Hello, World!')


### **Methods Comparison**
| **Method**         | **`bytes`**               | **`bytearray`**           |
|--------------------|--------------------------|---------------------------|
| **`.decode()`**    | Yes                       | Yes                       |
| **`.find()`**      | Yes                       | Yes                       |
| **`.split()`**     | Yes                       | Yes                       |
| **`.append()`**    | No                        | Yes                       |
| **`.extend()`**    | No                        | Yes                       |
| **`.pop()`**       | No                        | Yes                       |

In [11]:
# 🍎
ba = bytearray(b'Hello')
ba.append(33)  # Appends '!'
print(ba)  # Output: bytearray(b'Hello!')

bytearray(b'Hello!')


# 🍎 Practical Use Case: Network Communication
| **Scenario**       | **`bytes`**               | **`bytearray`**           |
|--------------------|--------------------------|---------------------------|
| **Sending Data**   | Yes                       | Yes                       |
| **Receiving Data** | Yes                       | Yes                       |
| **Modifying Received Data** | No                | Yes                       |

In [None]:
# put the code into a python file then run it
# Don't run it from here!
import socket

# Server
server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server.bind(('localhost', 8080))
server.listen(1)
conn, addr = server.accept()
data = conn.recv(1024)  # Receives data as bytes
print(data)
conn.sendall(b'Hello, Client!')

In [13]:
import socket

# Client
client = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
client.connect(('localhost', 8888))
client.sendall(b'Hello, Server!')
data = client.recv(1024)
print(data)

b'Hello, Client!'


### **Summary**
| **Feature**        | **`bytes`**               | **`bytearray`**           |
|--------------------|--------------------------|---------------------------|
| **Mutability**     | Immutable                 | Mutable                    |
| **Common Use**     | Read-only binary data     | Modifiable binary data     |
| **Typical Use Cases** | File I/O, network protocols | Network communication, byte-level data manipulation |

- **Key Takeaway**:
  - Choose `bytes` when immutability is desired.
  - Choose `bytearray` when you need to modify the data.
  - 🔗 Official Python Documentation: [bytes](https://docs.python.org/3/library/stdtypes.html#bytes), [bytearray](https://docs.python.org/3/library/stdtypes.html#bytearray)


## Python [`struct`](https://docs.python.org/3/library/struct.html)

### **The `struct` Module**
- The `struct` module in Python is used for working with binary data.
- It provides functions to convert between Python values and C structs represented as Python bytes objects.
- Useful for reading and writing binary files, network protocols, and data serialization.

---

### **Basic Concepts**
- **Packing**: Converting Python values into binary data.
- **Unpacking**: Converting binary data back into Python values.
- **Format Strings**: Describe the layout of the data when packing/unpacking.
  - Examples: `i` for integer, `f` for float, `s` for string.
- **Endianness**: Defines the byte order in which data is stored. 
  - `>` for big-endian, `<` for little-endian.

---

### **Format Strings**

- **Format**: `[<endian-flag>][integer]<type-codes>`
  - **Endian-flag** (optional): Specifies the byte order, size, and alignment.
  - **Integer** (optional): Repeats the type-code this many times.
  - **Type-codes**: Defines the type of data to be packed/unpacked.
- **Endian Flags and Their Meanings**

| **Flag** | **Description**                  | **Byte Order**  | **Size** | **Alignment**      |
|----------|----------------------------------|-----------------|---------| ------------------|
| `@`      | Native                            | Host’s native   | Native  | Native   |
| `=`      | Native                            | Host’s native   | Standard | None |
| `<`      | Little-endian                     | Little-endian   | Standard | None |
| `>`      | Big-endian                        | Big-endian      | Standard | None |
| `!`      | Network (Big-endian)              | Big-endian      | Standard | None |

- **Example Format String**:
- **`>4sI`**
  - `>`: Big-endian
  - `4s`: 4-byte string
  - `I`: Unsigned int (4 bytes)

---

## Type codes

| Format | C Type | Python Type | Standard Size (bytes) | 
|---|---|---|---|
| x | pad byte | no value |  |
| c | char | bytes of length 1 | 1 | 
| b | signed char | integer | 1 |
| B | unsigned char | integer | 1 |
| ? | _Bool | bool | 1 |
| h | short | integer | 2 | 
| H | unsigned short | integer | 2 |
| i | int | integer | 4 | (2) |
| I | unsigned int | integer | 4 | 
| l | long | integer | 4 | (2) |
| L | unsigned long | integer | 4 | 
| q | long long | integer | 8 |
| Q | unsigned long long | integer | 8 |
| n | ssize_t | integer |  |
| N | size_t | integer |  | |
| e | | float | 2 | 
| f | float | float | 4 | 
| d | double | float | 8 |
| s | char[] | bytes |  |
| p | char[] | bytes |  |
| P | void* | integer |  |


In [1]:
### **1. Packing/Unpacking Data**
import struct

# Packing data into a binary format
packed_data = struct.pack('>I4s', 1024, b'ABCD')
print(f"Packed Data: {packed_data}")

unpacked_data = struct.unpack('>I4s', packed_data)
print(f"Unpacked Data: {unpacked_data}")

Packed Data: b'\x00\x00\x04\x00ABCD'
Unpacked Data: (1024, b'ABCD')


In [2]:
### **2. Working with Complex Structures**
packed_data = struct.pack('>I f 5s', 1234, 2.34, b'Hello')
unpacked_data = struct.unpack('>I f 5s', packed_data)
print(f"Unpacked Data: {unpacked_data}")

Unpacked Data: (1234, 2.3399999141693115, b'Hello')


In [7]:
### **3. Handling Endianness**
import binascii

little_endian = struct.pack('<I', 0x12345678)
big_endian = struct.pack('>I', 0x12345678)

print(f"Little-endian: {little_endian}")
print(f"Big-endian: {big_endian}")

print(f"Little-endian: {binascii.hexlify(little_endian).decode()}")
print(f"Big-endian: {binascii.hexlify(big_endian).decode()}")

Little-endian: b'xV4\x12'
Big-endian: b'\x124Vx'
Little-endian: 78563412
Big-endian: 12345678


In [8]:
### **4. Practical Use Case: Binary File I/O**

# Writing to a binary file
with open('data.bin', 'wb') as f:
    f.write(struct.pack('>I', 1234))

# Reading from a binary file
with open('data.bin', 'rb') as f:
    data = f.read()
    unpacked_data = struct.unpack('>I', data)
print(f"Data from file: {unpacked_data}")

Data from file: (1234,)


## How to determine a complete message in the received data?
- **M1**: In TCP blocking mode, recv() returns an empty string indicates the end of a message
  - [Simply Send All Data and Then Close the Connection](./streamer.py)
    ```bash
    # open two terminals, one runs as the server
    python3 streamer.py
    # the other runs the client
    python3 streamer.py -c
    ```
- **M2**: Stream in both directions alternatively
- **M3**: Use fixed-length messages
  ```python
  def recvall(sock, length):
      data = b''
      while len(data) < length:
          more = sock.recv(length - len(data))
          if not more:
              raise EOFError('socket closed {len(data)} bytes into a {length}-byte message.')
          data += more
      return data
  ```
- **M4**: Delimit messages with special characters
  - Hard to choose the delimiter if any data is legal in the message
- **M5**: Prefix each message with its length
  - The length of each message must be known ahead
  - [Framing Each Block of Data by Preceding It with Its Length](./blocks.py)
    ```bash
    # open two terminals, one runs as the server
    python3 blocks.py
    # the other runs the client
    python3 blocks.py -c
    ```
- **M6**: 
  - If the length of message is unknow ahead in **M5**, use it for the current data and
  - Signal the end with length 0

- HTTP uses both **M4** and **M5**
  - the blank line '\r\n\r\n' delimits its header
  - the HTTP payload such as an image is pure binary data,
    - Content-Length is provided in the header to determine the amount of data

## 5.5 Pickles and self-delimiting formats
- pickles have built-in delimiting

In [37]:
import pickle

In [41]:
pa = pickle.dumps([10,'hi',20]) # the period . at the end of the output is the dilimiter
pa

b'\x80\x04\x95\x0e\x00\x00\x00\x00\x00\x00\x00]\x94(K\n\x8c\x02hi\x94K\x14e.'

In [42]:
pickle.loads(pa)

[10, 'hi', 20]

- pickle.load() reads from a file and stops at the end of the pickle data

In [45]:
from io import BytesIO
f = BytesIO(b'\x80\x04\x95\x0e\x00\x00\x00\x00\x00\x00\x00]\x94(K\n\x8c\x02hi\x94K\x14e.more data to come')

In [46]:
pickle.load(f)

[10, 'hi', 20]

In [47]:
f.tell()

25

In [48]:
f.read()

b'more data to come'

- wrap a socket in a Python file object with makefile() then supply to pickle.load()
- [furthe info: pickle — Python object serialization](https://docs.python.org/3/library/pickle.html)

## 5.6 XML and JSON
- widely used data formats
  - no framing support
- JSON is used to exchange data between different programming languages
  - allows Unicode characters in its strings and payload
  - encodes JSON strings as UTF-8 for network transmission
- XML is better for documents

In [50]:
import json
json.dumps([1979, '中美建交'])

'[1979, "\\u4e2d\\u7f8e\\u5efa\\u4ea4"]'

In [51]:
json.dumps([51, 'Nixon and 周恩来'], ensure_ascii=False)

'[51, "Nixon and 周恩来"]'

In [53]:
json.loads('{"USA":"Nixson", "中国":"周恩来"}')

{'USA': 'Nixson', '中国': '周恩来'}

- XML and JSON are text formats
- Binary formats like Thrift and Google Protocol Buffers are more efficient

## 5.7 Compression
- Network throughput is a bottleneck in distributed applications
  - it is worthwhile to compress data before transmission
- GNU zlib a popular compression format
  - available in Python Standard Library (PSL)
  - supports self-framing
  - however, most protocols choose to do their own framing

In [54]:
# there is overhead of compression
# here, two compression streams are separated with '|'
import zlib
data  = zlib.compress(b'First sentence') + b'|' + zlib.compress(b'Second sentence') + b'|'
data

b'x\x9cs\xcb,*.Q(N\xcd+I\xcdKN\x05\x00(d\x05~|x\x9c\x0bNM\xce\xcfKQ(N\xcd+I\xcdKN\x05\x00-\xab\x05\xd2|'

In [55]:
len(data)

47

In [67]:
# suppose network block is of size 8 bytes
d = zlib.decompressobj()
d.decompress(data[0:8]), d.unused_data # empty unused_data indicates more data to come

(b'First', b'')

In [68]:
d.decompress(data[8:16]), d.unused_data

(b' sentenc', b'')

In [69]:
d.decompress(data[16:24]), d.unused_data 
# in '|x', the character followed | belongs to the next compression stream, feed it to a new decompressionobj

(b'e', b'|x')

In [70]:
d2 = zlib.decompressobj()
d2.decompress(b'x'), d2.unused_data

(b'', b'')

In [71]:
d2.decompress(data[24:32]), d2.unused_data

(b'Second', b'')

In [72]:
d2.decompress(data[32:]), d2.unused_data
# that unused_data is nonempty indicates the second stream of compression is complete and intact

(b' sentence', b'|')

## 5.8 Network exceptions
- the number of socket errors is quite large
- but the number of actual exceptions with socket operations is quite few
  - OSError: nearly raised by every failure at any stage in network transmission
  - socket.gaierror: raised when getaddrinfo() failed to find a name or service
  - socket.timeout: indicates timeout before operation could complete normally
  - herror: raised from certain old-fasioned address lookup calls

In [76]:
import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
try:
    s.connect(('nonexistent.website.com', 80))
except socket.gaierror as e: # e.errno => -2; e.strerror => 'Name or service not known'
    raise

gaierror: [Errno -2] Name or service not known

- Some higher-level socket-based protocols such as *httplib* from PSL allow
  - expose raw socket errors
  - catch and turn raw socket errors into their own kind of errors

In [None]:
import http.client
h = http.client.HTTPConnection('nonexistent.website.com')
h.request('GET', '/')

- But *urllib2* hides this same error and raises URLError to be clean and neutral

In [None]:
import urllib.request
urllib.request.urlopen('http://nonexistent.website.com/')

## 5.9 Catching and reporting network exceptions
- Granular exception handler
  - wrap every network call with a try...except clause
  - suitable for short programs
- Blanket exception handler
  - wrap blocks of code with clear purposes