# 01. üßµ Strings, Bytes, and Character Encodings

Computers do not understand letters; they only understand numbers (bits and bytes). 
To display text, we need a **Mapping** (Encoding) between numbers and characters.

**Key Topics Covered:**
* **ASCII:** The original 1970s standard (English only).
* **Unicode:** The modern standard (All languages + Emojis).
* **Python Types:** `str` (Unicode) vs. `bytes` (Raw Data).
* **The Bridge:** `encode()` and `decode()`.

## 1.1 üèõÔ∏è ASCII (The Legacy Standard)

In the 1960s and 70s, memory was expensive. The **American Standard Code for Information Interchange (ASCII)** used just **7 bits** to represent 128 characters.

-   **0-31:** Control codes (e.g., `\n` Newline, `\r` Carriage Return).
-   **32-126:** Printable characters (A-Z, a-z, 0-9, punctuation).
-   **127:** Delete.

This was efficient but could only represent English.

In [1]:
# Inspecting ASCII values using ord() (Ordinal)
print(f"'A' = {ord('A')}")
print(f"'a' = {ord('a')}")
print(f"'0' = {ord('0')}")
print(f"Newline (\\n) = {ord('\n')}")

# Converting back using chr() (Character)
print(f"65 = {chr(65)}")
print(f"97 = {chr(97)}")

'A' = 65
'a' = 97
'0' = 48
Newline (\n) = 10
65 = A
97 = a


## 1.2 üåç Unicode (The Universal Standard)

To support Japanese, Arabic, Emoji, etc., we needed more than 128 slots. 
**Unicode** assigns a unique number (Code Point) to every character in human history.

-   **UTF-8:** The most popular encoding for Unicode. It is **variable length**:
    -   Standard ASCII characters use **1 byte** (compatible with old systems).
    -   Complex characters (like '√±' or 'üöÄ') use **2-4 bytes**.

In [2]:
# Python 3 strings are Unicode by default
text = "Hello üöÄ"

print(f"String: {text}")
print(f"Length (Characters): {len(text)}") 

# Notice: The Rocket emoji is just 1 character to Python's high-level str type.

String: Hello üöÄ
Length (Characters): 7


## 1.3 üåâ The Bridge: Encode & Decode

When sending data over a network (Sockets) or saving to a file, we cannot send abstract "Unicode Characters". We must send physical **Bytes**.

![Encode Decode Diagram](https://www.w3.org/International/questions/qa-what-is-encoding.en.png) 
*(Conceptual visualization)*

-   **Encode:** String $\rightarrow$ Bytes (Outgoing).
-   **Decode:** Bytes $\rightarrow$ String (Incoming).

In [None]:
text_u = "Caf√©"

# 1. Encode to Bytes (UTF-8)
data_bytes = text_u.encode('utf-8')

print(f"Original: {text_u} (type: {type(text_u)})")
print(f"str lenge: {len(text_u)}")
print(f"Encoded:  {data_bytes} (type: {type(data_bytes)})")
print(f"Bytes Length: {len(data_bytes)}") 
# Notice: Length is 5, not 4! The '√©' took 2 bytes.

# 2. Decode back to String
decoded_text = data_bytes.decode('utf-8')
print(f"Decoded:  {decoded_text}")

print(f"Is text_u == decoded_text: {text_u is decoded_text}")
# Create whole another object!

Original: Caf√© (type: <class 'str'>)
str lenge: 4
Encoded:  b'Caf\xc3\xa9' (type: <class 'bytes'>)
Bytes Length: 5
Decoded:  Caf√©
Is text_u == decoded_text: False


### ‚ö†Ô∏è The Network Trap
Sockets (Notebook 04) only understand Bytes. If you try to send a String directly, Python will crash.

```python
# socket.send("Hello")       # TypeError: a bytes-like object is required
socket.send("Hello".encode()) # Correct
```

---

## ÓÅûÊΩÆ Mini-Challenge: The Decoder Ring

**Task:**
You have intercepted a raw byte sequence from a server.
`secret_bytes = [72, 101, 108, 108, 111, 33]`

1.  Use a loop and `chr()` to decode these numbers manually.
2.  Join them into a single string.

In [7]:
secret_bytes = [72, 101, 108, 108, 111, 33]

# Write your decoder loop here


In [8]:
# Solution
decoded_chars = [chr(b) for b in secret_bytes]
message = "".join(decoded_chars)
print(f"Secret Message: {message}")

Secret Message: Hello!


---

## üåü Core Insight for Your CSE Career

### Python 2 vs Python 3
In Python 2, `str` was bytes. This caused endless bugs where people mixed text and binary data.
In Python 3, `str` is strictly Unicode, and `bytes` is strictly binary. This separation forces you to be explicit about Encoding, which prevents data corruption in databases and web requests.