# PCAP L3 ‚Äî Exercise 1 (Answers Notebook)

`Mission 01`: Decode strange characters from mission logs using Unicode code points, ASCII vs UTF-8 encodings, and per-character byte lengths.

**Context:** Your network analysis tool just captured traffic from users all over the world. Some payloads are plain ASCII, others have accented characters, Chinese text, or emojis. Your job is to build tiny helpers that explain what each character really is under the hood.

> üéØ Your Key Objectives:
> - Use `ord()` and `chr()` to connect characters to their Unicode code points.
> - Practice encoding text with `.encode()` and gracefully handle `UnicodeEncodeError`.
> - Inspect how many bytes per character UTF-8 uses for different characters.
> - Keep your code small, clear, and focused so each helper does one job well.

## Part 1: Code Point Scanner (`ord()` and formatted output)

Context: We'll start with a small helper that explains what a single character looks like in Unicode.

Your Tasks:
- [ ] 1.1 - Write a `describe_char()` function that takes in one character `ch`.
- [ ] 1.2 - Assume the input is a single character string (you don't need extra validation here).
- [ ] 1.3 - Use `ord()` for `ch` inside the function to get the integer code point.
- [ ] 1.4 - Return a formatted string: `"A -> 65 (U+0041)"` for the letter `"A"`.
- [ ] 1.5 - Make sure the `U+` code is always 4 hex digits (use `"{code_point:04X}"`).


In [None]:
# === Part 1: Your code here ===

# 1.1 / 1.2)
def describe_char(ch: str) -> str:
    """
    Return a formatted string describing the character's Unicode code point.

    Example: "A -> 65 (U+0041)"
    """
    # 1.3)
    code_point = ord(ch)
    # 1.4 / 1.5)
    return f"{ch} -> {code_point} (U+{code_point:04X})"


## Part 2: Safe ASCII Encoder (`.encode()` with error handling)

Context: ASCII is tiny and fast... but it only handles basic English characters (code points 0-127). Let's build a function to process string text with this in mind.

Your Tasks:
- [ ] 2.1 - Write a `safe_ascii_encode()` function that takes in a string `text`.
- [ ] 2.2 - Inside a `try:` block, call `text.encode("ascii")` and store the result in `encoded`.
- [ ] 2.3 - Catch `UnicodeEncodeError` as `e` if the text contains non-ASCII characters.
- [ ] 2.4 - On failure, print a friendly message and return `None`.
- [ ] 2.5 - On success, print `"Encoded OK: ..."`, then return the `encoded` bytes object.


In [None]:
# === Part 2: Your code here ===

# 2.1)
def safe_ascii_encode(text: str):
    """
    Try to encode text using ASCII.

    On success:
      - Print a success message
      - Return the encoded bytes

    On failure (UnicodeEncodeError):
      - Print a message describing the problem
      - Return None
    """
    try:
        # 2.2)
        encoded = text.encode("ascii")
    # 2.3)
    except UnicodeEncodeError as e:
        # 2.4)
        print(f"[safe_ascii_encode] Cannot encode text as ASCII: {e}")
        return None
    else:
        # 2.5)
        print(f"[safe_ascii_encode] Encoded OK: {encoded}")
        return encoded


## Part 3: UTF-8 Byte Length Analyzer

UTF-8 uses:
- 1 byte for basic ASCII characters,
- 2-4 bytes for many international characters and emojis.

Context: Let's build a helper that shows how many bytes each character uses.

Your Tasks:
- [ ] 3.1 - Write a `utf8_byte_lengths()` function that takes in a string `text`.
- [ ] 3.2 - Start with an empty list called `result`.
  - It will store pairs like `("A", 1)` or `("üíª", 4)`, so its type hint should describe ‚Äúa list of (str, int) tuples‚Äù.
  - Here's an example to get started: `list[tuple[_______]] = _______`
- [ ] 3.3 - Loop over each character `ch` in `text`.
- [ ] 3.4 - For each `ch`, compute `utf8_bytes = ch.encode("utf-8")`.
- [ ] 3.5 - Append a `(ch, len(utf8_bytes))` tuple to `result`.
- [ ] 3.6 - Return the `result` list at the end.


In [None]:
# === Part 3: Your code here ===

# 3.1)
def utf8_byte_lengths(text: str):
    """
    Build a list of (character, utf8_length) pairs for the given text.
    """
    # 3.2)
    result = list[tuple[str, int]] = []
    # 3.3)
    for ch in text:
        # 3.4)
        utf8_bytes = ch.encode("utf-8")
        # 3.5)
        result.append((ch, len(utf8_bytes)))
    # 3.6)
    return result


## Part 4: Mission Run ‚Äî Exercise All Paths

Context: Once you've implemented Parts 1-3, run the cells below without modification. They'll help you confirm that your helpers behave correctly for different kinds of text.


In [None]:
# === Tests for Part 1: describe_char() ===

print("describe_char('A')   ->", describe_char("A"))  # Expect: A -> 65 (U+0041)
print("describe_char('√©')   ->", describe_char("√©"))  # Expect: √© -> 233 (U+00E9)
print("describe_char('‰∏≠')  ->", describe_char("‰∏≠"))  # Expect: ‰∏≠ -> 20013 (U+4E2D)
print("describe_char('üöÄ') ->", describe_char("üöÄ"))  # Expect: üöÄ -> 128640 (U+1F680)

describe_char('A')   -> A -> 65 (U+0041)
describe_char('√©')   -> √© -> 233 (U+00E9)
describe_char('‰∏≠')  -> ‰∏≠ -> 20013 (U+4E2D)
describe_char('üöÄ') -> üöÄ -> 128640 (U+1F680)


In [None]:
# === Tests for Part 2: safe_ascii_encode() ===

# Expect: encodes cleanly and returns bytes
print("safe_ascii_encode('Hello') ->", safe_ascii_encode("Hello"))

print()  # spacer

# Expect: UnicodeEncodeError handled, prints friendly message, returns None
print("safe_ascii_encode('Caf√©') ->", safe_ascii_encode("Caf√©"))

print()  # spacer

# Expect: UnicodeEncodeError handled for emoji, returns None
print("safe_ascii_encode('Hi üåç') ->", safe_ascii_encode("Hi üåç"))

[safe_ascii_encode] Encoded OK: b'Hello'
safe_ascii_encode('Hello') -> b'Hello'

[safe_ascii_encode] Cannot encode text as ASCII: 'ascii' codec can't encode character '\xe9' in position 3: ordinal not in range(128)
safe_ascii_encode('Caf√©') -> None

[safe_ascii_encode] Cannot encode text as ASCII: 'ascii' codec can't encode character '\U0001f30d' in position 3: ordinal not in range(128)
safe_ascii_encode('Hi üåç') -> None


In [None]:
# === Tests for Part 3: utf8_byte_lengths() ===

sample = "Hi ‰∏ñüåç"

print(f"Text: {sample!r}")
print("utf8_byte_lengths(sample) ->")
lengths = utf8_byte_lengths(sample)
print(lengths)

# Expect (conceptually):
# 'H' -> 1 byte
# 'i' -> 1 byte
# ' ' -> 1 byte
# '‰∏ñ' -> 3 bytes
# 'üåç' -> 4 bytes

Text: 'Hi ‰∏ñüåç'
utf8_byte_lengths(sample) ->
[('H', 1), ('i', 1), (' ', 1), ('‰∏ñ', 3), ('üåç', 4)]


## Part 5: Debrief (short answers)

### Q1 of 2: Why is `UTF-8` backward-compatible with ASCII, and why does that matter in real applications?

*Your answers here:*

UTF-8 uses the same one-byte values for the first 128 code points as ASCII, which means any valid ASCII text is automatically valid UTF-8 with no changes.  

This is important to know because old ASCII-only files, network protocols, and applications can move to UTF-8 without breaking anything. We get global character support while everything that already worked in ASCII keeps working exactly the same. So it's a win-win!


_______

### Q2 of 2: In your own words, what's the difference between a character and a byte?

A character is the symbol we see in digital interfaces all the time (like `A`, `√©`, `‰∏≠`, or `üöÄ`). A byte, on the other hand, is just 8 bits of raw data, which is something the computer sees but something we rarely ever see.

An encoding (like UTF-8) defines how each character is represented as one or more bytes.  

Simple ASCII characters use 1 byte, but many Unicode characters (especially emojis and non-Latin scripts) need multiple bytes to represent a single character.

_______