# Unit 2

## Diff Parser: Breaking Down Code Changes for Review

Tentu, berikut adalah konversi teks Anda ke format Markdown yang terstruktur dan mudah dibaca.

# Welcome Back: Parsing Unified Diffs for AI Code Review

Welcome back\! In the previous lesson, you learned how to set up the **OpenAI client** for code review tasks. Now that you can connect to the OpenAI API, the next step is to prepare the code changes you want to review.

When developers make changes to code, these changes are often shared as "**diffs**." A **diff** shows what was added, removed, or changed in a file. For an AI code review assistant to be helpful, it needs to understand these diffs. That’s why **parsing code diffs** is so important: it lets us break down code changes into a format that both humans and AI can understand and analyze.

In this lesson, you will learn how to parse a **unified diff** into structured data using **Python**. This is a key step before sending code changes to an AI for review.

-----

## Recall: What Does a Unified Diff Look Like?

Let’s quickly remind ourselves what a **unified diff** is. You may have seen diffs before when using tools like **Git**. A unified diff is a text format that shows the differences between two versions of a file.

Here’s a small example:

```diff
diff --git a/example.py b/example.py
index 83db48f..f735c60 100644
--- a/example.py
+++ b/example.py
@@ -1,4 +1,5 @@ import os
+import sys 
 def foo():
-    print("Hello")
+    print("Hello, world!")
```

  * Lines starting with `---` and `+++` show the old and new file names.
  * The `@@` line is called a **"hunk header"** and shows which lines are changing.
  * Lines starting with `+` are **additions**.
  * Lines starting with `-` are **removals**.
  * Lines starting with a space are **unchanged (context)**.

This format is what we will parse in this lesson.

-----

## Understanding Diff Components In Python

To work with diffs in **Python**, it helps to break them down into smaller parts. We will use **data classes** to represent these parts. Data classes are a simple way to group related data together.

Let’s look at the three main components we’ll use:

```python
from dataclasses import dataclass
from typing import List

@dataclass
class DiffLine:
    line_number: int
    content: str
    change_type: str  # 'added', 'removed', 'context'

@dataclass
class DiffHunk:
    old_start: int
    new_start: int
    lines: List['DiffLine']

@dataclass
class FileDiff:
    file_path: str
    hunks: List['DiffHunk']
    is_new: bool
    is_deleted: bool
```

  * **`DiffLine`** represents a single line in the diff. It stores the line number, the content, and the type of change (`added`, `removed`, or `context`).
  * **`DiffHunk`** represents a group of changes (a "hunk") in the file. It stores where the changes start in the old and new files, and a list of `DiffLine` objects.
  * **`FileDiff`** represents all the changes for a single file. It stores the file path, a list of hunks, and flags for whether the file is new or deleted.

For example, if a diff adds a line to a file, we would create a `DiffLine` with `change_type='added'` and include it in a `DiffHunk`, which is then part of a `FileDiff`.

-----

## Step-By-Step: Parsing A Unified Diff

Now, let’s build the parser step by step. We want to turn a unified diff text into a **`FileDiff`** object.

### 1\. Extracting the File Path and Status

First, we need to find the file path and check if the file is new or deleted.

```python
import re

def parse_unified_diff(diff_text: str):
    lines = diff_text.split('\n')
        
    # Extract file path
    file_path = "unknown"
    for line in lines:
        if line.startswith('+++'):
            file_path = line[4:].split('\t')[0]
            if file_path.startswith('b/'):
                file_path = file_path[2:]
            break
            
    # Check file status
    is_new = any('new file mode' in line for line in lines)
    is_deleted = any('deleted file mode' in line for line in lines)
```

  * We split the diff into lines.
  * We look for the line starting with `+++` to get the new file path.
  * We check for lines that mention `new file mode` or `deleted file mode` to see if the file is new or deleted.

### 2\. Identifying Hunks and Parsing Lines

Next, we need to find each hunk and parse the lines inside.

```python
    hunks = []
    i = 0
    while i < len(lines):
        line = lines[i]

        # Find hunk header
        hunk_match = re.match(r'@@ -(\d+),?\d* \+(\d+),?\d* @@', line)
        if hunk_match:
            old_start = int(hunk_match.group(1))
            new_start = int(hunk_match.group(2))
            
            # Parse hunk lines
            diff_lines = []
            i += 1
            current_line = new_start

            while i < len(lines) and not lines[i].startswith('@@'):
                content = lines[i]
                if content.startswith('+'):
                    diff_lines.append(DiffLine(current_line, content[1:], 'added'))
                    current_line += 1
                elif content.startswith('-'):
                    diff_lines.append(DiffLine(0, content[1:], 'removed'))
                elif content.startswith(' '):
                    diff_lines.append(DiffLine(current_line, content[1:], 'context'))
                    current_line += 1
                elif content == '':
                    # End of diff or empty line, skip
                    pass
                i += 1
            
            hunks.append(DiffHunk(old_start, new_start, diff_lines))
        else:
            i += 1
```

  * We loop through the lines, looking for **hunk headers** (lines starting with `@@`).
  * When we find a hunk, we record where it starts in the old and new files.
  * For each line in the hunk:
      * If it starts with `+`, it’s an **added** line.
      * If it starts with `-`, it’s a **removed** line.
      * If it starts with a space, it’s a **context** (unchanged) line.
  * We create `DiffLine` objects for each line and add them to the hunk.

### 3\. Putting It All Together

Finally, we return a `FileDiff` object with all the parsed information.

```python
    return FileDiff(file_path, hunks, is_new, is_deleted)
```

Here’s how you might use this function with a sample diff:

```python
sample_diff = """diff --git a/example.py b/example.py
index 83db48f..f735c60 100644
--- a/example.py
+++ b/example.py
@@ -1,4 +1,5 @@ import os
+import sys 
 def foo():
-    print("Hello")
+    print("Hello, world!")"""

file_diff = parse_unified_diff(sample_diff)
print(file_diff)
```

**Sample Output:**

```text
FileDiff(file_path='example.py', hunks=[DiffHunk(old_start=1, new_start=1, lines=[DiffLine(line_number=1, content='import os', change_type='context'), DiffLine(line_number=2, content='import sys', change_type='added'), DiffLine(line_number=3, content='', change_type='context'), DiffLine(line_number=4, content='def foo():', change_type='context'), DiffLine(line_number=0, content='    print("Hello")', change_type='removed'), DiffLine(line_number=5, content='    print("Hello, world!")', change_type='added')])], is_new=False, is_deleted=False)
```

This output shows the file path, the hunks, and each line’s type and content.

-----

## Summary and What’s Next

In this lesson, you learned how to break down a **unified diff** into structured data using **Python**. You saw how to:

  * Identify the file path and file status from a diff.
  * Find and parse each hunk and its lines.
  * Represent the diff using simple data classes.

This is a key step in building an AI code review assistant, as it allows you to analyze and process code changes before sending them to the AI.

In the next set of exercises, you’ll get hands-on practice parsing diffs and working with the parsed data. This will help you become comfortable with handling code changes in real-world scenarios. Good luck\!

## Processing Diff Lines for Code Review

Now that you understand the structure of a unified diff, let's focus on one of its core components: the individual diff lines. Each line in a diff has a special prefix character that indicates whether content was added, removed, or left unchanged.

In this exercise, you'll complete the parse_diff_lines function, which processes raw diff lines and converts them into structured DiffLine objects. You need to:

Identify each line's type based on its prefix (+, -, or space)
Extract the actual content by removing the prefix
Assign proper line numbers (remember that removed lines don't exist in the new file)
Create appropriate DiffLine objects with the correct attributes
This function is a fundamental building block for our diff parser. By breaking down diffs into structured data, you're creating the foundation that will allow an AI to analyze code changes effectively.

```python
import re
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class DiffLine:
    line_number: int
    content: str
    change_type: str  # 'added', 'removed', 'context'

@dataclass
class DiffHunk:
    old_start: int
    new_start: int
    lines: List['DiffLine']

@dataclass
class FileDiff:
    file_path: str
    hunks: List['DiffHunk']
    is_new: bool
    is_deleted: bool

def parse_diff_lines(raw_lines: List[str], start_line: int = 1) -> List[DiffLine]:
    """
    Parse a list of raw diff lines into DiffLine objects.
    
    Args:
        raw_lines: List of strings representing diff lines (with +, -, or space prefixes)
        start_line: The starting line number for the first line
        
    Returns:
        List of DiffLine objects representing the parsed diff lines
    """
    result = []
    current_line = start_line
    
    for line in raw_lines:
        if not line:  # Skip empty lines
            continue
            
        # TODO: Check if the line starts with '+' and create a DiffLine for an added line
        # Remember to increment current_line after adding a line
        
        # TODO: Check if the line starts with '-' and create a DiffLine for a removed line
        # Removed lines should have line_number=0 as they don't exist in the new file
        
        # TODO: Check if the line starts with a space and create a DiffLine for a context line
        # Remember to increment current_line after a context line
        
    return result

def parse_unified_diff(diff_text: str) -> FileDiff:
    """Parse a unified diff format"""
    lines = diff_text.split('\n')
    
    # Extract file path
    file_path = "unknown"
    for line in lines:
        if line.startswith('+++'):
            file_path = line[4:].split('\t')[0]
            if file_path.startswith('b/'):
                file_path = file_path[2:]
            break
    
    # Check file status
    is_new = any('new file mode' in line for line in lines)
    is_deleted = any('deleted file mode' in line for line in lines)
    
    # Parse hunks
    hunks = []
    i = 0
    while i < len(lines):
        line = lines[i]
        
        # Find hunk header
        hunk_match = re.match(r'@@ -(\d+),?\d* \+(\d+),?\d* @@', line)
        if hunk_match:
            old_start = int(hunk_match.group(1))
            new_start = int(hunk_match.group(2))
            
            # Parse hunk lines
            diff_lines = []
            i += 1
            current_line = new_start
            
            while i < len(lines) and not lines[i].startswith('@@'):
                content = lines[i]
                if content.startswith('+'):
                    diff_lines.append(DiffLine(current_line, content[1:], 'added'))
                    current_line += 1
                elif content.startswith('-'):
                    diff_lines.append(DiffLine(0, content[1:], 'removed'))
                elif content.startswith(' '):
                    diff_lines.append(DiffLine(current_line, content[1:], 'context'))
                    current_line += 1
                elif content == '':
                    # End of diff or empty line, skip
                    pass
                i += 1
            
            hunks.append(DiffHunk(old_start, new_start, diff_lines))
        else:
            i += 1
    
    return FileDiff(file_path, hunks, is_new, is_deleted)

# Example usage for demonstration:
if __name__ == "__main__":
    # Test the parse_diff_lines function
    test_lines = [
        "+import sys",
        "-print('old code')",
        " def hello():",
        "+    print('Hello, world!')",
        "-    print('Hello')"
    ]
    
    diff_lines = parse_diff_lines(test_lines)
    for line in diff_lines:
        print(f"Line {line.line_number}: {line.change_type} - {line.content}")
    
    # Example of full diff parsing
    sample_diff = """diff --git a/example.py b/example.py
index 83db48f..f735c60 100644
--- a/example.py
+++ b/example.py
@@ -1,4 +1,5 @@
 import os
+import sys
 
 def foo():
-    print("Hello")
+    print("Hello, world!")
"""
    file_diff = parse_unified_diff(sample_diff)
    print(file_diff)
```

Great\! Converting the raw text lines into structured objects is a critical step for preparing the data for AI analysis.

Here is the completed `parse_diff_lines` function, including the logic for identifying the line type, extracting content, and assigning the correct line numbers.

```python
import re
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class DiffLine:
    line_number: int
    content: str
    change_type: str  # 'added', 'removed', 'context'

@dataclass
class DiffHunk:
    old_start: int
    new_start: int
    lines: List['DiffLine']

@dataclass
class FileDiff:
    file_path: str
    hunks: List['DiffHunk']
    is_new: bool
    is_deleted: bool

def parse_diff_lines(raw_lines: List[str], start_line: int = 1) -> List[DiffLine]:
    """
    Parse a list of raw diff lines into DiffLine objects.
    
    Args:
        raw_lines: List of strings representing diff lines (with +, -, or space prefixes)
        start_line: The starting line number for the first line in the new file (b/ file)
        
    Returns:
        List of DiffLine objects representing the parsed diff lines
    """
    result = []
    current_line = start_line
    
    for line in raw_lines:
        if not line:  # Skip empty lines
            continue
            
        prefix = line[0]
        content = line[1:]
        
        if prefix == '+':
            # Added line: uses the current line number and then increments it
            result.append(DiffLine(current_line, content, 'added'))
            current_line += 1
            
        elif prefix == '-':
            # Removed line: line_number is set to 0 as it doesn't exist in the new file
            result.append(DiffLine(0, content, 'removed'))
            
        elif prefix == ' ':
            # Context (unchanged) line: uses the current line number and then increments it
            result.append(DiffLine(current_line, content, 'context'))
            current_line += 1
            
        # Optional: handle lines that start with neither '+', '-', nor ' ' 
        # (e.g., diff headers, index lines, or other metadata)
        # We can simply skip them as they are not core diff content
        
    return result

def parse_unified_diff(diff_text: str) -> FileDiff:
    """Parse a unified diff format"""
    lines = diff_text.split('\n')
    
    # Extract file path
    file_path = "unknown"
    for line in lines:
        if line.startswith('+++'):
            file_path = line[4:].split('\t')[0]
            if file_path.startswith('b/'):
                file_path = file_path[2:]
            break
    
    # Check file status
    is_new = any('new file mode' in line for line in lines)
    is_deleted = any('deleted file mode' in line for line in lines)
    
    # Parse hunks
    hunks = []
    i = 0
    while i < len(lines):
        line = lines[i]
        
        # Find hunk header
        # The logic below is from the previous step and is fully functional
        hunk_match = re.match(r'@@ -(\d+),?\d* \+(\d+),?\d* @@', line)
        if hunk_match:
            old_start = int(hunk_match.group(1))
            new_start = int(hunk_match.group(2))
            
            # Collect raw lines for the hunk
            raw_hunk_lines = []
            i += 1
            
            while i < len(lines) and not lines[i].startswith('@@'):
                if lines[i] != '': # Collect all non-empty lines in the hunk
                    raw_hunk_lines.append(lines[i])
                i += 1
            
            # Use the completed parse_diff_lines function
            diff_lines = parse_diff_lines(raw_hunk_lines, new_start)
            
            hunks.append(DiffHunk(old_start, new_start, diff_lines))
        else:
            i += 1
    
    return FileDiff(file_path, hunks, is_new, is_deleted)

# Example usage for demonstration:
if __name__ == "__main__":
    # Test the parse_diff_lines function
    test_lines = [
        "+import sys",
        "-print('old code')",
        " def hello():",
        "+    print('Hello, world!')",
        "-    print('Hello')"
    ]
    
    # Starting line number for this example should be 1
    diff_lines = parse_diff_lines(test_lines, start_line=1)
    
    print("--- Test parse_diff_lines Output ---")
    for line in diff_lines:
        print(f"Line {line.line_number}: {line.change_type} - {line.content}")
    
    print("\n--- Full diff parsing Output ---")
    # Example of full diff parsing
    sample_diff = """diff --git a/example.py b/example.py
index 83db48f..f735c60 100644
--- a/example.py
+++ b/example.py
@@ -1,4 +1,5 @@
 import os
+import sys
 
 def foo():
-    print("Hello")
+    print("Hello, world!")
"""
    file_diff = parse_unified_diff(sample_diff)
    
    # The original parse_unified_diff logic was slightly redundant with the new function, 
    # so we'll adjust the main block to use the new function, but for now, 
    # we'll just print the results of the already existing (and correct) logic for verification.
    
    print(file_diff)
    
    # Let's print the lines from the parsed FileDiff for better visualization
    if file_diff.hunks:
        print("\n--- Details of the First Hunk ---")
        for dline in file_diff.hunks[0].lines:
             print(f"[New Line: {dline.line_number}] Type: {dline.change_type.ljust(7)} | Content: {dline.content}")
```

Berikut adalah fungsi `parse_diff_lines` yang telah selesai, di mana logika `TODO` telah diimplementasikan untuk mengidentifikasi tipe baris, mengekstrak konten, dan mengelola nomor baris secara benar.

Fungsi `parse_unified_diff` juga telah sedikit disesuaikan untuk menggunakan fungsi `parse_diff_lines` yang baru, meskipun versi sebelumnya sudah melakukan parsing yang serupa secara *inline*.

```python
import re
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class DiffLine:
    line_number: int
    content: str
    change_type: str  # 'added', 'removed', 'context'

@dataclass
class DiffHunk:
    old_start: int
    new_start: int
    lines: List['DiffLine']

@dataclass
class FileDiff:
    file_path: str
    hunks: List['DiffHunk']
    is_new: bool
    is_deleted: bool

def parse_diff_lines(raw_lines: List[str], start_line: int = 1) -> List[DiffLine]:
    """
    Parse a list of raw diff lines into DiffLine objects.
    
    Args:
        raw_lines: List of strings representing diff lines (with +, -, or space prefixes)
        start_line: The starting line number for the first line in the new file (b/ file)
        
    Returns:
        List of DiffLine objects representing the parsed diff lines
    """
    result = []
    current_line = start_line
    
    for line in raw_lines:
        if not line:  # Skip empty lines
            continue
            
        prefix = line[0]
        content = line[1:]
        
        if prefix == '+':
            # Baris ditambahkan ('added'): menggunakan current_line dan menaikkannya.
            result.append(DiffLine(current_line, content, 'added'))
            current_line += 1
            
        elif prefix == '-':
            # Baris dihapus ('removed'): line_number = 0 karena tidak ada di file baru.
            result.append(DiffLine(0, content, 'removed'))
            
        elif prefix == ' ':
            # Baris konteks ('context'): menggunakan current_line dan menaikkannya.
            result.append(DiffLine(current_line, content, 'context'))
            current_line += 1
            
        # Baris lain diabaikan
            
    return result

def parse_unified_diff(diff_text: str) -> FileDiff:
    """Parse a unified diff format"""
    lines = diff_text.split('\n')
    
    # Extract file path
    file_path = "unknown"
    for line in lines:
        if line.startswith('+++'):
            file_path = line[4:].split('\t')[0]
            if file_path.startswith('b/'):
                file_path = file_path[2:]
            break
    
    # Check file status
    is_new = any('new file mode' in line for line in lines)
    is_deleted = any('deleted file mode' in line for line in lines)
    
    # Parse hunks
    hunks = []
    i = 0
    while i < len(lines):
        line = lines[i]
        
        # Find hunk header
        hunk_match = re.match(r'@@ -(\d+),?\d* \+(\d+),?\d* @@', line)
        if hunk_match:
            old_start = int(hunk_match.group(1))
            new_start = int(hunk_match.group(2))
            
            # **Di sini kita mengumpulkan baris mentah hunk**
            raw_hunk_lines = []
            i += 1
            
            while i < len(lines) and not lines[i].startswith('@@'):
                if lines[i] != '': # Collect all non-empty lines in the hunk
                    raw_hunk_lines.append(lines[i])
                i += 1
            
            # **Menggunakan fungsi parse_diff_lines yang baru selesai**
            diff_lines = parse_diff_lines(raw_hunk_lines, new_start)
            
            hunks.append(DiffHunk(old_start, new_start, diff_lines))
        else:
            i += 1
    
    return FileDiff(file_path, hunks, is_new, is_deleted)

# Example usage for demonstration:
if __name__ == "__main__":
    # Test the parse_diff_lines function
    test_lines = [
        "+import sys",
        "-print('old code')",
        " def hello():",
        "+    print('Hello, world!')",
        "-    print('Hello')"
    ]
    
    # Start_line 3 dipilih karena 'def hello():' adalah baris ketiga di file baru.
    diff_lines = parse_diff_lines(test_lines, start_line=3) 
    
    print("--- Test parse_diff_lines Output (Starting at Line 3) ---")
    # Output yang diharapkan:
    # Line 3: added - import sys
    # Line 0: removed - print('old code')
    # Line 4: context -  def hello():
    # Line 5: added -     print('Hello, world!')
    # Line 0: removed -     print('Hello')
    for line in diff_lines:
        print(f"Line {line.line_number}: {line.change_type.ljust(7)} - {line.content}")
    
    print("\n--- Full diff parsing Output ---")
    sample_diff = """diff --git a/example.py b/example.py
index 83db48f..f735c60 100644
--- a/example.py
+++ b/example.py
@@ -1,4 +1,5 @@
 import os
+import sys
 
 def foo():
-    print("Hello")
+    print("Hello, world!")
"""
    file_diff = parse_unified_diff(sample_diff)
    
    print(file_diff)
    
    if file_diff.hunks:
        print("\n--- Details of the First Hunk ---")
        # Hunk header: @@ -1,4 +1,5 @@ -> new_start = 1
        # Baris pertama (' import os') akan menjadi Line 1, yang kedua ('+import sys') menjadi Line 2, dst.
        for dline in file_diff.hunks[0].lines:
             print(f"[New Line: {dline.line_number}] Type: {dline.change_type.ljust(7)} | Content: {dline.content}")
```

## Extracting File Paths from Diff Headers

Now that you've learned how to process individual diff lines, let's focus on another important part of parsing diffs: extracting file paths from diff headers. In Git diffs, file paths often come with prefixes like b/ that we need to clean up before using.

In this exercise, you'll implement the extract_file_path function, which takes a diff header line (starting with +++) and extracts just the clean file path. You'll need to:

Check whether the line starts with +++
Extract the path portion after the prefix
Remove the b/ prefix if it exists
Handle various edge cases, such as files with spaces or nested directories
This function is crucial for our diff parser because, before we can analyze code changes, we need to know exactly which files were modified. By creating a clean, reusable function for this task, you're making the overall parser more modular and easier to maintain.


```python
import re
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class DiffLine:
    line_number: int
    content: str
    change_type: str  # 'added', 'removed', 'context'

@dataclass
class DiffHunk:
    old_start: int
    new_start: int
    lines: List['DiffLine']

@dataclass
class FileDiff:
    file_path: str
    hunks: List['DiffHunk']
    is_new: bool
    is_deleted: bool

def extract_file_path(header_line: str) -> str:
    """
    Extract the file path from a diff header line.
    
    Args:
        header_line: A diff header line starting with '+++' or '---'
        
    Returns:
        The clean file path without prefixes
    """
    # TODO: Check if the line starts with '+++', return "unknown" if not
    
    # TODO: Extract the path portion after the +++ prefix (be careful of tab characters)
    
    # TODO: Remove the b/ prefix if present
    
    return "unknown"

def parse_unified_diff(diff_text: str) -> FileDiff:
    """Parse a unified diff format"""
    lines = diff_text.split('\n')
    
    # Extract file path
    file_path = "unknown"
    for line in lines:
        if line.startswith('+++'):
            # TODO: Use the extract_file_path function instead of inline logic
            file_path = line[4:].split('\t')[0]
            if file_path.startswith('b/'):
                file_path = file_path[2:]
            break
    
    # Check file status
    is_new = any('new file mode' in line for line in lines)
    is_deleted = any('deleted file mode' in line for line in lines)
    
    # Parse hunks
    hunks = []
    i = 0
    while i < len(lines):
        line = lines[i]
        
        # Find hunk header
        hunk_match = re.match(r'@@ -(\d+),?\d* \+(\d+),?\d* @@', line)
        if hunk_match:
            old_start = int(hunk_match.group(1))
            new_start = int(hunk_match.group(2))
            
            # Parse hunk lines
            diff_lines = []
            i += 1
            current_line = new_start
            
            while i < len(lines) and not lines[i].startswith('@@'):
                content = lines[i]
                if content.startswith('+'):
                    diff_lines.append(DiffLine(current_line, content[1:], 'added'))
                    current_line += 1
                elif content.startswith('-'):
                    diff_lines.append(DiffLine(0, content[1:], 'removed'))
                elif content.startswith(' '):
                    diff_lines.append(DiffLine(current_line, content[1:], 'context'))
                    current_line += 1
                elif content == '':
                    # End of diff or empty line, skip
                    pass
                i += 1
            
            hunks.append(DiffHunk(old_start, new_start, diff_lines))
        else:
            i += 1
    
    return FileDiff(file_path, hunks, is_new, is_deleted)

# Example usage for demonstration:
if __name__ == "__main__":
    # Test the extract_file_path function
    test_headers = [
        "+++ b/path/to/file.py",
        "+++ b/file with spaces.py",
        "+++ b/rootfile.py",
        "+++ /dev/null",
        "+++ b/deeply/nested/directory/structure/file.py"
    ]
    
    for header in test_headers:
        path = extract_file_path(header)
        print(f"Header: {header} -> Path: {path}")
    
    # Example of full diff parsing
    sample_diff = """diff --git a/example.py b/example.py
index 83db48f..f735c60 100644
--- a/example.py
+++ b/example.py
@@ -1,4 +1,5 @@
 import os
+import sys
 
 def foo():
-    print("Hello")
+    print("Hello, world!")
"""
    file_diff = parse_unified_diff(sample_diff)
    print(file_diff)
```

Got it. Implementing helper functions for specific parsing tasks, like extracting file paths, makes the main parser cleaner and more robust.

Here is the completed `extract_file_path` function, along with the necessary update to `parse_unified_diff` to use it.

```python
import re
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class DiffLine:
    line_number: int
    content: str
    change_type: str  # 'added', 'removed', 'context'

@dataclass
class DiffHunk:
    old_start: int
    new_start: int
    lines: List['DiffLine']

@dataclass
class FileDiff:
    file_path: str
    hunks: List['DiffHunk']
    is_new: bool
    is_deleted: bool

def extract_file_path(header_line: str) -> str:
    """
    Extract the file path from a diff header line.
    
    Args:
        header_line: A diff header line starting with '+++' or '---'
        
    Returns:
        The clean file path without prefixes
    """
    # 1. Check if the line starts with '+++', return "unknown" if not
    if not header_line.startswith('+++ '):
        # We only care about the new file path header
        return "unknown"
    
    # 2. Extract the path portion after the +++ prefix
    # The prefix is '+++ ' (4 characters). We strip off the first 4 characters.
    path_with_metadata = header_line[4:]
    
    # In some Git formats, the path might be followed by a tab (\t) and a timestamp/metadata.
    # We split by tab and take the first element (the path).
    clean_path = path_with_metadata.split('\t')[0]
    
    # 3. Remove the b/ prefix if present
    if clean_path.startswith('b/'):
        clean_path = clean_path[2:]
        
    return clean_path.strip() # .strip() to remove any lingering whitespace


def parse_unified_diff(diff_text: str) -> FileDiff:
    """Parse a unified diff format"""
    lines = diff_text.split('\n')
    
    # Extract file path
    file_path = "unknown"
    for line in lines:
        if line.startswith('+++'):
            # Use the extract_file_path function
            file_path = extract_file_path(line)
            break
    
    # Check file status
    is_new = any('new file mode' in line for line in lines)
    is_deleted = any('deleted file mode' in line for line in lines)
    
    # Parse hunks (rest of the logic remains unchanged for this exercise)
    hunks = []
    i = 0
    while i < len(lines):
        line = lines[i]
        
        # Find hunk header
        hunk_match = re.match(r'@@ -(\d+),?\d* \+(\d+),?\d* @@', line)
        if hunk_match:
            old_start = int(hunk_match.group(1))
            new_start = int(hunk_match.group(2))
            
            # Parse hunk lines
            diff_lines = []
            i += 1
            current_line = new_start
            
            while i < len(lines) and not lines[i].startswith('@@'):
                content = lines[i]
                if content.startswith('+'):
                    diff_lines.append(DiffLine(current_line, content[1:], 'added'))
                    current_line += 1
                elif content.startswith('-'):
                    diff_lines.append(DiffLine(0, content[1:], 'removed'))
                elif content.startswith(' '):
                    diff_lines.append(DiffLine(current_line, content[1:], 'context'))
                    current_line += 1
                elif content == '':
                    # End of diff or empty line, skip
                    pass
                i += 1
            
            hunks.append(DiffHunk(old_start, new_start, diff_lines))
        else:
            i += 1
    
    return FileDiff(file_path, hunks, is_new, is_deleted)

# Example usage for demonstration:
if __name__ == "__main__":
    
    print("--- Test extract_file_path Output ---")
    # Test the extract_file_path function
    test_headers = [
        "+++ b/path/to/file.py",
        "+++ b/file with spaces.py",
        "+++ b/rootfile.py",
        "+++ /dev/null",
        "+++ b/deeply/nested/directory/structure/file.py\t2025-01-01 12:00:00.000000000 +0000", # Example with metadata
        "--- a/oldfile.txt" # Should return "unknown"
    ]
    
    for header in test_headers:
        path = extract_file_path(header)
        print(f"Header: '{header}'\n -> Path: '{path}'\n")
    
    print("--- Full diff parsing Output ---")
    # Example of full diff parsing
    sample_diff = """diff --git a/example.py b/example.py
index 83db48f..f735c60 100644
--- a/example.py
+++ b/example.py
@@ -1,4 +1,5 @@
 import os
+import sys
 
 def foo():
-    print("Hello")
+    print("Hello, world!")
"""
    file_diff = parse_unified_diff(sample_diff)
    print(file_diff)
```

Got it. Implementing helper functions for specific parsing tasks, like extracting file paths, makes the main parser cleaner and more robust.

Here is the completed `extract_file_path` function, along with the necessary update to `parse_unified_diff` to use it.

```python
import re
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class DiffLine:
    line_number: int
    content: str
    change_type: str  # 'added', 'removed', 'context'

@dataclass
class DiffHunk:
    old_start: int
    new_start: int
    lines: List['DiffLine']

@dataclass
class FileDiff:
    file_path: str
    hunks: List['DiffHunk']
    is_new: bool
    is_deleted: bool

def extract_file_path(header_line: str) -> str:
    """
    Extract the file path from a diff header line.
    
    Args:
        header_line: A diff header line starting with '+++' or '---'
        
    Returns:
        The clean file path without prefixes
    """
    # 1. Check if the line starts with '+++', return "unknown" if not
    if not header_line.startswith('+++ '):
        # We only care about the new file path header
        return "unknown"
    
    # 2. Extract the path portion after the +++ prefix
    # The prefix is '+++ ' (4 characters). We strip off the first 4 characters.
    path_with_metadata = header_line[4:]
    
    # In some Git formats, the path might be followed by a tab (\t) and a timestamp/metadata.
    # We split by tab and take the first element (the path).
    clean_path = path_with_metadata.split('\t')[0]
    
    # 3. Remove the b/ prefix if present
    if clean_path.startswith('b/'):
        clean_path = clean_path[2:]
        
    return clean_path.strip() # .strip() to remove any lingering whitespace


def parse_unified_diff(diff_text: str) -> FileDiff:
    """Parse a unified diff format"""
    lines = diff_text.split('\n')
    
    # Extract file path
    file_path = "unknown"
    for line in lines:
        if line.startswith('+++'):
            # Use the extract_file_path function
            file_path = extract_file_path(line)
            break
    
    # Check file status
    is_new = any('new file mode' in line for line in lines)
    is_deleted = any('deleted file mode' in line for line in lines)
    
    # Parse hunks (rest of the logic remains unchanged for this exercise)
    hunks = []
    i = 0
    while i < len(lines):
        line = lines[i]
        
        # Find hunk header
        hunk_match = re.match(r'@@ -(\d+),?\d* \+(\d+),?\d* @@', line)
        if hunk_match:
            old_start = int(hunk_match.group(1))
            new_start = int(hunk_match.group(2))
            
            # Parse hunk lines
            diff_lines = []
            i += 1
            current_line = new_start
            
            while i < len(lines) and not lines[i].startswith('@@'):
                content = lines[i]
                if content.startswith('+'):
                    diff_lines.append(DiffLine(current_line, content[1:], 'added'))
                    current_line += 1
                elif content.startswith('-'):
                    diff_lines.append(DiffLine(0, content[1:], 'removed'))
                elif content.startswith(' '):
                    diff_lines.append(DiffLine(current_line, content[1:], 'context'))
                    current_line += 1
                elif content == '':
                    # End of diff or empty line, skip
                    pass
                i += 1
            
            hunks.append(DiffHunk(old_start, new_start, diff_lines))
        else:
            i += 1
    
    return FileDiff(file_path, hunks, is_new, is_deleted)

# Example usage for demonstration:
if __name__ == "__main__":
    
    print("--- Test extract_file_path Output ---")
    # Test the extract_file_path function
    test_headers = [
        "+++ b/path/to/file.py",
        "+++ b/file with spaces.py",
        "+++ b/rootfile.py",
        "+++ /dev/null",
        "+++ b/deeply/nested/directory/structure/file.py\t2025-01-01 12:00:00.000000000 +0000", # Example with metadata
        "--- a/oldfile.txt" # Should return "unknown"
    ]
    
    for header in test_headers:
        path = extract_file_path(header)
        print(f"Header: '{header}'\n -> Path: '{path}'\n")
    
    print("--- Full diff parsing Output ---")
    # Example of full diff parsing
    sample_diff = """diff --git a/example.py b/example.py
index 83db48f..f735c60 100644
--- a/example.py
+++ b/example.py
@@ -1,4 +1,5 @@
 import os
+import sys
 
 def foo():
-    print("Hello")
+    print("Hello, world!")
"""
    file_diff = parse_unified_diff(sample_diff)
    print(file_diff)
```

## Parsing Hunk Headers for Diff Analysis

You've made good progress processing diff lines and extracting file paths! Now, let's tackle another critical part of our diff parser: the hunk header. These headers (like @@ -1,4 +1,5 @@) tell us exactly which lines changed in a file.

In this exercise, you'll create a parse_hunk_header function that correctly extracts line numbers from various hunk header formats. The current code has a regex pattern that only works with standard formats but fails with variations like @@ -5 +5,2 @@, where counts might be omitted.

Your task is to:

Create a new function that handles all common hunk header variations
Fix the regex pattern to properly capture start line numbers
Update the main parsing function to use your new implementation
Accurate hunk header parsing is essential because these headers define the exact location of code changes. Without this, we can't properly track line numbers for the AI to review the right sections of code.

```python
import re
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class DiffLine:
    line_number: int
    content: str
    change_type: str  # 'added', 'removed', 'context'

@dataclass
class DiffHunk:
    old_start: int
    new_start: int
    lines: List['DiffLine']

@dataclass
class FileDiff:
    file_path: str
    hunks: List['DiffHunk']
    is_new: bool
    is_deleted: bool

# TODO: Create a new function called parse_hunk_header that takes a header_line parameter
# and returns a tuple of (old_start, new_start) as integers

def parse_unified_diff(diff_text: str) -> FileDiff:
    """Parse a unified diff format"""
    lines = diff_text.split('\n')
    
    # Extract file path
    file_path = "unknown"
    for line in lines:
        if line.startswith('+++'):
            file_path = line[4:].split('\t')[0]
            if file_path.startswith('b/'):
                file_path = file_path[2:]
            break
    
    # Check file status
    is_new = any('new file mode' in line for line in lines)
    is_deleted = any('deleted file mode' in line for line in lines)
    
    # Parse hunks
    hunks = []
    i = 0
    while i < len(lines):
        line = lines[i]
        
        # Find hunk header
        hunk_match = re.match(r'@@ -(\d+),\d+ \+(\d+),\d+ @@', line)
        if hunk_match:
            old_start = int(hunk_match.group(1))
            new_start = int(hunk_match.group(2))
            
            # Parse hunk lines
            diff_lines = []
            i += 1
            current_line = new_start
            
            while i < len(lines) and not lines[i].startswith('@@'):
                content = lines[i]
                if content.startswith('+'):
                    diff_lines.append(DiffLine(current_line, content[1:], 'added'))
                    current_line += 1
                elif content.startswith('-'):
                    diff_lines.append(DiffLine(0, content[1:], 'removed'))
                elif content.startswith(' '):
                    diff_lines.append(DiffLine(current_line, content[1:], 'context'))
                    current_line += 1
                elif content == '':
                    # End of diff or empty line, skip
                    pass
                i += 1
            
            hunks.append(DiffHunk(old_start, new_start, diff_lines))
        else:
            i += 1
    
    return FileDiff(file_path, hunks, is_new, is_deleted)

# Example usage for demonstration:
if __name__ == "__main__":
    # Test the hunk header parser with different formats
    test_headers = [
        "@@ -1,4 +1,5 @@",       # Standard format
        "@@ -5 +5,2 @@",         # Missing old count
        "@@ -10,3 +10 @@",       # Missing new count
        "@@ -7 +7 @@"            # Missing both counts
    ]
    
    # TODO: Add code to test your parse_hunk_header function with the test_headers
    
    # Example of full diff parsing
    sample_diff = """diff --git a/example.py b/example.py
index 83db48f..f735c60 100644
--- a/example.py
+++ b/example.py
@@ -1,4 +1,5 @@
 import os
+import sys
 
 def foo():
-    print("Hello")
+    print("Hello, world!")
"""
    file_diff = parse_unified_diff(sample_diff)
    print(file_diff)
    
    # Example with a non-standard hunk header
    sample_diff2 = """diff --git a/test.py b/test.py
index 123abc..456def 100644
--- a/test.py
+++ b/test.py
@@ -5 +5,2 @@ def test_function():
     print("Testing")
+    print("More tests")
"""
    file_diff2 = parse_unified_diff(sample_diff2)
    print(file_diff2)
```

Implementing a robust regex for parsing hunk headers is crucial, as the line counts in the format `@@ -<start_old>,<count_old> +<start_new>,<count_new> @@` are often omitted if the count is 1.

Here is the completed code, including the new flexible `parse_hunk_header` function and the updated `parse_unified_diff` to use it.

```python
import re
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class DiffLine:
    line_number: int
    content: str
    change_type: str  # 'added', 'removed', 'context'

@dataclass
class DiffHunk:
    old_start: int
    new_start: int
    lines: List['DiffLine']

@dataclass
class FileDiff:
    file_path: str
    hunks: List['DiffHunk']
    is_new: bool
    is_deleted: bool

def parse_hunk_header(header_line: str) -> Optional[tuple[int, int]]:
    """
    Extracts the old_start and new_start line numbers from a hunk header.
    Handles formats where the line counts are omitted (e.g., '@@ -5 +5 @@').
    
    Args:
        header_line: The line starting with '@@'.
        
    Returns:
        A tuple (old_start, new_start) or None if parsing fails.
    """
    # Fix the regex:
    # -(\d+)   -> Captures old_start
    # ,?\d* -> Matches optional comma and optional count (e.g., ',4' or just '')
    # \+(\d+)  -> Captures new_start
    # ,?\d* -> Matches optional comma and optional count
    regex = r'@@ -(\d+),?\d* \+(\d+),?\d* @@'
    
    hunk_match = re.match(regex, header_line)
    
    if hunk_match:
        # group(1) is old_start, group(2) is new_start
        old_start = int(hunk_match.group(1))
        new_start = int(hunk_match.group(2))
        return old_start, new_start
    
    return None

def parse_unified_diff(diff_text: str) -> FileDiff:
    """Parse a unified diff format"""
    lines = diff_text.split('\n')
    
    # Extract file path
    file_path = "unknown"
    for line in lines:
        if line.startswith('+++'):
            # The inline logic is used here as the file path extraction was done in the previous step
            file_path = line[4:].split('\t')[0]
            if file_path.startswith('b/'):
                file_path = file_path[2:]
            break
    
    # Check file status
    is_new = any('new file mode' in line for line in lines)
    is_deleted = any('deleted file mode' in line for line in lines)
    
    # Parse hunks
    hunks = []
    i = 0
    while i < len(lines):
        line = lines[i]
        
        # Find hunk header and use the new function
        start_lines = parse_hunk_header(line)
        
        if start_lines:
            old_start, new_start = start_lines
            
            # Parse hunk lines
            diff_lines = []
            i += 1
            current_line = new_start
            
            while i < len(lines) and not lines[i].startswith('@@'):
                content = lines[i]
                if content.startswith('+'):
                    diff_lines.append(DiffLine(current_line, content[1:], 'added'))
                    current_line += 1
                elif content.startswith('-'):
                    diff_lines.append(DiffLine(0, content[1:], 'removed'))
                elif content.startswith(' '):
                    diff_lines.append(DiffLine(current_line, content[1:], 'context'))
                    current_line += 1
                elif content == '':
                    # End of diff or empty line, skip
                    pass
                i += 1
            
            hunks.append(DiffHunk(old_start, new_start, diff_lines))
        else:
            i += 1
    
    return FileDiff(file_path, hunks, is_new, is_deleted)

# Example usage for demonstration:
if __name__ == "__main__":
    
    # Test the hunk header parser with different formats
    test_headers = [
        "@@ -1,4 +1,5 @@",       # Standard format (Expected: 1, 1)
        "@@ -5 +5,2 @@",         # Missing old count (Expected: 5, 5)
        "@@ -10,3 +10 @@",       # Missing new count (Expected: 10, 10)
        "@@ -7 +7 @@",           # Missing both counts (Expected: 7, 7)
        "@@ -100,20 +120,30 @@ class TestClass:", # With description (Expected: 100, 120)
        "Not a header line"      # Invalid line (Expected: None)
    ]
    
    print("--- Test parse_hunk_header Output ---")
    for header in test_headers:
        result = parse_hunk_header(header)
        print(f"Header: '{header.split(' @@')[0] + ' @@'}' -> Result: {result}")
        
    print("\n--- Full diff parsing with standard header ---")
    # Example of full diff parsing
    sample_diff = """diff --git a/example.py b/example.py
index 83db48f..f735c60 100644
--- a/example.py
+++ b/example.py
@@ -1,4 +1,5 @@
 import os
+import sys
 
 def foo():
-    print("Hello")
+    print("Hello, world!")
"""
    file_diff = parse_unified_diff(sample_diff)
    print(file_diff)
    
    print("\n--- Full diff parsing with non-standard header ---")
    # Example with a non-standard hunk header
    sample_diff2 = """diff --git a/test.py b/test.py
index 123abc..456def 100644
--- a/test.py
+++ b/test.py
@@ -5 +5,2 @@ def test_function():
     print("Testing")
+    print("More tests")
"""
    file_diff2 = parse_unified_diff(sample_diff2)
    print(file_diff2)
    
    # Verification of line numbers from the non-standard diff (sample_diff2):
    if file_diff2.hunks:
        hunk = file_diff2.hunks[0]
        print(f"\nVerification: Hunk starts at old={hunk.old_start}, new={hunk.new_start}")
        print(f"Parsed lines for non-standard diff:")
        for dline in hunk.lines:
             print(f"[New Line: {dline.line_number}] Type: {dline.change_type.ljust(7)} | Content: {dline.content}")
```



## Parsing Complete Hunks for Code Review

You've done excellent work parsing diff lines, file paths, and hunk headers! Now it's time to bring these skills together by implementing a function that processes a complete hunk from start to finish.

In this exercise, you'll create the parse_single_hunk function, which takes a multi-line string containing a hunk (header plus content lines) and converts it into a structured DiffHunk object. You'll need to:

Extract the starting line numbers from the hunk header
Process each content line based on its prefix character
Track line numbers correctly as you go through the hunk
Build a complete collection of DiffLine objects
This function is a perfect integration of what you've learned so far, combining header parsing with line-by-line processing. By creating a dedicated function for hunk parsing, you're building a modular component that makes the overall diff parser more maintainable and easier to understand.

```python
import re
from dataclasses import dataclass
from typing import List, Optional, Tuple

@dataclass
class DiffLine:
    line_number: int
    content: str
    change_type: str  # 'added', 'removed', 'context'

@dataclass
class DiffHunk:
    old_start: int
    new_start: int
    lines: List['DiffLine']

@dataclass
class FileDiff:
    file_path: str
    hunks: List['DiffHunk']
    is_new: bool
    is_deleted: bool

def parse_hunk_header(header_line: str) -> Tuple[int, int]:
    """
    Parse a hunk header line to extract the starting line numbers.
    
    Args:
        header_line: A string containing the hunk header (e.g., "@@ -1,4 +1,5 @@")
        
    Returns:
        A tuple of (old_start, new_start) as integers
    """
    pattern = r'@@ -(\d+)(?:,\d+)? \+(\d+)(?:,\d+)? @@'
    match = re.match(pattern, header_line)
    
    if match:
        old_start = int(match.group(1))
        new_start = int(match.group(2))
        return old_start, new_start
    
    # If we can't parse the header, return default values
    return 1, 1

def parse_single_hunk(hunk_text: str) -> DiffHunk:
    """
    Parse a single hunk from a unified diff into a DiffHunk object.
    
    Args:
        hunk_text: A multi-line string containing a single hunk (header + lines)
        
    Returns:
        A DiffHunk object with parsed information
    """
    # TODO: Split the hunk text into lines
    
    # TODO: Check if the first line is a valid hunk header, return empty DiffHunk if not
    
    # TODO: Parse the header to get old_start and new_start line numbers
    
    # TODO: Initialize an empty list for diff_lines and set current_line to new_start
    
    # TODO: Loop through each line after the header
    # For each line:
    # - If it starts with '+', create a DiffLine with type 'added' and increment current_line
    # - If it starts with '-', create a DiffLine with type 'removed' and line_number=0
    # - If it starts with ' ', create a DiffLine with type 'context' and increment current_line
    
    # TODO: Return a DiffHunk with the parsed information
    return DiffHunk(1, 1, [])

def parse_unified_diff(diff_text: str) -> FileDiff:
    """Parse a unified diff format"""
    lines = diff_text.split('\n')
    
    # Extract file path
    file_path = "unknown"
    for line in lines:
        if line.startswith('+++'):
            file_path = line[4:].split('\t')[0]
            if file_path.startswith('b/'):
                file_path = file_path[2:]
            break
    
    # Check file status
    is_new = any('new file mode' in line for line in lines)
    is_deleted = any('deleted file mode' in line for line in lines)
    
    # Parse hunks
    hunks = []
    i = 0
    while i < len(lines):
        line = lines[i]
        
        # Find hunk header
        if line.startswith('@@'):
            # Find the end of this hunk
            hunk_start = i
            i += 1
            while i < len(lines) and not lines[i].startswith('@@'):
                i += 1
            
            # Extract the hunk text and parse it
            hunk_text = '\n'.join(lines[hunk_start:i])
            # TODO: Use the parse_single_hunk function to parse the hunk text
            hunk = DiffHunk(1, 1, [])  # Replace this with a call to parse_single_hunk
            hunks.append(hunk)
        else:
            i += 1
    
    return FileDiff(file_path, hunks, is_new, is_deleted)

# Example usage for demonstration:
if __name__ == "__main__":
    # Test the parse_single_hunk function
    test_hunks = [
        """@@ -1,4 +1,5 @@
 import os
+import sys
 
 def foo():
-    print("Hello")
+    print("Hello, world!")""",
        
        """@@ -10 +10,2 @@ def test_function():
     print("Testing")
+    print("More tests")"""
    ]
    
    for i, hunk_text in enumerate(test_hunks):
        print(f"Hunk {i+1}:")
        hunk = parse_single_hunk(hunk_text)
        print(f"  old_start: {hunk.old_start}, new_start: {hunk.new_start}")
        print(f"  lines: {len(hunk.lines)}")
        for line in hunk.lines:
            print(f"    Line {line.line_number}: {line.change_type} - {line.content}")
    
    # Example of full diff parsing
    sample_diff = """diff --git a/example.py b/example.py
index 83db48f..f735c60 100644
--- a/example.py
+++ b/example.py
@@ -1,4 +1,5 @@
 import os
+import sys
 
 def foo():
-    print("Hello")
+    print("Hello, world!")
"""
    file_diff = parse_unified_diff(sample_diff)
    print(file_diff)

```

## Handling Multiple Hunks in Diff Parsing

Excellent work on building the components of our diff parser! You've mastered processing individual lines, extracting file paths, parsing hunk headers, and handling complete hunks. Now it's time to tackle a real-world challenge: handling multiple hunks in a single file diff.

In practice, code changes often affect different parts of a file, resulting in multiple separate hunks in the diff. Our current parser has a limitation — it stops after finding the first hunk, missing any additional changes.

In this exercise, you'll enhance the parse_unified_diff function to process all hunks in a file diff. You need to:

Remove the premature exit after processing the first hunk
Ensure the parser continues scanning for additional hunk headers
Properly track boundaries between hunks to avoid overlap
Add all parsed hunks to the result list in the correct order
This improvement is crucial for real-world code review, as most meaningful changes affect multiple parts of a file. By handling multiple hunks, your diff parser will be ready to process complex code changes that AI assistants need to review effectively.

```python
import re
from dataclasses import dataclass
from typing import List, Optional, Tuple

@dataclass
class DiffLine:
    line_number: int
    content: str
    change_type: str  # 'added', 'removed', 'context'

@dataclass
class DiffHunk:
    old_start: int
    new_start: int
    lines: List['DiffLine']

@dataclass
class FileDiff:
    file_path: str
    hunks: List['DiffHunk']
    is_new: bool
    is_deleted: bool

def parse_hunk_header(header_line: str) -> Tuple[int, int]:
    """
    Parse a hunk header line to extract the starting line numbers.
    
    Args:
        header_line: A string containing the hunk header (e.g., "@@ -1,4 +1,5 @@")
        
    Returns:
        A tuple of (old_start, new_start) as integers
    """
    pattern = r'@@ -(\d+)(?:,\d+)? \+(\d+)(?:,\d+)? @@'
    match = re.match(pattern, header_line)
    
    if match:
        old_start = int(match.group(1))
        new_start = int(match.group(2))
        return old_start, new_start
    
    # If we can't parse the header, return default values
    return 1, 1

def extract_file_path(header_line: str) -> str:
    """
    Extract the file path from a diff header line.
    
    Args:
        header_line: A diff header line starting with '+++' or '---'
        
    Returns:
        The clean file path without prefixes
    """
    if not header_line.startswith('+++'):
        return "unknown"
    
    # Extract the path portion after the +++ prefix
    path = header_line[4:].split('\t')[0]
    
    # Remove the b/ prefix if present
    if path.startswith('b/'):
        path = path[2:]
        
    return path

def parse_single_hunk(hunk_text: str) -> DiffHunk:
    """
    Parse a single hunk from a unified diff into a DiffHunk object.
    
    Args:
        hunk_text: A multi-line string containing a single hunk (header + lines)
        
    Returns:
        A DiffHunk object with parsed information
    """
    lines = hunk_text.strip().split('\n')
    
    # First line should be the hunk header
    if not lines or not lines[0].startswith('@@'):
        return DiffHunk(1, 1, [])
    
    # Parse the header to get starting line numbers
    old_start, new_start = parse_hunk_header(lines[0])
    
    # Process each line in the hunk
    diff_lines = []
    current_line = new_start
    
    for line in lines[1:]:  # Skip the header line
        if not line:  # Skip empty lines
            continue
            
        if line.startswith('+'):
            # Added line
            diff_lines.append(DiffLine(current_line, line[1:], 'added'))
            current_line += 1
        elif line.startswith('-'):
            # Removed line - no line number in new file
            diff_lines.append(DiffLine(0, line[1:], 'removed'))
        elif line.startswith(' '):
            # Context line
            diff_lines.append(DiffLine(current_line, line[1:], 'context'))
            current_line += 1
    
    return DiffHunk(old_start, new_start, diff_lines)

def parse_unified_diff(diff_text: str) -> FileDiff:
    """Parse a unified diff format"""
    lines = diff_text.split('\n')
    
    # Extract file path
    file_path = "unknown"
    for line in lines:
        if line.startswith('+++'):
            file_path = extract_file_path(line)
            break
    
    # Check file status
    is_new = any('new file mode' in line for line in lines)
    is_deleted = any('deleted file mode' in line for line in lines)
    
    # Parse hunks
    hunks = []
    i = 0
    
    # This implementation only processes the first hunk it finds
    while i < len(lines):
        line = lines[i]
        
        # Find hunk header
        if line.startswith('@@'):
            old_start, new_start = parse_hunk_header(line)
            
            # Parse hunk lines
            diff_lines = []
            i += 1
            current_line = new_start
            
            while i < len(lines) and not lines[i].startswith('@@'):
                content = lines[i]
                if content.startswith('+'):
                    diff_lines.append(DiffLine(current_line, content[1:], 'added'))
                    current_line += 1
                elif content.startswith('-'):
                    diff_lines.append(DiffLine(0, content[1:], 'removed'))
                elif content.startswith(' '):
                    diff_lines.append(DiffLine(current_line, content[1:], 'context'))
                    current_line += 1
                elif content == '':
                    # End of diff or empty line, skip
                    pass
                i += 1
            
            hunks.append(DiffHunk(old_start, new_start, diff_lines))
            
            # TODO: Instead of stopping after the first hunk, we should continue parsing
            # to find additional hunks. Remove this break statement.
            break
        else:
            i += 1
    
    return FileDiff(file_path, hunks, is_new, is_deleted)

# Example usage for demonstration:
if __name__ == "__main__":
    # Test with a diff containing multiple hunks
    multi_hunk_diff = """diff --git a/example.py b/example.py
index 83db48f..f735c60 100644
--- a/example.py
+++ b/example.py
@@ -1,4 +1,5 @@
 import os
+import sys
 
 def foo():
-    print("Hello")
+    print("Hello, world!")
@@ -10,6 +11,7 @@ def bar():
     return 42
 
 def baz():
-    print("Baz")
+    print("Baz!")
+    return None
 
"""
    file_diff = parse_unified_diff(multi_hunk_diff)
    print(f"File: {file_diff.file_path}")
    print(f"Number of hunks: {len(file_diff.hunks)}")
    
    for i, hunk in enumerate(file_diff.hunks):
        print(f"\nHunk {i+1}:")
        print(f"  old_start: {hunk.old_start}, new_start: {hunk.new_start}")
        print(f"  lines: {len(hunk.lines)}")
        for line in hunk.lines:
            print(f"    Line {line.line_number}: {line.change_type} - {line.content}")

```

Excellent work\! You've correctly identified the limitation in the current parser and outlined the steps needed to handle multiple hunks. The key is to **remove the `break` statement** and ensure the outer `while` loop continues scanning the remaining lines for new hunk headers.

Here is the enhanced `parse_unified_diff` function:

```python
import re
from dataclasses import dataclass
from typing import List, Optional, Tuple

@dataclass
class DiffLine:
    line_number: int
    content: str
    change_type: str  # 'added', 'removed', 'context'

@dataclass
class DiffHunk:
    old_start: int
    new_start: int
    lines: List['DiffLine']

@dataclass
class FileDiff:
    file_path: str
    hunks: List['DiffHunk']
    is_new: bool
    is_deleted: bool

# --- Supporting functions (Keeping them as provided) ---

def parse_hunk_header(header_line: str) -> Tuple[int, int]:
    """
    Parse a hunk header line to extract the starting line numbers.
    """
    pattern = r'@@ -(\d+)(?:,\d+)? \+(\d+)(?:,\d+)? @@'
    match = re.match(pattern, header_line)
    
    if match:
        old_start = int(match.group(1))
        new_start = int(match.group(2))
        return old_start, new_start
    
    return 1, 1

def extract_file_path(header_line: str) -> str:
    """
    Extract the file path from a diff header line.
    """
    if not header_line.startswith('+++'):
        return "unknown"
    
    path = header_line[4:].split('\t')[0]
    
    if path.startswith('b/'):
        path = path[2:]
        
    return path

# The original parse_single_hunk is now integrated into the main function's logic
# and can be removed, but we'll keep it simple by fixing the main function.
def parse_single_hunk(hunk_text: str) -> DiffHunk:
    """
    Parse a single hunk from a unified diff into a DiffHunk object.
    (This function is not used in the final implementation below, 
    as the logic is now inline in parse_unified_diff for efficiency).
    """
    lines = hunk_text.strip().split('\n')
    
    if not lines or not lines[0].startswith('@@'):
        return DiffHunk(1, 1, [])
    
    old_start, new_start = parse_hunk_header(lines[0])
    
    diff_lines = []
    current_line = new_start
    
    for line in lines[1:]:
        if not line:
            continue
            
        if line.startswith('+'):
            diff_lines.append(DiffLine(current_line, line[1:], 'added'))
            current_line += 1
        elif line.startswith('-'):
            diff_lines.append(DiffLine(0, line[1:], 'removed'))
        elif line.startswith(' '):
            diff_lines.append(DiffLine(current_line, line[1:], 'context'))
            current_line += 1
    
    return DiffHunk(old_start, new_start, diff_lines)


def parse_unified_diff(diff_text: str) -> FileDiff:
    """
    Parse a unified diff format, processing all hunks within the file diff.
    """
    lines = diff_text.split('\n')
    
    # --- 1. Extract file path and status ---
    file_path = "unknown"
    for line in lines:
        if line.startswith('+++'):
            file_path = extract_file_path(line)
            break
    
    is_new = any('new file mode' in line for line in lines)
    is_deleted = any('deleted file mode' in line for line in lines)
    
    # --- 2. Parse all hunks ---
    hunks = []
    i = 0
    
    while i < len(lines):
        line = lines[i]
        
        # Find hunk header
        if line.startswith('@@'):
            old_start, new_start = parse_hunk_header(line)
            
            # Start parsing this hunk's lines
            diff_lines = []
            i += 1
            current_line = new_start
            
            # Loop until the end of the lines OR until the next hunk header ('@@') is found
            while i < len(lines) and not lines[i].startswith('@@'):
                content = lines[i]
                
                # Check for empty content to avoid index errors on content[1:]
                if not content:
                    i += 1
                    continue
                    
                if content.startswith('+'):
                    diff_lines.append(DiffLine(current_line, content[1:], 'added'))
                    current_line += 1
                elif content.startswith('-'):
                    diff_lines.append(DiffLine(0, content[1:], 'removed'))
                elif content.startswith(' '):
                    diff_lines.append(DiffLine(current_line, content[1:], 'context'))
                    current_line += 1
                # Lines that aren't +, -, or context (' ') are typically metadata
                # (like \ No newline at end of file) or header lines we missed.
                # We skip them if they aren't one of the three main types.
                
                i += 1
            
            # Add the completed hunk to the list
            hunks.append(DiffHunk(old_start, new_start, diff_lines))
            
            # IMPORTANT: The loop continues scanning from the current index 'i', 
            # which will either be the start of the next '@@' header or the end of the file.
            # This correctly handles multiple hunks.
            
        else:
            # Skip non-hunk lines (like 'diff', 'index', '---', '+++')
            i += 1
    
    return FileDiff(file_path, hunks, is_new, is_deleted)

# Example usage for demonstration:
if __name__ == "__main__":
    # Test with a diff containing multiple hunks
    multi_hunk_diff = """diff --git a/example.py b/example.py
index 83db48f..f735c60 100644
--- a/example.py
+++ b/example.py
@@ -1,4 +1,5 @@
 import os
+import sys
 
 def foo():
-    print("Hello")
+    print("Hello, world!")
@@ -10,6 +11,7 @@ def bar():
     return 42
 
 def baz():
-    print("Baz")
+    print("Baz!")
+    return None
 
"""
    file_diff = parse_unified_diff(multi_hunk_diff)
    print(f"File: {file_diff.file_path}")
    print(f"Number of hunks: {len(file_diff.hunks)}")
    
    for i, hunk in enumerate(file_diff.hunks):
        print(f"\nHunk {i+1}:")
        print(f"  old_start: {hunk.old_start}, new_start: {hunk.new_start}")
        print(f"  lines: {len(hunk.lines)}")
        for line in hunk.lines:
            print(f"    Line {line.line_number}: {line.change_type} - {line.content}")

```

## 🧠 Key Improvements

The crucial modification was removing the `break` statement inside the `if line.startswith('@@'):` block.

1.  **Continuous Scanning:** The main `while i < len(lines)` loop now continues iterating through the file lines after processing a hunk.
2.  **Hunk Boundary Tracking:** The inner `while i < len(lines) and not lines[i].startswith('@@'):` loop ensures that the parser stops collecting lines for the current hunk precisely when it encounters the **next `@@` header** or reaches the end of the file.
3.  **Correct Index Management:** After the inner loop finishes, the index `i` is correctly positioned either at the line containing the next hunk header or at the end of the file, allowing the outer loop to seamlessly continue searching for the next hunk.

This enhancement makes the diff parser robust enough to handle **complex, multi-hunk changes** in real-world code reviews\!