# Unit 3

## Git History Extraction with Python

## Introduction: Why Extract Git History?

Welcome back\! In the previous lessons, you learned how to set up the **LLM Code Review Assistant** project and how to scan a codebase to collect information about code files. Now, we are ready to take the next step: extracting the history of changes made to the project using **Git**.

**Git history** is a record of all the changes that have been made to a project over time. This history is very valuable for understanding how a project has evolved, who made which changes, and why those changes were made. For code review and analysis, being able to look at past commits and file changes helps you spot patterns, understand the reasoning behind code, and even catch mistakes.

In this lesson, you will learn how to use **Python** to extract Git history from a project. This will give you the tools to analyze changes and prepare for more advanced code review tasks.

-----

## Quick Recall: Git Repositories and Commits

Before we dive in, let’s quickly remind ourselves what a **Git repository** and a **commit** are.

  * A **Git repository** is a folder that tracks changes to files using Git. It stores all the information about the project’s history, including every change made to the files.
  * A **commit** is a snapshot of the project at a certain point in time. Each commit has:
      * A unique **hash** (an ID)
      * A **message** describing the change
      * The **author’s name** and **email**
      * The **date** and **time** of the change

In the last lesson, you learned how to scan a codebase for files. Now, we will focus on reading the history of commits and the changes they contain.

-----

## What Information Can We Get from Git History?

When we extract Git history, we are mainly interested in two things:

1.  **Commit Details:** Information about each commit, such as the hash, message, author, and date.
2.  **File Changes:** Which files were changed in each commit, and what the changes were.

Here is an example of what a commit might look like:

```
Commit: 1a2b3c4d
Message: Add user authentication endpoints
Author: Alice Johnson <alice@example.com>
Date: 2024-06-01 10:15:00
```

And an example of a file change in that commit:

```
File: backend/src/api/auth.py
Change: (diff content showing what was added or removed)
```

This information helps you answer questions like:

  * Who made a certain change?
  * When was a feature added?
  * What exactly was changed in a file?

-----

## Extracting Git History with Python

Let's walk through how to extract Git history using **Python**. We will use the **`gitpython`** library, which makes it easy to interact with Git repositories from Python code.

### Installing Required Dependencies

Before we can start working with Git repositories in Python, we need to install the `gitpython` library. Run this command in your terminal:

```bash
pip install gitpython
```

This will install the library that allows Python to interact with Git repositories.

### Step 1: Import Required Libraries

First, we need to import the libraries we will use. **`gitpython`** is used to interact with Git, and **`dataclasses`** help us organize the data.

```python
from git import Repo
from datetime import datetime
from dataclasses import dataclass
```

  * `Repo` lets us work with a Git repository.
  * `datetime` is used for handling dates.
  * `dataclass` helps us define simple classes for storing data.

### Step 2: Define Data Structures

We will use two data classes: one for commits and one for file changes.

```python
@dataclass
class GitCommit:
    hash: str
    message: str
    author: str
    date: datetime

@dataclass
class FileChange:
    file_path: str
    commit_hash: str
    diff_content: str
```

  * `GitCommit` stores information about each commit.
  * `FileChange` stores information about each file change in a commit.

### Step 3: Create the `GitHistoryExtractor` Class

Now, let’s create a class that will handle extracting the history.

```python
class GitHistoryExtractor:
    def __init__(self):
        self.commits = []
        self.file_changes = []
```

The `__init__` method sets up two lists: one for commits and one for file changes.

### Step 4: Extract Commits and File Changes

Let's add a method to extract commits and their file changes.

```python
def extract_commits(self, repo_path, max_commits=50):
    """
    Extract commit history and file changes from a git repository.
    
    Args:
        repo_path (str): Path to the git repository directory
        max_commits (int): Maximum number of commits to process (default: 50)
        
    Returns:
        list: List of GitCommit objects containing commit information
    """
    print(f"Extracting git history: {repo_path}")
    
    # Initialize the repository object - this connects to the git repo
    repo = Repo(repo_path)

    # Iterate through commits starting from the most recent (HEAD)
    # max_count limits how many commits we process to avoid overwhelming data
    for commit in repo.iter_commits(max_count=max_commits):
        
        # Create a GitCommit object with all the essential commit information
        git_commit = GitCommit(
            hash=commit.hexsha,                    # Full SHA hash (unique identifier)
            message=commit.message.strip(),       # Commit message with whitespace removed
            author=f"{commit.author.name} <{commit.author.email}>",  # Author info
            date=commit.committed_datetime        # When the commit was made
        )
        
        # Add this commit to our collection
        self.commits.append(git_commit)
        
        # Extract file changes by comparing this commit with its parent
        # Check if commit has parents (first commit in repo has no parents)
        if commit.parents:
            # Get the immediate parent commit (most commits have one parent)
            parent = commit.parents[0]
            
            # Generate diff between parent and current commit
            # create_patch=True gives us the actual diff content (what changed)
            for diff in parent.diff(commit, create_patch=True):
                
                # diff.b_path is the file path after the change
                # We check b_path exists to handle deleted files gracefully
                if diff.b_path:
                    file_change = FileChange(
                        file_path=diff.b_path,                    # Path to the changed file
                        commit_hash=commit.hexsha,                # Which commit this change belongs to
                        # Decode binary diff data to string, ignoring errors
                        diff_content=diff.diff.decode('utf-8', errors='ignore') 
                    )
                    
                    # Add this file change to our collection
                    self.file_changes.append(file_change)

    # Report what we found
    print(f"Found {len(self.commits)} commits, {len(self.file_changes)} changes")
    return self.commits
```

**Breakdown of `extract_commits`:**

  * **Repository Connection:** `Repo(repo_path)` connects to the Git repository.
  * **Commit Iteration:** `repo.iter_commits(max_count=max_commits)` iterates through commits from most recent backwards, limited by `max_commits`.
  * **Commit Data:** Uses properties like `commit.hexsha`, `commit.message.strip()`, and `commit.committed_datetime` to populate the `GitCommit` dataclass.
  * **File Change (Diff):**
      * It checks `if commit.parents` to skip the first commit.
      * It uses `parent.diff(commit, create_patch=True)` to generate the changes.
      * `diff.b_path` is used for the file path after the change.
      * `diff.diff.decode('utf-8', errors='ignore')` safely converts the binary diff content to a readable string.

-----

### Step 5: Using the Extractor

Let’s see how to use this class in a script.

```python
def main():
    extractor = GitHistoryExtractor()
    repo_path = "./sample-ecommerce-api"
    
    # You must ensure that './sample-ecommerce-api' is a valid git repository
    commits = extractor.extract_commits(repo_path, max_commits=10)
    
    print("\nRecent commits:")
    for i, commit in enumerate(commits[:3]):
        print(f"{i+1}. {commit.hash[:8]} - {commit.message[:50]}...")
        print(f"   Author: {commit.author}")
        print(f"   Date: {commit.date}")
        print()

# Typically, this would be wrapped in a __main__ block for execution
# if __name__ == "__main__":
#     main()
```

**Example Output:**

```
Extracting git history: ./sample-ecommerce-api
Found 5 commits, 4 changes

Recent commits:
1. 9f8e7d6c - Add order processing functionality...
   Author: Carol Davis <carol@example.com>
   Date: 2024-06-01 12:00:00

2. 7a6b5c4d - Implement product CRUD operations...
   Author: Bob Smith <bob@example.com>
   Date: 2024-06-01 11:30:00

3. 5e4d3c2b - Add user authentication endpoints...
   Author: Alice Johnson <alice@example.com>
   Date: 2024-06-01 11:00:00
```

-----

## Summary And What’s Next

In this lesson, you learned how to extract Git history from a project using **Python**. You saw how to:

  * Use the **`gitpython`** library to access a repository.
  * Collect **commit details** and **file changes**.
  * Organize this information using **data classes**.

This prepares you for the practice exercises, where you will try out these steps yourself and get comfortable working with Git history in Python. Understanding Git history is a key skill for code review and project analysis, and you are now ready to put it into practice\!

## Customizing Git Commit Display Format

Now that you've learned how to extract Git history using Python, let's customize how this information is displayed to users. In this exercise, you'll modify the output format of our commit display to make it more concise and readable.

Your task is to update the print statement in the main function that shows commit information. Specifically:

Change the commit hash display to show only the first 6 characters (instead of 8).
Change the commit message display to show only the first 30 characters (instead of 50).
This small change will help make our output more compact while still providing enough information to identify commits. It's also a good way to become familiar with how the extracted Git data flows through our application.

By completing this exercise, you'll take your first step in customizing how Git history is presented, which is an important skill for building effective code review tools.

```python
from git import Repo
from datetime import datetime
import os

# Minimal dataclasses for continuity with the outline
from dataclasses import dataclass

@dataclass
class GitCommit:
    hash: str
    message: str
    author: str
    date: datetime

@dataclass
class FileChange:
    file_path: str
    commit_hash: str
    diff_content: str

class GitHistoryExtractor:
    def __init__(self):
        self.commits = []
        self.file_changes = []
    
    def extract_commits(self, repo_path, max_commits=50):
        print(f"Extracting git history: {repo_path}")
        repo = Repo(repo_path)
        
        for commit in repo.iter_commits(max_count=max_commits):
            git_commit = GitCommit(
                hash=commit.hexsha,
                message=commit.message.strip(),
                author=f"{commit.author.name} <{commit.author.email}>",
                date=commit.committed_datetime
            )
            self.commits.append(git_commit)
            
            # Extract file changes
            if commit.parents:
                parent = commit.parents[0]
                for diff in parent.diff(commit, create_patch=True):
                    if diff.b_path:
                        file_change = FileChange(
                            file_path=diff.b_path,
                            commit_hash=commit.hexsha,
                            diff_content=diff.diff.decode('utf-8', errors='ignore')
                        )
                        self.file_changes.append(file_change)
        
        print(f"Found {len(self.commits)} commits, {len(self.file_changes)} changes")
        return self.commits


def main():
    extractor = GitHistoryExtractor()
    
    # Use current directory as the git repository
    repo_path = "./sample-ecommerce-api"
    
    # Check if we're in a git repository
    if not os.path.exists(os.path.join(repo_path, ".git")):
        print("Not a git repository. Initializing a sample repo...")
        # For demo purposes, we'll just show the class structure
        print("GitHistoryExtractor initialized successfully!")
        return
    
    try:
        commits = extractor.extract_commits(repo_path, max_commits=10)
        
        # Display first few commits
        print("\nRecent commits:")
        for i, commit in enumerate(commits[:3]):
            # TODO: Change the commit hash display to show only the first 6 characters (instead of 8)
            # TODO: Change the commit message display to show only the first 30 characters (instead of 50)
            print(f"{i+1}. {commit.hash[:8]} - {commit.message[:50]}...")
            print(f"   Author: {commit.author}")
            print(f"   Date: {commit.date}")
            print()
            
    except Exception as e:
        print(f"Error extracting git history: {e}")


if __name__ == "__main__":
    main()

```

A simple, effective change to improve readability\! Here is the updated `main` function with the commit hash truncated to 6 characters and the commit message to 30 characters.

## Updated `main` Function

The modification is applied directly to the `print` statement within the loop that displays the recent commits.

```python
from git import Repo
from datetime import datetime
import os

# Minimal dataclasses for continuity with the outline
from dataclasses import dataclass

@dataclass
class GitCommit:
    hash: str
    message: str
    author: str
    date: datetime

@dataclass
class FileChange:
    file_path: str
    commit_hash: str
    diff_content: str

class GitHistoryExtractor:
    def __init__(self):
        self.commits = []
        self.file_changes = []
    
    def extract_commits(self, repo_path, max_commits=50):
        print(f"Extracting git history: {repo_path}")
        # NOTE: In a real scenario, you'd add error handling for Repo initialization
        repo = Repo(repo_path)
        
        for commit in repo.iter_commits(max_count=max_commits):
            git_commit = GitCommit(
                hash=commit.hexsha,
                message=commit.message.strip(),
                author=f"{commit.author.name} <{commit.author.email}>",
                date=commit.committed_datetime
            )
            self.commits.append(git_commit)
            
            # Extract file changes
            if commit.parents:
                parent = commit.parents[0]
                for diff in parent.diff(commit, create_patch=True):
                    if diff.b_path:
                        file_change = FileChange(
                            file_path=diff.b_path,
                            commit_hash=commit.hexsha,
                            diff_content=diff.diff.decode('utf-8', errors='ignore')
                        )
                        self.file_changes.append(file_change)
        
        print(f"Found {len(self.commits)} commits, {len(self.file_changes)} changes")
        return self.commits


def main():
    extractor = GitHistoryExtractor()
    
    # Use a sample directory for the git repository
    repo_path = "./sample-ecommerce-api"
    
    # Check if we're in a git repository
    if not os.path.exists(os.path.join(repo_path, ".git")):
        print("Not a git repository. Initializing a sample repo...")
        # Since we cannot run git commands here, we'll stop, 
        # assuming the user will ensure a valid repo exists locally.
        print("GitHistoryExtractor initialized successfully. Please ensure a valid git repo exists at './sample-ecommerce-api' to see full output.")
        return
    
    try:
        commits = extractor.extract_commits(repo_path, max_commits=10)
        
        # Display first few commits
        print("\nRecent commits:")
        for i, commit in enumerate(commits[:3]):
            # FIX: Commit hash display changed from [:8] to [:6]
            # FIX: Commit message display changed from [:50] to [:30]
            print(f"{i+1}. {commit.hash[:6]} - {commit.message[:30]}...")
            print(f"   Author: {commit.author}")
            print(f"   Date: {commit.date}")
            print()
            
    except Exception as e:
        # A common error here will be GitCommandError if the path is not a git repo
        print(f"Error extracting git history. Ensure '{repo_path}' is a valid repository. Error: {e}")


if __name__ == "__main__":
    main()
```

The key line that was updated is:

```python
print(f"{i+1}. {commit.hash[:6]} - {commit.message[:30]}...")
```

## Filtering Git History by Author

Now that you've customized how Git commits are displayed, let's add a useful filtering feature to our extractor. When working with large projects, you often need to focus on changes made by specific team members.

In this exercise, you'll enhance the extract_commits method to accept an author_filter parameter. When provided, the method should include only commits from authors whose names contain the filter string (case-insensitive).

Your task involves:

Adding the author_filter parameter with a default value of None
Adding logic to check whether commit authors match the filter
Updating the main() function to demonstrate this filtering capability
This filtering feature will be incredibly helpful when analyzing contributions from specific team members or tracking down who made certain changes to the codebase. By implementing this feature, you'll gain practical experience working with Git metadata and learn how to extract targeted information from a repository's history.

```python
from git import Repo
from datetime import datetime
import os

# Minimal dataclasses for continuity with the outline
from dataclasses import dataclass

@dataclass
class GitCommit:
    hash: str
    message: str
    author: str
    date: datetime

@dataclass
class FileChange:
    file_path: str
    commit_hash: str
    diff_content: str

class GitHistoryExtractor:
    def __init__(self):
        self.commits = []
        self.file_changes = []
    
    # TODO: Add an author_filter parameter with a default value of None
    def extract_commits(self, repo_path, max_commits=50):
        print(f"Extracting git history: {repo_path}")
        repo = Repo(repo_path)
        
        for commit in repo.iter_commits(max_count=max_commits):
            author_name = commit.author.name
            
            # TODO: Add a condition to check if the commit's author name contains the filter string (case-insensitive)
            # TODO: Skip commits that don't match the author filter
            
            git_commit = GitCommit(
                hash=commit.hexsha,
                message=commit.message.strip(),
                author=f"{author_name} <{commit.author.email}>",
                date=commit.committed_datetime
            )
            self.commits.append(git_commit)
            
            # Extract file changes
            if commit.parents:
                parent = commit.parents[0]
                for diff in parent.diff(commit, create_patch=True):
                    if diff.b_path:
                        file_change = FileChange(
                            file_path=diff.b_path,
                            commit_hash=commit.hexsha,
                            diff_content=diff.diff.decode('utf-8', errors='ignore')
                        )
                        self.file_changes.append(file_change)
        
        print(f"Found {len(self.commits)} commits, {len(self.file_changes)} changes")
        return self.commits


def main():
    extractor = GitHistoryExtractor()
    
    # Use current directory as the git repository
    repo_path = "./sample-ecommerce-api"
    
    # Check if we're in a git repository
    if not os.path.exists(os.path.join(repo_path, ".git")):
        print("Not a git repository. Initializing a sample repo...")
        # For demo purposes, we'll just show the class structure
        print("GitHistoryExtractor initialized successfully!")
        return
    
    try:
        # Extract all commits first
        commits = extractor.extract_commits(repo_path, max_commits=10)
        
        # Display first few commits
        print("\nAll recent commits:")
        for i, commit in enumerate(commits[:3]):
            print(f"{i+1}. {commit.hash[:8]} - {commit.message[:50]}...")
            print(f"   Author: {commit.author}")
            print(f"   Date: {commit.date}")
            print()
        
        # TODO: Reset the extractor and create a new instance
        # TODO: Define an author name to filter by (e.g., "Alice")
        # TODO: Call extract_commits with the author_filter parameter
        
        # TODO: Display filtered commits with appropriate header
        # TODO: Add logic to handle the case when no commits match the filter
            
    except Exception as e:
        print(f"Error extracting git history: {e}")


if __name__ == "__main__":
    main()

```

That's a practical and necessary feature for any code analysis tool\! Here is the updated code, incorporating the `author_filter` into the `extract_commits` method and demonstrating its use in `main()`.

The key change is the check: `if author_filter and author_filter.lower() not in author_name.lower():`.

```python
from git import Repo
from datetime import datetime
import os
import shutil # Added for main block cleanup in a real environment

# Minimal dataclasses for continuity with the outline
from dataclasses import dataclass

@dataclass
class GitCommit:
    hash: str
    message: str
    author: str
    date: datetime

@dataclass
class FileChange:
    file_path: str
    commit_hash: str
    diff_content: str

class GitHistoryExtractor:
    def __init__(self):
        self.commits = []
        self.file_changes = []
    
    # FIX: Added author_filter parameter with default None
    def extract_commits(self, repo_path, max_commits=50, author_filter=None):
        print(f"Extracting git history: {repo_path}")
        # Clear previous results before a new scan
        self.commits = []
        self.file_changes = []

        try:
            repo = Repo(repo_path)
        except Exception as e:
            print(f"Error initializing Git repository at {repo_path}: {e}")
            return []

        for commit in repo.iter_commits(max_count=max_commits):
            author_name = commit.author.name
            
            # FIX: Logic to check if the commit's author name matches the filter
            if author_filter:
                # Case-insensitive check: if filter is provided and not found in author name, skip
                if author_filter.lower() not in author_name.lower():
                    continue # Skip to the next commit
            
            git_commit = GitCommit(
                hash=commit.hexsha,
                message=commit.message.strip(),
                author=f"{author_name} <{commit.author.email}>",
                date=commit.committed_datetime
            )
            self.commits.append(git_commit)
            
            # Extract file changes
            if commit.parents:
                parent = commit.parents[0]
                for diff in parent.diff(commit, create_patch=True):
                    # Only process if the file exists after the change (not a deletion)
                    if diff.b_path:
                        file_change = FileChange(
                            file_path=diff.b_path,
                            commit_hash=commit.hexsha,
                            diff_content=diff.diff.decode('utf-8', errors='ignore')
                        )
                        self.file_changes.append(file_change)
        
        print(f"Found {len(self.commits)} commits, {len(self.file_changes)} changes")
        return self.commits


def display_commits(commits, header="Recent commits:"):
    print(f"\n--- {header} ({len(commits)} found) ---")
    if not commits:
        print("No commits found matching the criteria.")
        return
        
    for i, commit in enumerate(commits[:3]): # Display top 3
        # Use the customized display format (hash[:6], message[:30])
        print(f"{i+1}. {commit.hash[:6]} - {commit.message[:30]}...")
        print(f"   Author: {commit.author}")
        print(f"   Date: {commit.date.strftime('%Y-%m-%d %H:%M:%S')}")
        print()


def main():
    # Setup for a runnable demonstration
    repo_path = "./sample-ecommerce-api"
    if not os.path.exists(os.path.join(repo_path, ".git")):
        print(f"Git repository not found at {repo_path}. Cannot run demonstration.")
        return

    try:
        # 1. Unfiltered Scan
        extractor = GitHistoryExtractor()
        all_commits = extractor.extract_commits(repo_path, max_commits=10)
        display_commits(all_commits, "All recent commits")
        
        # 2. Filtered Scan
        # FIX: Reset the extractor by creating a new instance
        extractor = GitHistoryExtractor() 
        
        # FIX: Define an author name to filter by
        author_filter = "Alice" 
        
        # FIX: Call extract_commits with the author_filter parameter
        filtered_commits = extractor.extract_commits(repo_path, max_commits=10, author_filter=author_filter)
        
        # FIX: Display filtered commits
        display_commits(filtered_commits, f"Commits filtered by Author: '{author_filter}'")
            
    except Exception as e:
        print(f"Error extracting git history: {e}")


if __name__ == "__main__":
    # NOTE: For this to run successfully, the './sample-ecommerce-api' directory must exist
    # and be a valid git repository with commit history from different authors.
    main()
```

## Identifying Hot Spots in Git History

## Filtering Git History by File Type

## Generating Repository Activity Insights at Scale

## Generating Repository Activity Insights at Scale