# Unit 4

## CLI Integration and Data Analysis

## Welcome Back\! Combining Tools into a Command-Line Interface

Welcome back\! So far, you have learned how to scan a codebase for files and extract git history using **Python**. In this lesson, you will see how to combine these tools into a single **command-line interface (CLI)** program. This CLI will let you analyze any codebase by simply running a command in your terminal.

Using a CLI is a common way to interact with developer tools. It allows you to run your analysis on any project folder, see results right away, and even automate tasks. By the end of this lesson, you will understand how to build a simple CLI that ties together your codebase scanner and git history extractor and displays useful statistics about a project.

-----

## Quick Recap: Scanning and Git Extraction

Before we dive in, let’s quickly remind ourselves of the two main components you built in previous lessons:

  * **Repository Scanner:** This tool walks through a project directory, finds code files, reads their contents, and detects their programming language.
  * **Git History Extractor:** This tool uses the `gitpython` library to read the commit history and file changes from a git repository.

You already have classes like `RepositoryScanner` and `GitHistoryExtractor` that handle these tasks. In this lesson, you will see how to use them together in a new way.

-----

## Building a Simple CLI with `argparse`

To make your tool easy to use from the terminal, you will use Python’s **`argparse`** library. This library helps you accept arguments from the command line, such as the path to the repository you want to analyze.

Let’s start by importing `argparse` and setting up a basic CLI:

```python
import argparse

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--repo', default='.', help='Repository path')
    args = parser.parse_args()
    print(f"Analyzing repository at: {args.repo}")

if __name__ == "__main__":
    main()
```

**Explanation:**

  * `argparse.ArgumentParser()` creates a parser for command-line arguments.
  * `add_argument('--repo', ...)` lets the user specify a repository path with `--repo`. If not provided, it defaults to the current directory (`.`).
  * `args = parser.parse_args()` reads the arguments from the command line.
  * The script prints out the path it will analyze.

**Example usage:**

```bash
python3 code_review_assistant/cli.py --repo ./sample-ecommerce-api
```

This command tells the program to analyze the `sample-ecommerce-api` folder.

-----

## Connecting the Pieces: Running the Analysis

Now, let’s see how to use your existing scanner and git extractor inside the CLI. You will create a new class called **`CodebaseAnalyzer`** that brings everything together.

First, import your scanner and git extractor:

```python
from scanner import RepositoryScanner
from git_extractor import GitHistoryExtractor
```

Next, define the `CodebaseAnalyzer` class:

```python
class CodebaseAnalyzer:
    def __init__(self):
        self.scanner = RepositoryScanner()
        self.git_extractor = GitHistoryExtractor()

    def analyze_repository(self, repo_path):
        print("Starting analysis...")
        
        # Scan files
        files = self.scanner.scan_repository(repo_path)
        
        # Extract git history
        commits = self.git_extractor.extract_commits(repo_path)
        
        # Show results
        print(f"\nResults:")
        print(f"Files: {len(files)}")
        print(f"Commits: {len(commits)}")
        print(f"File changes: {len(self.git_extractor.file_changes)}")
        
        # Language stats
        languages = {}
        for file in files:
            languages[file.language] = languages.get(file.language, 0) + 1
        print(f"Languages: {languages}")
```

**Explanation:**

  * The `CodebaseAnalyzer` class creates instances of your scanner and git extractor.
  * The `analyze_repository` method:
      * Scans the repository for code files.
      * Extracts the git commit history.
      * Prints out the number of files, commits, file changes, and a breakdown of programming languages.

Now, update your `main()` function to use this analyzer:

```python
def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--repo', default='.', help='Repository path')
    args = parser.parse_args()
    
    analyzer = CodebaseAnalyzer()
    analyzer.analyze_repository(args.repo)
```

-----

## Displaying Results: Making Sense of the Output

When you run your CLI tool, you will see output like this:

```
Starting analysis...
Scanning: ./sample-ecommerce-api
Found 12 files
Extracting git history: ./sample-ecommerce-api
Found 25 commits, 40 changes

Results:
Files: 12
Commits: 25
File changes: 40
Languages: {'Python': 10, 'JavaScript': 2}
```

**What does this mean?**

  * **Files:** The number of code files found in the repository.
  * **Commits:** The number of git commits extracted (up to the limit set in your extractor).
  * **File changes:** The number of file changes detected in the commit history.
  * **Languages:** A dictionary showing how many files were found for each programming language.

This output gives you a quick overview of the project’s size, activity, and language breakdown. It’s a useful starting point for any code review or analysis.

-----

## Summary and Practice Preview

In this lesson, you learned how to build a simple **command-line interface (CLI)** that brings together your codebase scanner and git history extractor. You saw how to use `argparse` to accept user input, how to connect your existing components, and how to display useful statistics about a codebase.

Next, you will get a chance to practice running and modifying this CLI tool yourself. Try analyzing different repositories, experiment with the output, and see how the results change. This hands-on practice will help you become comfortable using and extending CLI tools for code analysis.

## Adding Output Options to CLI Tool

Now that you've built a basic CLI tool that combines your scanner and git extractor, let's make it more user-friendly by adding output options. Different users have different needs — some want just the key facts, while others need all the details.

In this exercise, you'll enhance the CLI by adding a new --output-format argument that lets users choose between "simple" and "detailed" output modes. When users select "simple" (the default), they'll see just the essential file and commit counts. With "detailed", they'll get the full analysis, including language breakdown and file changes.

Your tasks:

Add the new --output-format argument to the parser with appropriate choices.
Update the analyze_repository method to accept this parameter.
Modify the output logic to display different information based on the selected format.
This enhancement makes your tool more versatile and respects users' preferences for how much information they want to see. Command-line tools that adapt to different use cases are much more likely to be adopted by development teams.

```python
import argparse
from scanner import RepositoryScanner
from git_extractor import GitHistoryExtractor

class CodebaseAnalyzer:
    def __init__(self):
        self.scanner = RepositoryScanner()
        self.git_extractor = GitHistoryExtractor()
    
    # TODO: Update the method signature to accept an output_format parameter with "simple" as the default
    def analyze_repository(self, repo_path):
        print("Starting analysis...")
        
        # Scan files
        files = self.scanner.scan_repository(repo_path)
        
        # Extract git history
        commits = self.git_extractor.extract_commits(repo_path)
        
        # Show results
        print(f"\nResults:")
        print(f"Files: {len(files)}")
        print(f"Commits: {len(commits)}")
        
        # TODO: Add conditional logic here to only show the detailed information below 
        # when output_format is "detailed"
        print(f"File changes: {len(self.git_extractor.file_changes)}")
        
        # Language stats
        languages = {}
        for file in files:
            languages[file.language] = languages.get(file.language, 0) + 1
        print(f"Languages: {languages}")

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--repo', default='.', help='Repository path')
    # TODO: Add a new argument called --output-format that accepts 'simple' or 'detailed' with 'simple' as default
    args = parser.parse_args()
    
    analyzer = CodebaseAnalyzer()
    # TODO: Update this line to pass the output_format argument to the analyze_repository method
    analyzer.analyze_repository(args.repo)

if __name__ == "__main__":
    main()
```

That's a great next step\! Making CLI tools adaptable with different output formats significantly improves user experience.

Here is the updated code incorporating the new `--output-format` argument and the conditional output logic inside the `CodebaseAnalyzer` class.

```python
import argparse
from scanner import RepositoryScanner
from git_extractor import GitHistoryExtractor

class CodebaseAnalyzer:
    def __init__(self):
        self.scanner = RepositoryScanner()
        self.git_extractor = GitHistoryExtractor()
    
    # Updated: Added output_format parameter with "simple" as the default
    def analyze_repository(self, repo_path, output_format="simple"):
        print("Starting analysis...")
        
        # Scan files
        files = self.scanner.scan_repository(repo_path)
        
        # Extract git history
        commits = self.git_extractor.extract_commits(repo_path)
        
        # Show results
        print(f"\nResults:")
        # Essential information (always shown)
        print(f"Files: {len(files)}")
        print(f"Commits: {len(commits)}")
        
        # Conditional logic: Show detailed information only when output_format is "detailed"
        if output_format == "detailed":
            print("\n--- Detailed Analysis ---")
            print(f"File changes: {len(self.git_extractor.file_changes)}")
            
            # Language stats
            languages = {}
            for file in files:
                languages[file.language] = languages.get(file.language, 0) + 1
            print(f"Languages: {languages}")

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--repo', default='.', help='Repository path')
    # Added: The --output-format argument
    parser.add_argument(
        '--output-format', 
        choices=['simple', 'detailed'], # Specify valid choices
        default='simple',             # Set the default value
        help='Output format (simple or detailed)'
    )
    args = parser.parse_args()
    
    analyzer = CodebaseAnalyzer()
    # Updated: Passed the output_format argument to the analyze_repository method
    analyzer.analyze_repository(args.repo, args.output_format)

if __name__ == "__main__":
    main()
```

### Key Changes Explained:

1.  **`main()` function:**

      * We added:
        ```python
        parser.add_argument(
            '--output-format', 
            choices=['simple', 'detailed'],
            default='simple',
            help='Output format (simple or detailed)'
        )
        ```
        This tells `argparse` to expect an optional `--output-format` argument, validates its value against the allowed `choices`, and sets a `default` value.
      * We updated the analyzer call:
        ```python
        analyzer.analyze_repository(args.repo, args.output_format)
        ```
        This passes the value of the new argument from `argparse` to the analyzer method.

2.  **`CodebaseAnalyzer.analyze_repository()` method:**

      * The method signature was updated to accept the new parameter:
        ```python
        def analyze_repository(self, repo_path, output_format="simple"):
        ```
      * The detailed output logic is now enclosed in a conditional block:
        ```python
        if output_format == "detailed":
            # ... detailed stats (file changes, languages) ...
        ```
        This ensures the detailed information is only printed when the user explicitly requests it.

## Adding Fast Scan Mode to CLI

Cosmo
Just now
Read message aloud
Excellent work on adding output format options to your CLI! Now, let's make your tool even more flexible by adding a "scan-only" mode that skips the git extraction process entirely.

Git history extraction can be time-consuming for large repositories, and sometimes users just want a quick overview of the codebase structure. In this exercise, you'll add a fast analysis option that focuses only on file statistics.

Your tasks:

Create a new method called scan_only_mode in the CodebaseAnalyzer class that:

Uses only the repository scanner (skips git extraction)
Counts the total lines of code across all files
Displays the file count, language breakdown, and total lines
Add a new command-line flag --scan-only that activates this faster analysis mode

This enhancement will make your CLI tool more practical for everyday use, especially when working with large codebases where full analysis might take too long. Professional developer tools often provide these kinds of "quick scan" options to respect users' time.

```python
import argparse
from scanner import RepositoryScanner
from git_extractor import GitHistoryExtractor

class CodebaseAnalyzer:
    def __init__(self):
        self.scanner = RepositoryScanner()
        self.git_extractor = GitHistoryExtractor()
    
    def analyze_repository(self, repo_path):
        print("Starting full analysis...")
        
        # Scan files
        files = self.scanner.scan_repository(repo_path)
        
        # Extract git history
        commits = self.git_extractor.extract_commits(repo_path)
        
        # Show results
        print(f"\nResults:")
        print(f"Files: {len(files)}")
        print(f"Commits: {len(commits)}")
        print(f"File changes: {len(self.git_extractor.file_changes)}")
        
        # Language stats
        languages = {}
        for file in files:
            languages[file.language] = languages.get(file.language, 0) + 1
        print(f"Languages: {languages}")
    
    # TODO: Create a new method called scan_only_mode that takes repo_path as a parameter
    # This method should:
    # 1. Print a starting message
    # 2. Use the scanner to get files (but skip git extraction)
    # 3. Count total lines of code across all files
    # 4. Display results including file count, total lines, and language breakdown

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--repo', default='.', help='Repository path')
    # TODO: Add a new argument --scan-only that is a flag (no value needed)
    args = parser.parse_args()
    
    analyzer = CodebaseAnalyzer()
    
    # TODO: Add conditional logic to call either scan_only_mode or analyze_repository
    # based on whether the --scan-only flag is present
    analyzer.analyze_repository(args.repo)

if __name__ == "__main__":
    main()

```

You're absolutely right; adding a **"scan-only" mode** is a professional and practical enhancement. It provides users with a faster analysis when full Git history isn't needed.

Here is the updated code implementing the new `scan_only_mode` method and integrating the `--scan-only` flag into the CLI logic.

```python
import argparse
from scanner import RepositoryScanner
# Assuming the file classes (like File or CodeFile) have a 'lines_of_code' attribute 
# that the scanner populates. We don't import them, but we assume they exist for the calculation.
from git_extractor import GitHistoryExtractor

class CodebaseAnalyzer:
    def __init__(self):
        self.scanner = RepositoryScanner()
        self.git_extractor = GitHistoryExtractor()
    
    def analyze_repository(self, repo_path):
        print("Starting full analysis...")
        
        # Scan files
        files = self.scanner.scan_repository(repo_path)
        
        # Extract git history
        commits = self.git_extractor.extract_commits(repo_path)
        
        # Show results
        print(f"\nResults:")
        print(f"Files: {len(files)}")
        print(f"Commits: {len(commits)}")
        print(f"File changes: {len(self.git_extractor.file_changes)}")
        
        # Language stats
        languages = {}
        for file in files:
            languages[file.language] = languages.get(file.language, 0) + 1
        print(f"Languages: {languages}")

    # New method for fast, scan-only mode
    def scan_only_mode(self, repo_path):
        print("Starting fast analysis (scan-only mode)...")
        
        # 1. Use the scanner to get files (skips git extraction)
        files = self.scanner.scan_repository(repo_path)
        
        # 2. Count total lines of code across all files
        total_lines = sum(file.lines_of_code for file in files)
        
        # 3. Calculate Language stats
        languages = {}
        for file in files:
            languages[file.language] = languages.get(file.language, 0) + 1
            
        # 4. Display results
        print(f"\nScan Results:")
        print(f"Total Files: {len(files)}")
        print(f"Total Lines of Code: {total_lines}")
        print(f"Languages: {languages}")

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--repo', default='.', help='Repository path')
    # Added: The --scan-only flag, using action='store_true'
    parser.add_argument(
        '--scan-only', 
        action='store_true', 
        help='Activate fast analysis mode, skipping Git history extraction.'
    )
    args = parser.parse_args()
    
    analyzer = CodebaseAnalyzer()
    
    # Conditional logic based on the --scan-only flag
    if args.scan_only:
        analyzer.scan_only_mode(args.repo)
    else:
        analyzer.analyze_repository(args.repo)

if __name__ == "__main__":
    main()
```

### Key Changes Explained:

1.  **`CodebaseAnalyzer.scan_only_mode()`:**

      * It only calls `self.scanner.scan_repository(repo_path)`, **skipping** the `git_extractor`.
      * It calculates the **Total Lines of Code** using a generator expression and `sum()`: `total_lines = sum(file.lines_of_code for file in files)`. This assumes your `scanner` correctly populates a `lines_of_code` attribute on each file object.
      * It prints the relevant metrics for a fast scan: **Total Files**, **Total Lines of Code**, and the **Language breakdown**.

2.  **`main()` function:**

      * We added the `--scan-only` argument using `action='store_true'`:
        ```python
        parser.add_argument('--scan-only', action='store_true', ...)
        ```
        The `action='store_true'` setting means that if the user provides the flag, `args.scan_only` will be `True`; otherwise, it defaults to `False`.
      * We implemented the **conditional call** logic:
        ```python
        if args.scan_only:
            analyzer.scan_only_mode(args.repo)
        else:
            analyzer.analyze_repository(args.repo)
        ```
        The program now checks the flag and executes the appropriate analysis mode, respecting the user's need for speed. 🏎️


The `AttributeError` indicates that the `CodeFile` objects returned by your `RepositoryScanner` do not have an attribute named **`lines_of_code`**.

The error occurs here in your `scan_only_mode` method:

```python
  File "/usercode/FILESYSTEM/code_review_assistant/cli.py", line 41, in scan_only_mode
    total_lines = sum(file.lines_of_code for file in files)
```

### The Solution

To fix this, you need to update the logic to **calculate the lines of code** *before* you try to sum them up, or you need to ensure your `RepositoryScanner` is responsible for adding that attribute to the `CodeFile` objects.

Assuming the scanner returns objects that have a **`content`** attribute (the string content of the file), you can calculate the line count by counting the newline characters (`\n`).

Here is the corrected `scan_only_mode` method:

```python
    def scan_only_mode(self, repo_path):
        print("Starting fast analysis (scan-only mode)...")
        
        # 1. Use the scanner to get files
        files = self.scanner.scan_repository(repo_path)
        
        # 2. Calculate Language stats and Total Lines of Code
        languages = {}
        total_lines = 0
        
        for file in files:
            # Assuming 'file' has a 'content' attribute from the scanner
            # Calculate lines of code by counting newlines
            # If the file object *already* had lines_of_code, use that instead:
            # line_count = file.lines_of_code 
            
            # Since it doesn't, we calculate it from content:
            # Note: Adding 1 handles cases where the file doesn't end with a newline
            line_count = file.content.count('\n') + 1 if file.content else 0

            total_lines += line_count
            
            languages[file.language] = languages.get(file.language, 0) + 1
            
        # 3. Display results
        print(f"\nScan Results:")
        print(f"Total Files: {len(files)}")
        # Updated line: Now uses the calculated total_lines variable
        print(f"Total Lines of Code: {total_lines}")
        print(f"Languages: {languages}")
```

By iterating over `files` and calculating `line_count` from `file.content` (which the scanner presumably extracts), you avoid the `AttributeError` and correctly implement the total line count.

## Making Your CLI Tool Error Proof

After adding output options and a fast scan mode to your CLI, it's time to make your tool more robust by fixing some bugs. Your current analyzer works well with ideal repositories, but real-world codebases can be messy!

In this exercise, you'll improve the error handling in your CodebaseAnalyzer class to make it work reliably with any repository. The current code has issues when dealing with non-git repositories, invalid paths, and files with unknown language types.

Your tasks:

Add a check to verify that the repository path exists before attempting analysis
Add error handling around the git extraction call to gracefully handle repositories without git history
Fix the language statistics calculation to properly handle all file types without crashing
These improvements will make your tool much more reliable in real-world scenarios. Professional-grade tools need to handle unexpected situations gracefully — users should see helpful messages instead of cryptic errors when something goes wrong.

CLI Tip: When building CLI tools, you may want to add boolean flags (like --verbose or --scan-only). For boolean flags, you must specify the action parameter for argparse to understand that it needs to store a boolean value:

Python
Copy to clipboard
parser.add_argument('--scan-only', action='store_true', help='Run only file scanning (skip git extraction)')
Without action='store_true', argparse expects a value to be provided after the flag. With this action, the flag becomes a simple on/off switch that defaults to False and becomes True when the flag is present.


```python
import argparse
from scanner import RepositoryScanner
from git_extractor import GitHistoryExtractor

class CodebaseAnalyzer:
    def __init__(self):
        self.scanner = RepositoryScanner()
        self.git_extractor = GitHistoryExtractor()
    
    def analyze_repository(self, repo_path):
        print("Starting analysis...")
        
        # TODO: Add a check to verify if the path exists
        
        # Scan files
        files = self.scanner.scan_repository(repo_path)
        
        # Extract git history
        # TODO: Add error handling around this call to handle repositories without git history
        commits = self.git_extractor.extract_commits(repo_path)
        
        # Show results
        print(f"\nResults:")
        print(f"Files: {len(files)}")
        print(f"Commits: {len(commits)}")
        print(f"File changes: {len(self.git_extractor.file_changes)}")
        
        # Language stats
        languages = {}
        for file in files:
            languages[file.language] = languages.get(file.language, 0) + 1
        print(f"Languages: {languages}")

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--repo', default='.', help='Repository path')
    args = parser.parse_args()
    
    analyzer = CodebaseAnalyzer()
    analyzer.analyze_repository(args.repo)

if __name__ == "__main__":
    main()

```

Making your CLI tool **error-proof** is essential for professional deployment. This task requires adding checks for path existence, gracefully handling Git errors, and ensuring the language calculation can process all file types.

Since the `RepositoryScanner` and `GitHistoryExtractor` are external classes (which we assume might raise errors), and we need a path check, we'll need to import the `os` and `sys` modules for robust path validation and graceful program exit. We'll also wrap the Git extraction in a `try...except` block, specifically looking for exceptions related to non-Git directories (like `InvalidGitRepositoryError` or a generic `Exception` if the exact error is unknown).

Here is the updated and more robust code:

```python
import argparse
import os  # Needed for path validation
import sys # Needed for exiting the program gracefully

# Assuming these classes and their potential exceptions are defined elsewhere
from scanner import RepositoryScanner
from git_extractor import GitHistoryExtractor 
# We'll assume the git library used by GitHistoryExtractor might raise this:
from git.exc import InvalidGitRepositoryError 

class CodebaseAnalyzer:
    def __init__(self):
        self.scanner = RepositoryScanner()
        self.git_extractor = GitHistoryExtractor()
    
    def analyze_repository(self, repo_path):
        print("Starting analysis...")
        
        # 1. Add a check to verify if the path exists
        if not os.path.isdir(repo_path):
            print(f"ERROR: Repository path not found or is not a directory: {repo_path}")
            sys.exit(1) # Exit with a non-zero status code to indicate an error
        
        # Scan files
        files = self.scanner.scan_repository(repo_path)
        
        # Extract git history (Error Handling Added)
        commits = []
        try:
            commits = self.git_extractor.extract_commits(repo_path)
        except InvalidGitRepositoryError:
             print("WARNING: Not a valid Git repository. Skipping Git history extraction.")
        except Exception as e:
            # Catch other potential errors during git interaction
            print(f"WARNING: Error during Git history extraction: {e}")

        # Show results
        print(f"\nResults:")
        print(f"Files: {len(files)}")
        # If git extraction failed, commits will be an empty list []
        print(f"Commits: {len(commits)}") 
        
        # Only show file changes if commits were successfully extracted
        if commits:
             print(f"File changes: {len(self.git_extractor.file_changes)}")
        else:
             print("File changes: 0 (Git history not available)")
        
        # Language stats (Handling Unknown Language Types)
        languages = {}
        for file in files:
            # Fix: Ensure the language attribute exists and is usable. 
            # The scanner is expected to assign a language, but we'll use a placeholder 
            # like 'Unknown' if it's None, preventing a crash during dictionary access.
            lang = file.language if file.language else "Unknown" 
            languages[lang] = languages.get(lang, 0) + 1
        print(f"Languages: {languages}")

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--repo', default='.', help='Repository path')
    args = parser.parse_args()
    
    analyzer = CodebaseAnalyzer()
    # Note: If you were using the scan-only mode, you'd add the flag and logic here.
    analyzer.analyze_repository(args.repo)

if __name__ == "__main__":
    main()
```

### Summary of Error Proofing:

1.  **Path Validation:**

      * Imported `os` and `sys`.
      * Used `if not os.path.isdir(repo_path)` to check for a valid directory.
      * Used `sys.exit(1)` to stop the program gracefully if the path is invalid.

2.  **Git Error Handling:**

      * Wrapped `self.git_extractor.extract_commits(repo_path)` in a `try...except` block.
      * Initialized `commits = []` to ensure it's always defined, even if the Git extraction fails.
      * Used `except InvalidGitRepositoryError` (or a similar exception for your Git library) to catch non-Git repos and print a **warning** instead of crashing.
      * Added a generic `except Exception` for other unexpected Git errors.
      * The printout for **File changes** is now conditional on whether `commits` were successfully extracted.

3.  **Language Statistics Fix:**

      * Modified the loop to handle files where `file.language` might be `None` (or empty):
        ```python
        lang = file.language if file.language else "Unknown" 
        languages[lang] = languages.get(lang, 0) + 1
        ```
        This ensures that even files with unknown or non-detected languages are counted under an **"Unknown"** category, preventing an `AttributeError` or `KeyError` during processing.

## Enhancing CLI Output with Detailed Statistics

## Building Robust CLI Error Handling