# Code Comment Generator

## Goal
Build a tool that takes a Python function as input and automatically generates a docstring explaining what the function does.

## Tech Stack
- **Model**: Salesforce CodeT5 (text-generation model fine-tuned on code)
- **Libraries**: Transformers, Torch

## Project Overview
1. Load a pre-trained code generation model
2. Create a prompt template for docstring generation
3. Generate docstrings for Python functions
4. Evaluate and display results

## Step 1: Install Required Libraries

In [28]:
!pip install transformers torch accelerate -q

## Step 2: Import Libraries

In [29]:
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
import warnings
warnings.filterwarnings('ignore')

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

Using device: cpu


## Step 3: Load Pre-trained Model

In [30]:
model_name = "Salesforce/codet5-base-multi-sum"

print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_name)

print("Loading model...")
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model = model.to(device)

print(f"Model loaded successfully: {model_name}")

Loading tokenizer...
Loading model...
Model loaded successfully: Salesforce/codet5-base-multi-sum


## Step 4: Create Docstring Generator Function

In [31]:
def generate_docstring(code: str, max_length: int = 128) -> str:
    """
    Generate a docstring for the given Python code.

    Args:
        code: Python function code as a string
        max_length: Maximum length of generated docstring

    Returns:
        Generated docstring describing the function
    """
    inputs = tokenizer(
        code,
        return_tensors="pt",
        max_length=512,
        truncation=True,
        padding=True
    ).to(device)

    with torch.no_grad():
        outputs = model.generate(
            inputs["input_ids"],
            max_length=max_length,
            num_beams=5,
            early_stopping=True,
            no_repeat_ngram_size=2
        )

    docstring = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return docstring


def format_function_with_docstring(code: str, docstring: str) -> str:
    """
    Format the function with the generated docstring inserted.

    Args:
        code: Original Python function code
        docstring: Generated docstring

    Returns:
        Function code with docstring inserted
    """
    lines = code.strip().split('\n')

    for i, line in enumerate(lines):
        if line.strip().startswith('def '):
            indent = len(line) - len(line.lstrip()) + 4
            docstring_formatted = ' ' * indent + '"""' + docstring + '"""'
            lines.insert(i + 1, docstring_formatted)
            break

    return '\n'.join(lines)


print("Docstring generator functions created!")

Docstring generator functions created!


## Step 5: Test with Sample Python Functions

In [32]:
sample_functions = [
    '''def calculate_area(length, width):
    return length * width''',

    '''def find_max(numbers):
    if not numbers:
        return None
    max_val = numbers[0]
    for num in numbers:
        if num > max_val:
            max_val = num
    return max_val''',

    '''def reverse_string(s):
    return s[::-1]''',

    '''def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)''',

    '''def read_file_lines(filepath):
    with open(filepath, 'r') as f:
        lines = f.readlines()
    return [line.strip() for line in lines]'''
]

print(f"Loaded {len(sample_functions)} sample functions for testing")

Loaded 5 sample functions for testing


### Generate Docstrings for All Sample Functions

In [33]:
print("=" * 60)
print("CODE COMMENT GENERATOR - RESULTS")
print("=" * 60)

for i, func in enumerate(sample_functions, 1):
    print(f"\nFunction {i}:")
    print("-" * 40)
    print("Original Code:")
    print(func)
    print("\nGenerated Docstring:")
    docstring = generate_docstring(func)
    print(f'   """{docstring}"""')
    print("\nFunction with Docstring:")
    print(format_function_with_docstring(func, docstring))
    print("=" * 60)

CODE COMMENT GENERATOR - RESULTS

Function 1:
----------------------------------------
Original Code:
def calculate_area(length, width):
    return length * width

Generated Docstring:
   """Calculate the area of the image ."""

Function with Docstring:
def calculate_area(length, width):
    """Calculate the area of the image ."""
    return length * width

Function 2:
----------------------------------------
Original Code:
def find_max(numbers):
    if not numbers:
        return None
    max_val = numbers[0]
    for num in numbers:
        if num > max_val:
            max_val = num
    return max_val

Generated Docstring:
   """Find the maximum number in a sequence of sequence ."""

Function with Docstring:
def find_max(numbers):
    """Find the maximum number in a sequence of sequence ."""
    if not numbers:
        return None
    max_val = numbers[0]
    for num in numbers:
        if num > max_val:
            max_val = num
    return max_val

Function 3:
----------------------

## Step 6: Read from file and generate docstrings

Process Python files and generate docstrings for all functions.

In [34]:
import re
from pathlib import Path

def extract_functions(code: str) -> list:
    """
    Extract all function definitions from Python code.

    Args:
        code: Python source code as a string

    Returns:
        List of function code strings
    """
    pattern = r'(def\s+\w+\s*\([^)]*\)\s*:[\s\S]*?)(?=\ndef\s|\nclass\s|\Z)'
    functions = re.findall(pattern, code)
    return [f.strip() for f in functions if f.strip()]


def process_python_code(code: str) -> str:
    """
    Process Python code and add docstrings to all functions.

    Args:
        code: Python source code as a string

    Returns:
        Code with generated docstrings added
    """
    functions = extract_functions(code)
    result = code

    for func in functions:
        if '"""' in func or "'''" in func:
            continue

        docstring = generate_docstring(func)
        func_with_doc = format_function_with_docstring(func, docstring)
        result = result.replace(func, func_with_doc)

    return result


def process_python_file(filepath: str, output_path: str = None) -> str:
    """
    Read a Python file, add docstrings to all functions, and optionally save.

    Args:
        filepath: Path to the Python file to process
        output_path: Optional path to save the processed code (if None, doesn't save)

    Returns:
        Code with generated docstrings added
    """
    path = Path(filepath)

    if not path.exists():
        raise FileNotFoundError(f"File not found: {filepath}")

    if not path.suffix == '.py':
        raise ValueError(f"File must be a Python file (.py): {filepath}")

    code = path.read_text()
    processed_code = process_python_code(code)

    if output_path:
        output = Path(output_path)
        output.write_text(processed_code)
        print(f"Processed code saved to: {output_path}")

    return processed_code


# Example usage with a file path
filepath = "file.py"  # Change this to your Python file path

print(f"Processing file: {filepath}")
print("=" * 60)

try:
    # Read original code
    original_code = Path(filepath).read_text()
    print("Original Python Code:")
    print(original_code)
    print("\n" + "=" * 60)
    print("Code with Generated Docstrings:")
    print("=" * 60)

    # Process and display (without saving)
    processed_code = process_python_file(filepath)
    print(processed_code)

    # Uncomment the line below to save to a new file
    processed_code = process_python_file(filepath, output_path="file_regenerated.py")

except FileNotFoundError as e:
    print(f"Error: {e}")
except ValueError as e:
    print(f"Error: {e}")

Processing file: file.py
Original Python Code:
def calculate_area(length, width):
    return length * width

def find_max(numbers):
    if not numbers:
        return None
    max_val = numbers[0]
    for num in numbers:
        if num > max_val:
            max_val = num
    return max_val

def reverse_string(s):
    return s[::-1]

def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)

def read_file_lines(filepath):
    with open(filepath, 'r') as f:
        lines = f.readlines()
    return [line.strip() for line in lines]

Code with Generated Docstrings:
def calculate_area(length, width):
    """Calculate the area of the image ."""
    return length * width

def find_max(numbers):
    """Find the maximum number in a sequence of sequence ."""
    if not numbers:
        return None
    max_val = numbers[0]
    for num in numbers:
        if num > max_val:
            max_val = num
    return max_val

def reverse_string(s):
    """Reverse the las