# GitHub Repository Documentation Generator with OpenAI SDK

## Setup Instructions

This notebook uses the OpenAI SDK with the open-source `gpt-oss-120b` model to generate repository documentation.

1. **Get OpenAI API Key** from https://platform.openai.com/api-keys
2. **Set environment variable**: `OPENAI_API_KEY=your_token_here` (already in .env file)
3. **Run the cells below** to generate documentation

The default model is `openai/gpt-oss-120b`.

In [1]:
import os
import re
import json
import base64
import pathlib
import requests
from datetime import datetime
from openai import OpenAI
from tqdm import tqdm

In [None]:
GITHUB_USERNAME = os.getenv("GITHUB_USERNAME", "").strip()
GITHUB_TOKEN = os.getenv("GITHUB_TOKEN", "").strip() 
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "").strip()  # required
MODEL_NAME = "gpt-oss-120b"
OUT_DIR = pathlib.Path("../data/raw/repo_summaries").resolve()

In [3]:
OUT_DIR.mkdir(parents=True, exist_ok=True)

MAX_TREE_ITEMS = 800
MAX_FILE_CHARS = 12000
MAX_SOURCE_FILES = 12
WORKERS=10

IMPORTANT_FILES = [
    "README.md", "README.MD", "README.rst",
    "pyproject.toml", "requirements.txt", "Pipfile", "setup.py",
    "package.json", "pnpm-lock.yaml", "yarn.lock", "package-lock.json",
    "Cargo.toml", "go.mod", "pom.xml", "build.gradle", "build.gradle.kts",
    "Dockerfile", "docker-compose.yml",
    ".env.example", ".github/workflows",
    "Makefile", "compose.yaml",
]

SOURCE_EXTS = {
    ".py", ".js", ".ts", ".tsx", ".java", ".kt", ".go", ".rs", ".cpp", ".c", ".h", ".hpp",
    ".cs", ".php", ".rb", ".swift", ".scala", ".lua", ".sql", ".sh"
}

In [4]:
def gh_headers():
    h = {"Accept": "application/vnd.github+json"}
    if GITHUB_TOKEN:
        h["Authorization"] = f"Bearer {GITHUB_TOKEN}"
    return h

In [5]:
def gh_get(url, params=None):
    r = requests.get(url, headers=gh_headers(), params=params, timeout=60)
    r.raise_for_status()
    return r.json()

In [6]:
def list_repos(username):
    """Fetch all repos using pagination (100 per page)."""
    repos = []
    page = 1
    while True:
        batch = gh_get(
            f"https://api.github.com/users/{username}/repos",
            params={"per_page": 100, "page": page, "sort": "updated"}
        )
        if not batch:
            break
        repos.extend(batch)
        page += 1
    return repos

In [7]:
def get_default_branch(repo_full_name, fallback="main"):
    repo = gh_get(f"https://api.github.com/repos/{repo_full_name}")
    return repo.get("default_branch") or fallback

In [8]:
def get_repo_tree(repo_full_name, branch):
    """
    Two-step process: get commit SHA from branch ref, then fetch full recursive tree.
    """
    ref = gh_get(f"https://api.github.com/repos/{repo_full_name}/git/refs/heads/{branch}")
    sha = ref["object"]["sha"]
    
    tree = gh_get(
        f"https://api.github.com/repos/{repo_full_name}/git/trees/{sha}",
        params={"recursive": 1}
    )
    return tree.get("tree", [])

In [9]:
def is_probably_binary(path):
    """Check file extension against known binary formats."""
    return any(path.lower().endswith(ext) for ext in [
        ".png", ".jpg", ".jpeg", ".gif", ".webp",
        ".pdf", ".zip", ".gz", ".7z",
        ".mp4", ".mov", ".avi",
        ".exe", ".dll", ".so", ".dylib"
    ])

In [10]:
def fetch_file_text(repo_full_name, path, branch):
    """
    GitHub API returns file content base64-encoded.
    Decode and handle both UTF-8 and Latin-1 encodings.
    """
    data = gh_get(
        f"https://api.github.com/repos/{repo_full_name}/contents/{path}",
        params={"ref": branch}
    )
    
    if isinstance(data, dict) and data.get("type") == "file":
        content = data.get("content", "")
        if data.get("encoding") == "base64" and content:
            raw = base64.b64decode(content.encode("utf-8", errors="ignore"))
            try:
                txt = raw.decode("utf-8", errors="replace")
            except Exception:
                txt = raw.decode("latin-1", errors="replace")
            return txt[:MAX_FILE_CHARS]
    return ""

In [11]:
def pick_key_files(tree_paths):
    """
    Multi-stage file selection:
    1. Important files (README, config, CI/CD)
    2. Source code files sorted by depth
    3. Deduplicate while preserving order
    """
    picked = []
    
    for imp in tqdm(IMPORTANT_FILES, desc="Finding important files"):
        for p in tree_paths:
            if p == imp or p.endswith("/" + imp) or (imp.endswith("/") and p.startswith(imp)):
                picked.append(p)

    src = [p for p in tree_paths if pathlib.Path(p).suffix in SOURCE_EXTS and not is_probably_binary(p)]
    src.sort(key=lambda x: (x.count("/"), len(x)))
    picked.extend(src[:MAX_SOURCE_FILES])

    seen, out = set(), []
    for p in tqdm(picked, desc="Deduplicating files"):
        if p not in seen:
            seen.add(p)
            out.append(p)
    return out

In [12]:
def build_repo_context(repo, tree):
    repo_full = repo["full_name"]
    branch = get_default_branch(repo_full)

    tree_items = [t for t in tree if t.get("type") in ("blob", "tree")]
    tree_items = tree_items[:MAX_TREE_ITEMS]

    paths = [t["path"] for t in tree_items if "path" in t]
    files = [p for p in paths if not is_probably_binary(p)]

    chosen = pick_key_files(files)

    file_blobs = []
    for p in chosen:
        try:
            txt = fetch_file_text(repo_full, p, branch)
            if txt.strip():
                file_blobs.append({"path": p, "text": txt})
        except Exception:
            continue

    meta = {
        "name": repo.get("name"),
        "full_name": repo_full,
        "description": repo.get("description"),
        "topics": repo.get("topics", []),
        "default_branch": branch,
        "language": repo.get("language"),
        "updated_at": repo.get("updated_at"),
        "stargazers": repo.get("stargazers_count"),
        "forks": repo.get("forks_count"),
        "open_issues": repo.get("open_issues_count"),
        "license": (repo.get("license") or {}).get("spdx_id"),
        "html_url": repo.get("html_url"),
    }
    return meta, paths, file_blobs


In [13]:
def make_prompt(meta, paths, file_blobs):
    tree_preview = "\n".join(paths[:600])

    def fence(path, text):
        ext = pathlib.Path(path).suffix.lstrip(".")
        lang = ext if ext else ""
        text = text[:MAX_FILE_CHARS]
        return f"### {path}\n```{lang}\n{text}\n```\n"

    snippets = "\n".join(fence(f["path"], f["text"]) for f in file_blobs[:20])

    return f"""
You are a senior engineer writing documentation.

Generate a SINGLE markdown document that explains this GitHub repository clearly.
Use proper headings and subheadings and keep it accurate based only on provided data.
If something is unclear, say so.

# Required structure
- Title with repo name
- Overview (what it is, who it's for)
- Key Features (bullets)
- Architecture / How it works (based on files/config)
- Notable folders/files (explain why they matter)
- Setup & Run (infer from configs; include commands if obvious)
- How to use (examples if you can infer)
- Testing / CI (if present)
- Deployment (if present)
- Contribution notes (if present)
- Limitations / TODOs you infer (clearly labeled as inference)

# Repo metadata (JSON)
{json.dumps(meta, indent=2)}

# File tree (preview)
{tree_preview}

# File excerpts
{snippets}
""".strip()

In [14]:
def safe_filename(name: str) -> str:
    """Replace invalid filename characters with underscores."""
    name = re.sub(r"[^a-zA-Z0-9._-]+", "_", name).strip("_")
    return name or "repo"

In [15]:
def generate_markdown_with_openai(prompt: str) -> str:
    """
    Generate markdown documentation using OpenAI SDK with OSS model.
    """
    if not OPENAI_API_KEY:
        raise RuntimeError("OPENAI_API_KEY is missing. Set it in your environment.")

    client = OpenAI(api_key=OPENAI_API_KEY)

    response = client.chat.completions.create(
        model=MODEL_NAME,
        messages=[
            {"role": "system", "content": "You write high-quality repo documentation in Markdown."},
            {"role": "user", "content": prompt},
        ],
        temperature=0.2,
        max_tokens=4096,
        top_p=0.9,
    )

    return response.choices[0].message.content

In [16]:
def process_one_repo(repo):
    """
    Worker function executed in a separate process.
    Must be top-level (picklable) for Windows/macOS spawn.
    """
    repo_full = repo["full_name"]
    try:
        branch = get_default_branch(repo_full)
        tree = get_repo_tree(repo_full, branch)
        meta, paths, file_blobs = build_repo_context(repo, tree)
        prompt = make_prompt(meta, paths, file_blobs)
        md = generate_markdown_with_openai(prompt)

        out_path = OUT_DIR / f"{safe_filename(repo['name'])}.md"
        header = f"<!-- Generated: {datetime.utcnow().isoformat()}Z | Model: {MODEL_NAME} -->\n\n"
        out_path.write_text(header + md.strip() + "\n", encoding="utf-8")

        return {"repo": repo_full, "ok": True, "path": str(out_path)}
    except Exception as e:
        return {"repo": repo_full, "ok": False, "error": str(e)}

In [17]:
def main():
    """Fetch all repos and process in parallel with progress tracking."""
    if not GITHUB_USERNAME:
        raise RuntimeError("Set GITHUB_USERNAME in env.")

    repos = list_repos(GITHUB_USERNAME)
    print(f"Found {len(repos)} repos for @{GITHUB_USERNAME}")

    for res in tqdm(repos, desc="Processing repositories"):
        res = process_one_repo(res)
        if res["ok"]:
            print(f"✅ {res['repo']} -> {res['path']}")
        else:
            print(f"⚠️ {res['repo']} failed: {res['error']}")
if __name__ == "__main__":
    main()

Found 41 repos for @upratham


Finding important files: 100%|██████████| 22/22 [00:00<?, ?it/s]
Deduplicating files: 100%|██████████| 1/1 [00:00<?, ?it/s]
Processing repositories:   2%|▏         | 1/41 [00:01<01:14,  1.86s/it]

⚠️ upratham/LLM-RAG-private-knowldge-worker failed: Error code: 404 - {'error': {'message': 'The model `gpt-oss-120b` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}


Finding important files: 100%|██████████| 22/22 [00:00<?, ?it/s]
Deduplicating files: 100%|██████████| 4/4 [00:00<?, ?it/s]
Processing repositories:   5%|▍         | 2/41 [00:03<01:00,  1.55s/it]

⚠️ upratham/AI-in-Production-Healthcare-App failed: Error code: 404 - {'error': {'message': 'The model `gpt-oss-120b` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}


Finding important files: 100%|██████████| 22/22 [00:00<?, ?it/s]
Deduplicating files: 100%|██████████| 8/8 [00:00<?, ?it/s]
Processing repositories:   7%|▋         | 3/41 [00:05<01:08,  1.79s/it]

⚠️ upratham/production failed: Error code: 404 - {'error': {'message': 'The model `gpt-oss-120b` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}


Finding important files: 100%|██████████| 22/22 [00:00<?, ?it/s]
Deduplicating files: 100%|██████████| 1/1 [00:00<?, ?it/s]
Processing repositories:  10%|▉         | 4/41 [00:07<01:11,  1.94s/it]

⚠️ upratham/Breast-Cancer-Segmentation failed: Error code: 404 - {'error': {'message': 'The model `gpt-oss-120b` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}


Finding important files: 100%|██████████| 22/22 [00:00<?, ?it/s]
Deduplicating files: 100%|██████████| 4/4 [00:00<?, ?it/s]
Processing repositories:  12%|█▏        | 5/41 [00:08<01:03,  1.77s/it]

⚠️ upratham/LLM-Meeting-Minutes-Generation failed: Error code: 404 - {'error': {'message': 'The model `gpt-oss-120b` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}


Finding important files: 100%|██████████| 22/22 [00:00<00:00, 43362.17it/s]
Deduplicating files: 100%|██████████| 4/4 [00:00<?, ?it/s]
Processing repositories:  15%|█▍        | 6/41 [00:11<01:06,  1.90s/it]

⚠️ upratham/LLM-Code-Explainer failed: Error code: 404 - {'error': {'message': 'The model `gpt-oss-120b` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}


Finding important files: 100%|██████████| 22/22 [00:00<?, ?it/s]
Deduplicating files: 100%|██████████| 10/10 [00:00<?, ?it/s]
Processing repositories:  15%|█▍        | 6/41 [00:12<01:10,  2.01s/it]


KeyboardInterrupt: 