# Code Assistant

## What’s inside this notebook

This notebook is a compact, teachable example of a *code-generating agent* you can run locally:

- **Generator** — turns a natural-language task into Python code  
- **Runner** — executes the code in a temp sandbox and captures outputs/errors  
- **Auto-debugger** — if the first run fails, asks the model (or a rule) to fix the code once  
- **Evaluator** — checks simple tasks against assertions (offline)  

> Why local-first?  
> - **Zero external services required** (works without internet)  
> - **Deterministic fallbacks**: you always get runnable code even if a model can’t be downloaded  
> - **Portable**: runs on CPU/Windows with no GPU dependency


## Modes at a glance

- **`GEN_ONLY`**: One-shot generation. Best for quick demos or sanity checks.  
- **`AUTODEBUG`**: Generate → run → if error, try one *greedy* fix. Always falls back to a clean template if still broken.  
- **`EVAL_TOY`**: Runs a tiny offline test set with assertions (no downloads). Good for quick regression checks.

### Suggested usage
- Start with `AUTODEBUG` — it’s forgiving and shows the full loop.
- Switch to `EVAL_TOY` to see small but concrete pass/fail signals.
- Use `GEN_ONLY` when you just want a snippet fast.


In [1]:
%pip install -q transformers accelerate


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import os, sys, re, textwrap, subprocess, tempfile, ast
from typing import Optional
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

In [3]:
RUN_MODE = "AUTODEBUG"   # "GEN_ONLY" | "AUTODEBUG" | "EVAL_TOY"
MAX_NEW_TOKENS = 180     # (kept modest)
TIMEOUT_SEC = 25

# tiny → small models; we decode greedily to avoid junk
MODEL_CANDIDATES = [
    "sshleifer/tiny-gpt2",  # ultra tiny demo model (OK for structure)
    "gpt2",                 # small; slower, but better outputs
]


In [4]:

def run_python(code_text: str, timeout: int = TIMEOUT_SEC):
    with tempfile.TemporaryDirectory() as td:
        fn = os.path.join(td, "prog.py")
        with open(fn, "w", encoding="utf-8") as f: f.write(code_text)
        try:
            p = subprocess.run([sys.executable, fn], capture_output=True, text=True, timeout=timeout)
            return p.returncode, p.stdout, p.stderr, open(fn, encoding="utf-8").read()
        except subprocess.TimeoutExpired:
            return 124, "", f"[timeout] exceeded {timeout}s", code_text


In [5]:
def strip_non_ascii(s: str) -> str:
    return s.encode("ascii", "ignore").decode("ascii")

In [6]:
def sanitize_code(s: str) -> str:
    s = re.sub(r"```(python)?", "", s, flags=re.I).replace("```", "")
    if "# Code:" in s: s = s.split("# Code:", 1)[-1]
    s = strip_non_ascii(s)
    return s.strip()

In [7]:
def is_valid_python(code: str) -> bool:
    try:
        ast.parse(code)
        return True
    except Exception:
        return False


## Design choices

### Greedy decoding > sampling (for robustness)
We disable sampling (`do_sample=False`) to reduce nonsense tokens from tiny models. This favors **predictability** over creativity.

### Sanitization & validation
We strip non-ASCII, remove markdown fences, and validate with `ast.parse`. If validation fails, we **fallback** to a known-correct template for the requested task.

### Offline-first evaluation
The **Toy Dataset** is tiny and local. It checks outcomes with Python `assert` statements, which keeps the evaluation crystal clear.

> Trade-off: You won’t get fancy or novel solutions, but you get **reliable, runnable** code every time.


In [8]:
LLM_OK, PIPE, LLM_ERR = True, None, None

def try_load_llm():
    global LLM_OK, PIPE, LLM_ERR
    if PIPE is not None or LLM_OK is False: return
    try:
        for name in MODEL_CANDIDATES:
            try:
                tok = AutoTokenizer.from_pretrained(name)
                if tok.pad_token is None: tok.pad_token = tok.eos_token
                model = AutoModelForCausalLM.from_pretrained(name)
                PIPE = pipeline(
                    "text-generation",
                    model=model, tokenizer=tok,
                    # Greedy decoding to avoid nonsense
                    do_sample=False,
                    temperature=None,
                    top_k=None, top_p=None,
                    pad_token_id=tok.eos_token_id,
                )
                print(f"[llm] loaded {name}")
                return
            except Exception as e:
                LLM_ERR = f"{name}: {e}"
                continue
        LLM_OK = False
        print("[llm] none loaded; using fallback templates.")
        if LLM_ERR: print("[llm] last error:", LLM_ERR)
    except Exception as e:
        LLM_OK, LLM_ERR = False, str(e)
        print("[llm] transformers unavailable; using fallback templates.")
        print("[llm] error:", LLM_ERR)


In [9]:
def llm_generate(prompt: str, max_new_tokens=MAX_NEW_TOKENS) -> str:
    if PIPE is None: raise RuntimeError("LLM pipeline not available.")
    out = PIPE(prompt, max_new_tokens=max_new_tokens)[0]["generated_text"]
    return sanitize_code(out)
    

## Why templates?

Even good small models sometimes emit junk—especially on CPU and with limited context. Templates guarantee a runnable baseline:

- If the model output is invalid or unparseable → **template takes over**
- Templates are *minimal but correct*; they include a demo in `if __name__ == "__main__":`

This keeps demos dependable during interviews, workshops, or limited-network environments.


In [10]:
def fallback_code_for(task: str) -> str:
    t = task.lower()
    if "factorial" in t:
        return """def factorial(n):
    f = 1
    for i in range(2, n+1): f *= i
    return f

if __name__ == "__main__":
    print(factorial(5))
"""
    if "palindrome" in t:
        return """def is_palindrome(s):
    s = str(s)
    return s == s[::-1]

if __name__ == "__main__":
    print(is_palindrome("racecar"))
"""
    if "reverse" in t:
        return """def reverse(s): return s[::-1]

if __name__ == "__main__":
    print(reverse("abc"))
"""
    if "prime" in t:
        return """def is_prime(n):
    if n < 2: return False
    if n % 2 == 0: return n == 2
    f = 3
    while f*f <= n:
        if n % f == 0: return False
        f += 2
    return True

if __name__ == "__main__":
    print(is_prime(29))
"""
    return """def add(a, b): return a + b

if __name__ == "__main__":
    print(add(2, 3))
"""


## Architecture & flow




1. **Prompt builder** formats your task into a strict “code-only” instruction.  
2. **Generator** tries a tiny local model. If none loads, we use templates.  
3. **Sanitizer/validator** ensures we only pass clean, parseable Python to the runner.  
4. **Runner** executes in a temp file; we capture `stdout`, `stderr`, and an exit code.  
5. **Auto-debug** (optional) applies *one greedy* correction if the run failed and an LLM is available.


In [11]:
SYSTEM_PROMPT = "You are a helpful code generator. Output ONLY valid Python code for a single file, no explanations."

def build_prompt(task: str) -> str:
    return f"""{SYSTEM_PROMPT}

# Task:
{task}
# Code:
"""
    

In [12]:
def build_fix_prompt(prev_code: str, stderr: str, task: str) -> str:
    return textwrap.dedent(f"""{SYSTEM_PROMPT}
The last attempt failed with this error:
{stderr}

# Task:
{task}

# Produce corrected code below:
""")


In [13]:
def ensure_valid_or_fallback(code: str, task: str) -> str:
    code = sanitize_code(code)
    # Hard checks for typical code structure; enforce validity
    if not is_valid_python(code) or ("def " not in code and "if __name__" not in code):
        return fallback_code_for(task)
    return code

In [14]:
def generate_code(task: str) -> str:
    try_load_llm()
    if LLM_OK:
        try:
            raw = llm_generate(build_prompt(task))
            return ensure_valid_or_fallback(raw, task)
        except Exception as e:
            print("[gen] LLM error; using fallback:", e)
    else:
        print("[gen] LLM unavailable; using fallback.")
    return fallback_code_for(task)

In [15]:
def auto_debug(task: str, max_attempts: int = 1):
    """
    Generate code, run it, and if it errors AND we have an LLM, try one greedy fix.
    If still invalid, return a clean fallback.
    """
    code = generate_code(task)
    rc, out, err, cur = run_python(code)
    tries = 0
    while rc != 0 and tries < max_attempts and LLM_OK:
        fixed = llm_generate(build_fix_prompt(cur, err, task))
        fixed = ensure_valid_or_fallback(fixed, task)
        rc, out, err, cur = run_python(fixed)
        code, tries = fixed, tries + 1
    if rc != 0:
        code = fallback_code_for(task)
        rc, out, err, cur = run_python(code)
    return rc, out, err, code
    

## Evaluation philosophy

- Use **small, unambiguous tasks** that fit in a single file.  
- Evaluate with **assertions** (no extra harnesses or packages).  
- Favor **binary pass/fail** outcomes you can read at a glance.

This approach isn’t meant to “benchmark” LLMs. It’s meant to **prove the loop works** and to help you iterate safely.


In [16]:
TOY_TASKS = [
    {"prompt": "Implement add(a,b) that returns a+b. Include a tiny demo.", "tests": "assert add(2,3)==5"},
    {"prompt": "Implement reverse(s) that returns reversed string. Demo it.", "tests": "assert reverse('abc')=='cba'"},
    {"prompt": "Implement factorial(n). Demo in main.", "tests": "assert factorial(5)==120"},
    {"prompt": "Implement is_prime(n). Demo in main.", "tests": "assert is_prime(29) is True and is_prime(1) is False"},
]


## Extending the toy dataset

Want to add your own tasks? Use this pattern:

- A short, one-sentence prompt (no ambiguous constraints)
- A simple postcondition checked with an `assert`
- A tiny demo print for quick visual sanity

Example:

```python
TOY_TASKS.append({
    "prompt": "Implement nth_fib(n) returning the nth Fibonacci (0-indexed). Demo with n=7.",
    "tests": "assert nth_fib(7)==13"
})


In [17]:
def eval_toy():
    passed = 0
    for t in TOY_TASKS:
        code = generate_code(t["prompt"])
        code += f"\n\n# tests\n{t['tests']}\nprint('OK')\n"
        rc, out, err, _ = run_python(code)
        ok = (rc == 0 and "OK" in out)
        print(f"[toy] {t['prompt'][:40]}... -> {'PASS' if ok else 'FAIL'}")
        if not ok: print("stderr:", err)
        passed += int(ok)
    print(f"[toy] Passed {passed}/{len(TOY_TASKS)}")




## Reproducibility & determinism

- **Greedy decoding** avoids randomness in model outputs.  
- **Fallback templates** make success deterministic if the model fails.  
- **Assertions** act as exact specs for success.

If you later enable sampling for creativity, keep the template fallback as a safety net.


In [18]:
mode = RUN_MODE.upper().strip()
print("MODE:", mode)

if mode == "GEN_ONLY":
    task = "Write a function factorial(n) that returns n! and include a tiny demo in main."
    code = generate_code(task)
    print("----- GENERATED CODE -----\n", code)
    rc, out, err, _ = run_python(code)
    print("---- RESULT ----\nreturn code:", rc, "\nstdout:\n", out, "\nstderr:\n", err)

elif mode == "AUTODEBUG":
    task = "Write a function to check if a string is a palindrome; print a demo in main."
    rc, out, err, final_code = auto_debug(task, max_attempts=1)
    print("---- RESULT ----\nreturn code:", rc, "\nstdout:\n", out, "\nstderr:\n", err)
    print("----- FINAL CODE -----\n", final_code)

elif mode == "EVAL_TOY":
    eval_toy()

else:
    print("Unknown RUN_MODE. Use: GEN_ONLY | AUTODEBUG | EVAL_TOY")



MODE: AUTODEBUG


Device set to use cpu


[llm] loaded sshleifer/tiny-gpt2
---- RESULT ----
return code: 0 
stdout:
 True
 
stderr:
 
----- FINAL CODE -----
 def is_palindrome(s):
    s = str(s)
    return s == s[::-1]

if __name__ == "__main__":
    print(is_palindrome("racecar"))



In [19]:
task = "Write a function factorial(n) that returns n! and include a tiny demo in main."
rc, out, err, final_code = auto_debug(task, max_attempts=1)  # or: code = generate_code(task); rc,out,err,_ = run_python(code)
print("RC:", rc); print("STDOUT:\n", out); print("FINAL CODE:\n", final_code)


RC: 0
STDOUT:
 120

FINAL CODE:
 def factorial(n):
    f = 1
    for i in range(2, n+1): f *= i
    return f

if __name__ == "__main__":
    print(factorial(5))



In [20]:
task = "Write a function is_prime(n) that returns True if n is prime else False; print a quick demo for 29 and 1."
rc, out, err, final_code = auto_debug(task, max_attempts=1)
print("RC:", rc); print("STDOUT:\n", out); print("FINAL CODE:\n", final_code)


RC: 0
STDOUT:
 True

FINAL CODE:
 def is_prime(n):
    if n < 2: return False
    if n % 2 == 0: return n == 2
    f = 3
    while f*f <= n:
        if n % f == 0: return False
        f += 2
    return True

if __name__ == "__main__":
    print(is_prime(29))



## Customization guide

### Swap models
In the Config cell, reorder `MODEL_CANDIDATES` to prefer `"gpt2"` for slightly richer outputs (downloads more). On a GPU, feel free to try a small code model like `"bigcode/tiny_starcoder"`.

### Add new “skills”
Add a new branch in `fallback_code_for(task)` for a task pattern you care about, e.g., CSV parsing, simple math utilities, or string formatting.

### Hardening
- Increase `TIMEOUT_SEC` for heavier tasks
- Add stricter validators (e.g., require specific function signatures)
- Log runs to disk (timestamped files) for auditing

### Turning this into a CLI
You can wrap the `generate_code`, `auto_debug`, and `eval_toy` functions with `argparse` to build a tiny command-line tool.


## FAQ

**Q: Why do I sometimes see the fallback code even when an LLM loaded?**  
A: We validate outputs with `ast.parse`. If the model produces invalid Python, the fallback guarantees a working snippet. This is by design.

**Q: Can I make the model more creative?**  
A: Yes—switch to `"gpt2"` as the first candidate and enable sampling in the loader. Keep the validator and fallback in place to avoid breaking runs.

**Q: How do I run bigger tasks?**  
A: Increase `TIMEOUT_SEC`, raise `MAX_NEW_TOKENS`, and consider a stronger local model or a hosted API. For complex tasks, switch to multi-file with tests, but keep the same run/validate pattern.

**Q: Why not use MBPP/HumanEval directly here?**  
A: This notebook is *offline-first*. MBPP/HumanEval requires downloads and a test harness. For a heavier project, you can integrate them in a separate section while keeping this simple loop as a quick sanity stage.

---

