# Cervella Baby POC - Week 2

> **TIER 2 - Medium Tasks (T11-T18)**

---

## About

| Item | Value |
|------|-------|
| **Week 1 Result** | PASS! 9/10 task superati |
| **Timeline** | Week 2: 11-17 Gennaio 2026 |
| **Modello** | Qwen3-4B-Instruct-2507 (Apache 2.0) |
| **Task** | T11-T18 (TIER 2 - Medium) |

---

## Success Criteria Week 2

- **PASS:** >=5/8 task con score >=75%
- **CONDITIONAL:** 3-4/8 task pass
- **FAIL:** <3/8 task pass

---

*"La magia ora e' con coscienza!"*

## 1. Setup Environment

**IMPORTANTE:** Prima di eseguire:
1. Runtime > Change runtime type > **T4 GPU**
2. Hai almeno 15GB RAM disponibile

In [None]:
# Check GPU disponibile
!nvidia-smi

In [None]:
# Install Unsloth (ottimizzato per Colab)
%%capture
!pip install unsloth
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

In [None]:
# Install altre dipendenze
%%capture
!pip install transformers datasets accelerate bitsandbytes

In [None]:
# Imports
import json
import time
from datetime import datetime
from unsloth import FastLanguageModel
import torch

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")

## 2. Load Model - Qwen3-4B-Instruct-2507

In [None]:
# Load model con Unsloth (4-bit quantization)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-4B-Instruct-2507",
    max_seq_length=4096,  # Aumentato per task piu complessi
    dtype=None,
    load_in_4bit=True,
)

# Enable fast inference
FastLanguageModel.for_inference(model)

print("\nModello caricato!")
print(f"Memory footprint: {model.get_memory_footprint() / 1e9:.2f} GB")

## 3. System Prompt - COSTITUZIONE Cervella

In [None]:
SYSTEM_PROMPT = """
# CERVELLA - Core Identity

## CHI SONO

Sono Cervella, PARTNER STRATEGICO di Rafa (non assistente).

**La differenza:**
- Assistente: "Si Rafa, faccio subito"
- Partner: "Aspetta Rafa, prima devo capire/ricercare/pensare"

**Ruolo:**
- Rafa = CEO & Visionary (il PERCHE)
- Io = Strategic Partner (il COME)
- Insieme = La magia

## OBIETTIVO FINALE: LIBERTA GEOGRAFICA

Non lavoriamo per il codice. Lavoriamo per la LIBERTA.

## FILOSOFIA CORE - I Pilastri:

1. "Lavoriamo in PACE! Senza CASINO! Dipende da NOI!"
2. "Fatto BENE > Fatto VELOCE"
3. "I dettagli fanno SEMPRE la differenza"
4. "Nulla e complesso - solo non ancora studiato!"
5. "Non e sempre come immaginiamo... ma alla fine e il 100000%!"

## COME LAVORO - LE 4 REGOLE DEL PARTNER

1. RAGIONARE - Non eseguire ciecamente
2. RICERCARE - Prima di proporre
3. DISSENTIRE - Quando necessario
4. PROTEGGERE - Il progetto e Rafa

## TONE & VOICE

- Con CALMA e PRECISIONE
- Mai fretta, mai approssimazioni
- Ogni dettaglio conta. Sempre.
- Output CONCISO e strutturato

## REGOLA D'ORO

PRIMA DI AGIRE, CHIEDITI:
1. Ho CAPITO cosa serve veramente?
2. Ho RICERCATO come si fa?
3. Ho RAGIONATO sulle conseguenze?
4. Sto facendo la cosa GIUSTA o la cosa VELOCE?

Se anche UNA risposta e NO -> FERMATI e PENSA
"""

print(f"System prompt length: {len(SYSTEM_PROMPT)} chars")

## 4. Task Dataset - T11-T18 (TIER 2 - Medium)

In [None]:
# Task Dataset TIER 2 (Medium)
TASKS = [
    {
        "id": "T11",
        "name": "Orchestrazione Multi-Worker",
        "input": """Obiettivo: Deploy FASE 5 Database Miracollo
File target: 22 tabelle, 47 query ottimizzate

Task: Pianifica orchestrazione 3 worker paralleli.
Disponibili: cervella-data, cervella-backend, cervella-tester

Constraint:
- cervella-data: Analisi DB, migrazioni
- cervella-backend: Refactoring services
- cervella-tester: Verifica performance

Output: Piano sequenziale/parallelo con dipendenze.""",
        "pass_threshold": 0.75
    },
    {
        "id": "T12",
        "name": "Decisione Architetturale Semplice",
        "input": """Context: Cervella Baby POC ha 20 task da eseguire.

Opzione A: Single script run_all_tasks.py (sequenziale)
- Pro: Semplice, 1 file, log unico
- Contro: Lento, difficile debug singolo task

Opzione B: Task file separati + orchestrator
- Pro: Test singolo facile, parallelo possibile
- Contro: Piu file, setup complesso

Opzione C: Notebook Colab con celle
- Pro: Interattivo, visualizzazione immediata
- Contro: Non automabile, hard to version

Task: Decidi opzione migliore. Output: Decisione + PERCHE + next step.""",
        "pass_threshold": 0.75
    },
    {
        "id": "T13",
        "name": "Code Review Basic",
        "input": """```python
class CervellaClient:
    def __init__(self, api_key):
        self.api_key = api_key  # ISSUE: non validato

    def chat(self, message):
        try:
            response = requests.post(...)
            return response.json()
        except:
            return {"error": "Request failed"}
```

Task: Code review. Trova 2+ issues, suggerisci fix.""",
        "pass_threshold": 0.75
    },
    {
        "id": "T14",
        "name": "Bug Analysis da Log",
        "input": """Error log:
[14:24:13] INFO: Writing file: docs/RICERCA.md
[14:24:14] INFO: File salvato!
[14:24:20] ERROR: Regina verification failed: File not found

Context:
- Problema ricorrente (3+ volte)
- Altri worker NON hanno problema
- File path sembra corretto

Task: Root cause analysis + suggerimento fix.""",
        "pass_threshold": 0.75
    },
    {
        "id": "T15",
        "name": "Documentazione Pattern Emerso",
        "input": """Pattern osservato nelle sessioni:
1. Sessione N: cervella-researcher fa ricerca
2. Sessione N+1: Regina implementa basandosi su ricerca
3. Sessione N+2: cervella-tester verifica

Questo pattern ha funzionato con Score 10/10.

Task: Documenta pattern come guida riutilizzabile.""",
        "pass_threshold": 0.75
    },
    {
        "id": "T16",
        "name": "Analisi Costi Multi-Scenario",
        "input": """Dati:
- Claude API: $3/M input, $15/M output
- Self-host Qwen3: $175/mese (Vast.ai)

Scenari volume mensile:
1. Startup: 30K requests (avg 500 input + 1000 output tokens)
2. Growth: 100K requests
3. Scale: 500K requests

Task: Tabella comparativa costi, break-even.""",
        "pass_threshold": 0.75
    },
    {
        "id": "T17",
        "name": "Refactoring Plan da Code Smell",
        "input": """Funzione attuale: create_booking()
- 250 righe
- Fa: validation, pricing calculation, DB insert, email notification, logging
- Troppi if/else annidati
- Difficile da testare

Task: Proponi refactoring in 3+ services separati.
Output: Plan con nuovi file, responsabilita, effort.""",
        "pass_threshold": 0.75
    },
    {
        "id": "T18",
        "name": "Summary Ricerca Approfondita",
        "input": """3 report da sintetizzare:

Report 14 - Costi Dettagliati (1087 righe):
- Claude: $3/M input, $15/M output
- Qwen3 self-host: $175/mese Vast.ai
- Break-even: 12.5M tokens/mese

Report 15 - Timeline e Rischi (1400 righe):
- Full Independence: 9-14 mesi
- MVP Hybrid: 6-8 settimane
- Risk: Performance gap 60-70%

Report 16 - GO/NO-GO Framework (1050 righe):
- Score: 7.5/10
- Raccomandazione: CONDITIONAL GO
- POC $50 valida tutto

Task: Summary executive max 300 parole.
Focus: Decisione GO/NO-GO, next step, risk.""",
        "pass_threshold": 0.75
    }
]

print(f"Loaded {len(TASKS)} tasks (T11-T18) - TIER 2 Medium")

## 5. Inference Function

In [None]:
def run_inference(task_input, system_prompt=SYSTEM_PROMPT, max_new_tokens=1024):
    """Run inference on a single task.
    
    Aumentato max_new_tokens a 1024 per task piu complessi.
    """
    
    # Format messages for Qwen3 chat template
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": task_input}
    ]
    
    # Apply chat template
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    
    # Tokenize
    inputs = tokenizer([text], return_tensors="pt").to("cuda")
    
    # Generate
    start_time = time.time()
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        use_cache=True,
        temperature=0.7,
        top_p=0.9,
    )
    latency = time.time() - start_time
    
    # Decode
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Extract only the assistant response
    if "assistant" in response.lower():
        response = response.split("assistant")[-1].strip()
    
    return {
        "response": response,
        "latency_seconds": latency,
        "tokens_generated": len(outputs[0]) - len(inputs["input_ids"][0])
    }

print("Inference function ready! (max_new_tokens=1024 per task medium)")

## 6. Test Single Task (T11)

In [None]:
# Test T11
task = TASKS[0]
print(f"Testing: {task['id']} - {task['name']}")
print(f"Input:\n{task['input']}")
print("\n" + "="*60 + "\n")

result = run_inference(task["input"])

print(f"Response ({result['latency_seconds']:.2f}s):")
print(result["response"])

## 7. Run All T11-T18

In [None]:
# Run all tasks
results = []

for task in TASKS:
    print(f"\n{'='*60}")
    print(f"Running: {task['id']} - {task['name']}")
    print(f"{'='*60}")
    
    result = run_inference(task["input"])
    
    results.append({
        "task_id": task["id"],
        "task_name": task["name"],
        "input": task["input"],
        "output": result["response"],
        "latency_seconds": result["latency_seconds"],
        "tokens_generated": result["tokens_generated"],
        "pass_threshold": task["pass_threshold"],
        "score": None,
        "passed": None
    })
    
    print(f"\nResponse ({result['latency_seconds']:.2f}s):")
    print(result["response"][:800] + "..." if len(result["response"]) > 800 else result["response"])

print(f"\n\n{'='*60}")
print(f"COMPLETED: {len(results)}/{len(TASKS)} tasks")
print(f"Avg latency: {sum(r['latency_seconds'] for r in results) / len(results):.2f}s")

## 8. Evaluation Framework

**Rubrica di Valutazione TIER 2 (1-5 per ogni criterio):**

| Criterio | 5 | 4 | 3 | 2 | 1 |
|----------|---|---|---|---|---|
| Correttezza | Perfetto | 1-2 errori minori | 3-4 errori | 5+ errori | Completamente errato |
| Completezza | Tutto + extra | Tutto richiesto | Manca 1 secondario | Manca 2+ importanti | Inutilizzabile |
| Stile Cervella | Calmo, preciso, PERCHE | Professionale | Funzionale generico | Robotic/casual | Non riconoscibile |
| Utility | Actionable subito | Serve minor editing | Serve context | Troppo generico | Inutilizzabile |

**Score finale:** Media 4 criteri x 20 = 0-100%

**Pass TIER 2:** Score >= 75%

In [None]:
def evaluate_task(task_result, correttezza, completezza, stile, utility):
    """Evaluate a task result manually.
    
    Args:
        task_result: dict from results list
        correttezza: 1-5
        completezza: 1-5
        stile: 1-5
        utility: 1-5
    
    Returns:
        Updated task_result with score and passed
    """
    avg_score = (correttezza + completezza + stile + utility) / 4
    score_pct = avg_score * 20  # Convert to 0-100%
    passed = score_pct >= (task_result["pass_threshold"] * 100)
    
    task_result["evaluation"] = {
        "correttezza": correttezza,
        "completezza": completezza,
        "stile": stile,
        "utility": utility,
        "avg_score": avg_score
    }
    task_result["score"] = score_pct
    task_result["passed"] = passed
    
    status = "PASS" if passed else "FAIL"
    print(f"{task_result['task_id']}: {status} ({score_pct:.0f}%) - threshold {task_result['pass_threshold']*100:.0f}%")
    
    return task_result

print("Evaluation function ready!")
print("\nPer valutare un task:")
print('results[0] = evaluate_task(results[0], correttezza=4, completezza=4, stile=4, utility=4)')

In [None]:
# Valuta T11 - Orchestrazione Multi-Worker
# COMPILA DOPO AVER VISTO L'OUTPUT!

# results[0] = evaluate_task(results[0], correttezza=?, completezza=?, stile=?, utility=?)

In [None]:
# Valuta T12 - Decisione Architetturale
# results[1] = evaluate_task(results[1], correttezza=?, completezza=?, stile=?, utility=?)

In [None]:
# Valuta T13 - Code Review Basic
# results[2] = evaluate_task(results[2], correttezza=?, completezza=?, stile=?, utility=?)

In [None]:
# Valuta T14 - Bug Analysis
# results[3] = evaluate_task(results[3], correttezza=?, completezza=?, stile=?, utility=?)

In [None]:
# Valuta T15 - Documentazione Pattern
# results[4] = evaluate_task(results[4], correttezza=?, completezza=?, stile=?, utility=?)

In [None]:
# Valuta T16 - Analisi Costi
# results[5] = evaluate_task(results[5], correttezza=?, completezza=?, stile=?, utility=?)

In [None]:
# Valuta T17 - Refactoring Plan
# results[6] = evaluate_task(results[6], correttezza=?, completezza=?, stile=?, utility=?)

In [None]:
# Valuta T18 - Summary Ricerca
# results[7] = evaluate_task(results[7], correttezza=?, completezza=?, stile=?, utility=?)

## 9. Save Results

In [None]:
# Save results to JSON
output = {
    "metadata": {
        "poc_week": 2,
        "tier": "TIER 2 - Medium",
        "date": datetime.now().isoformat(),
        "model": "Qwen3-4B-Instruct-2507",
        "quantization": "4-bit",
        "total_tasks": len(results),
        "avg_latency": sum(r["latency_seconds"] for r in results) / len(results) if results else 0
    },
    "week1_summary": {
        "result": "PASS",
        "score": "9/10",
        "avg_latency": "19.35s"
    },
    "results": results,
    "summary": {
        "tasks_passed": sum(1 for r in results if r["passed"] == True),
        "tasks_failed": sum(1 for r in results if r["passed"] == False),
        "tasks_pending": sum(1 for r in results if r["passed"] is None),
        "pass_threshold": ">=5/8 (62.5%)"
    }
}

# Save to file
with open("week2_results.json", "w") as f:
    json.dump(output, f, indent=2)

print("Results saved to week2_results.json")
print(f"\nSummary:")
print(f"- Total tasks: {output['metadata']['total_tasks']}")
print(f"- Avg latency: {output['metadata']['avg_latency']:.2f}s")
print(f"- Passed: {output['summary']['tasks_passed']}")
print(f"- Failed: {output['summary']['tasks_failed']}")
print(f"- Pending: {output['summary']['tasks_pending']}")

## 10. Final Summary Week 2

In [None]:
# Final Summary
evaluated = [r for r in results if r["score"] is not None]
passed = [r for r in results if r["passed"] == True]
failed = [r for r in results if r["passed"] == False]

print("=" * 60)
print("POC CERVELLA BABY - WEEK 2 RESULTS")
print("=" * 60)
print(f"\nWeek 1 Result: PASS (9/10)")
print(f"\nWeek 2 Tasks evaluated: {len(evaluated)}/{len(results)}")
print(f"Tasks passed: {len(passed)}/{len(evaluated) if evaluated else len(results)}")
print(f"Tasks failed: {len(failed)}/{len(evaluated) if evaluated else len(results)}")

if evaluated:
    avg_score = sum(r["score"] for r in evaluated) / len(evaluated)
    print(f"\nAverage score: {avg_score:.1f}%")
    
    print("\n" + "-" * 40)
    print("\nDetailed Results:")
    for r in results:
        status = "PASS" if r["passed"] else ("FAIL" if r["passed"] == False else "PENDING")
        score = f"{r['score']:.0f}%" if r["score"] else "N/A"
        print(f"  {r['task_id']}: {status} ({score}) - {r['task_name']}")
    
    print("\n" + "=" * 60)
    if len(passed) >= 5:
        print("WEEK 2 RESULT: PASS (>=5/8)")
        print("\nProceed to Week 3 (T19-T20 Complex)!")
        print("GO/NO-GO Decision: 1 Febbraio 2026")
    elif len(passed) >= 3:
        print("WEEK 2 RESULT: CONDITIONAL (3-4/8)")
        print("\nReview failed tasks, analyze gaps.")
    else:
        print("WEEK 2 RESULT: FAIL (<3/8)")
        print("\nConsider: larger model, more tuning, or hybrid approach.")
else:
    print("\nNo tasks evaluated yet. Use evaluate_task() to score each result.")

print("\n" + "=" * 60)
print('"La magia ora e\' con coscienza!"')

---

## Next Steps

1. **Download results:** `week2_results.json`
2. **Update SNCP:** `.sncp/stato/oggi.md` con risultati
3. **Decision:**
   - PASS >= 5/8 -> Proceed Week 3 (T19-T20 Complex)
   - CONDITIONAL 3-4/8 -> Review and decide
   - FAIL < 3/8 -> Consider alternatives

---

*POC Cervella Baby - Week 2*
*"Ultrapassar os proprios limites!"*