# Phase 0.1 ‚Äî Colab Setup & FinRLlama Inference Test

**What this notebook does:**
1. Sets up the project environment on Colab
2. Verifies GPU access
3. Logs into HuggingFace (needed for gated LLaMA model)
4. Runs the FinRLlama prompt template on Qwen 2.5-3B to establish a BASELINE
5. (Once approved) Runs the actual FinRLlama model for comparison

**Why Colab?** Our 8GB Mac can't fit a 3B model in memory for inference. Colab Pro gives us an A100 with 40GB VRAM ‚Äî more than enough.

## Step 1: Verify GPU

First, make sure Colab gave us a GPU. Go to **Runtime ‚Üí Change runtime type ‚Üí T4 or A100**.

In [None]:
# üéì This cell checks what GPU Colab assigned us.
# nvidia-smi is the NVIDIA command to see GPU info (like 'top' for your GPU).
!nvidia-smi

import torch
print(f"\nPyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")
else:
    print("‚ö†Ô∏è No GPU! Go to Runtime ‚Üí Change runtime type ‚Üí select T4 or A100")

## Step 2: Clone Our Repo

In [None]:
# üéì Clone our private repo into Colab's temporary filesystem.
# Since you connected GitHub to Colab, authentication should work automatically.
# If it asks for auth, use a GitHub Personal Access Token.
import os

REPO_DIR = "/content/india-credit-signals"

if not os.path.exists(REPO_DIR):
    !git clone https://github.com/spiffler33/india-credit-signals.git {REPO_DIR}
    print(f"‚úÖ Cloned to {REPO_DIR}")
else:
    !cd {REPO_DIR} && git pull
    print(f"‚úÖ Pulled latest into {REPO_DIR}")

os.chdir(REPO_DIR)
!git log --oneline -5

## Step 3: Install Dependencies

Colab comes with many packages pre-installed, but we need our specific versions.

In [None]:
# üéì Install our project deps. We use pip here (not uv) because Colab's
# environment is pre-built and uv venvs conflict with it.
# The -q flag = quiet (less output noise).
!pip install -q transformers>=4.48.0 peft>=0.14.0 datasets>=3.0.0 accelerate>=1.0.0 \
    bitsandbytes>=0.43.0 loguru>=0.7.0 httpx>=0.27.0 tenacity>=8.2.0 \
    polars>=1.0.0 pyyaml>=6.0

print("\n‚úÖ Dependencies installed")

# Verify key imports
import transformers, peft, datasets, accelerate, bitsandbytes
print(f"transformers: {transformers.__version__}")
print(f"peft: {peft.__version__}")
print(f"bitsandbytes: {bitsandbytes.__version__}")
print(f"torch: {torch.__version__}, CUDA: {torch.cuda.is_available()}")

## Step 4: HuggingFace Login

We need this to access Meta's gated LLaMA model (base model for FinRLlama).

**If your LLaMA access is still PENDING**, skip this ‚Äî Step 5 uses Qwen which doesn't need auth.

To get your token: https://huggingface.co/settings/tokens ‚Üí New token ‚Üí Read access.

In [None]:
# üéì This stores your HuggingFace token so the transformers library
# can download gated models. The token is saved in ~/.cache/huggingface/.
from huggingface_hub import login

# Option A: Use Colab's Secrets feature (recommended ‚Äî keeps token out of notebook)
# Go to the üîë icon in Colab sidebar ‚Üí add secret named HF_TOKEN
try:
    from google.colab import userdata
    hf_token = userdata.get('HF_TOKEN')
    login(token=hf_token)
    print("‚úÖ Logged in via Colab Secrets")
except Exception:
    # Option B: Paste token manually (will prompt you)
    login()
    print("‚úÖ Logged in manually")

## Step 5: Baseline Test ‚Äî Qwen 2.5-3B (no fine-tuning)

We run the FinRLlama prompt template on a model that was NOT fine-tuned for finance.
This is our **baseline** ‚Äî how well does a generic LLM do at scoring financial news?

Later we compare: baseline Qwen ‚Üí fine-tuned FinRLlama ‚Üí our credit-risk model.

In [None]:
import re
import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# üéì Same prompt that FinRLlama was fine-tuned on (from their task2_signal.py).
# An un-tuned model sees this for the first time ‚Äî it has to rely on general
# language understanding, not specialized financial training.
SIGNAL_PROMPT = """
Task: Analyze the following news headline about a stock and provide a sentiment score between -{signal_strength} and {signal_strength}, where:
-{signal_strength} represents a highly negative sentiment, likely indicating a substantial decline in stock performance.
-{threshold} represents a moderate negative sentiment, suggesting a slight potential decline in stock performance.
0 represents neutral sentiment, indicating no significant impact on stock performance.
{threshold} represents a moderate positive sentiment, indicating potential for slight stock growth.
{signal_strength} represents a highly positive sentiment, indicating significant potential for stock appreciation.

Consider the likely influence of market feedback from previous price movements and sentiment trends:
How has the stock's price responded to similar news in the past?
Does the headline align with prevailing market sentiment, or does it contradict current trends?
How might this sentiment lead to a change in the stock's behavior, considering both historical price patterns and market expectations?

Examples of sentiment scoring:
"Company X announces layoffs amidst economic downturn." Score: -8
"Company Y reports record revenue growth in Q1." Score: 7
"Market sees strong response to Company Z's new product release." Score: 5

Do not provide any explanations or reasoning. Output only a single integer in the range of -{signal_strength} to {signal_strength} based on the sentiment of the news and its potential impact on stock performance.

News headline: "{news}"

Price Data: "{prices}"

SENTIMENT SCORE:
"""

# Test cases: mix of US equity (in-domain) and Indian credit (out-of-domain)
TEST_CASES = [
    {
        "name": "Strong positive ‚Äî earnings beat",
        "news": "Apple reports Q4 revenue of $94.9 billion, beating analyst estimates by 5%. iPhone sales surge 12% year-over-year driven by strong demand in emerging markets.",
        "prices": "AAPL: Open=175.20, High=178.50, Low=174.80, Close=177.90, Volume=82M",
    },
    {
        "name": "Strong negative ‚Äî fraud/governance",
        "news": "SEC charges Wirecard executives with massive accounting fraud. Company files for insolvency after revealing 1.9 billion euros missing from accounts.",
        "prices": "WDI: Open=104.50, High=104.50, Low=1.28, Close=1.28, Volume=350M",
    },
    {
        "name": "Ambiguous ‚Äî mixed signals",
        "news": "Tesla announces 10% workforce reduction while simultaneously revealing record vehicle deliveries of 1.8 million units in 2023.",
        "prices": "TSLA: Open=248.50, High=252.30, Low=245.10, Close=246.80, Volume=115M",
    },
    {
        "name": "NBFC credit ‚Äî IL&FS crisis (out of domain)",
        "news": "IL&FS defaults on Rs 1,000 crore commercial paper. RBI expresses concern about liquidity in NBFC sector. DHFL share price crashes 60% on contagion fears.",
        "prices": "ILFS: Open=25.50, High=25.50, Low=12.20, Close=12.80, Volume=45M",
    },
    {
        "name": "Regulatory ‚Äî RBI action (out of domain)",
        "news": "RBI increases risk weights on NBFC lending by 25 basis points, citing rapid credit growth concerns. Banking stocks fall 2-3% across the board.",
        "prices": "NIFTYBANK: Open=44250, High=44300, Low=43100, Close=43200, Volume=200M",
    },
]

In [None]:
def load_model(model_name: str):
    """Load a model onto GPU with appropriate settings."""
    print(f"Loading {model_name}...")
    t0 = time.time()

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    # üéì On Colab GPU we can load in fp16 directly onto CUDA.
    # device_map="auto" lets transformers figure out the best GPU placement.
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
        device_map="auto",
    )

    load_time = time.time() - t0
    param_count = sum(p.numel() for p in model.parameters())
    print(f"‚úÖ Loaded in {load_time:.1f}s | {param_count / 1e9:.2f}B params")
    return model, tokenizer


def run_test(model, tokenizer, model_label: str):
    """Run all test cases and print results."""
    signal_strength = 10
    threshold = signal_strength // 3
    device = next(model.parameters()).device

    print("\n" + "=" * 70)
    print(f"Results: {model_label}")
    print("=" * 70)

    results = []
    for i, case in enumerate(TEST_CASES, 1):
        prompt = SIGNAL_PROMPT.format(
            signal_strength=signal_strength,
            threshold=threshold,
            news=case["news"],
            prices=case["prices"],
        )

        inputs = tokenizer(prompt, return_tensors="pt").to(device)

        t0 = time.time()
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=10,
                pad_token_id=tokenizer.eos_token_id,
                do_sample=False,
            )
        gen_time = time.time() - t0

        new_tokens = outputs[0][inputs["input_ids"].shape[1]:]
        response = tokenizer.decode(new_tokens, skip_special_tokens=True).strip()

        match = re.search(r"-?\d+", response)
        score = int(match.group()) if match else None
        direction = None
        if score is not None:
            direction = "BULLISH" if score >= threshold else "BEARISH" if score <= -threshold else "NEUTRAL"

        results.append({"name": case["name"], "score": score, "direction": direction, "raw": response})

        print(f"\n[{i}/5] {case['name']}")
        print(f"  Raw output: {repr(response)}")
        print(f"  Score: {score} ‚Üí {direction}" if score else f"  Score: PARSE FAILED")
        print(f"  Time: {gen_time:.2f}s")

    return results

In [None]:
# üéì BASELINE: Qwen 2.5-3B-Instruct ‚Äî a general-purpose LLM, NOT fine-tuned for finance.
# This shows us what a "smart but untrained" model does with the FinRLlama prompt.
qwen_model, qwen_tokenizer = load_model("Qwen/Qwen2.5-3B-Instruct")
qwen_results = run_test(qwen_model, qwen_tokenizer, "Qwen 2.5-3B (BASELINE ‚Äî no finance training)")

# Free GPU memory before loading next model
del qwen_model, qwen_tokenizer
torch.cuda.empty_cache()

## Step 6: FinRLlama Test (skip if LLaMA access still pending)

Run the SAME test cases on the fine-tuned FinRLlama model.
This lets us compare: does fine-tuning on market feedback actually improve signal quality?

**Skip this cell if your LLaMA access is still PENDING.**

In [None]:
# üéì FinRLlama: LLaMA 3.2-3B fine-tuned with RLMF (market feedback as reward).
# This model was specifically trained to produce better sentiment scores.
# Compare its outputs to the baseline Qwen above.
try:
    finrl_model, finrl_tokenizer = load_model("Arnav-Gr0ver/FinRLlama-3.2-3B-Instruct")
    finrl_results = run_test(finrl_model, finrl_tokenizer, "FinRLlama 3.2-3B (RLMF fine-tuned)")

    del finrl_model, finrl_tokenizer
    torch.cuda.empty_cache()
except Exception as e:
    print(f"‚ö†Ô∏è Could not load FinRLlama: {e}")
    print("This is expected if LLaMA access is still PENDING.")
    print("Re-run this cell once Meta approves your access.")
    finrl_results = None

## Step 7: Compare Results

Side-by-side comparison of baseline vs fine-tuned model.

In [None]:
# üéì Summary comparison table
print("\n" + "=" * 70)
print("COMPARISON: Baseline vs Fine-tuned")
print("=" * 70)
print(f"{'Test Case':<45} {'Qwen (base)':>12} {'FinRLlama':>12}")
print("-" * 70)

for i, case in enumerate(TEST_CASES):
    qwen_score = qwen_results[i]["score"] if qwen_results[i]["score"] is not None else "FAIL"
    finrl_score = "N/A"
    if finrl_results and finrl_results[i]["score"] is not None:
        finrl_score = finrl_results[i]["score"]
    elif finrl_results is None:
        finrl_score = "PENDING"

    print(f"{case['name']:<45} {str(qwen_score):>12} {str(finrl_score):>12}")

print("\n" + "=" * 70)
print("üéì KEY QUESTION: Do the scores make sense?")
print("   - Earnings beat should be strongly positive (+7 to +10)")
print("   - Fraud should be strongly negative (-8 to -10)")
print("   - Mixed signals should be near 0")
print("   - NBFC/RBI headlines are OUT OF DOMAIN ‚Äî interesting to see what happens")
print("   ‚Üí This is WHY we need to fine-tune: generic models don't understand credit risk.")

---

## What Just Happened

```
‚úÖ DONE: Phase 0.1 ‚Äî Baseline inference test on Colab
üìä We now have baseline scores from an un-tuned model using the FinRLlama prompt
üéì KEY INSIGHT: The generic model probably scores US equity headlines OK, but
   struggles with Indian NBFC/credit headlines ‚Äî because it was never trained on them.
   This gap is exactly what our fine-tuning will fix.
‚è≠Ô∏è NEXT: Read FinGPT codebase (Phase 0.2), then start data collection (Phase 1)
```