# Nemotron Phishing Detection Workshop

This notebook walks through the full fine-tuning workflow on the Enron dataset:
1. Download the dataset
2. Convert to JSONL
3. Fine-tune Nemotron with LoRA
4. Evaluate the model


## Environment setup (run in terminal)
Create a virtual environment and install dependencies from the shell.

```bash
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install -r requirements.txt
pip install -r requirements-vllm.txt
```


## Prepare the data


## Configure Kaggle API
Export your Kaggle credentials before downloading the dataset.

In [None]:
import os
os.environ['KAGGLE_USERNAME'] = 'YYYYYYYYYY'
os.environ['KAGGLE_KEY'] = 'XXXXXXXXXXXXX'

## Download the dataset

In [3]:
!python ../scripts/download_dataset.py --output_dir ../data/raw

Downloading wcukierski/enron-email-dataset to ../data/raw...
Dataset URL: https://www.kaggle.com/datasets/wcukierski/enron-email-dataset
Download completed, but maildir was not found. Check the output directory.


## Convert to JSONL
This uses a simple keyword heuristic to label phishing vs benign.

In [1]:
!python ../scripts/prepare_jsonl.py --input_csv ../data/raw/emails.csv --output_dir ../data/processed

Parsing emails: 517401it [05:59, 1440.15it/s]
Wrote JSONL files to ../data/processed


## Inspect dataset stats

In [2]:
import json
from pathlib import Path
stats = json.loads(Path('../data/processed/stats.json').read_text())
stats

{'total': 50000,
 'train': 40000,
 'val': 5000,
 'test': 5000,
 'phishing': 6567,
 'benign': 43433}

In [3]:
# Trim dataset to 10% to target ~1 hour training on L4
import json
import random
from pathlib import Path

src_dir = Path('../data/processed')
dst_dir = Path('../data/processed_small')
dst_dir.mkdir(parents=True, exist_ok=True)
seed = 42
fraction = 0.1

def sample_jsonl(src, dst, fraction, seed):
    lines = Path(src).read_text().splitlines()
    rng = random.Random(seed)
    k = max(1, int(len(lines) * fraction))
    sample = rng.sample(lines, k)
    Path(dst).write_text('\n'.join(sample) + '\n')
    return k

counts = {}
counts['train'] = sample_jsonl(src_dir / 'train.jsonl', dst_dir / 'train.jsonl', fraction, seed)
counts['val'] = sample_jsonl(src_dir / 'val.jsonl', dst_dir / 'val.jsonl', fraction, seed)
counts['test'] = sample_jsonl(src_dir / 'test.jsonl', dst_dir / 'test.jsonl', fraction, seed)

def count_labels(path):
    stats = {'phishing': 0, 'benign': 0, 'total': 0}
    for line in Path(path).read_text().splitlines():
        if not line:
            continue
        obj = json.loads(line)
        label = obj.get('label', '')
        if label in stats:
            stats[label] += 1
        stats['total'] += 1
    return stats

train_stats = count_labels(dst_dir / 'train.jsonl')
val_stats = count_labels(dst_dir / 'val.jsonl')
test_stats = count_labels(dst_dir / 'test.jsonl')
small_stats = {
    'total': train_stats['total'] + val_stats['total'] + test_stats['total'],
    'train': train_stats['total'],
    'val': val_stats['total'],
    'test': test_stats['total'],
    'phishing': train_stats['phishing'] + val_stats['phishing'] + test_stats['phishing'],
    'benign': train_stats['benign'] + val_stats['benign'] + test_stats['benign'],
}
(dst_dir / 'stats.json').write_text(json.dumps(small_stats, indent=2))
small_stats


{'total': 5000,
 'train': 4000,
 'val': 500,
 'test': 500,
 'phishing': 649,
 'benign': 4351}

## Base model experience


## Serve the base model (run in terminal)
Start the non-tuned model before preparing data to sanity check inference.

Terminal A (serve):
```bash
# Optional: pip install -r requirements-vllm.txt
python scripts/serve_vllm.py --model_name nvidia/Nemotron-Mini-4B-Instruct \
  --served_model_name base --port 8000 \
  --enforce_eager --max_model_len 512 --gpu_memory_utilization 0.6 \
  --max_num_batched_tokens 512 --max_num_seqs 4
```
The base model is registered as the OpenAI model name `base`.

Stop the server with Ctrl+C when done.


## Smoke test the base model (streaming)
This example is labeled benign in the dataset and is a known base-model miss.


In [21]:
import json
import requests

endpoint = "http://127.0.0.1:8000/v1/completions"
model = "base"

def build_prompt(subject, body):
    return (
        "### Instruction:\n"
        "Classify the email as phishing or benign. Reply with only the label.\n"
        "### Email:\n"
        f"Subject: {subject.strip()}\n"
        f"Body: {body.strip()}\n"
        "### Response:\n"
    )

def normalize_label(text):
    lowered = text.lower()
    if "phish" in lowered:
        return "phishing"
    if "benign" in lowered or "ham" in lowered:
        return "benign"
    return "unknown"

def stream_completion(prompt):
    payload = {
        "model": model,
        "prompt": prompt,
        "max_tokens": 6,
        "temperature": 0.0,
        "stream": True,
    }
    chunks = []
    with requests.post(endpoint, json=payload, stream=True, timeout=60) as resp:
        resp.raise_for_status()
        for line in resp.iter_lines(decode_unicode=True):
            if not line:
                continue
            if line.startswith("data: "):
                data = line[len("data: "):].strip()
                if data == "[DONE]":
                    break
                chunk = json.loads(data)
                text = chunk.get("choices", [{}])[0].get("text", "")
                if text:
                    print(text, end="", flush=True)
                    chunks.append(text)
    print()
    return "".join(chunks).strip()

subject = "Pipeline 30 Years"
body = "Pipeline 30 Years"
expected = "benign"

print(f"Subject: {subject}")
raw = stream_completion(build_prompt(subject, body))
pred = normalize_label(raw)
status = "correct" if pred == expected else "wrong"
print(f"Predicted: {pred} | Expected: {expected} | {status}")


Subject: Pipeline 30 Years
ph

ishing
Predicted: phishing | Expected: benign | wrong


## Evaluate the base model (run in terminal)
Score 500 test samples against the base endpoint and save results. Overlength samples are skipped.

Terminal B (evaluate):
```bash
python scripts/test_model.py --api openai \
  --endpoint http://127.0.0.1:8000/v1/completions \
  --openai_model base \
  --test_file data/processed_small/test.jsonl \
  --max_samples 500 \
  --output_file outputs/eval_base.json
```


## View base accuracy


In [22]:
import json
from pathlib import Path

base = json.loads(Path("../outputs/eval_base.json").read_text())
print(f"Base accuracy: {base['accuracy']:.2%} ({base['correct']}/{base['total']})")
print(f"Skipped (overlength): {base.get('skipped', 0)}")


Base accuracy: 59.08% (218/369)
Skipped (overlength): 131


## Fine-tune and evaluate


## Fine-tune the model (run in terminal)
Training can take hours, so run it from a shell instead of the notebook.


```bash
python scripts/train.py --data_dir data/processed_small --output_dir outputs \
  --model_name nvidia/Nemotron-Mini-4B-Instruct --num_train_epochs 1 --max_seq_length 512
```


## Serve the tuned model (run in terminal)
If you already trained an adapter in `outputs/adapter`, serve it with the OpenAI-compatible vLLM server so streaming works.

Terminal A (serve):
```bash
# Optional: pip install -r requirements-vllm.txt
python scripts/serve_vllm.py --model_name nvidia/Nemotron-Mini-4B-Instruct \
  --adapter_dir outputs/adapter --port 8000
```
The adapter is registered as the OpenAI model name `phishing`.


If you hit CUDA OOM during startup, retry with tighter limits:
```bash
python scripts/serve_vllm.py --model_name nvidia/Nemotron-Mini-4B-Instruct \
  --adapter_dir outputs/adapter --port 8000 \
  --enforce_eager --max_model_len 1024 --gpu_memory_utilization 0.8 \
  --max_num_batched_tokens 1024 --max_num_seqs 8
```


## Smoke test the tuned model (streaming)
Re-run the same example to see whether the tuned model fixes the mistake.


In [23]:
import json
import requests

endpoint = "http://127.0.0.1:8000/v1/completions"
model = "phishing"

def build_prompt(subject, body):
    return (
        "### Instruction:\n"
        "Classify the email as phishing or benign. Reply with only the label.\n"
        "### Email:\n"
        f"Subject: {subject.strip()}\n"
        f"Body: {body.strip()}\n"
        "### Response:\n"
    )

def normalize_label(text):
    lowered = text.lower()
    if "phish" in lowered:
        return "phishing"
    if "benign" in lowered or "ham" in lowered:
        return "benign"
    return "unknown"

def stream_completion(prompt):
    payload = {
        "model": model,
        "prompt": prompt,
        "max_tokens": 6,
        "temperature": 0.0,
        "stream": True,
    }
    chunks = []
    with requests.post(endpoint, json=payload, stream=True, timeout=60) as resp:
        resp.raise_for_status()
        for line in resp.iter_lines(decode_unicode=True):
            if not line:
                continue
            if line.startswith("data: "):
                data = line[len("data: "):].strip()
                if data == "[DONE]":
                    break
                chunk = json.loads(data)
                text = chunk.get("choices", [{}])[0].get("text", "")
                if text:
                    print(text, end="", flush=True)
                    chunks.append(text)
    print()
    return "".join(chunks).strip()

subject = "Pipeline 30 Years"
body = "Pipeline 30 Years"
expected = "benign"

print(f"Subject: {subject}")
raw = stream_completion(build_prompt(subject, body))
pred = normalize_label(raw)
status = "correct" if pred == expected else "wrong"
print(f"Predicted: {pred} | Expected: {expected} | {status}")


Subject: Pipeline 30 Years
benign
Predicted: benign | Expected: benign | correct


## Evaluate the tuned model (run in terminal)
Score 500 test samples against the tuned endpoint and save results. Overlength samples are skipped.

Terminal B (evaluate):
```bash
python scripts/test_model.py --api openai \
  --endpoint http://127.0.0.1:8000/v1/completions \
  --openai_model phishing \
  --test_file data/processed_small/test.jsonl \
  --max_samples 500 \
  --output_file outputs/eval_tuned.json
```


## View tuned accuracy


In [24]:
import json
from pathlib import Path

tuned = json.loads(Path("../outputs/eval_tuned.json").read_text())
print(f"Tuned accuracy: {tuned['accuracy']:.2%} ({tuned['correct']}/{tuned['total']})")
print(f"Skipped (overlength): {tuned.get('skipped', 0)}")


Tuned accuracy: 87.15% (407/467)
Skipped (overlength): 33


## Compare results
Calculate the accuracy gain from fine-tuning.


In [25]:
import json
from pathlib import Path

base = json.loads(Path("../outputs/eval_base.json").read_text())
tuned = json.loads(Path("../outputs/eval_tuned.json").read_text())

def fmt(result):
    return f"{result['accuracy']:.2%} ({result['correct']}/{result['total']})"

print("Base accuracy:", fmt(base))
print("Tuned accuracy:", fmt(tuned))
print("Absolute gain:", f"{(tuned['accuracy'] - base['accuracy']) * 100:.2f} pp")


Base accuracy: 59.08% (218/369)
Tuned accuracy: 87.15% (407/467)
Absolute gain: 28.07 pp
