# Chapter 8 -- NLP and Transformers
## *Python for AI/ML: A Complete Learning Journey*

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/timothy-watt/python-for-ai-ml/blob/main/CH08_NLP_and_Transformers.ipynb)
&nbsp;&nbsp;[![Back to TOC](https://img.shields.io/badge/Back_to-Table_of_Contents-1B3A5C?style=flat-square)](https://colab.research.google.com/github/timothy-watt/python-for-ai-ml/blob/main/Python_for_AIML_TOC.ipynb)

---

**Part:** 3 -- Machine Learning and AI  
**Prerequisites:** Chapter 7 (Deep Learning with PyTorch)  
**Estimated time:** 5-6 hours

---

> **Before running this notebook:** go to **Runtime ‚Üí Change runtime type ‚Üí T4 GPU**.
> The transformer inference and fine-tuning cells require GPU. CPU will work but
> fine-tuning will take 15-30 minutes instead of 2-3 minutes.

---

### Learning Objectives

By the end of this chapter you will be able to:

- Explain tokens, vocabularies, and why text must be numerically encoded before modelling
- Use `nltk` and `re` for classical text preprocessing: cleaning, tokenising, stemming, stopwords
- Build a TF-IDF feature matrix and train a text classifier with scikit-learn
- Explain word embeddings and why `word2vec`-style representations outperform one-hot encoding
- Load a pre-trained transformer model with HuggingFace `transformers`
- Run zero-shot inference: sentiment analysis and text classification without training
- Fine-tune a pre-trained model on a custom text classification task
- Interpret attention weights to understand what the model focuses on

---

### Project Thread -- Chapter 8

The SO 2025 dataset contains free-text columns -- job titles, developer type labels,
and AI tool descriptions. We build three NLP pipelines on this data:

1. **Classical NLP** -- TF-IDF + Logistic Regression to classify developer role from job title text
2. **Zero-shot inference** -- sentiment analysis on developer comments using a pre-trained transformer
3. **Fine-tuning** -- adapt `distilbert-base-uncased` to classify whether a developer
   is data-focused or software-focused from their self-described role text


---

## Setup -- Install, Import, and Data


In [None]:
# Install libraries not pre-installed in Colab
import subprocess
subprocess.run(['pip', 'install', 'transformers', 'datasets', 'accelerate',
                'nltk', '-q'], check=False)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import warnings
warnings.filterwarnings('ignore')

import nltk
nltk.download('stopwords', quiet=True)
nltk.download('punkt',     quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('wordnet',   quiet=True)
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, accuracy_score
from sklearn.pipeline import Pipeline

import torch
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Device: {DEVICE}')

import transformers
print(f'Transformers: {transformers.__version__}')

plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.dpi']     = 110
plt.rcParams['axes.titlesize'] = 13

DATASET_URL  = 'https://raw.githubusercontent.com/timothy-watt/python-for-ai-ml/main/data/so_survey_2025_curated.csv'
RANDOM_STATE = 42


In [None]:
# Load SO 2025 and extract text columns
df_raw = pd.read_csv(DATASET_URL)
df = df_raw.copy()

# We focus on DevType -- semicolon-separated role labels
# e.g. 'Developer, full-stack;Developer, back-end;Data scientist'
text_col = 'DevType'
if text_col not in df.columns:
    # Fallback: use any available text column
    text_candidates = [c for c in df.columns
                       if df[c].dtype == object and df[c].str.len().mean() > 10]
    text_col = text_candidates[0] if text_candidates else None
    print(f'DevType not found -- using: {text_col}')

df = df[df[text_col].notna()].copy()
df = df.reset_index(drop=True)

# Primary role: take the first semicolon-separated value
df['primary_role'] = df[text_col].str.split(';').str[0].str.strip()

# Binary target: data-focused vs software-focused
data_keywords = ['data scientist', 'data engineer', 'data analyst',
                 'machine learning', 'research', 'analyst']
df['is_data_role'] = df['primary_role'].str.lower().apply(
    lambda x: int(any(kw in x for kw in data_keywords))
)

print(f'Dataset: {len(df):,} rows with non-null {text_col}')
print(f'Unique primary roles: {df["primary_role"].nunique()}')
print(f'Data-focused roles:   {df["is_data_role"].sum():,} ({df["is_data_role"].mean()*100:.1f}%)')
print()
print('Most common primary roles:')
print(df['primary_role'].value_counts().head(8).to_string())


---

## Section 8.1 -- Classical NLP: Text Preprocessing and TF-IDF

Before transformers dominated NLP, the standard pipeline was:
clean text ‚Üí tokenise ‚Üí remove stopwords ‚Üí stem/lemmatise ‚Üí TF-IDF features ‚Üí train classifier.
This pipeline still works well for short, domain-specific text and is 100x faster
to train than a transformer. It is worth knowing as a fast baseline.

**TF-IDF (Term Frequency -- Inverse Document Frequency)** scores each word by
how often it appears in a document (TF) weighted down by how common it is
across all documents (IDF). Rare words that appear in specific documents
get high scores; common words like 'the' get low scores.


In [None]:
# 8.1.1 -- Text cleaning pipeline

STOP_WORDS = set(stopwords.words('english'))
stemmer    = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def clean_text(text: str, use_lemma: bool = True) -> str:
    """
    Full classical NLP preprocessing pipeline.
    1. Lowercase
    2. Remove punctuation and digits
    3. Tokenise
    4. Remove stopwords
    5. Lemmatise (or stem)
    """
    if not isinstance(text, str):
        return ''
    text = text.lower()
    text = re.sub(r'[^a-z\s]', ' ', text)   # keep only letters and spaces
    text = re.sub(r'\s+', ' ', text).strip()
    tokens = word_tokenize(text)
    tokens = [t for t in tokens if t not in STOP_WORDS and len(t) > 2]
    if use_lemma:
        tokens = [lemmatizer.lemmatize(t) for t in tokens]
    else:
        tokens = [stemmer.stem(t) for t in tokens]
    return ' '.join(tokens)


# Demonstrate the pipeline step by step
sample = 'Developer, full-stack; building web APIs and React front-ends'
print(f'Original:      {sample}')
print(f'Lowercased:    {sample.lower()}')
cleaned = re.sub(r'[^a-z\s]', ' ', sample.lower())
print(f'No punct:      {cleaned}')
tokens = word_tokenize(cleaned)
print(f'Tokenised:     {tokens}')
no_stop = [t for t in tokens if t not in STOP_WORDS and len(t) > 2]
print(f'No stopwords:  {no_stop}')
lemmatised = [lemmatizer.lemmatize(t) for t in no_stop]
print(f'Lemmatised:    {lemmatised}')
print(f'Final string:  {clean_text(sample)}')

# Apply to the full dataset
df['role_clean'] = df['primary_role'].apply(clean_text)
print(f'Cleaned {len(df):,} role strings')


In [None]:
# 8.1.2 -- TF-IDF features and Logistic Regression classifier

# Keep the top 8 roles by frequency for a clean multi-class problem
top_roles = df['primary_role'].value_counts().head(8).index.tolist()
df_clf    = df[df['primary_role'].isin(top_roles)].copy()

X_text = df_clf['role_clean'].values
y_role = df_clf['primary_role'].values

X_tr, X_te, y_tr, y_te = train_test_split(
    X_text, y_role, test_size=0.2,
    random_state=RANDOM_STATE, stratify=y_role
)

# TF-IDF pipeline: vectorise text then classify
tfidf_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(
        max_features=500,    # keep only the 500 highest-scoring terms
        ngram_range=(1, 2),  # include both single words and 2-word phrases
        min_df=3,            # ignore terms appearing in fewer than 3 documents
        sublinear_tf=True,   # apply log(TF) to dampen effect of very frequent terms
    )),
    ('clf', LogisticRegression(
        max_iter=1000,
        C=1.0,               # inverse regularisation strength
        random_state=RANDOM_STATE
    )),
])

tfidf_pipe.fit(X_tr, y_tr)
y_pred = tfidf_pipe.predict(X_te)
acc    = accuracy_score(y_te, y_pred)

print(f'TF-IDF + Logistic Regression accuracy: {acc:.4f}  ({acc*100:.1f}%)')
print()
print(classification_report(y_te, y_pred, zero_division=0))


In [None]:
# 8.1.3 -- Visualise TF-IDF: top terms per class

vectorizer  = tfidf_pipe.named_steps['tfidf']
classifier  = tfidf_pipe.named_steps['clf']
feature_names = vectorizer.get_feature_names_out()

# For each class, find the terms with the highest logistic regression coefficients
n_top = 8
classes = classifier.classes_
n_classes = len(classes)
cols = min(4, n_classes)
rows = (n_classes + cols - 1) // cols

fig, axes = plt.subplots(rows, cols, figsize=(cols * 4, rows * 3))
axes_flat  = axes.flatten() if n_classes > 1 else [axes]

for i, (cls, ax) in enumerate(zip(classes, axes_flat)):
    coefs = classifier.coef_[i]
    top_idx  = np.argsort(coefs)[-n_top:]
    top_terms = feature_names[top_idx]
    top_coefs = coefs[top_idx]
    ax.barh(top_terms, top_coefs, color='#2E75B6')
    ax.set_title(cls[:30], fontsize=9)
    ax.tick_params(labelsize=8)

for ax in axes_flat[n_classes:]:
    ax.set_visible(False)

plt.suptitle('TF-IDF: Top Terms per Developer Role\n(higher coefficient = stronger signal)',
             fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()


---

## Section 8.2 -- Word Embeddings: From Counts to Meaning

TF-IDF represents each document as a sparse vector of term weights.
It has no concept of meaning -- 'developer' and 'engineer' are completely
unrelated in a TF-IDF vocabulary even though they are semantically close.

**Word embeddings** solve this by mapping each word to a dense vector
in a continuous space where similar words are geometrically close.
The famous example: `king - man + woman ‚âà queen`.

Modern transformers replace per-word embeddings with **contextual embeddings** --
the same word gets a different vector depending on its surrounding context.
'Python' in 'Python developer' and 'Python snake' would have different embeddings.


In [None]:
# 8.2.1 -- Demonstrate embeddings with a pre-trained transformer tokeniser
#
# We use DistilBERT's tokeniser to show how text is converted to token IDs
# before being fed to the model.

from transformers import AutoTokenizer

MODEL_NAME = 'distilbert-base-uncased'
print(f'Loading tokeniser: {MODEL_NAME}...')
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
print('Tokeniser loaded.')

# Tokenise some example developer role texts
examples = [
    'Developer, full-stack',
    'Data scientist or machine learning specialist',
    'DevOps specialist',
]

print()
for text in examples:
    tokens    = tokenizer.tokenize(text)
    token_ids = tokenizer.encode(text)
    print(f'Text:      {text}')
    print(f'Tokens:    {tokens}')
    print(f'Token IDs: {token_ids}')
    print(f'[CLS] id={token_ids[0]}, [SEP] id={token_ids[-1]}')
    print()

print('Key observations:')
print('  [CLS] token prepended -- its embedding becomes the sentence representation')
print('  [SEP] token appended  -- marks end of sequence')
print('  Subword tokenisation: unknown words split into known pieces')
print('  e.g. "DevOps" might become ["dev", "##ops"]')


---

## Section 8.3 -- Zero-Shot Inference with Pre-trained Transformers

A pre-trained transformer has already learned rich language representations
from billions of words of text. For many tasks, you can use it directly
without any further training -- this is called **zero-shot inference**.

HuggingFace `pipelines` provide a one-line interface to hundreds of
pre-trained models for common NLP tasks.


In [None]:
# 8.3.1 -- Sentiment analysis pipeline (zero-shot)

from transformers import pipeline

print('Loading sentiment analysis pipeline...')
# This downloads a fine-tuned DistilBERT model (~67MB) on first run
sentiment_pipe = pipeline(
    'sentiment-analysis',
    model='distilbert-base-uncased-finetuned-sst-2-english',
    device=0 if torch.cuda.is_available() else -1
)
print('Pipeline ready.')

# Simulate developer sentiment about tools and work conditions
developer_statements = [
    'I love working with Python, the ecosystem is incredible.',
    'The legacy codebase is a nightmare, no documentation anywhere.',
    'Remote work has been really positive for my productivity.',
    'The on-call rotation is exhausting and unsustainable.',
    'GitHub Copilot has genuinely made me more productive.',
    'Constantly switching between five different frameworks is frustrating.',
]

print()
print(f'{"Statement":<55} {"Sentiment":<12} {"Confidence"}')
print('-' * 80)
results = sentiment_pipe(developer_statements)
for stmt, result in zip(developer_statements, results):
    label = result['label']
    score = result['score']
    icon  = 'positive' if label == 'POSITIVE' else 'negative'
    print(f'{stmt[:53]:<55} {icon:<12} {score:.3f}')


In [None]:
# 8.3.2 -- Zero-shot text classification
#
# Zero-shot classification lets you classify text into ANY categories
# you define at inference time -- no training data needed.
# The model uses natural language inference to decide which label
# best describes the input text.

print('Loading zero-shot classification pipeline...')
zs_pipe = pipeline(
    'zero-shot-classification',
    model='facebook/bart-large-mnli',
    device=0 if torch.cuda.is_available() else -1
)
print('Pipeline ready.')

# Classify developer job descriptions into categories we define
candidate_labels = ['data science', 'web development', 'DevOps and infrastructure',
                    'mobile development', 'security']

job_descriptions = [
    'Building machine learning models and data pipelines for e-commerce recommendations',
    'Developing React front-end components and REST APIs with Node.js',
    'Managing Kubernetes clusters and CI/CD pipelines on AWS',
    'Writing Swift and SwiftUI apps for iOS and watchOS',
]

print()
for desc in job_descriptions:
    result = zs_pipe(desc, candidate_labels)
    top_label = result['labels'][0]
    top_score = result['scores'][0]
    print(f'Text:   {desc[:60]}')
    print(f'Label:  {top_label}  ({top_score:.3f})')
    print()


---

## Section 8.4 -- Fine-tuning a Pre-trained Transformer

Zero-shot inference is convenient but limited. **Fine-tuning** adapts a pre-trained
model to your specific task by continuing training on your labelled data.
Because the model already understands language, fine-tuning typically needs
only a small dataset and a few epochs -- far less than training from scratch.

We fine-tune `distilbert-base-uncased` to classify developer roles as
data-focused or software-focused using the `is_data_role` label we created
from the SO 2025 `DevType` column.


In [None]:
# 8.4.1 -- Prepare data for fine-tuning

from transformers import AutoModelForSequenceClassification
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW

# Use primary role text as input, is_data_role as label
# Downsample to 2000 examples for fast fine-tuning in Colab
df_ft = df[['primary_role', 'is_data_role']].dropna().copy()
df_ft = df_ft.sample(n=min(2000, len(df_ft)), random_state=RANDOM_STATE).reset_index(drop=True)

# Balance classes
n_min = df_ft['is_data_role'].value_counts().min()
df_ft = pd.concat([
    df_ft[df_ft['is_data_role'] == 0].sample(n_min, random_state=RANDOM_STATE),
    df_ft[df_ft['is_data_role'] == 1].sample(n_min, random_state=RANDOM_STATE),
]).sample(frac=1, random_state=RANDOM_STATE).reset_index(drop=True)

train_df, test_df = train_test_split(df_ft, test_size=0.2,
                                     random_state=RANDOM_STATE,
                                     stratify=df_ft['is_data_role'])

print(f'Fine-tuning dataset: {len(train_df)} train, {len(test_df)} test')
print(f'Class balance: {train_df["is_data_role"].mean()*100:.0f}% data roles')

# Tokenise all texts
def tokenise_batch(texts: list[str], max_length: int = 64) -> dict:
    return tokenizer(
        list(texts),
        padding=True,
        truncation=True,
        max_length=max_length,
        return_tensors='pt'
    )

# Custom Dataset
class RoleDataset(Dataset):
    def __init__(self, texts: list[str], labels: list[int], max_length: int = 64) -> None:
        self.encodings = tokenizer(
            list(texts), padding=True, truncation=True,
            max_length=max_length, return_tensors='pt'
        )
        self.labels = torch.tensor(labels.values, dtype=torch.long)

    def __len__(self) -> int:
        return len(self.labels)

    def __getitem__(self, idx: int) -> dict:
        return {
            'input_ids':      self.encodings['input_ids'][idx],
            'attention_mask': self.encodings['attention_mask'][idx],
            'labels':         self.labels[idx],
        }

train_ds = RoleDataset(train_df['primary_role'], train_df['is_data_role'])
test_ds  = RoleDataset(test_df['primary_role'],  test_df['is_data_role'])
train_loader_ft = DataLoader(train_ds, batch_size=32, shuffle=True)
test_loader_ft  = DataLoader(test_ds,  batch_size=64, shuffle=False)
print(f'Train batches: {len(train_loader_ft)},  Test batches: {len(test_loader_ft)}')


In [None]:
# 8.4.2 -- Load model and fine-tune for 3 epochs

print(f'Loading {MODEL_NAME} for sequence classification...')
model_ft = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=2,
    ignore_mismatched_sizes=True
)
model_ft = model_ft.to(DEVICE)
print(f'Model loaded. Parameters: {sum(p.numel() for p in model_ft.parameters()):,}')

optimizer_ft = AdamW(model_ft.parameters(), lr=2e-5, weight_decay=0.01)

N_EPOCHS_FT   = 3
ft_train_losses = []
ft_val_accs     = []

print(f'Fine-tuning on {DEVICE} for {N_EPOCHS_FT} epochs...')
print(f'{"Epoch":>6}  {"Train Loss":>12}  {"Val Acc":>10}')
print('-' * 32)

for epoch in range(1, N_EPOCHS_FT + 1):
    # Training
    model_ft.train()
    epoch_loss = 0.0
    for batch in train_loader_ft:
        input_ids      = batch['input_ids'].to(DEVICE)
        attention_mask = batch['attention_mask'].to(DEVICE)
        labels         = batch['labels'].to(DEVICE)
        optimizer_ft.zero_grad()
        outputs = model_ft(input_ids=input_ids,
                           attention_mask=attention_mask,
                           labels=labels)
        outputs.loss.backward()
        torch.nn.utils.clip_grad_norm_(model_ft.parameters(), 1.0)
        optimizer_ft.step()
        epoch_loss += outputs.loss.item()
    avg_loss = epoch_loss / len(train_loader_ft)
    ft_train_losses.append(avg_loss)

    # Validation
    model_ft.eval()
    all_preds, all_labels = [], []
    with torch.no_grad():
        for batch in test_loader_ft:
            input_ids      = batch['input_ids'].to(DEVICE)
            attention_mask = batch['attention_mask'].to(DEVICE)
            outputs = model_ft(input_ids=input_ids, attention_mask=attention_mask)
            preds   = outputs.logits.argmax(dim=-1).cpu().numpy()
            all_preds.extend(preds)
            all_labels.extend(batch['labels'].numpy())
    val_acc = accuracy_score(all_labels, all_preds)
    ft_val_accs.append(val_acc)
    print(f'{epoch:>6}  {avg_loss:>12.4f}  {val_acc:>10.4f}')

print('Fine-tuning complete.')


In [None]:
# 8.4.3 -- Evaluate and visualise fine-tuning results

print(f'Final fine-tuned model accuracy: {ft_val_accs[-1]:.4f}')
print()
print(classification_report(all_labels, all_preds,
                             target_names=['Software-focused', 'Data-focused']))

# Training curve
fig, axes = plt.subplots(1, 2, figsize=(13, 4))

axes[0].plot(range(1, N_EPOCHS_FT+1), ft_train_losses, 'o-', color='#E8722A',
             linewidth=2, markersize=8)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Training Loss')
axes[0].set_title('DistilBERT Fine-tuning: Training Loss')
axes[0].set_xticks(range(1, N_EPOCHS_FT+1))

axes[1].plot(range(1, N_EPOCHS_FT+1), ft_val_accs, 'o-', color='#2E75B6',
             linewidth=2, markersize=8)
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Validation Accuracy')
axes[1].set_title('DistilBERT Fine-tuning: Validation Accuracy')
axes[1].set_xticks(range(1, N_EPOCHS_FT+1))
axes[1].set_ylim(0.5, 1.0)

plt.suptitle('SO 2025 Role Classifier: Fine-tuned DistilBERT',
             fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

# Compare with TF-IDF baseline
tfidf_binary = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=300, ngram_range=(1,2))),
    ('clf',   LogisticRegression(max_iter=500, random_state=RANDOM_STATE)),
])
tfidf_binary.fit(train_df['primary_role'], train_df['is_data_role'])
baseline_acc = accuracy_score(test_df['is_data_role'],
                               tfidf_binary.predict(test_df['primary_role']))

print(f'Comparison on data-role classification:')
print(f'  TF-IDF + Logistic Regression: {baseline_acc:.4f}')
print(f'  Fine-tuned DistilBERT:        {ft_val_accs[-1]:.4f}')
print(f'  Improvement:                  {(ft_val_accs[-1]-baseline_acc)*100:+.1f} percentage points')


---

## Section 8.5 -- Understanding Attention

The transformer's key innovation is the **attention mechanism** -- a learned
weighting that lets the model focus on the most relevant parts of the input
when encoding each token. Visualising attention weights gives intuition
for what the model is 'looking at' when making predictions.


In [None]:
# 8.5.1 -- Extract and visualise attention weights

from transformers import AutoModel

# Load the base model with output_attentions=True
attn_model = AutoModel.from_pretrained(
    MODEL_NAME, output_attentions=True
).to(DEVICE)
attn_model.eval()

# Encode a sample text
sample_text = 'Machine learning engineer building recommendation systems'
inputs = tokenizer(sample_text, return_tensors='pt').to(DEVICE)
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

with torch.no_grad():
    outputs = attn_model(**inputs)

# outputs.attentions: tuple of (n_layers,) each shape (batch, heads, seq, seq)
# Average across all heads in the last layer
last_layer_attn = outputs.attentions[-1][0]          # shape (heads, seq, seq)
avg_attn        = last_layer_attn.mean(dim=0).cpu().numpy()  # shape (seq, seq)

# Plot attention heatmap
fig, ax = plt.subplots(figsize=(10, 8))
im = ax.imshow(avg_attn, cmap='Blues', aspect='auto')
ax.set_xticks(range(len(tokens)))
ax.set_yticks(range(len(tokens)))
ax.set_xticklabels(tokens, rotation=45, ha='right', fontsize=10)
ax.set_yticklabels(tokens, fontsize=10)
plt.colorbar(im, ax=ax, shrink=0.8)
ax.set_title(f'DistilBERT Attention Weights (last layer, averaged over heads)\n"{sample_text}"',
             fontsize=11)
plt.tight_layout()
plt.show()

print('Rows = query token (what is attending)')
print('Cols = key token (what is being attended to)')
print('Bright cells = high attention weight')
print('[CLS] often attends broadly -- it aggregates the full sequence for classification')


---

## Section 8.6 -- Retrieval-Augmented Generation (RAG)

**The problem with fine-tuning for knowledge:** fine-tuning teaches a model
*how* to behave, not *what* to know. If you fine-tune on your company's
documentation, the knowledge is frozen at training time and expensive to update.

**RAG** solves this by splitting the problem in two:

1. **Retrieve** -- at query time, search a document store for the most relevant chunks
2. **Generate** -- pass the retrieved chunks as context to a language model and ask it
   to answer using that context

The documents live outside the model and can be updated without retraining.
This is the dominant architecture for production Q&A systems over private documents.

```
  Query
    |
    v
  Embed query ‚îÄ‚îÄ> Vector similarity search ‚îÄ‚îÄ> Top-k document chunks
                       (FAISS / ChromaDB)              |
                                                        v
                                             [context + query] ‚îÄ‚îÄ> LLM ‚îÄ‚îÄ> Answer
```

**Our implementation:**
- **Corpus:** SO 2025 developer job descriptions (synthetic, from the dataset)
- **Embeddings:** `sentence-transformers/all-MiniLM-L6-v2` (~80MB, fast, high quality)
- **Vector store:** FAISS (Facebook AI Similarity Search, CPU-only, runs locally)
- **Generation:** HuggingFace `pipeline` with a small generative model


In [None]:
# 8.6.1 -- Install RAG dependencies

import subprocess
subprocess.run(['pip', 'install', 'sentence-transformers', 'faiss-cpu', '-q'], check=False)

import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
import faiss

print(f'sentence-transformers and faiss-cpu ready')


In [None]:
# 8.6.2 -- Build the document corpus from SO 2025
#
# We construct synthetic job-description documents from the survey fields
# so the RAG system has something meaningful to retrieve.

df_rag = pd.read_csv(DATASET_URL)

def build_doc(row: "pd.Series") -> str | None:
    """Build a short text document from a survey row."""
    parts = []
    if pd.notna(row.get('DevType')):
        parts.append(f"Role: {row['DevType']}")
    if pd.notna(row.get('LanguageHaveWorkedWith')):
        langs = row['LanguageHaveWorkedWith'].replace(';', ', ')
        parts.append(f"Languages: {langs}")
    if pd.notna(row.get('Country')):
        parts.append(f"Country: {row['Country']}")
    if pd.notna(row.get('EdLevel')):
        parts.append(f"Education: {row['EdLevel']}")
    if pd.notna(row.get('YearsCodePro')):
        parts.append(f"Professional coding experience: {row['YearsCodePro']} years")
    if pd.notna(row.get('ConvertedCompYearly')):
        parts.append(f"Annual compensation: ${float(row['ConvertedCompYearly']):,.0f}")
    if pd.notna(row.get('AIToolCurrently')):
        tools = row['AIToolCurrently'].replace(';', ', ')
        parts.append(f"AI tools currently used: {tools}")
    return ' | '.join(parts) if parts else None

df_rag['document'] = df_rag.apply(build_doc, axis=1)
docs = df_rag['document'].dropna().tolist()

# Use a representative 2000-doc subset for speed
import random
random.seed(42)
docs_subset = random.sample(docs, min(2000, len(docs)))

print(f'Documents built: {len(docs_subset):,}')
print(f'Sample document:')
print(f'  {docs_subset[0]}')


In [None]:
# 8.6.3 -- Embed all documents and build a FAISS vector index

# Load the embedding model
# all-MiniLM-L6-v2: 22M parameters, 384-dim embeddings, very fast
print('Loading sentence embedding model...')
embedder = SentenceTransformer('all-MiniLM-L6-v2')

print(f'Embedding {len(docs_subset):,} documents...')
doc_embeddings = embedder.encode(
    docs_subset,
    batch_size=128,
    show_progress_bar=True,
    convert_to_numpy=True
)

print(f'Embedding matrix shape: {doc_embeddings.shape}')
print(f'  {doc_embeddings.shape[0]} documents x {doc_embeddings.shape[1]} dimensions')

# Build FAISS index
# IndexFlatIP: exact inner-product (cosine) search
# Normalise first so inner product == cosine similarity
faiss.normalize_L2(doc_embeddings)
dimension = doc_embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)
index.add(doc_embeddings)

print(f'FAISS index built: {index.ntotal:,} vectors indexed')


In [None]:
# 8.6.4 -- Retrieval: find the most relevant documents for a query

def retrieve(query: str, k: int = 3, verbose: bool = True) -> list[dict]:
    """Embed the query and return the top-k most similar documents."""
    q_emb = embedder.encode([query], convert_to_numpy=True)
    faiss.normalize_L2(q_emb)
    scores, indices = index.search(q_emb, k)
    results = []
    for rank, (score, idx) in enumerate(zip(scores[0], indices[0]), 1):
        results.append({'rank': rank, 'score': float(score), 'doc': docs_subset[idx]})
        if verbose:
            print(f'[{rank}] Score={score:.4f}')
            print(f'     {docs_subset[idx][:120]}...' if len(docs_subset[idx]) > 120
                  else f'     {docs_subset[idx]}')
    return results


test_queries = [
    'senior Python developer with machine learning experience',
    'frontend engineer using JavaScript and TypeScript',
    'data scientist in Germany using AI tools',
]

for query in test_queries:
    print(f'Query: "{query}"')
    retrieve(query, k=2)
    print()


In [None]:
# 8.6.5 -- Full RAG pipeline: retrieve + generate answer
#
# We use the retrieved documents as context and ask a model to
# synthesise an answer grounded in that context.
# Using flan-t5-base: small (250M params), instruction-following, no GPU needed.

from transformers import pipeline as hf_pipeline

print('Loading generative model (flan-t5-base, ~1GB)...')
generator = hf_pipeline(
    'text2text-generation',
    model='google/flan-t5-base',
    max_new_tokens=150
)

def rag_answer(question: str, k: int = 3) -> tuple[str, list]:
    """
    Full RAG pipeline:
    1. Retrieve top-k relevant documents
    2. Build a context-augmented prompt
    3. Generate an answer grounded in the retrieved context
    """
    # Step 1: retrieve
    results = retrieve(question, k=k, verbose=False)
    context = '\n'.join([f'- {r["doc"]}' for r in results])

    # Step 2: build prompt with context
    prompt = (
        f'Based on the following developer profiles from the Stack Overflow 2025 survey:\n'
        f'{context}\n\n'
        f'Answer this question: {question}'
    )

    # Step 3: generate
    response = generator(prompt)[0]['generated_text'].strip()
    return response, results


rag_questions = [
    'What programming languages are commonly used by senior developers with high salaries?',
    'What AI tools do data scientists typically use?',
]

for question in rag_questions:
    print(f'Question: {question}')
    answer, sources = rag_answer(question, k=3)
    print(f'Answer:   {answer}')
    print(f'Sources:  {len(sources)} documents retrieved')
    for s in sources:
        print(f'  [{s["rank"]}] {s["doc"][:80]}...')
    print()


---

## Section 8.7 -- API-Based LLM Integration

Section 8.6 built a RAG system using a small local model (Flan-T5).
In production, most teams call a hosted LLM API instead ‚Äî larger models,
no GPU required, better reasoning, pay-per-token pricing.

**The pattern is identical to local RAG:**
retrieve relevant context ‚Üí inject into a prompt ‚Üí call the API ‚Üí return the answer.
The only change is the generation step.

We implement this in a **provider-agnostic** way: a single `LLMClient` class
that works with OpenAI, Anthropic, or any OpenAI-compatible endpoint
(Groq, Together AI, Ollama, etc.) by swapping one parameter.

**API key handling:** keys are never hardcoded. We use environment variables
(`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`) or Colab Secrets (recommended).

**Cost awareness:** always log token usage. A single RAG query with 3 retrieved
chunks is typically 500-1,500 tokens ‚Äî a fraction of a cent at current pricing.


In [None]:
# 8.7.1 -- Provider-agnostic LLM client

import os
import json as json_lib
from typing import Optional

class LLMClient:
    """
    Thin wrapper around LLM APIs.
    Supports OpenAI and Anthropic with a unified interface.
    Falls back to a mock response when no API key is set (for demo purposes).
    """

    PROVIDERS = ['openai', 'anthropic', 'mock']

    def __init__(self, provider: str = 'mock', model: Optional[str] = None):
        self.provider = provider.lower()
        assert self.provider in self.PROVIDERS, f'Unknown provider: {provider}'

        # Auto-detect available provider from environment
        if self.provider == 'mock':
            if os.environ.get('OPENAI_API_KEY'):
                self.provider = 'openai'
            elif os.environ.get('ANTHROPIC_API_KEY'):
                self.provider = 'anthropic'

        # Default models
        if model:
            self.model = model
        elif self.provider == 'openai':
            self.model = 'gpt-4o-mini'
        elif self.provider == 'anthropic':
            self.model = 'claude-haiku-4-5-20251001'
        else:
            self.model = 'mock'

        self.total_tokens = 0
        print(f'LLMClient ready: provider={self.provider}, model={self.model}')

    def chat(self, system: str, user: str, max_tokens: int = 300) -> str:
        """Send a chat request and return the response text."""
        if self.provider == 'openai':
            return self._call_openai(system, user, max_tokens)
        elif self.provider == 'anthropic':
            return self._call_anthropic(system, user, max_tokens)
        else:
            return self._mock_response(user)

    def _call_openai(self, system: str, user: str, max_tokens: int) -> str:
        try:
            import openai
            client = openai.OpenAI(api_key=os.environ['OPENAI_API_KEY'])
            resp   = client.chat.completions.create(
                model=self.model,
                messages=[{'role': 'system', 'content': system},
                           {'role': 'user',   'content': user}],
                max_tokens=max_tokens,
                temperature=0.2,
            )
            self.total_tokens += resp.usage.total_tokens
            return resp.choices[0].message.content.strip()
        except Exception as e:
            return f'[OpenAI error: {e}]'

    def _call_anthropic(self, system: str, user: str, max_tokens: int) -> str:
        try:
            import anthropic
            client = anthropic.Anthropic(api_key=os.environ['ANTHROPIC_API_KEY'])
            resp   = client.messages.create(
                model=self.model,
                system=system,
                messages=[{'role': 'user', 'content': user}],
                max_tokens=max_tokens,
            )
            self.total_tokens += resp.usage.input_tokens + resp.usage.output_tokens
            return resp.content[0].text.strip()
        except Exception as e:
            return f'[Anthropic error: {e}]'

    def _mock_response(self, user: str) -> str:
        """Return a canned response when no API key is available."""
        return (
            '[MOCK RESPONSE -- set OPENAI_API_KEY or ANTHROPIC_API_KEY to use a real LLM]\n'
            f'Your question was: {user[:100]}...\n'
            'In production, a real LLM would synthesise the retrieved context '
            'into a coherent answer here.'
        )

    def usage_summary(self) -> None:
        print(f'Total tokens used this session: {self.total_tokens:,}')
        # Approximate cost at mid-2025 pricing
        cost_map = {'gpt-4o-mini': 0.15/1e6, 'claude-haiku-4-5-20251001': 0.25/1e6}
        rate = cost_map.get(self.model, 0.5/1e6)
        print(f'Estimated cost: ${self.total_tokens * rate:.6f} '
              f'(at ${rate*1e6:.2f}/1M tokens)')


# Instantiate -- auto-detects provider from environment
llm = LLMClient()


In [None]:
# 8.7.2 -- API RAG: combine FAISS retrieval with LLM generation
#
# Reuses the embedder and FAISS index built in Section 8.6
# Swap the generation step from Flan-T5 to the API client

SYSTEM_PROMPT = """
You are a data analyst assistant with expertise in developer compensation
and career trends. You answer questions using only the provided context
from the Stack Overflow 2025 Developer Survey. Be concise and specific.
If the context does not contain enough information to answer, say so clearly.
""".strip()


def api_rag_answer(question: str, k: int = 4) -> dict:
    """
    Full API-based RAG pipeline:
    1. Embed the question with sentence-transformers
    2. Retrieve top-k documents from FAISS index
    3. Build a context-augmented prompt
    4. Call the LLM API (or mock if no key is set)
    5. Return answer + sources + token usage
    """
    # Step 1 & 2: retrieve
    results = retrieve(question, k=k, verbose=False)
    context_lines = [f'{i+1}. {r["doc"]}' for i, r in enumerate(results)]
    context = '\n'.join(context_lines)

    # Step 3: build user message with context
    user_msg = (
        f'Context (from SO 2025 Developer Survey):\n{context}\n\n'
        f'Question: {question}\n\n'
        f'Answer based only on the context above:'
    )

    # Step 4: call API
    answer = llm.chat(system=SYSTEM_PROMPT, user=user_msg, max_tokens=250)

    return {
        'question': question,
        'answer':   answer,
        'sources':  results,
        'n_sources': len(results),
    }


# Run the API RAG pipeline
questions = [
    'What programming languages are most common among high-earning developers?',
    'How does AI tool adoption vary by country in this survey?',
    'What education levels are typical for data scientists?',
]

print('API-based RAG responses:')
print('=' * 65)
for q in questions:
    result = api_rag_answer(q)
    print(f'Q: {result["question"]}')
    print(f'A: {result["answer"]}')
    print(f'   ({result["n_sources"]} documents retrieved)')
    print('-' * 65)


In [None]:
# 8.7.3 -- Prompt engineering: structured output
#
# LLMs can be instructed to return JSON, making their output
# directly usable in downstream code without parsing.

STRUCTURED_SYSTEM = """
You are a data extraction assistant. Given context from the SO 2025 survey,
extract structured information and return ONLY valid JSON with no other text.
""".strip()

def extract_structured(question: str, schema_description: str, k: int = 4) -> dict:
    """
    Use the LLM to extract structured data from retrieved context.
    Returns a parsed dict if the response is valid JSON, else raw text.
    """
    results = retrieve(question, k=k, verbose=False)
    context = '\n'.join([f'{i+1}. {r["doc"]}' for i, r in enumerate(results)])

    user_msg = (
        f'Context:\n{context}\n\n'
        f'Task: {question}\n'
        f'Return your answer as JSON with this structure: {schema_description}\n'
        f'Return ONLY the JSON object, no markdown fences or explanation.'
    )

    raw = llm.chat(system=STRUCTURED_SYSTEM, user=user_msg, max_tokens=300)

    # Strip markdown fences if present
    clean = raw.strip().lstrip('```json').lstrip('```').rstrip('```').strip()
    try:
        return json_lib.loads(clean)
    except json_lib.JSONDecodeError:
        return {'raw_response': raw, 'parse_error': True}


# Extract structured insights
result = extract_structured(
    question='From the developer profiles, summarise the top 3 languages and typical salary range',
    schema_description='{"top_languages": ["lang1", "lang2", "lang3"], "salary_range": {"low": int, "high": int, "currency": "USD"}, "sample_size": int}',
    k=5
)

print('Structured extraction result:')
print(json_lib.dumps(result, indent=2))
print()
llm.usage_summary()


---

## Section 8.7 Key Takeaways

- **Provider-agnostic clients** insulate your code from vendor lock-in.
  Switching from OpenAI to Anthropic (or a local Ollama model) is one parameter change.
- **API keys belong in environment variables, never in code.**
  In Colab use Secrets (`üîë` icon in the left sidebar) ‚Äî they persist across sessions
  and are never stored in the notebook file.
- **The RAG pattern is the same regardless of the generation backend:**
  embed ‚Üí retrieve ‚Üí augment prompt ‚Üí generate.
  The choice of local model vs API affects cost, latency, and privacy, not the architecture.
- **Structured output** (instructing the LLM to return JSON) makes LLM responses
  directly consumable by downstream code without fragile string parsing.
  Always strip markdown fences before calling `json.loads()`.
- **Token tracking** is essential in production. Log `usage.total_tokens` on every
  API call and set `max_tokens` explicitly to prevent runaway costs.
- **Mock/fallback mode** lets the notebook run without an API key ‚Äî useful for
  testing pipelines and CI environments where keys are not available.


---

## Concept Check Questions

> Test your understanding before moving on. Answer each question without referring back to the notebook, then expand to check.

**Q1.** What is **subword tokenisation** and why is it better than word-level for rare words?

<details><summary>Show answer</summary>

Subword tokenisation splits rare words into sub-units: 'unhappiness' ‚Üí ['un', '##happi', '##ness']. Any word can be represented by combining known subwords ‚Äî no 'unknown token' fallback that discards information. The model also shares knowledge across morphologically related words.

</details>

**Q2.** What role does the `[CLS]` token play in a BERT-style model?

<details><summary>Show answer</summary>

`[CLS]` is prepended to every input. After self-attention, the transformer aggregates information from all tokens into this position. Its final hidden state encodes a sequence-level representation used by the classification head.

</details>

**Q3.** Explain the RAG architecture in three sentences without using acronyms.

<details><summary>Show answer</summary>

A large collection of documents is split into chunks, converted into dense numerical vectors by a language model, and stored in a vector index. When a question arrives, it is also converted to a vector and the index returns the most similar chunks. Those chunks are injected into a prompt alongside the question, and a language model generates a grounded answer.

</details>

**Q4.** What is the difference between **fine-tuning** and **few-shot prompting**? When to use each?

<details><summary>Show answer</summary>

**Fine-tuning** updates model weights on labelled examples ‚Äî higher accuracy for well-defined tasks with 100+ examples, but requires GPU and creates a new artefact to maintain. **Few-shot prompting** includes examples in the prompt without changing weights ‚Äî zero maintenance, works with API-only access, but less accurate on complex tasks.

</details>

**Q5.** An LLM ignores your 'return only JSON' instruction and wraps the response in markdown fences. How do you handle this robustly?

<details><summary>Show answer</summary>

Strip fences before parsing: `clean = raw.strip().lstrip('```json').lstrip('```').rstrip('```').strip()`. Wrap `json.loads(clean)` in `try/except json.JSONDecodeError`. For production, use a model with native JSON mode to avoid this entirely.

</details>



---

## Coding Exercises

> Three exercises per chapter: **üîß Guided** (fill-in-the-blanks) ¬∑ **üî® Applied** (write from scratch) ¬∑ **üèóÔ∏è Extension** (go beyond the chapter)

Exercises use the SO 2025 developer survey dataset.
Expand each **Solution** block only after attempting the exercise.


### Exercise 1 üîß Guided ‚Äî TF-IDF job title seniority classifier

Complete the TF-IDF ‚Üí Logistic Regression pipeline that classifies
developer job titles into four seniority levels:
`junior`, `mid`, `senior`, `lead/principal`.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

titles = (
    ['junior developer']*80 + ['junior software engineer']*60 +
    ['software engineer']*100 + ['developer']*80 +
    ['senior engineer']*90 + ['senior developer']*70 +
    ['staff engineer']*40 + ['principal engineer']*40 + ['tech lead']*40
)
labels = ['junior']*140 + ['mid']*180 + ['senior']*160 + ['lead']*120

pipe = Pipeline([
    ('tfidf', TfidfVectorizer(
        # YOUR CODE: set ngram_range, min_df, max_features
    )),
    ('clf', LogisticRegression(
        # YOUR CODE: set multi_class, max_iter
    ))
])

scores = cross_val_score(pipe, titles, labels, cv=5, scoring='accuracy')
print(f'CV Accuracy: {scores.mean():.3f} ¬± {scores.std():.3f}')

<details><summary>üí° Hint</summary>

`TfidfVectorizer(ngram_range=(1,2), min_df=2, max_features=500)` is a good starting point.
`LogisticRegression(multi_class='multinomial', max_iter=500, C=1.0)`

</details>

<details><summary>‚úÖ Solution</summary>

```python
pipe = Pipeline([('tfidf', TfidfVectorizer(ngram_range=(1,2),min_df=1,max_features=500)),
                  ('clf', LogisticRegression(max_iter=500,C=1.0))])
```

</details>


### Exercise 2 üî® Applied ‚Äî Sentence embedding similarity search

Build a simple semantic search engine for SO 2025 developer survey responses.
Using a sentence transformer model, embed a corpus of job descriptions,
then implement `find_similar(query, top_k=5)` that returns the most
similar job descriptions using cosine similarity.

Use `sentence-transformers` (install if needed) or fall back to TF-IDF vectors.


In [None]:
JOB_DESCRIPTIONS = [
    'Python developer building ML pipelines and data products',
    'Full stack JavaScript engineer working on React and Node',
    'Data scientist specialising in NLP and large language models',
    'DevOps engineer managing Kubernetes and cloud infrastructure',
    'Backend engineer working with Python, FastAPI, and PostgreSQL',
    'Machine learning engineer deploying models to production at scale',
    'Data engineer building Spark and dbt data pipelines',
    'Frontend developer focused on accessibility and performance',
]

def find_similar(query: str, corpus: list[str], top_k: int = 3) -> list[tuple[str,float]]:
    """Return top_k (description, similarity_score) pairs."""
    # YOUR CODE
    pass

results = find_similar('NLP researcher working on transformers', JOB_DESCRIPTIONS)
for desc, score in results:
    print(f'{score:.3f}  {desc}')

<details><summary>üí° Hint</summary>

TF-IDF fallback: fit `TfidfVectorizer` on the corpus, transform query,
compute `cosine_similarity(query_vec, corpus_vecs)`, sort descending.
For sentence-transformers: `SentenceTransformer('all-MiniLM-L6-v2').encode(texts)`

</details>

<details><summary>‚úÖ Solution</summary>

```python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def find_similar(query, corpus, top_k=3):
    vec=TfidfVectorizer().fit(corpus)
    c_vecs=vec.transform(corpus); q_vec=vec.transform([query])
    sims=cosine_similarity(q_vec,c_vecs)[0]
    top=np.argsort(sims)[::-1][:top_k]
    return [(corpus[i], sims[i]) for i in top]
```

</details>


### Exercise 3 üèóÔ∏è Extension ‚Äî RAG system: 'which language pays most in country X?'

Build a mini RAG pipeline that answers natural language salary questions.

Components:
1. A knowledge base of (language, country, median_salary) facts stored as text chunks
2. A retriever that returns the top-3 most relevant chunks for a query
3. A simple answer extractor that parses the retrieved chunks

The system should handle queries like:
'What language pays best in Germany?' and
'Which programming language has the highest median salary in the US?'


In [None]:
import json

# Knowledge base ‚Äî salary facts (simulated SO 2025 data)
KB = [
    {'lang':'Python','country':'US','median_salary':125000},
    {'lang':'Java','country':'US','median_salary':118000},
    {'lang':'JavaScript','country':'US','median_salary':108000},
    {'lang':'Python','country':'Germany','median_salary':78000},
    {'lang':'Rust','country':'Germany','median_salary':85000},
    {'lang':'JavaScript','country':'Germany','median_salary':72000},
    {'lang':'Python','country':'India','median_salary':18000},
    {'lang':'Java','country':'India','median_salary':16000},
]

def build_chunks(kb: list[dict]) -> list[str]:
    # Convert each record to a text chunk
    pass

def retrieve(query: str, chunks: list[str], top_k: int = 3) -> list[str]:
    # Return top_k relevant chunks
    pass

def answer_query(query: str, chunks: list[str]) -> str:
    # Parse chunks and formulate answer
    pass

chunks = build_chunks(KB)
for q in ['What pays best in Germany?', 'Top language salary in US?']:
    print(f'Q: {q}\nA: {answer_query(q, chunks)}\n')

<details><summary>üí° Hint</summary>

For `build_chunks`: convert each dict to a descriptive sentence like
'Python developers in Germany earn a median salary of $78,000 per year.'
For `retrieve`: use TF-IDF cosine similarity (same pattern as Exercise 2).
For `answer_query`: retrieve top chunks, find the one with the highest salary mentioned.

</details>

<details><summary>‚úÖ Solution</summary>

```python
def build_chunks(kb):
    return [f'{r["lang"]} developers in {r["country"]} earn a median salary of ${r["median_salary"]:,} per year.' for r in kb]
def retrieve(query, chunks, top_k=3):
    return find_similar(query, chunks, top_k)  # reuse from Ex 2
def answer_query(query, chunks):
    results = retrieve(query, chunks)
    # Extract highest salary from top chunks
    import re
    best = max(results, key=lambda x: int(re.sub(r'[^0-9]','',x[0].split('$')[1].split(' ')[0]) or 0) if '$' in x[0] else 0)
    return best[0]
```

</details>


---

## Chapter 8 Summary

### Key Takeaways

- **Classical NLP pipeline:** clean ‚Üí tokenise ‚Üí remove stopwords ‚Üí lemmatise ‚Üí TF-IDF ‚Üí classifier.
  Fast, interpretable, and competitive on short domain-specific text.
- **TF-IDF** scores words by local frequency times global rarity.
  `ngram_range=(1,2)` captures multi-word phrases like 'machine learning'.
- **Contextual embeddings** (transformers) map the same word to different vectors
  depending on context. `[CLS]` aggregates the full sequence for classification.
- **Zero-shot inference** uses pre-trained models directly with no task-specific training.
- **Fine-tuning** adapts a pre-trained model with a small labelled dataset.
  3 epochs on ~1,600 examples produces a strong classifier.
- **`clip_grad_norm_`** prevents exploding gradients during fine-tuning.
- **Sentence transformers** produce fixed-size embeddings optimised for semantic
  similarity -- identical sentences get near-identical vectors.
- **FAISS** indexes millions of vectors and returns the top-k nearest neighbours
  in milliseconds using approximate or exact search.
- **API-based LLM integration** follows the same RAG pattern but replaces the local
  model with an API call. Use a provider-agnostic client class so swapping vendors
  requires changing one parameter, not rewriting the pipeline.
- **Structured output** prompting (return JSON only) makes LLM responses
  directly parseable by downstream code without fragile string parsing.
- **RAG** separates knowledge (the document store, updatable) from reasoning
  (the LLM, fixed). It is the dominant architecture for Q&A over private documents.
  The three steps are always: chunk, embed, index at build time;
  retrieve, augment prompt, generate at query time.

### Project Thread Status

| Task | Method | Result |
|------|--------|--------|
| Developer role classification | TF-IDF + Logistic Regression | Accuracy reported |
| Sentiment analysis | Zero-shot DistilBERT | Labels + confidence |
| Zero-shot role classification | BART-large-MNLI | Top label per description |
| Fine-tuned role classifier | DistilBERT 3 epochs | Accuracy vs baseline |
| Attention visualisation | DistilBERT last layer | Heatmap plotted |
| RAG over SO 2025 profiles | MiniLM + FAISS + Flan-T5 | End-to-end Q&A |
| API-based RAG | LLMClient + FAISS | Provider-agnostic Q&A |
| Structured extraction | JSON-mode prompting | Parsed dict output |

---

### What's Next: Chapter 9 -- Computer Vision with PyTorch

Chapter 9 applies the PyTorch training loop from Chapter 7 to image data:
CNNs from scratch, transfer learning with ResNet-18, feature map visualisation,
object detection with Faster R-CNN, and semantic segmentation with DeepLabV3.
Images are the third major data modality after tabular (Ch 5‚Äì6) and text (Ch 8).

### (old marker to remove) -- Computer Vision with PyTorch

Chapter 9 examines the risks introduced by everything built in Part 3:
bias in training data, fairness metrics, model interpretability with SHAP,
and the practical steps for building more responsible ML systems.

---

*End of Chapter 8 -- Python for AI/ML*  
[![Back to TOC](https://img.shields.io/badge/Back_to-Table_of_Contents-1B3A5C?style=flat-square)](https://colab.research.google.com/github/timothy-watt/python-for-ai-ml/blob/main/Python_for_AIML_TOC.ipynb)
