# Chapter 8 -- NLP and Transformers
## *Python for AI/ML: A Complete Learning Journey*

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/timothy-watt/python-for-ai-ml/blob/main/CH08_NLP_and_Transformers.ipynb)
&nbsp;&nbsp;[![Back to TOC](https://img.shields.io/badge/Back_to-Table_of_Contents-1B3A5C?style=flat-square)](https://colab.research.google.com/github/timothy-watt/python-for-ai-ml/blob/main/Python_for_AIML_TOC.ipynb)

---

**Part:** 3 -- Machine Learning and AI  
**Prerequisites:** Chapter 7 (Deep Learning with PyTorch)  
**Estimated time:** 5-6 hours

---

> **Before running this notebook:** go to **Runtime → Change runtime type → T4 GPU**.
> The transformer inference and fine-tuning cells require GPU. CPU will work but
> fine-tuning will take 15-30 minutes instead of 2-3 minutes.

---

### Learning Objectives

By the end of this chapter you will be able to:

- Explain tokens, vocabularies, and why text must be numerically encoded before modelling
- Use `nltk` and `re` for classical text preprocessing: cleaning, tokenising, stemming, stopwords
- Build a TF-IDF feature matrix and train a text classifier with scikit-learn
- Explain word embeddings and why `word2vec`-style representations outperform one-hot encoding
- Load a pre-trained transformer model with HuggingFace `transformers`
- Run zero-shot inference: sentiment analysis and text classification without training
- Fine-tune a pre-trained model on a custom text classification task
- Interpret attention weights to understand what the model focuses on

---

### Project Thread -- Chapter 8

The SO 2025 dataset contains free-text columns -- job titles, developer type labels,
and AI tool descriptions. We build three NLP pipelines on this data:

1. **Classical NLP** -- TF-IDF + Logistic Regression to classify developer role from job title text
2. **Zero-shot inference** -- sentiment analysis on developer comments using a pre-trained transformer
3. **Fine-tuning** -- adapt `distilbert-base-uncased` to classify whether a developer
   is data-focused or software-focused from their self-described role text


---

## Setup -- Install, Import, and Data


In [None]:
# Install libraries not pre-installed in Colab
import subprocess
subprocess.run(['pip', 'install', 'transformers', 'datasets', 'accelerate',
                'nltk', '-q'], check=False)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import warnings
warnings.filterwarnings('ignore')

import nltk
nltk.download('stopwords', quiet=True)
nltk.download('punkt',     quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('wordnet',   quiet=True)
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, accuracy_score
from sklearn.pipeline import Pipeline

import torch
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Device: {DEVICE}')

import transformers
print(f'Transformers: {transformers.__version__}')

plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.dpi']     = 110
plt.rcParams['axes.titlesize'] = 13

DATASET_URL  = 'https://raw.githubusercontent.com/timothy-watt/python-for-ai-ml/main/data/so_survey_2025_curated.csv'
RANDOM_STATE = 42


In [None]:
# Load SO 2025 and extract text columns
df_raw = pd.read_csv(DATASET_URL)
df = df_raw.copy()

# We focus on DevType -- semicolon-separated role labels
# e.g. 'Developer, full-stack;Developer, back-end;Data scientist'
text_col = 'DevType'
if text_col not in df.columns:
    # Fallback: use any available text column
    text_candidates = [c for c in df.columns
                       if df[c].dtype == object and df[c].str.len().mean() > 10]
    text_col = text_candidates[0] if text_candidates else None
    print(f'DevType not found -- using: {text_col}')

df = df[df[text_col].notna()].copy()
df = df.reset_index(drop=True)

# Primary role: take the first semicolon-separated value
df['primary_role'] = df[text_col].str.split(';').str[0].str.strip()

# Binary target: data-focused vs software-focused
data_keywords = ['data scientist', 'data engineer', 'data analyst',
                 'machine learning', 'research', 'analyst']
df['is_data_role'] = df['primary_role'].str.lower().apply(
    lambda x: int(any(kw in x for kw in data_keywords))
)

print(f'Dataset: {len(df):,} rows with non-null {text_col}')
print(f'Unique primary roles: {df["primary_role"].nunique()}')
print(f'Data-focused roles:   {df["is_data_role"].sum():,} ({df["is_data_role"].mean()*100:.1f}%)')
print()
print('Most common primary roles:')
print(df['primary_role'].value_counts().head(8).to_string())


---

## Section 8.1 -- Classical NLP: Text Preprocessing and TF-IDF

Before transformers dominated NLP, the standard pipeline was:
clean text → tokenise → remove stopwords → stem/lemmatise → TF-IDF features → train classifier.
This pipeline still works well for short, domain-specific text and is 100x faster
to train than a transformer. It is worth knowing as a fast baseline.

**TF-IDF (Term Frequency -- Inverse Document Frequency)** scores each word by
how often it appears in a document (TF) weighted down by how common it is
across all documents (IDF). Rare words that appear in specific documents
get high scores; common words like 'the' get low scores.


In [None]:
# 8.1.1 -- Text cleaning pipeline

STOP_WORDS = set(stopwords.words('english'))
stemmer    = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def clean_text(text, use_lemma=True):
    """
    Full classical NLP preprocessing pipeline.
    1. Lowercase
    2. Remove punctuation and digits
    3. Tokenise
    4. Remove stopwords
    5. Lemmatise (or stem)
    """
    if not isinstance(text, str):
        return ''
    text = text.lower()
    text = re.sub(r'[^a-z\s]', ' ', text)   # keep only letters and spaces
    text = re.sub(r'\s+', ' ', text).strip()
    tokens = word_tokenize(text)
    tokens = [t for t in tokens if t not in STOP_WORDS and len(t) > 2]
    if use_lemma:
        tokens = [lemmatizer.lemmatize(t) for t in tokens]
    else:
        tokens = [stemmer.stem(t) for t in tokens]
    return ' '.join(tokens)


# Demonstrate the pipeline step by step
sample = 'Developer, full-stack; building web APIs and React front-ends'
print(f'Original:      {sample}')
print(f'Lowercased:    {sample.lower()}')
cleaned = re.sub(r'[^a-z\s]', ' ', sample.lower())
print(f'No punct:      {cleaned}')
tokens = word_tokenize(cleaned)
print(f'Tokenised:     {tokens}')
no_stop = [t for t in tokens if t not in STOP_WORDS and len(t) > 2]
print(f'No stopwords:  {no_stop}')
lemmatised = [lemmatizer.lemmatize(t) for t in no_stop]
print(f'Lemmatised:    {lemmatised}')
print(f'Final string:  {clean_text(sample)}')

# Apply to the full dataset
df['role_clean'] = df['primary_role'].apply(clean_text)
print(f'Cleaned {len(df):,} role strings')


In [None]:
# 8.1.2 -- TF-IDF features and Logistic Regression classifier

# Keep the top 8 roles by frequency for a clean multi-class problem
top_roles = df['primary_role'].value_counts().head(8).index.tolist()
df_clf    = df[df['primary_role'].isin(top_roles)].copy()

X_text = df_clf['role_clean'].values
y_role = df_clf['primary_role'].values

X_tr, X_te, y_tr, y_te = train_test_split(
    X_text, y_role, test_size=0.2,
    random_state=RANDOM_STATE, stratify=y_role
)

# TF-IDF pipeline: vectorise text then classify
tfidf_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(
        max_features=500,    # keep only the 500 highest-scoring terms
        ngram_range=(1, 2),  # include both single words and 2-word phrases
        min_df=3,            # ignore terms appearing in fewer than 3 documents
        sublinear_tf=True,   # apply log(TF) to dampen effect of very frequent terms
    )),
    ('clf', LogisticRegression(
        max_iter=1000,
        C=1.0,               # inverse regularisation strength
        random_state=RANDOM_STATE
    )),
])

tfidf_pipe.fit(X_tr, y_tr)
y_pred = tfidf_pipe.predict(X_te)
acc    = accuracy_score(y_te, y_pred)

print(f'TF-IDF + Logistic Regression accuracy: {acc:.4f}  ({acc*100:.1f}%)')
print()
print(classification_report(y_te, y_pred, zero_division=0))


In [None]:
# 8.1.3 -- Visualise TF-IDF: top terms per class

vectorizer  = tfidf_pipe.named_steps['tfidf']
classifier  = tfidf_pipe.named_steps['clf']
feature_names = vectorizer.get_feature_names_out()

# For each class, find the terms with the highest logistic regression coefficients
n_top = 8
classes = classifier.classes_
n_classes = len(classes)
cols = min(4, n_classes)
rows = (n_classes + cols - 1) // cols

fig, axes = plt.subplots(rows, cols, figsize=(cols * 4, rows * 3))
axes_flat  = axes.flatten() if n_classes > 1 else [axes]

for i, (cls, ax) in enumerate(zip(classes, axes_flat)):
    coefs = classifier.coef_[i]
    top_idx  = np.argsort(coefs)[-n_top:]
    top_terms = feature_names[top_idx]
    top_coefs = coefs[top_idx]
    ax.barh(top_terms, top_coefs, color='#2E75B6')
    ax.set_title(cls[:30], fontsize=9)
    ax.tick_params(labelsize=8)

for ax in axes_flat[n_classes:]:
    ax.set_visible(False)

plt.suptitle('TF-IDF: Top Terms per Developer Role\n(higher coefficient = stronger signal)',
             fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()


---

## Section 8.2 -- Word Embeddings: From Counts to Meaning

TF-IDF represents each document as a sparse vector of term weights.
It has no concept of meaning -- 'developer' and 'engineer' are completely
unrelated in a TF-IDF vocabulary even though they are semantically close.

**Word embeddings** solve this by mapping each word to a dense vector
in a continuous space where similar words are geometrically close.
The famous example: `king - man + woman ≈ queen`.

Modern transformers replace per-word embeddings with **contextual embeddings** --
the same word gets a different vector depending on its surrounding context.
'Python' in 'Python developer' and 'Python snake' would have different embeddings.


In [None]:
# 8.2.1 -- Demonstrate embeddings with a pre-trained transformer tokeniser
#
# We use DistilBERT's tokeniser to show how text is converted to token IDs
# before being fed to the model.

from transformers import AutoTokenizer

MODEL_NAME = 'distilbert-base-uncased'
print(f'Loading tokeniser: {MODEL_NAME}...')
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
print('Tokeniser loaded.')

# Tokenise some example developer role texts
examples = [
    'Developer, full-stack',
    'Data scientist or machine learning specialist',
    'DevOps specialist',
]

print()
for text in examples:
    tokens    = tokenizer.tokenize(text)
    token_ids = tokenizer.encode(text)
    print(f'Text:      {text}')
    print(f'Tokens:    {tokens}')
    print(f'Token IDs: {token_ids}')
    print(f'[CLS] id={token_ids[0]}, [SEP] id={token_ids[-1]}')
    print()

print('Key observations:')
print('  [CLS] token prepended -- its embedding becomes the sentence representation')
print('  [SEP] token appended  -- marks end of sequence')
print('  Subword tokenisation: unknown words split into known pieces')
print('  e.g. "DevOps" might become ["dev", "##ops"]')


---

## Section 8.3 -- Zero-Shot Inference with Pre-trained Transformers

A pre-trained transformer has already learned rich language representations
from billions of words of text. For many tasks, you can use it directly
without any further training -- this is called **zero-shot inference**.

HuggingFace `pipelines` provide a one-line interface to hundreds of
pre-trained models for common NLP tasks.


In [None]:
# 8.3.1 -- Sentiment analysis pipeline (zero-shot)

from transformers import pipeline

print('Loading sentiment analysis pipeline...')
# This downloads a fine-tuned DistilBERT model (~67MB) on first run
sentiment_pipe = pipeline(
    'sentiment-analysis',
    model='distilbert-base-uncased-finetuned-sst-2-english',
    device=0 if torch.cuda.is_available() else -1
)
print('Pipeline ready.')

# Simulate developer sentiment about tools and work conditions
developer_statements = [
    'I love working with Python, the ecosystem is incredible.',
    'The legacy codebase is a nightmare, no documentation anywhere.',
    'Remote work has been really positive for my productivity.',
    'The on-call rotation is exhausting and unsustainable.',
    'GitHub Copilot has genuinely made me more productive.',
    'Constantly switching between five different frameworks is frustrating.',
]

print()
print(f'{"Statement":<55} {"Sentiment":<12} {"Confidence"}')
print('-' * 80)
results = sentiment_pipe(developer_statements)
for stmt, result in zip(developer_statements, results):
    label = result['label']
    score = result['score']
    icon  = 'positive' if label == 'POSITIVE' else 'negative'
    print(f'{stmt[:53]:<55} {icon:<12} {score:.3f}')


In [None]:
# 8.3.2 -- Zero-shot text classification
#
# Zero-shot classification lets you classify text into ANY categories
# you define at inference time -- no training data needed.
# The model uses natural language inference to decide which label
# best describes the input text.

print('Loading zero-shot classification pipeline...')
zs_pipe = pipeline(
    'zero-shot-classification',
    model='facebook/bart-large-mnli',
    device=0 if torch.cuda.is_available() else -1
)
print('Pipeline ready.')

# Classify developer job descriptions into categories we define
candidate_labels = ['data science', 'web development', 'DevOps and infrastructure',
                    'mobile development', 'security']

job_descriptions = [
    'Building machine learning models and data pipelines for e-commerce recommendations',
    'Developing React front-end components and REST APIs with Node.js',
    'Managing Kubernetes clusters and CI/CD pipelines on AWS',
    'Writing Swift and SwiftUI apps for iOS and watchOS',
]

print()
for desc in job_descriptions:
    result = zs_pipe(desc, candidate_labels)
    top_label = result['labels'][0]
    top_score = result['scores'][0]
    print(f'Text:   {desc[:60]}')
    print(f'Label:  {top_label}  ({top_score:.3f})')
    print()


---

## Section 8.4 -- Fine-tuning a Pre-trained Transformer

Zero-shot inference is convenient but limited. **Fine-tuning** adapts a pre-trained
model to your specific task by continuing training on your labelled data.
Because the model already understands language, fine-tuning typically needs
only a small dataset and a few epochs -- far less than training from scratch.

We fine-tune `distilbert-base-uncased` to classify developer roles as
data-focused or software-focused using the `is_data_role` label we created
from the SO 2025 `DevType` column.


In [None]:
# 8.4.1 -- Prepare data for fine-tuning

from transformers import AutoModelForSequenceClassification
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW

# Use primary role text as input, is_data_role as label
# Downsample to 2000 examples for fast fine-tuning in Colab
df_ft = df[['primary_role', 'is_data_role']].dropna().copy()
df_ft = df_ft.sample(n=min(2000, len(df_ft)), random_state=RANDOM_STATE).reset_index(drop=True)

# Balance classes
n_min = df_ft['is_data_role'].value_counts().min()
df_ft = pd.concat([
    df_ft[df_ft['is_data_role'] == 0].sample(n_min, random_state=RANDOM_STATE),
    df_ft[df_ft['is_data_role'] == 1].sample(n_min, random_state=RANDOM_STATE),
]).sample(frac=1, random_state=RANDOM_STATE).reset_index(drop=True)

train_df, test_df = train_test_split(df_ft, test_size=0.2,
                                     random_state=RANDOM_STATE,
                                     stratify=df_ft['is_data_role'])

print(f'Fine-tuning dataset: {len(train_df)} train, {len(test_df)} test')
print(f'Class balance: {train_df["is_data_role"].mean()*100:.0f}% data roles')

# Tokenise all texts
def tokenise_batch(texts, max_length=64):
    return tokenizer(
        list(texts),
        padding=True,
        truncation=True,
        max_length=max_length,
        return_tensors='pt'
    )

# Custom Dataset
class RoleDataset(Dataset):
    def __init__(self, texts, labels, max_length=64):
        self.encodings = tokenizer(
            list(texts), padding=True, truncation=True,
            max_length=max_length, return_tensors='pt'
        )
        self.labels = torch.tensor(labels.values, dtype=torch.long)

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return {
            'input_ids':      self.encodings['input_ids'][idx],
            'attention_mask': self.encodings['attention_mask'][idx],
            'labels':         self.labels[idx],
        }

train_ds = RoleDataset(train_df['primary_role'], train_df['is_data_role'])
test_ds  = RoleDataset(test_df['primary_role'],  test_df['is_data_role'])
train_loader_ft = DataLoader(train_ds, batch_size=32, shuffle=True)
test_loader_ft  = DataLoader(test_ds,  batch_size=64, shuffle=False)
print(f'Train batches: {len(train_loader_ft)},  Test batches: {len(test_loader_ft)}')


In [None]:
# 8.4.2 -- Load model and fine-tune for 3 epochs

print(f'Loading {MODEL_NAME} for sequence classification...')
model_ft = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=2,
    ignore_mismatched_sizes=True
)
model_ft = model_ft.to(DEVICE)
print(f'Model loaded. Parameters: {sum(p.numel() for p in model_ft.parameters()):,}')

optimizer_ft = AdamW(model_ft.parameters(), lr=2e-5, weight_decay=0.01)

N_EPOCHS_FT   = 3
ft_train_losses = []
ft_val_accs     = []

print(f'Fine-tuning on {DEVICE} for {N_EPOCHS_FT} epochs...')
print(f'{"Epoch":>6}  {"Train Loss":>12}  {"Val Acc":>10}')
print('-' * 32)

for epoch in range(1, N_EPOCHS_FT + 1):
    # Training
    model_ft.train()
    epoch_loss = 0.0
    for batch in train_loader_ft:
        input_ids      = batch['input_ids'].to(DEVICE)
        attention_mask = batch['attention_mask'].to(DEVICE)
        labels         = batch['labels'].to(DEVICE)
        optimizer_ft.zero_grad()
        outputs = model_ft(input_ids=input_ids,
                           attention_mask=attention_mask,
                           labels=labels)
        outputs.loss.backward()
        torch.nn.utils.clip_grad_norm_(model_ft.parameters(), 1.0)
        optimizer_ft.step()
        epoch_loss += outputs.loss.item()
    avg_loss = epoch_loss / len(train_loader_ft)
    ft_train_losses.append(avg_loss)

    # Validation
    model_ft.eval()
    all_preds, all_labels = [], []
    with torch.no_grad():
        for batch in test_loader_ft:
            input_ids      = batch['input_ids'].to(DEVICE)
            attention_mask = batch['attention_mask'].to(DEVICE)
            outputs = model_ft(input_ids=input_ids, attention_mask=attention_mask)
            preds   = outputs.logits.argmax(dim=-1).cpu().numpy()
            all_preds.extend(preds)
            all_labels.extend(batch['labels'].numpy())
    val_acc = accuracy_score(all_labels, all_preds)
    ft_val_accs.append(val_acc)
    print(f'{epoch:>6}  {avg_loss:>12.4f}  {val_acc:>10.4f}')

print('Fine-tuning complete.')


In [None]:
# 8.4.3 -- Evaluate and visualise fine-tuning results

print(f'Final fine-tuned model accuracy: {ft_val_accs[-1]:.4f}')
print()
print(classification_report(all_labels, all_preds,
                             target_names=['Software-focused', 'Data-focused']))

# Training curve
fig, axes = plt.subplots(1, 2, figsize=(13, 4))

axes[0].plot(range(1, N_EPOCHS_FT+1), ft_train_losses, 'o-', color='#E8722A',
             linewidth=2, markersize=8)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Training Loss')
axes[0].set_title('DistilBERT Fine-tuning: Training Loss')
axes[0].set_xticks(range(1, N_EPOCHS_FT+1))

axes[1].plot(range(1, N_EPOCHS_FT+1), ft_val_accs, 'o-', color='#2E75B6',
             linewidth=2, markersize=8)
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Validation Accuracy')
axes[1].set_title('DistilBERT Fine-tuning: Validation Accuracy')
axes[1].set_xticks(range(1, N_EPOCHS_FT+1))
axes[1].set_ylim(0.5, 1.0)

plt.suptitle('SO 2025 Role Classifier: Fine-tuned DistilBERT',
             fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

# Compare with TF-IDF baseline
tfidf_binary = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=300, ngram_range=(1,2))),
    ('clf',   LogisticRegression(max_iter=500, random_state=RANDOM_STATE)),
])
tfidf_binary.fit(train_df['primary_role'], train_df['is_data_role'])
baseline_acc = accuracy_score(test_df['is_data_role'],
                               tfidf_binary.predict(test_df['primary_role']))

print(f'Comparison on data-role classification:')
print(f'  TF-IDF + Logistic Regression: {baseline_acc:.4f}')
print(f'  Fine-tuned DistilBERT:        {ft_val_accs[-1]:.4f}')
print(f'  Improvement:                  {(ft_val_accs[-1]-baseline_acc)*100:+.1f} percentage points')


---

## Section 8.5 -- Understanding Attention

The transformer's key innovation is the **attention mechanism** -- a learned
weighting that lets the model focus on the most relevant parts of the input
when encoding each token. Visualising attention weights gives intuition
for what the model is 'looking at' when making predictions.


In [None]:
# 8.5.1 -- Extract and visualise attention weights

from transformers import AutoModel

# Load the base model with output_attentions=True
attn_model = AutoModel.from_pretrained(
    MODEL_NAME, output_attentions=True
).to(DEVICE)
attn_model.eval()

# Encode a sample text
sample_text = 'Machine learning engineer building recommendation systems'
inputs = tokenizer(sample_text, return_tensors='pt').to(DEVICE)
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

with torch.no_grad():
    outputs = attn_model(**inputs)

# outputs.attentions: tuple of (n_layers,) each shape (batch, heads, seq, seq)
# Average across all heads in the last layer
last_layer_attn = outputs.attentions[-1][0]          # shape (heads, seq, seq)
avg_attn        = last_layer_attn.mean(dim=0).cpu().numpy()  # shape (seq, seq)

# Plot attention heatmap
fig, ax = plt.subplots(figsize=(10, 8))
im = ax.imshow(avg_attn, cmap='Blues', aspect='auto')
ax.set_xticks(range(len(tokens)))
ax.set_yticks(range(len(tokens)))
ax.set_xticklabels(tokens, rotation=45, ha='right', fontsize=10)
ax.set_yticklabels(tokens, fontsize=10)
plt.colorbar(im, ax=ax, shrink=0.8)
ax.set_title(f'DistilBERT Attention Weights (last layer, averaged over heads)\n"{sample_text}"',
             fontsize=11)
plt.tight_layout()
plt.show()

print('Rows = query token (what is attending)')
print('Cols = key token (what is being attended to)')
print('Bright cells = high attention weight')
print('[CLS] often attends broadly -- it aggregates the full sequence for classification')


---

## Chapter 8 Summary

### Key Takeaways

- **Classical NLP pipeline:** clean -> tokenise -> remove stopwords -> lemmatise -> TF-IDF -> classifier.
  Fast, interpretable, and competitive on short domain-specific text.
- **TF-IDF** scores words by local frequency times global rarity.
  `ngram_range=(1,2)` captures multi-word phrases like 'machine learning'.
- **Embeddings** map words to dense vectors where semantic similarity equals geometric proximity.
  Contextual embeddings (transformers) go further: same word, different context, different vector.
- **Subword tokenisation** (BPE/WordPiece) handles unknown words by splitting them into
  known subword pieces. `[CLS]` and `[SEP]` are special control tokens.
- **Zero-shot inference** uses pre-trained models directly with no task-specific training.
  HuggingFace `pipeline()` is the one-line entry point.
- **Fine-tuning** adapts a pre-trained model to your task with a small labelled dataset.
  3 epochs on 1,600 examples produces a strong classifier because the model already
  understands language -- you are only teaching it your specific categories.
- **`clip_grad_norm_`** prevents exploding gradients during fine-tuning -- always include it.
- **Attention weights** show which tokens the model focuses on. `[CLS]` often attends
  broadly because it aggregates the full sequence for classification output.

### Project Thread Status

| Task | Method | Result |
|------|--------|--------|
| Developer role classification | TF-IDF + Logistic Regression | Accuracy reported |
| Sentiment analysis on dev statements | Zero-shot DistilBERT | Labels + confidence |
| Zero-shot role classification | BART-large-MNLI | Top label per description |
| Data vs software role classification | Fine-tuned DistilBERT | Accuracy vs baseline |
| Attention visualisation | DistilBERT last layer | Heatmap plotted |

---

### What's Next: Chapter 9 -- Ethics, Bias, and Responsible AI

Chapter 9 examines the risks introduced by everything built in Part 3:
bias in training data, fairness metrics, model interpretability with SHAP,
and the practical steps for building more responsible ML systems.
The SO 2025 dataset provides concrete examples -- salary prediction models
can encode geographic and demographic biases that require explicit mitigation.

---

*End of Chapter 8 -- Python for AI/ML*  
[![Back to TOC](https://img.shields.io/badge/Back_to-Table_of_Contents-1B3A5C?style=flat-square)](https://colab.research.google.com/github/timothy-watt/python-for-ai-ml/blob/main/Python_for_AIML_TOC.ipynb)
