# 08 — Transformers for Text (PyTorch + Hugging Face) — Inference-first (CPU)

Objectives:
- Use Hugging Face pipelines for quick text classification (sentiment) on CPU
- Understand tokenization, truncation, and context window limits
- Batch inference efficiently and evaluate accuracy on a small validation split
- Discuss privacy, bias, and compliance considerations

Assumptions:
- Input is tokenized text sequences; model considers relationships across positions (self-attention)
- Context window is finite (e.g., 512 tokens) — longer text must be truncated or chunked

Cautions/Data Prep:
- Clean/tokenize appropriately; watch special tokens and truncation behavior
- Ensure input length ≤ model max length; otherwise truncate/slide window
- Training/fine-tuning requires significant compute and data; avoid in CPU-only short labs
- Handle privacy: avoid sensitive data; consider anonymization and auditability
- Monitor for bias and harmful outputs in enterprise contexts


In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
from time import perf_counter
from pprint import pprint

from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_dataset

MODEL = 'distilbert-base-uncased-finetuned-sst-2-english'  # small, CPU-friendly sentiment model


## 1) Quickstart: Sentiment analysis pipeline
Pipelines wrap tokenization + model forward + postprocessing. First call downloads weights (cached afterwards).

In [None]:
clf = pipeline('sentiment-analysis', model=MODEL)
texts = [
    "I loved the movie! The performances were outstanding.",
    "This was a waste of time. Completely boring and predictable.",
]
pprint(clf(texts))

Under the hood: tokenization, truncation, and max sequence length. Inspect tokenizer and model config (e.g., max length ~512 tokens).

In [None]:
tok = AutoTokenizer.from_pretrained(MODEL)
mdl = AutoModelForSequenceClassification.from_pretrained(MODEL)
tok.model_max_length, mdl.config.id2label

Manual tokenization example: encode/decode, attention masks, and truncation vs no truncation.

In [None]:
sample = "This product is surprisingly good and well worth the price!" * 30  # long
enc_trunc = tok(sample, truncation=True, max_length=128, padding='max_length', return_tensors='pt')
enc_no_trunc = tok(sample, truncation=False, return_tensors='pt')
len(enc_trunc['input_ids'][0]), enc_no_trunc['input_ids'].shape[1]

You can pass tokenizer kwargs through the pipeline call to control truncation/max_length and batch size for speed on CPU.

In [None]:
batch = [
    "Terrific pacing and solid acting.",
    "Mediocre plot, but some scenes were fun.",
    "I wouldn't recommend this to anyone.",
    "Absolutely fantastic!", 
] * 64  # simulate a larger batch

t0 = perf_counter()
out = clf(batch, batch_size=32, truncation=True, max_length=128)
dt = perf_counter() - t0
print(f"Processed {len(batch)} texts in {dt:.2f}s (~{len(batch)/dt:.1f} texts/s on CPU)")
out[:3]

## 2) Evaluate on a standard dataset (SST-2 validation subset)
We use GLUE SST-2 dev split (binary sentiment) and evaluate pipeline accuracy on the first N examples for a quick CPU exercise.

In [None]:
ds = load_dataset('glue', 'sst2', split='validation')  # requires internet to fetch
N = 200  # keep small for CPU
sampled = ds.select(range(N))
label_map = {0: 'NEGATIVE', 1: 'POSITIVE'}

preds = clf(sampled['sentence'], batch_size=32, truncation=True, max_length=128)
pred_labels = [p['label'] for p in preds]
true_labels = [label_map[y] for y in sampled['label']]
acc = np.mean(np.array(pred_labels) == np.array(true_labels))
print({'N': N, 'accuracy': round(float(acc), 3)})

Truncation effect: vary max_length and compare accuracy stability. Longer sequences may capture more context at the cost of runtime.

In [None]:
accs = {}
for L in [64, 128, 256]:
    preds_L = clf(sampled['sentence'], batch_size=32, truncation=True, max_length=L)
    pred_L = [p['label'] for p in preds_L]
    accs[L] = float(np.mean(np.array(pred_L) == np.array(true_labels)))
accs

## 3) Handling long texts (chunking)
For texts longer than the model max length, chunk and aggregate predictions (e.g., average logits or majority vote of labels). Here, perform a simple majority vote over chunks for sentiment polarity as a demonstration.

In [None]:
def chunk_text(text, tokenizer, max_length=256, stride=32):
    tokens = tokenizer(text, return_offsets_mapping=True, add_special_tokens=False)
    ids = tokens['input_ids']
    chunks = []
    i = 0
    while i < len(ids):
        chunk_ids = ids[i:i+max_length-2]  # reserve for [CLS]/[SEP]
        chunk_text = tokenizer.decode(chunk_ids)
        chunks.append(chunk_text)
        i += max_length - 2 - stride
        if i <= 0: i = len(chunk_ids)  # safety
    return chunks

long_text = ("The film starts strong with engaging characters and witty dialogue, "
             "but gradually loses momentum, suffering from pacing issues. "
             "Nonetheless, certain scenes are genuinely moving, and the score is memorable. "
             "Overall, it's a mixed bag with both charming moments and frustrating flaws.") * 20

parts = chunk_text(long_text, tok, max_length=128, stride=32)
pred_parts = clf(parts, batch_size=16, truncation=True, max_length=128)
labels = [p['label'] for p in pred_parts]
final = pd.Series(labels).value_counts().idxmax()
final, pd.Series(labels).value_counts().to_dict()

## 4) Notes on privacy, bias, and compliance
- Avoid passing sensitive data to third-party endpoints; models downloaded locally run offline after caching, but take care when logging text
- Audit and anonymize where necessary; keep access controls and data retention policies
- Monitor outputs for bias and harmful content; document known model limitations
- In enterprise, add guardrails, human-in-the-loop, and domain-specific evaluation


## Exercises
Instructor solution cells are hidden/collapsed.
1. Batch vs single: Measure throughput (texts/sec) for single-item calls in a loop vs a batched call of size 32 on 256 texts.
2. Truncation stability: For SST-2 N=200, compute accuracy at max_length in [32, 64, 128, 256]; plot or print results. Which setting balances speed and accuracy best on CPU?
3. Chunking policy: Implement a function that aggregates chunk predictions by averaging scores (use the `score` field), not just majority label.


In [None]:
# Exercise 1: Batch vs single
# TODO: Create 256 medium-length sentences. Time clf on: (a) loop calling one-by-one; (b) one batched call (batch_size=32).
# Report times and throughput.
...

In [None]:
# Solution 1 (hidden)
samples = ["The movie was okay, some parts dragged but acting was fine." for _ in range(256)]
t0 = perf_counter();
single = [clf([s])[0] for s in samples];
t_single = perf_counter() - t0
t0 = perf_counter();
batched = clf(samples, batch_size=32, truncation=True, max_length=128);
t_batch = perf_counter() - t0
{'single_s': round(t_single,2), 'batch_s': round(t_batch,2), 'throughput_single': round(256/max(t_single,1e-6),1), 'throughput_batch': round(256/max(t_batch,1e-6),1)}

In [None]:
# Exercise 2: Truncation stability on SST-2
# TODO: For L in [32,64,128,256], compute accuracy on the first N=200 validation examples.
# Hint: reuse 'sampled' and 'true_labels'.
...

In [None]:
# Solution 2 (hidden)
accs2 = {}
for L in [32,64,128,256]:
    predsL = clf(sampled['sentence'], batch_size=32, truncation=True, max_length=L)
    accs2[L] = float(np.mean(np.array([p['label'] for p in predsL]) == np.array(true_labels)))
accs2

In [None]:
# Exercise 3: Score-averaging chunking
# TODO: Modify the chunk aggregation to average class scores and pick the final label by max average score.
# Hint: create a vector [score_pos, score_neg] or similar from pipeline outputs and average across chunks.
...

In [None]:
# Solution 3 (hidden)
def chunk_predict_avg(text, clf, tokenizer, max_length=128, stride=32):
    chunks = chunk_text(text, tokenizer, max_length=max_length, stride=stride)
    outs = clf(chunks, batch_size=16, truncation=True, max_length=max_length)
    # map label->index
    labels = sorted(list({o['label'] for o in outs}))
    idx = {lab:i for i,lab in enumerate(labels)}
    scores = np.zeros((len(outs), len(labels)))
    for i,o in enumerate(outs):
        scores[i, idx[o['label']]] = o['score']
    avg = scores.mean(axis=0)
    pred = labels[int(np.argmax(avg))]
    return pred, {labels[i]: float(avg[i]) for i in range(len(labels))}

pred_label, avg_scores = chunk_predict_avg(long_text, clf, tok)
pred_label, avg_scores

## Wrap-up checklist
- [ ] Use pipelines for quick CPU inference; cache models locally
- [ ] Control truncation and max_length; respect model context window
- [ ] Batch inputs for speed on CPU
- [ ] Evaluate on a small standard split to sanity-check performance
- [ ] Consider privacy, bias, and compliance in production use
