Week 10 · Day 6 — Packaging & Inference Pipeline
Why this matters

Training a model is only half the job — to use it, you need a clean inference pipeline. Saving weights, vocab, and preprocessing ensures reproducibility. A simple predict() function makes your model usable anywhere (apps, scripts, APIs).

Theory Essentials

Save model + vocab: both must match training.

Deterministic preprocessing: tokenization, padding, masks must be identical.

Batch vs single prediction: pipeline should handle both.

predict() function: one clean entry point → {label, prob}.

Latency check: measure average inference time.

In [None]:
import torch, torch.nn as nn
import numpy as np, time

torch.manual_seed(42)

# Model definition
class BiLSTMClassifier(nn.Module):
    def __init__(self, vocab_size=100, embed_dim=16, hidden_dim=32, num_classes=2):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_dim*2, num_classes)
    def forward(self, x, lengths):
        embeds = self.embedding(x)
        packed = nn.utils.rnn.pack_padded_sequence(
            embeds, lengths.cpu(), batch_first=True, enforce_sorted=False
        )
        _, (h_n, _) = self.lstm(packed)
        h_cat = torch.cat((h_n[-2], h_n[-1]), dim=1)
        return self.fc(h_cat)

# Dummy trained model
model = BiLSTMClassifier()
torch.save(model.state_dict(), "bilstm.pt")  # save weights

# Fake vocab (token-to-id)
vocab = {"i":1, "love":2, "hate":3, "this":4, "movie":5}
torch.save(vocab, "vocab.pt")

# Load model + vocab for inference
vocab = torch.load("vocab.pt")
model = BiLSTMClassifier()
model.load_state_dict(torch.load("bilstm.pt"))
model.eval()

# Inference function
def predict(text, model, vocab):
    tokens = [vocab.get(w.lower(), 0) for w in text.split()]
    lengths = torch.tensor([len(tokens)])
    x = torch.tensor(tokens).unsqueeze(0)  # batch=1
    with torch.no_grad():
        logits = model(x, lengths)
        probs = torch.softmax(logits, dim=1).cpu().numpy()[0]
    pred = int(np.argmax(probs))
    return {"label": pred, "prob": float(probs[pred])}

# Demo
print(predict("I love this movie", model, vocab))
print(predict("I hate this", model, vocab))

# Latency check
t0 = time.time()
for _ in range(100):
    predict("I love this movie", model, vocab)
print("Avg latency:", (time.time()-t0)/100, "sec per sample")


{'label': 1, 'prob': 0.5582220554351807}
{'label': 1, 'prob': 0.5485066175460815}
Avg latency: 0.0010336852073669434 sec per sample


1) Core (10–15 min)

Task: Modify predict() to also return all class probabilities instead of just top one.

In [2]:
def predict(text, model, vocab):
    tokens = [vocab.get(w.lower(), 0) for w in text.split()]
    lengths = torch.tensor([len(tokens)])
    x = torch.tensor(tokens).unsqueeze(0)  # batch=1
    with torch.no_grad():
        logits = model(x, lengths)
        probs = torch.softmax(logits, dim=1).cpu().numpy()[0]
    pred = int(np.argmax(probs))
    return {"label": pred, "probs": probs.tolist()}

print(predict("I love this movie", model, vocab))
print(predict("I hate this", model, vocab))

{'label': 1, 'probs': [0.44177791476249695, 0.5582220554351807]}
{'label': 1, 'probs': [0.45149338245391846, 0.5485066175460815]}


2) Practice (10–15 min)

Task: Extend predict() to accept a list of sentences (batch inference). Compare latency vs single inference.

In [7]:
def predict_batch(texts, model, vocab):
    seqs = [[vocab.get(w.lower(),0) for w in t.split()] for t in texts]
    lengths = torch.tensor([len(s) for s in seqs])
    padded = nn.utils.rnn.pad_sequence([torch.tensor(s) for s in seqs], batch_first=True)
    with torch.no_grad():
        logits = model(padded, lengths)
        probs = torch.softmax(logits, dim=1).cpu().numpy()
    return probs

texts = [
    "I love this movie",
    "I hate this",
    "love love love",
    "hate movie",
    "this movie",
    "love movie",                      
    "I love hate this movie" # mixed sentiment
]

print(predict_batch(texts, model, vocab))

t0 = time.time()
for _ in range(100):
    predict_batch(texts, model, vocab)
print("Avg latency:", (time.time()-t0)/100, "sec per batch")

[[0.4417779  0.55822206]
 [0.45149338 0.5485066 ]
 [0.4435395  0.55646056]
 [0.42600808 0.57399195]
 [0.44542235 0.55457765]
 [0.43711635 0.5628836 ]
 [0.44422817 0.5557718 ]]
Avg latency: 0.0033507895469665526 sec per batch


Almost three times slower.

3) Stretch (optional, 10–15 min)

Task: Add an option to predict() to also return highlighted tokens using gradient-based saliency (from Day 5).

In [9]:
def predict_sal(text, model, vocab, return_saliency=False):
    # --- tokenize ---
    ids = [vocab.get(w.lower(), 0) for w in text.split()] or [0]  # handle empty
    lengths = torch.tensor([len(ids)], dtype=torch.long)
    x = torch.tensor(ids, dtype=torch.long).unsqueeze(0)          # [1, T]

    # --- standard inference ---
    model.eval()
    with torch.no_grad():
        logits = model(x, lengths)                 # [1, C]
        probs = torch.softmax(logits, dim=1)[0]    # [C]
    pred = int(torch.argmax(probs).item())
    out = {"label": pred, "prob": float(probs[pred].item())}

    if not return_saliency:
        return out

    # --- saliency (extra pass that tracks gradients) ---
    # Recreate the forward with retained grad on embeddings (no model changes needed)
    embeds = model.embedding(x)                    # [1, T, E]
    embeds.retain_grad()                           # keep grads on non-leaf tensor
    packed = nn.utils.rnn.pack_padded_sequence(
        embeds, lengths.cpu(), batch_first=True, enforce_sorted=False
    )
    _, (h_n, _) = model.lstm(packed)
    h_cat = torch.cat((h_n[-2], h_n[-1]), dim=1)   # [1, 2H]
    logits2 = model.fc(h_cat)                      # [1, C]

    score = logits2[0, pred]                       # use predicted-class logit
    model.zero_grad()
    score.backward()

    L = int(lengths[0].item())
    grads = embeds.grad[0, :L]                     # [T, E]
    sal = grads.norm(dim=1).detach().cpu().numpy() # [T]
    # normalize 0–1 (safe)
    rng = sal.max() - sal.min()
    sal = (sal - sal.min()) / (rng if rng > 0 else 1.0)

    tokens = text.split()[:L]                      # raw tokens (same split as above)
    out.update({"tokens": tokens, "saliency": sal.tolist()})
    return out

print(predict_sal("I love this movie", model, vocab))
print(predict_sal("I hate this", model, vocab, return_saliency=True))
# Pretty-print saliency
r = predict_sal("I love hate this movie", model, vocab, return_saliency=True)
for t, s in zip(r["tokens"], r["saliency"]):
    print(f"{t:<12} {s:.2f}")


{'label': 1, 'prob': 0.5582220554351807}
{'label': 1, 'prob': 0.5485066175460815, 'tokens': ['I', 'hate', 'this'], 'saliency': [1.0, 0.0, 0.2271358221769333]}
I            1.00
love         0.23
hate         0.00
this         0.03
movie        0.88


Clearly the model has to focus on different words. The word 'I' is not good enough to make predictions.

Mini-Challenge (≤40 min)

Task: Build a predict.py script that:

Loads bilstm.pt + vocab.pt.

Accepts text input from CLI.

Prints predicted label, probability, and (optional) latency.

Acceptance Criteria:

Runs as python predict.py "Some text".

Outputs label + probability.

Works for both short and long inputs.

In [11]:
# Mini-Challenge: interactive predict() in a notebook (no CLI args)

import time, torch, torch.nn as nn, numpy as np

# --- 1) Model class (must match how you saved weights) ---
class BiLSTMClassifier(nn.Module):
    def __init__(self, vocab_size=100, embed_dim=16, hidden_dim=32, num_classes=2, pad_idx=0):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_idx)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_dim * 2, num_classes)
        self.pad_idx = pad_idx
    def forward(self, x, lengths):
        embeds = self.embedding(x)
        packed = nn.utils.rnn.pack_padded_sequence(
            embeds, lengths.cpu(), batch_first=True, enforce_sorted=False
        )
        _, (h_n, _) = self.lstm(packed)
        h_cat = torch.cat((h_n[-2], h_n[-1]), dim=1)
        return self.fc(h_cat)

# --- 2) Load vocab + model (from files if available) ---
PAD_ID = 0
try:
    vocab = torch.load("vocab.pt")
except FileNotFoundError:
    # fallback tiny vocab so the cell still works
    vocab = {"i":1, "love":2, "hate":3, "this":4, "movie":5}
try:
    state = torch.load("bilstm.pt", map_location="cpu")
    model = BiLSTMClassifier() #vocab_size=max(vocab.values(), default=0)+1, pad_idx=PAD_ID
    model.load_state_dict(state)
except FileNotFoundError:
    # untrained demo model if weights aren’t present
    model = BiLSTMClassifier(vocab_size=max(vocab.values(), default=0)+1, pad_idx=PAD_ID)
model.eval();

# --- 3) Inference helpers ---
LABEL_MAP = {0: "neg", 1: "pos"}
MAX_LEN = 512  # cap extremely long inputs

def text_to_ids(text, vocab, pad_id=0, max_len=MAX_LEN):
    ids = [vocab.get(w.lower(), pad_id) for w in text.split()]
    ids = ids[:max_len]
    if not ids:
        ids = [pad_id]
    return ids

def predict(text: str):
    ids = text_to_ids(text, vocab, pad_id=PAD_ID)
    x = torch.tensor(ids, dtype=torch.long).unsqueeze(0)              # [1, T]
    lengths = torch.tensor([len(ids)], dtype=torch.long)
    with torch.no_grad():
        logits = model(x, lengths)                                    # [1, 2]
        probs = torch.softmax(logits, dim=1)[0].cpu().numpy()
    pred = int(np.argmax(probs))
    return pred, float(probs[pred])

# --- 4) Interactive loop with latency print ---
print("IMDB sentiment demo — type a review. Press Enter on an empty line or 'q' to quit.\n")
while True:
    txt = input("Your text> ").strip()
    if txt == "" or txt.lower() == "q":
        print("Bye!")
        break
    t0 = time.perf_counter()
    label, prob = predict(txt)
    dt = (time.perf_counter() - t0) * 1e3
    print(f"Prediction: {LABEL_MAP.get(label, label)}  |  prob={prob:.3f}  |  latency={dt:.2f} ms\n")


IMDB sentiment demo — type a review. Press Enter on an empty line or 'q' to quit.



Prediction: pos  |  prob=0.551  |  latency=2.81 ms

Prediction: pos  |  prob=0.541  |  latency=2.48 ms

Prediction: pos  |  prob=0.574  |  latency=4.21 ms

Prediction: pos  |  prob=0.532  |  latency=3.27 ms

Prediction: pos  |  prob=0.520  |  latency=2.81 ms

Bye!


Notes / Key Takeaways

Always save weights + vocab + preprocessing rules.

Reproducibility = same pipeline for training & inference.

Batch inference is much faster than looping single calls.

Predict function = clear API entry point.

Latency checks matter if deploying to real users.

Reflection

Why must vocab/preprocessing be saved alongside model weights?

Why is batch inference faster than repeated single inferences?

Why is it important to package the model together with preprocessing (vocab, tokenization, padding)?
Because inference must exactly match training. If you save only the model weights but not the vocab or preprocessing steps, new inputs may be tokenized differently and lead to wrong predictions. Packaging ensures reproducibility, consistency, and prevents “training-serving skew.”

Why measure latency and test batch vs single inference?
Latency tells us if the model is fast enough for real-world use. Comparing single vs batched inference shows efficiency: batching usually reduces per-sample cost since the model runs in parallel. These checks ensure the model is not only accurate but also usable in deployment.