# Coding Exercise 3: Local LLM Sentiment Classification

In this exercise, we'll use a small LLM _running on our own machine_ (well, on a Google Colab virtual machine...) to automatically classify customer reviews as positive, negative, or neutral.

This is a practical example of how businesses use AI today — instead of a human reading thousands of reviews, an LLM can do it in seconds.

It follows from **Episode 7**, where we looked at how LLMs work and the wealth of possible use cases. It also illustrates an area where LLMs can perform as well, if not better than traditional NLP-based sentiment analysis.

## What we'll do:
1. **Fetch reviews** from a URL using Python's `requests` library (simulating pulling data from an API)
2. **Load a small LLM** called Qwen2.5-0.5B-Instruct onto our machine
3. **Classify each review** by giving the model a carefully worded prompt
4. **Review the results** and discuss what worked and what didn't

**Note: From the Runtime menu above, select "Change runtime type" and then select "T4 GPU" for the hardware accelerator. This will run the model significantly faster than on the CPU runtime.**

## Why a local model?
Services like ChatGPT and Claude run on remote servers — you send your data to them over the internet. A **local model** runs right here on this computer. This matters when:
- You're working with **sensitive data** (medical records, financial info) that can't leave your system
- You want to avoid **API costs** (cloud AI services charge per request)
- You need to work **offline** or with very low latency

The trade-off: local models are typically much smaller and less capable than cloud models.

## Learning Objectives

By the end of this exercise, you should understand:
1. How the `requests` Python library can be used to retrieve data from the web
2. How to download a local model from Hugging Face
3. How to load that model into PyTorch, along with its tokenizer
4. How to write a classifier function we can use to classify each review
5. How to clean and santise the model output
6. How to summarise and sanity check results

The aim here is to extend what we've learned about building and training models into actually using them to do stuff.

---
## 1: Install dependencies

We need two Python libraries:
- **`transformers`** — Hugging Face's library for downloading and running AI models
- **`torch`** (PyTorch) — the underlying framework that does the actual computation
- **`requests`** — a simple library for fetching data from URLs (comes pre-installed, but we include it for clarity)

The `-q` flag means "quiet" — it suppresses most of the installation output to keep things tidy.

In [2]:
!pip install -q transformers torch requests

---
## 2: Fetch customer reviews

In a real-world scenario, reviews might come from:
- A company's database or API
- A web scraping pipeline
- A CSV file exported from a survey tool

Here we define them inline for simplicity, but the commented-out code at the top shows how you'd fetch them from a hosted JSON file using `requests.get()`.

Each review is a Python **dictionary** with an `id` and `text` field — a common format for structured data.

In [3]:
import requests
import json

# ------------------------------------------------------------------
# OPTION A: Fetch from a real URL (uncomment these lines to use)
# ------------------------------------------------------------------
url = "https://raw.githubusercontent.com/uoy-research/tfds-course-ai/refs/heads/main/Data/CodingExercises/CE3/02_ce3_restaurant_reviews.json"
response = requests.get(url)      # Send an HTTP GET request to the URL
reviews = response.json()          # Parse the JSON response into a Python list

# ------------------------------------------------------------------
# OPTION B: Define reviews inline for testing
# ------------------------------------------------------------------
# reviews = [
#     {"id": 1, "text": "Absolutely love this product! Best purchase I've made all year."},
#     {"id": 2, "text": "Terrible quality. Broke after two days. Complete waste of money."},
#     {"id": 3, "text": "It's okay. Does what it's supposed to do, nothing special."},
#     {"id": 4, "text": "Shipping was fast and the item exceeded my expectations!"},
#     {"id": 5, "text": "Not worth the price. The description was misleading."},
#     {"id": 6, "text": "Decent product for the price point. Would consider buying again."},
#     {"id": 7, "text": "DO NOT BUY. Customer service was rude and unhelpful."},
#     {"id": 8, "text": "Works as advertised. Arrived on time."},
#     {"id": 9, "text": "This changed my life! I've already recommended it to all my friends."},
#     {"id": 10, "text": "Meh. It's fine I guess. Not great, not terrible."}
# ]

# Print what we've got
print(f"Loaded {len(reviews)} reviews\n")
for r in reviews:
    print(f"  [{r['id']:>2}] {r['text']}")

Loaded 250 reviews

  [ 1] I've tried to recreate their carbonara at home and I simply cannot. There's some kind of magic happening in that kitchen.
  [ 2] They make their own pasta in-house and you can watch them doing it. The tagliatelle was sublime.
  [ 3] The salad was fresh enough and reasonably sized. Fairly priced too. Just not very exciting.
  [ 4] I asked for no onions due to an allergy and my dish arrived loaded with onions. Dangerous.
  [ 5] The vegan options consisted of a side salad. In 2024. Really.
  [ 6] The panini was so tightly pressed it was basically a cracker with some cheese inside.
  [ 7] The chef came out to check on our table personally. That kind of attention to detail is rare.
  [ 8] Decent enough for a quick lunch. Standard fare at standard prices.
  [ 9] We come here occasionally when everywhere else is full. It does what it needs to do.
  [10] Sticky tables, dirty cutlery, and a general feeling that hygiene isn't a priority here.
  [11] The staff went abov

---
## 3: Load the language model

We're using **Qwen2.5-0.5B-Instruct** — a model with 0.5 billion parameters created by Alibaba's Qwen team.

To put that in context:
- **GPT-4** is estimated to have over **1 trillion** parameters
- **Qwen2.5-0.5B** has **500 million** — roughly 2,000x smaller

Despite being tiny by modern standards, it's been specifically **instruction-tuned**, meaning it was trained to follow instructions (like "classify this review") rather than just predict the next word.

### What's happening in the code below:

1. **`AutoTokenizer`** loads the model's **tokenizer** — this converts human-readable text into numbers (tokens) that the model can process. Every model has its own specific tokenizer.

2. **`AutoModelForCausalLM`** loads the actual model weights — this is the "brain" of the model, containing all 500 million learned parameters.

3. **`torch_dtype=torch.float32`** tells PyTorch what numerical precision to use. `float32` is the standard — each parameter is stored as a 32-bit floating point number.

4. **Device selection** — we check if a GPU is available (faster) and fall back to CPU if not. For a model this small, CPU works fine.

In [4]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "Qwen/Qwen2.5-0.5B-Instruct"

# --- Load the tokenizer ---
# The tokenizer converts text to tokens (numbers) and back again.
# For example: "I love it" might become [40, 3021, 433]
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_name)

# --- Load the model ---
# This downloads ~1GB of model weights the first time.
# Subsequent runs use the cached version.
print("Loading model (this may take a minute the first time)...")
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float32  # Standard precision; works on all hardware
)

# --- Choose the best available device ---
if torch.cuda.is_available():
    device = "cuda"   # NVIDIA GPU — much faster
    print("Using GPU")
else:
    device = "cpu"     # Standard processor — slower but always available
    print("Using CPU (plenty fast for a 0.5B model)")

# Move the model to our chosen device
model = model.to(device)

# Print a summary
total_params = sum(p.numel() for p in model.parameters())
print(f"\nModel loaded on {device}!")
print(f"Total parameters: {total_params:,} ({total_params / 1e9:.2f} billion)")

Loading tokenizer...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/659 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


Loading model (this may take a minute the first time)...


model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/290 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

Using GPU

Model loaded on cuda!
Total parameters: 494,032,768 (0.49 billion)


---
## 4: Build the classification function

This is the core of the exercise. We need to:
1. **Construct a prompt** that tells the model what we want
2. **Format it** using the model's expected chat template
3. **Tokenize** the prompt (convert text → numbers)
4. **Generate** a response from the model
5. **Decode** the response (convert numbers → text)
6. **Parse** the result to extract a clean label

### About the prompt design:

We use two messages — a **system message** and a **user message**:

- The **system message** sets the model's role and behaviour: *"You are a sentiment classifier. Reply with exactly one word."* This is like giving an employee their job description before handing them a task.

- The **user message** contains the actual review to classify.

Separating instructions from data is a prompting best practice — it helps the model understand what's an instruction vs. what's the content to process.

### About `temperature`:

Temperature controls randomness in the model's output:
- **0.0** = always pick the most likely next word (deterministic)
- **1.0** = sample proportionally from all possibilities (creative/varied)
- **0.1** (our setting) = almost deterministic, with just a tiny bit of variation

For classification, we want consistency, so we keep temperature very low.

In [5]:
def classify_review(review_text, model, tokenizer, device):
    """
    Classify a single review as POSITIVE, NEGATIVE, or NEUTRAL.

    Args:
        review_text: The review string to classify
        model: The loaded language model
        tokenizer: The model's tokenizer
        device: 'cuda' for GPU or 'cpu'

    Returns:
        A tuple of (normalised_label, raw_model_output)
    """

    # --- 1. Construct the prompt as a conversation ---
    messages = [
        {
            "role": "system",
            "content": (
                "You are a sentiment classifier. "
                "You reply with exactly one word: POSITIVE, NEGATIVE, or NEUTRAL. "
                "No explanation."
            )
        },
        {
            "role": "user",
            "content": f"Classify this review:\n\"{review_text}\""
        }
    ]

    # --- 2. Apply the chat template ---
    # Each model expects its prompt formatted a specific way.
    # The chat template converts our messages list into the exact format
    # the model was trained on (with special tokens like <|im_start|>).
    # add_generation_prompt=True adds the marker that tells the model
    # "now it's your turn to respond".
    input_text = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )

    # --- 3. Tokenize ---
    # Convert the formatted text into token IDs (numbers) and move
    # them to the same device (CPU/GPU) as the model.
    # return_tensors="pt" means "return PyTorch tensors"
    inputs = tokenizer(input_text, return_tensors="pt").to(device)

    # --- 4. Generate a response ---
    # torch.no_grad() tells PyTorch we're only doing inference (not training),
    # so it doesn't need to track gradients — this saves memory and speeds things up.
    with torch.no_grad():
        outputs = model.generate(
            **inputs,               # The tokenized prompt
            max_new_tokens=5,       # Only generate up to 5 new tokens (we just need one word)
            temperature=0.1,        # Very low randomness — we want a consistent answer
            do_sample=True,         # Enable sampling (required when temperature > 0)
            pad_token_id=tokenizer.eos_token_id  # Technical detail: tells model what "padding" looks like
        )

    # --- 5. Decode the response ---
    # The model's output includes our entire input PLUS the new tokens.
    # We slice off just the new tokens using the length of our input.
    new_tokens = outputs[0][inputs['input_ids'].shape[1]:]  # Everything after our prompt
    generated = tokenizer.decode(new_tokens, skip_special_tokens=True).strip()

    # --- 6. Parse and normalise the label ---
    # The model might say "POSITIVE" or "Positive" or "positive." etc.
    # We extract the first word and check if it matches our categories.
    first_word = generated.upper().split()[0] if generated else ""

    if "POSITIVE" in first_word or "POS" in first_word:
        return "POSITIVE", generated
    elif "NEGATIVE" in first_word or "NEG" in first_word:
        return "NEGATIVE", generated
    elif "NEUTRAL" in first_word or "NEU" in first_word:
        return "NEUTRAL", generated
    else:
        return "UNKNOWN", generated  # Model didn't give us a clean answer

---
## 5: Run the classifier on all reviews

Now we loop through each review and classify it. We also time the process to see how fast our small model is.

In [6]:
import time

print("Classifying reviews...\n")
print(f"{'ID':>3} | {'Label':<10} | {'Raw Output':<15} | Review")
print("-" * 80)

results = []
start_time = time.time()

for review in reviews:
    label, raw = classify_review(review["text"], model, tokenizer, device)
    results.append({
        "id": review["id"],
        "text": review["text"],
        "label": label,
        "raw": raw
    })
    print(f"{review['id']:>3} | {label:<10} | {raw:<15} | {review['text'][:45]}")

elapsed = time.time() - start_time
print(f"\nDone! Classified {len(reviews)} reviews in {elapsed:.1f}s")
print(f"Average: {elapsed/len(reviews):.2f}s per review")

Classifying reviews...

 ID | Label      | Raw Output      | Review
--------------------------------------------------------------------------------
  1 | NEGATIVE   | NEGATIVE        | I've tried to recreate their carbonara at hom
  2 | POSITIVE   | POSITIVE        | They make their own pasta in-house and you ca
  3 | NEGATIVE   | NEGATIVE        | The salad was fresh enough and reasonably siz
  4 | NEGATIVE   | NEGATIVE        | I asked for no onions due to an allergy and m
  5 | NEGATIVE   | NEGATIVE        | The vegan options consisted of a side salad. 
  6 | NEGATIVE   | NEGATIVE        | The panini was so tightly pressed it was basi
  7 | POSITIVE   | POSITIVE        | The chef came out to check on our table perso
  8 | NEUTRAL    | NEUTRAL         | Decent enough for a quick lunch. Standard far
  9 | NEUTRAL    | NEUTRAL         | We come here occasionally when everywhere els
 10 | NEGATIVE   | NEGATIVE        | Sticky tables, dirty cutlery, and a general f
 11 | POSITIVE   | PO

---
## 6: View the summary

Let's see how the model did overall. We use Python's `Counter` to tally up how many reviews fell into each category.

In [7]:
from collections import Counter

counts = Counter(r["label"] for r in results)

print("=== Classification Summary ===")
for label in ["POSITIVE", "NEGATIVE", "NEUTRAL", "UNKNOWN"]:
    if counts[label] > 0:
        bar = "█" * counts[label]  # Simple bar chart
        print(f"  {label:<10} {bar} ({counts[label]})")

# Flag any reviews the model couldn't classify cleanly
if counts["UNKNOWN"] > 0:
    print(f"\n⚠️  {counts['UNKNOWN']} reviews couldn't be classified cleanly.")
    print("   This can happen with small models — here's what it actually said:")
    for r in results:
        if r["label"] == "UNKNOWN":
            print(f"     [{r['id']}] Model said: '{r['raw']}'")

=== Classification Summary ===
  POSITIVE   ██████████████████████████████████████████████████████████████████████████████████████████████████ (98)
  NEGATIVE   ████████████████████████████████████████████████████████████████████████████████████████████████████████ (104)
  NEUTRAL    ████████████████████████████████████████████████ (48)


---
## 12: Discussion Points

**How well did it do?**
- Look at the results above. Were there any obvious mistakes?
- Which reviews were hardest for the model? Why?

**The role of the prompt:**
- We gave the model a system message ("You are a sentiment classifier") and a user message (the review). Why split it this way?
- What might happen if we just said "What do you think of this review?" instead?

**Size matters:**
- This model has 0.5 billion parameters. GPT-4 has over 1 trillion. What's the trade-off?
- Could you use a model this small in production? When would you need something bigger?

**Local vs. cloud:**
- What are the advantages of running a model locally instead of calling an API like ChatGPT?
- What are the disadvantages?

---
## 13: Challenges to try

1. **Few-shot prompting:** Add examples to the system message, e.g. *'Example: "Great product!" → POSITIVE'*. Does accuracy improve?
2. **Five categories:** Change the prompt to classify as very positive / positive / neutral / negative / very negative. Can the model handle it?
3. **Swap the model:** Try `HuggingFaceTB/SmolLM2-360M-Instruct` (even smaller). How does it compare?
4. **Your own reviews:** Replace the commented-out reviews list with reviews that you've written, or reviews that you've found on a real online shop. Do the results surprise you?