# Exercise 1 — Starter

Build on the **demo** by exploring: (A) generation length, (B) numerical precision, and (C) KV cache.

## 0) Environment guard

#### Built on top of the Demo
Avoid NumPy 2.x + TensorFlow wheel conflicts by disabling optional TF/Flax imports in `transformers`.

In [None]:
import os
os.environ["TRANSFORMERS_NO_TF"] = "1"
os.environ["TRANSFORMERS_NO_FLAX"] = "1"

## 1) Imports & globals

#### Built on top of the Demo
If the preferred model isn't available, we fall back to a tiny model to keep the notebook runnable everywhere.

In [None]:
import time
from typing import Dict, Any, List
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import matplotlib.pyplot as plt

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device={device}; NumPy={np.__version__}; Torch={torch.__version__}")

PREFERRED_MODEL = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

DEFAULT_PROMPT = (
        "Artificial intelligence (AI) is intelligence demonstrated by machines, in contrast to the natural "
        "intelligence displayed by humans and animals. Leading AI textbooks define the field as the study "
        "of intelligent agents: any system that perceives its environment and takes actions that maximize "
        "its chance of achieving its goals. Colloquially, the term \"artificial intelligence\" is often "
        "used to describe machines that mimic cognitive functions that humans associate with the human mind, "
        "such as learning and problem-solving."
)


## 2) Helpers 

#### Built on top of the Demo

In [None]:
def _resolve_dtype(dtype_str: str, device: torch.device):
    """
      1) Normalize/validate `dtype_str` (accept 'float16' or 'float32'; case-insensitive is fine).
         - If unsupported, raise ValueError with a clear message.

      2) CPU guard:
         - If device.type == 'cpu' and requested dtype is float16, fall back to torch.float32
           (most CPUs don’t support fp16 efficiently).
         - Return both: (torch.float32, "float32 (forced on CPU)").

      3) Otherwise map the string to the torch dtype:
         - 'float16' -> torch.float16
         - 'float32' -> torch.float32

      4) Return a tuple: (resolved_torch_dtype, label_string), where label_string is
         a human-readable descriptor like 'float16' or 'float32 (forced on CPU)'.
    """
    if device.type == 'cpu' and dtype_str == 'float16':
        return torch.float32, 'float32 (forced on CPU)'
    return (torch.float16 if dtype_str == 'float16' else torch.float32), dtype_str

In [None]:
def load_model_and_tokenizer(model_name: str, dtype_str: str = "float16"):
    """
      1) Resolve the requested precision:
         - Call `_resolve_dtype(dtype_str, device)` to get `(torch_dtype, dtype_label)`.
         - Ensure CPU requests for 'float16' are coerced to float32.

      2) Build a candidate list of model ids:
         - Start with `[model_name]`.
         - (Optional) Append `FALLBACK_MODELS` so the function still works if the preferred model
           isn’t available in the environment.

      3) Try candidates in order:
         for name in candidates:
           try:
             - tokenizer = AutoTokenizer.from_pretrained(name, use_fast=True)
             - model = AutoModelForCausalLM.from_pretrained(name, torch_dtype=torch_dtype)
             - Move model to `device` and call `model.eval()`
             - If `tokenizer.pad_token_id` is None and `tokenizer.eos_token_id` exists,
               set `tokenizer.pad_token = tokenizer.eos_token` to avoid padding warnings.
             - RETURN `(tokenizer, model, name, dtype_label)`
           except Exception as e:
             - Collect the error and continue to the next candidate.

      4) If all candidates fail:
         - Raise `RuntimeError` with a concise message and include the first/last error
           to aid debugging.

    Notes:
      - You may pass `low_cpu_mem_usage=True` or `device_map="auto"` (if appropriate for your runtime).
      - Keep `trust_remote_code=False` unless you explicitly need custom modeling code.
    """
    dtype, label = _resolve_dtype(dtype_str, device)
    for name in [model_name]:
        try:
            tok = AutoTokenizer.from_pretrained(name, use_fast=True)
            mdl = AutoModelForCausalLM.from_pretrained(name, torch_dtype=dtype)
            mdl.to(device)
            mdl.eval()
            if tok.pad_token_id is None and tok.eos_token_id is not None:
                tok.pad_token = tok.eos_token
            return tok, mdl, name, label
        except Exception:
            continue
    raise RuntimeError('Could not load any model from preferred/fallback list.')

In [None]:
def time_generate(
    model,
    tokenizer,
    prompt: str,
    max_new_tokens: int = 50,
    use_cache: bool = True,
    num_warmup: int = 1,
    num_runs: int = 3,
) -> Dict[str, Any]:
    """
      1) Tokenize the prompt:
         - Create `input_ids = tokenizer(prompt, return_tensors='pt').input_ids.to(model.device)`.

      2) Warm up (no timing):
         - Under `torch.no_grad()`, run `model.generate(input_ids, max_new_tokens=8, use_cache=use_cache)`
           `num_warmup` times to stabilize kernels/clocks.
         - If CUDA, you may `torch.cuda.synchronize()` after warmup.

      3) Timed runs:
         - Initialize `latencies = []` and `tokens_out = []`.
         - Under `torch.no_grad()`, loop `num_runs` times:
             * If CUDA, `torch.cuda.synchronize()` before starting the timer.
             * Record `t0 = time.perf_counter()`.
             * Call `out = model.generate(input_ids, max_new_tokens=max_new_tokens, use_cache=use_cache)`.
             * If CUDA, `torch.cuda.synchronize()` to ensure all work has finished.
             * Record `t1 = time.perf_counter()` and append `(t1 - t0)` to `latencies`.
             * Compute generated token count: `gen = out.shape[-1] - input_ids.shape[-1]`
               and append to `tokens_out`.

      4) Aggregate metrics:
         - `L = average(latencies)` (seconds)
         - `TG = average(tokens_out)` (tokens)
         - `TPS = TG / L` (tokens/sec), guard divide-by-zero
         - `LPT = (L / max(TG, 1e-9)) * 1000.0` (ms/token)

      5) Return a dict with keys:
         - 'total_latency_s', 'tokens_generated', 'tokens_per_sec', 'avg_latency_per_token_ms'

      Notes:
        - Ensure `model.eval()` was called beforehand.
        - Do all forwards under `torch.no_grad()` to avoid autograd overhead.
        - We measure **end-to-end** generation time for the given `max_new_tokens`.
    """
    input_ids = tokenizer(prompt, return_tensors='pt').input_ids.to(model.device)
    with torch.no_grad():
        for _ in range(num_warmup):
            _ = model.generate(input_ids, max_new_tokens=8, use_cache=use_cache)
    latencies, tokens_out = [], []
    with torch.no_grad():
        for _ in range(num_runs):
            t0 = time.perf_counter()
            out = model.generate(input_ids, max_new_tokens=max_new_tokens, use_cache=use_cache)
            t1 = time.perf_counter()
            latencies.append(t1 - t0)
            tokens_out.append(out.shape[-1] - input_ids.shape[-1])
    L = sum(latencies) / max(len(latencies), 1)
    TG = sum(tokens_out) / max(len(tokens_out), 1)
    TPS = (TG / L) if L > 0 else float('nan')
    LPT = (L / max(TG, 1e-9)) * 1000.0
    return {'total_latency_s': L, 'tokens_generated': TG, 'tokens_per_sec': TPS, 'avg_latency_per_token_ms': LPT}


In [None]:
def plot_xy(xs: List[float], ys: List[float], xlabel: str, ylabel: str, title: str):
    plt.figure()
    plt.plot(xs, ys, marker='o')
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
    plt.grid(True)
    plt.show()

## 3) Baseline
Run one fixed setup as baseline: `max_new_tokens=50`, `float16`, `use_cache=True`.

In [None]:
tokenizer, model, MODEL_USED, DTYPE_USED = load_model_and_tokenizer(PREFERRED_MODEL, dtype_str='float16')
print(f"Loaded: {MODEL_USED} | dtype={DTYPE_USED}")
baseline = time_generate(model, tokenizer, DEFAULT_PROMPT, max_new_tokens=50, use_cache=True)
assert all(k in baseline for k in ['total_latency_s','tokens_generated','tokens_per_sec','avg_latency_per_token_ms'])
baseline

## 4) Exercise A — Generation Length vs Cost
Vary `max_new_tokens` and plot latency/throughput.

In [None]:
def run_for_varied_lengths():
    """
    TODO:
      1) Define a list of generation lengths, e.g. lengths = [10, 50, 100, 250].

      2) For each L in lengths:
         - Call time_generate(model, tokenizer, DEFAULT_PROMPT, max_new_tokens=L, use_cache=True)
         - Collect each result dict in results_len.

      3) From results_len compute:
         - latencies = [r['total_latency_s'] for r in results_len]
         - throughputs = [r['tokens_per_sec'] for r in results_len]

      4) Plot two charts using plot_xy (one figure per chart):
         - Latency vs Generation Length
         - Throughput vs Generation Length
         (Use simple line+marker plots with grid; no custom colors.)

      5) (Optional) return a dict for downstream use:
         return {'lengths': lengths, 'latencies': latencies, 'throughputs': throughputs}
    """
    raise NotImplementedError("Implement run_for_varied_lengths() per the TODO above.")


run_for_varied_lengths()

## 5) Exercise B — Numerical Precision (float16 vs float32)
Reload for each precision with identical settings and compare.

In [None]:
def run_varied_precision():
    """
    TODO:
      1) Define the list of precisions to test, e.g. ['float16', 'float32'].

      2) For each precision p in the list:
         - Load model + tokenizer using load_model_and_tokenizer(PREFERRED_MODEL, dtype_str=p).
         - Run time_generate with:
             * DEFAULT_PROMPT
             * max_new_tokens=50
             * use_cache=True
         - Store the result dictionary in compare_precision[p].

      3) Return compare_precision, which maps precision → generation performance metrics.

      4) (Optional) Plot or print results for easier comparison.
         Example metrics: latency, tokens/sec, latency per token.
    """
    raise NotImplementedError("Implement run_varied_precision() per the TODO above.")



compare_precision = run_varied_precision()

## 6) Exercise C — KV Cache On vs Off
Use the same model and toggle `use_cache`.

In [None]:
kv_results = {}
kv_results['use_cache=True'] = time_generate(model, tokenizer, DEFAULT_PROMPT, max_new_tokens=50, use_cache=True)
kv_results['use_cache=False'] = time_generate(model, tokenizer, DEFAULT_PROMPT, max_new_tokens=50, use_cache=False)
kv_results

## 7) (Optional) Compact summary table
Summarize the three ideas in a small table.

In [None]:
import pandas as pd
rows = []
rows.append({'label':'baseline (fp16, cache on)','dtype':'float16','use_cache':True,'max_new_tokens':50, **baseline})
rows.append({'label':'fp32, cache on','dtype':'float32','use_cache':True,'max_new_tokens':50, **compare_precision['float32']})
rows.append({'label':'fp16, cache off','dtype':'float16','use_cache':False,'max_new_tokens':50, **kv_results['use_cache=False']})
df = pd.DataFrame(rows)
df