## Installation

In [1]:
%%capture
!pip install vllm==0.7.1 evaluate==0.4.3 rouge_score==0.1.2 bitsandbytes==0.45.1

## Global Variables

In [2]:
dataset_name = input("Enter the name of the generated dataset during Lesson 3. Hit enter to default to our cached generated dataset.") or "pauliusztin/second_brain_course_summarization_task"
print(f"{dataset_name=}")
model_name = input("Enter the name of fine-tuned LLM during Lesson 4. Hit enter to default to our fine-tuned LLM.") or "pauliusztin/Meta-Llama-3.1-8B-Instruct-Second-Brain-Summarization"
print(f"{model_name=}")

Enter the name of the generated dataset during Lesson 3. Hit enter to default to our cached generated dataset.
dataset_name='pauliusztin/second_brain_course_summarization_task'
Enter the name of fine-tuned LLM during Lesson 4. Hit enter to default to our fine-tuned LLM.
model_name='pauliusztin/Meta-Llama-3.1-8B-Instruct-Second-Brain-Summarization'


In [21]:
import torch


def get_gpu_info() -> str | None:
    """Gets GPU device name if available.

    Returns:
        str | None: Name of the GPU device if available, None if no GPU is found.
    """
    if not torch.cuda.is_available():
        return None

    gpu_name = torch.cuda.get_device_properties(0).name

    return gpu_name


active_gpu_name = get_gpu_info()

print("GPU type:")
print(active_gpu_name)

GPU type:
Tesla T4


Depending on the type of GPU you are using, we pick a max evaluation sample number to avoid waiting too much to generate the answers required for evaluation.

In [22]:
if active_gpu_name and "T4" in active_gpu_name:
    max_evaluation_samples = 8
elif active_gpu_name and ("A100" in active_gpu_name or "L4" in active_gpu_name):
    max_evaluation_samples = 70
elif active_gpu_name:
    max_evaluation_samples = 8
else:
    raise ValueError("No Nvidia GPU found.")

print("--- Parameters ---")
print(f"{max_evaluation_samples=}")

--- Parameters ---
max_evaluation_samples=10


## Load Fine-tuned LLM

In [3]:
from vllm import LLM, SamplingParams

llm = LLM(model=model_name, max_model_len=4096, dtype="float16", quantization="bitsandbytes", load_format="bitsandbytes")

INFO 02-04 17:11:58 __init__.py:183] Automatically detected platform cuda.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


INFO 02-04 17:12:14 config.py:526] This model supports multiple tasks: {'score', 'classify', 'embed', 'generate', 'reward'}. Defaulting to 'generate'.
INFO 02-04 17:12:15 llm_engine.py:232] Initializing a V0 LLM engine (v0.7.1) with config: model='pauliusztin/Meta-Llama-3.1-8B-Instruct-Second-Brain-Summarization', speculative_config=None, tokenizer='pauliusztin/Meta-Llama-3.1-8B-Instruct-Second-Brain-Summarization', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, 

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


INFO 02-04 17:12:25 model_runner.py:1116] Loading model weights took 5.3422 GB
INFO 02-04 17:12:48 worker.py:266] Memory profiling takes 22.20 seconds
INFO 02-04 17:12:48 worker.py:266] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.90) = 13.27GiB
INFO 02-04 17:12:48 worker.py:266] model weights take 5.34GiB; non_torch_memory takes 0.05GiB; PyTorch activation peak memory takes 1.20GiB; the rest of the memory reserved for KV Cache is 6.68GiB.
INFO 02-04 17:12:48 executor_base.py:108] # CUDA blocks: 3418, # CPU blocks: 2048
INFO 02-04 17:12:48 executor_base.py:113] Maximum concurrency for 4096 tokens per request: 13.35x
INFO 02-04 17:12:50 model_runner.py:1435] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_uti

Capturing CUDA graph shapes: 100%|██████████| 35/35 [01:38<00:00,  2.82s/it]

INFO 02-04 17:14:29 model_runner.py:1563] Graph capturing finished in 99 secs, took 0.71 GiB
INFO 02-04 17:14:29 llm_engine.py:429] init engine (profile, create kv cache, warmup model) took 123.83 seconds





## Prepare Input Samples

In [23]:
from datasets import load_dataset


alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
You are a helpful assistant specialized in summarizing documents. Generate a concise TL;DR summary in markdown format having a maximum of 512 characters of the key findings from the provided documents, highlighting the most significant insights

### Input:
{}

### Response:
{}"""

def format_sample(sample: dict) -> str:
  return alpaca_prompt.format(sample["instruction"], "")

In [33]:
dataset = load_dataset(dataset_name, split="test")
dataset = dataset.select(range(max_evaluation_samples))

In [34]:
len(dataset)

10

In [35]:
dataset[0]["instruction"][:1000]

'# Notes\n\n\n\n<child_page>\n# Design Patterns\n\n# Training code\n\nThe most natural way of splitting the training code:\n- Dataset\n- DatasetLoader\n- Model\n- ModelFactory\n- Trainer (takes in the dataset and model)\n- Evaluator\n\n# Serving code\n\n[Infrastructure]Model (takes in the trained model)\n\t- register\n\t- deploy\n</child_page>\n\n\n---\n\n# Resources [Community]\n\n# Resources [Science]\n\n# Tools'

In [36]:
dataset[0]["answer"][:1000]

'```markdown\n# TL;DR Summary\n\n## Design Patterns\n- **Training Code Structure**: Key components include Dataset, DatasetLoader, Model, ModelFactory, Trainer, and Evaluator.\n- **Serving Code**: Infrastructure for Model registration and deployment.\n\n## Tags\n- Generative AI\n- LLMs\n```'

In [37]:
dataset = dataset.map(lambda sample: {"prompt": format_sample(sample)})

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

In [38]:
dataset[0]

{'instruction': '# Notes\n\n\n\n<child_page>\n# Design Patterns\n\n# Training code\n\nThe most natural way of splitting the training code:\n- Dataset\n- DatasetLoader\n- Model\n- ModelFactory\n- Trainer (takes in the dataset and model)\n- Evaluator\n\n# Serving code\n\n[Infrastructure]Model (takes in the trained model)\n\t- register\n\t- deploy\n</child_page>\n\n\n---\n\n# Resources [Community]\n\n# Resources [Science]\n\n# Tools',
 'answer': '```markdown\n# TL;DR Summary\n\n## Design Patterns\n- **Training Code Structure**: Key components include Dataset, DatasetLoader, Model, ModelFactory, Trainer, and Evaluator.\n- **Serving Code**: Infrastructure for Model registration and deployment.\n\n## Tags\n- Generative AI\n- LLMs\n```',
 'prompt': 'Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nYou are a helpful assistant specialized in summarizing documents. Ge

In [39]:
dataset["prompt"][0]

'Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nYou are a helpful assistant specialized in summarizing documents. Generate a concise TL;DR summary in markdown format having a maximum of 512 characters of the key findings from the provided documents, highlighting the most significant insights\n\n### Input:\n# Notes\n\n\n\n<child_page>\n# Design Patterns\n\n# Training code\n\nThe most natural way of splitting the training code:\n- Dataset\n- DatasetLoader\n- Model\n- ModelFactory\n- Trainer (takes in the dataset and model)\n- Evaluator\n\n# Serving code\n\n[Infrastructure]Model (takes in the trained model)\n\t- register\n\t- deploy\n</child_page>\n\n\n---\n\n# Resources [Community]\n\n# Resources [Science]\n\n# Tools\n\n### Response:\n'

## Generate Answers

In [None]:
from vllm import SamplingParams

sampling_params = SamplingParams(temperature=0.0, top_p=0.95, min_p=0.05, max_tokens=4096)
predictions = llm.generate(dataset["prompt"], sampling_params)

Processed prompts:   0%|          | 0/10 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]



Processed prompts:  80%|████████  | 8/10 [01:17<00:20, 10.19s/it, est. speed input: 1834.90 toks/s, output: 1.18 toks/s]

In [None]:
predictions[0].outputs[0].text

In [None]:
answers = [prediction.outputs[0].text for prediction in predictions]
answers[0]

## Evaluate

The last step is to compute some metrics on the validation split to see how well our fine-tuned LLM performs.

In [None]:
import evaluate
import numpy as np

from tqdm import tqdm

rouge = evaluate.load("rouge")

def compute_metrics(predictions: list[str], references: list[str]):
    result = rouge.compute(
        predictions=predictions, references=references, use_stemmer=True
    )
    result["mean_len"] = np.mean([len(p) for p in predictions])

    return {k: round(v, 4) for k, v in result.items()}

In [None]:
references = dataset["answer"]

In [None]:
references[0]

In [None]:
validation_metrics = compute_metrics(answers, references)
print(validation_metrics)