## Installation

In [1]:
%%capture
!pip install vllm==0.7.1 evaluate==0.4.3 rouge_score==0.1.2 bitsandbytes==0.45.1

## Global Variables

In [2]:
dataset_name = (
    input(
        "Enter the name of the generated dataset during Lesson 3. Hit enter to default to our cached generated dataset."
    )
    or "pauliusztin/second_brain_course_summarization_task"
)
print(f"{dataset_name=}")
model_name = (
    input(
        "Enter the name of fine-tuned LLM during Lesson 4. Hit enter to default to our fine-tuned LLM."
    )
    or "pauliusztin/Meta-Llama-3.1-8B-Instruct-Second-Brain-Summarization"
)
print(f"{model_name=}")

Enter the name of the generated dataset during Lesson 3. Hit enter to default to our cached generated dataset.
dataset_name='pauliusztin/second_brain_course_summarization_task'
Enter the name of fine-tuned LLM during Lesson 4. Hit enter to default to our fine-tuned LLM.
model_name='pauliusztin/Meta-Llama-3.1-8B-Instruct-Second-Brain-Summarization'


In [3]:
import torch


def get_gpu_info() -> str | None:
    """Gets GPU device name if available.

    Returns:
        str | None: Name of the GPU device if available, None if no GPU is found.
    """
    if not torch.cuda.is_available():
        return None

    gpu_name = torch.cuda.get_device_properties(0).name

    return gpu_name


active_gpu_name = get_gpu_info()

print("GPU type:")
print(active_gpu_name)

GPU type:
Tesla T4


Depending on the type of GPU you are using, we pick a max evaluation sample number to avoid waiting too much to generate the answers required for evaluation.

In [4]:
if active_gpu_name and "T4" in active_gpu_name:
    max_evaluation_samples = 8
elif active_gpu_name and ("A100" in active_gpu_name or "L4" in active_gpu_name):
    max_evaluation_samples = 70
elif active_gpu_name:
    max_evaluation_samples = 8
else:
    raise ValueError("No Nvidia GPU found.")

print("--- Parameters ---")
print(f"{max_evaluation_samples=}")

--- Parameters ---
max_evaluation_samples=8


## Load Fine-tuned LLM

In [5]:
from vllm import LLM

llm = LLM(
    model=model_name,
    max_model_len=4096,
    dtype="float16",
    quantization="bitsandbytes",
    load_format="bitsandbytes",
)

INFO 03-05 13:47:57 __init__.py:183] Automatically detected platform cuda.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/981 [00:00<?, ?B/s]

INFO 03-05 13:48:16 config.py:526] This model supports multiple tasks: {'score', 'generate', 'classify', 'embed', 'reward'}. Defaulting to 'generate'.
INFO 03-05 13:48:17 llm_engine.py:232] Initializing a V0 LLM engine (v0.7.1) with config: model='pauliusztin/Meta-Llama-3.1-8B-Instruct-Second-Brain-Summarization', speculative_config=None, tokenizer='pauliusztin/Meta-Llama-3.1-8B-Instruct-Second-Brain-Summarization', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, 

tokenizer_config.json:   0%|          | 0.00/55.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

INFO 03-05 13:48:21 cuda.py:184] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 03-05 13:48:21 cuda.py:232] Using XFormers backend.
INFO 03-05 13:48:22 model_runner.py:1111] Starting to load model pauliusztin/Meta-Llama-3.1-8B-Instruct-Second-Brain-Summarization...
INFO 03-05 13:48:22 loader.py:1078] Loading weights with BitsAndBytes quantization.  May take a while ...
INFO 03-05 13:48:23 weight_utils.py:251] Using model weights format ['*.safetensors']


model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


INFO 03-05 13:54:05 model_runner.py:1116] Loading model weights took 5.3520 GB
INFO 03-05 13:54:28 worker.py:266] Memory profiling takes 23.22 seconds
INFO 03-05 13:54:28 worker.py:266] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.90) = 13.27GiB
INFO 03-05 13:54:28 worker.py:266] model weights take 5.35GiB; non_torch_memory takes 0.05GiB; PyTorch activation peak memory takes 1.20GiB; the rest of the memory reserved for KV Cache is 6.67GiB.
INFO 03-05 13:54:29 executor_base.py:108] # CUDA blocks: 3413, # CPU blocks: 2048
INFO 03-05 13:54:29 executor_base.py:113] Maximum concurrency for 4096 tokens per request: 13.33x
INFO 03-05 13:54:32 model_runner.py:1435] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_uti

Capturing CUDA graph shapes: 100%|██████████| 35/35 [01:41<00:00,  2.90s/it]

INFO 03-05 13:56:13 model_runner.py:1563] Graph capturing finished in 101 secs, took 0.71 GiB
INFO 03-05 13:56:13 llm_engine.py:429] init engine (profile, create kv cache, warmup model) took 128.24 seconds





## Prepare Input Samples

In [6]:
from datasets import load_dataset

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
You are a helpful assistant specialized in summarizing documents. Generate a concise TL;DR summary in markdown format having a maximum of 512 characters of the key findings from the provided documents, highlighting the most significant insights

### Input:
{}

### Response:
{}"""


def format_sample(sample: dict) -> str:
    return alpaca_prompt.format(sample["instruction"], "")

In [7]:
dataset = load_dataset(dataset_name, split="test")
dataset = dataset.select(range(max_evaluation_samples))

README.md:   0%|          | 0.00/529 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/3.25M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/955k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/575 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/72 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/72 [00:00<?, ? examples/s]

In [8]:
len(dataset)

8

In [9]:
dataset[0]["instruction"][:1000]

'# Notes\n\n\n\n<child_page>\n# Design Patterns\n\n# Training code\n\nThe most natural way of splitting the training code:\n- Dataset\n- DatasetLoader\n- Model\n- ModelFactory\n- Trainer (takes in the dataset and model)\n- Evaluator\n\n# Serving code\n\n[Infrastructure]Model (takes in the trained model)\n\t- register\n\t- deploy\n</child_page>\n\n\n---\n\n# Resources [Community]\n\n# Resources [Science]\n\n# Tools'

In [10]:
dataset[0]["answer"][:1000]

'```markdown\n# TL;DR Summary\n\n## Design Patterns\n- **Training Code Structure**: Key components include Dataset, DatasetLoader, Model, ModelFactory, Trainer, and Evaluator.\n- **Serving Code**: Infrastructure for Model registration and deployment.\n\n## Tags\n- Generative AI\n- LLMs\n```'

In [11]:
dataset = dataset.map(lambda sample: {"prompt": format_sample(sample)})

Map:   0%|          | 0/8 [00:00<?, ? examples/s]

In [12]:
dataset[0]

{'instruction': '# Notes\n\n\n\n<child_page>\n# Design Patterns\n\n# Training code\n\nThe most natural way of splitting the training code:\n- Dataset\n- DatasetLoader\n- Model\n- ModelFactory\n- Trainer (takes in the dataset and model)\n- Evaluator\n\n# Serving code\n\n[Infrastructure]Model (takes in the trained model)\n\t- register\n\t- deploy\n</child_page>\n\n\n---\n\n# Resources [Community]\n\n# Resources [Science]\n\n# Tools',
 'answer': '```markdown\n# TL;DR Summary\n\n## Design Patterns\n- **Training Code Structure**: Key components include Dataset, DatasetLoader, Model, ModelFactory, Trainer, and Evaluator.\n- **Serving Code**: Infrastructure for Model registration and deployment.\n\n## Tags\n- Generative AI\n- LLMs\n```',
 'prompt': 'Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nYou are a helpful assistant specialized in summarizing documents. Ge

In [13]:
dataset["prompt"][0]

'Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nYou are a helpful assistant specialized in summarizing documents. Generate a concise TL;DR summary in markdown format having a maximum of 512 characters of the key findings from the provided documents, highlighting the most significant insights\n\n### Input:\n# Notes\n\n\n\n<child_page>\n# Design Patterns\n\n# Training code\n\nThe most natural way of splitting the training code:\n- Dataset\n- DatasetLoader\n- Model\n- ModelFactory\n- Trainer (takes in the dataset and model)\n- Evaluator\n\n# Serving code\n\n[Infrastructure]Model (takes in the trained model)\n\t- register\n\t- deploy\n</child_page>\n\n\n---\n\n# Resources [Community]\n\n# Resources [Science]\n\n# Tools\n\n### Response:\n'

## Generate Answers

In [14]:
from vllm import SamplingParams

sampling_params = SamplingParams(
    temperature=0.0, top_p=0.95, min_p=0.05, max_tokens=4096
)
predictions = llm.generate(dataset["prompt"], sampling_params)

Processed prompts:   0%|          | 0/8 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]



Processed prompts: 100%|██████████| 8/8 [01:27<00:00, 10.88s/it, est. speed input: 1404.74 toks/s, output: 3.30 toks/s]


In [15]:
predictions[0].outputs[0].text

'```markdown\n# TL;DR Summary\n\n**Design Patterns:**\n\n  * Split training code into:\n    * Dataset\n    * DatasetLoader\n    * Model\n    * ModelFactory\n    * Trainer\n    * Evaluator\n  * Split serving code into:\n    * Infrastructure\n    * Model (register, deploy)\n\n**Key Insights:**\n\n  * Modularize training and serving code for better organization and scalability.\n  * Utilize design patterns to enhance code maintainability and efficiency.\n\n\n\n**Tags:** Generative AI, LLMs, MLOps\n```'

In [16]:
answers = [prediction.outputs[0].text for prediction in predictions]
answers[0]

'```markdown\n# TL;DR Summary\n\n**Design Patterns:**\n\n  * Split training code into:\n    * Dataset\n    * DatasetLoader\n    * Model\n    * ModelFactory\n    * Trainer\n    * Evaluator\n  * Split serving code into:\n    * Infrastructure\n    * Model (register, deploy)\n\n**Key Insights:**\n\n  * Modularize training and serving code for better organization and scalability.\n  * Utilize design patterns to enhance code maintainability and efficiency.\n\n\n\n**Tags:** Generative AI, LLMs, MLOps\n```'

## Evaluate

The last step is to compute some metrics on the validation split to see how well our fine-tuned LLM performs.

In [17]:
import evaluate
import numpy as np

rouge = evaluate.load("rouge")


def compute_metrics(predictions: list[str], references: list[str]):
    result = rouge.compute(
        predictions=predictions, references=references, use_stemmer=True
    )
    result["mean_len"] = np.mean([len(p) for p in predictions])

    return {k: round(v, 4) for k, v in result.items()}

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [18]:
references = dataset["answer"]

In [19]:
references[0]

'```markdown\n# TL;DR Summary\n\n## Design Patterns\n- **Training Code Structure**: Key components include Dataset, DatasetLoader, Model, ModelFactory, Trainer, and Evaluator.\n- **Serving Code**: Infrastructure for Model registration and deployment.\n\n## Tags\n- Generative AI\n- LLMs\n```'

In [20]:
validation_metrics = compute_metrics(answers, references)
print(validation_metrics)

{'rouge1': 0.2208, 'rouge2': 0.1163, 'rougeL': 0.1801, 'rougeLsum': 0.1924, 'mean_len': 173.875}
