In this notebook, we showcase how to use the KVpress pipelines by answering questions about NVIDIA Wikipedia article.

In [None]:
import requests
from bs4 import BeautifulSoup

import torch
from transformers import pipeline

from kvpress import (
    ExpectedAttentionPress,
    KnormPress,
    ObservedAttentionPress,
    RandomPress,
    SnapKVPress,
    StreamingLLMPress,
    TOVAPress,
)

# Load the pipeline and data

In [None]:
# Load pipeline

device = "cuda:0"
ckpt = "meta-llama/Meta-Llama-3.1-8B-Instruct"
# ckpt = "microsoft/Phi-3.5-mini-instruct"
# ckpt = "mistralai/Mistral-Nemo-Instruct-2407"
attn_implementation = "flash_attention_2"  # use "eager" for ObservedAttentionPress and "sdpa" if you can't use "flash_attention_2"
pipe = pipeline("kv-press-text-generation", model=ckpt, device=device, torch_dtype="auto", model_kwargs={"attn_implementation":attn_implementation})

You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [3]:
# Load data
url = "https://en.wikipedia.org/wiki/Nvidia"
content = requests.get(url).content
soup = BeautifulSoup(content, "html.parser")
context = "".join([p.text for p in soup.find_all("p")]) + "\n\n"
tokens = pipe.tokenizer.encode(context, return_tensors="pt").to(device)
print(f"Number of tokens: {tokens.size(1)}")

Number of tokens: 8747


# Use the pipeline with a press

In [4]:
# Pick a press with a compression ratio, you can run the following cells with different presses
compression_ratio = 0.5
press = ExpectedAttentionPress(compression_ratio)
# press = KnormPress(compression_ratio)
# press = RandomPress(compression_ratio)

In [5]:
# Run the pipeline on a single question

question = "Complete this sentence: The Nvidia GeForce Partner Program was a ..."
true_answer = "marketing program designed to provide partnering companies with benefits such as public relations support, video game bundling, and marketing development funds."
pred_answer = pipe(context, question=question, press=press)["answer"]

print(f"Question:   {question}")
print(f"Answer:     {true_answer}")
print(f"Prediction: {pred_answer}")

The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.


Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


Question:   Complete this sentence: The Nvidia GeForce Partner Program was a ...
Answer:     marketing program designed to provide partnering companies with benefits such as public relations support, video game bundling, and marketing development funds.
Prediction: marketing program designed to provide partnering companies with benefits such as public relations support, video game bundling, and marketing development funds.


In [6]:
# Run the pipeline on multiple questions, the context will be compressed only once

questions = [
    "What happened on March 1, 2024?",
    "What was the unofficial company motto of Nvidia during the early days?",
]

true_answers = [
    "Nvidia became the third company in the history of the United States to close with a market capitalization in excess of $2 trillion",
    "Our company is thirty days from going out of business",
]

pred_answers = pipe(context, questions=questions, press=press)["answers"]
for question, pred_answer, true_answer in zip(questions, pred_answers, true_answers):
    print(f"Question:   {question}")
    print(f"Answer:     {true_answer}")
    print(f"Prediction: {pred_answer}")
    print()

Question:   What happened on March 1, 2024?
Answer:     Nvidia became the third company in the history of the United States to close with a market capitalization in excess of $2 trillion
Prediction: Nvidia became the third company in U.S. history to close with a market capitalization of over $2 trillion.

Question:   What was the unofficial company motto of Nvidia during the early days?
Answer:     Our company is thirty days from going out of business
Prediction: The unofficial company motto of Nvidia during the early days was "Thirty days from bankruptcy."



In [7]:
# Use an answer prefix and limit the number of tokens in the answer

question = "What is GTC ?"
true_answer = "Nvidia's GPU Technology Conference (GTC) is a series of technical conferences held around the world."
answer_prefix = "Come on you don't know GTC ? Everyone"
max_new_tokens = 30

pred_answer_with_prefix = pipe(context, question=question, answer_prefix=answer_prefix, press=press, max_new_tokens=max_new_tokens)["answer"]
pred_answer_without_prefix = pipe(context, question=question, press=press, max_new_tokens=max_new_tokens)["answer"]

print(f"Question:              {question}")
print(f"Answer:                {true_answer}")
print(f"Prediction w/o prefix: {pred_answer_without_prefix}")
print(f"Prediction w/ prefix : {answer_prefix + pred_answer_with_prefix}")

Question:              What is GTC ?
Answer:                Nvidia's GPU Technology Conference (GTC) is a series of technical conferences held around the world.
Prediction w/o prefix: GTC stands for GPU Technology Conference. It is a series of technical conferences held by Nvidia, a multinational technology company that specializes in graphics processing units (
Prediction w/ prefix : Come on you don't know GTC ? Everyone knows GTC. GTC stands for GPU Technology Conference. It is a series of technical conferences held by Nvidia, a multinational technology company that specializes in


In [11]:
# SnapKV use the latest queries to prune the KV-cache. It's hence more efficient if we include the question during compression as the latest queries will correspond to the question.
# However it implies also implies that SnapKV cannot compress well the context independently of the question (e.g. as in a chat use case)


question = "Complete this sentence: In April 2016, Nvidia produced the DGX-1 based on an 8 GPU cluster,"
true_answer = (
    "to improve the ability of users to use deep learning by combining GPUs with integrated deep learning software"
)

press = SnapKVPress(compression_ratio=0.8)

pred_answer_with_question = pipe(context + question, press=press)["answer"]
pred_answer_without_question = pipe(context, question=question, press=press)["answer"]

print(f"Question:         {question}")
print(f"Answer:           {true_answer}")
print(f"Prediction w/ Q:  {pred_answer_with_question}")
print(f"Prediction w/o Q: {pred_answer_without_question}")

Question:         Complete this sentence: In April 2016, Nvidia produced the DGX-1 based on an 8 GPU cluster,
Answer:           to improve the ability of users to use deep learning by combining GPUs with integrated deep learning software
Prediction w/ Q:  to improve the ability of users to use deep learning by combining GPUs with integrated deep learning software.
Prediction w/o Q: In April 2016, Nvidia produced the DGX-1 based on an 8 GPU cluster, which was the first commercially available deep learning system.
