In this notebook, we showcase how to use the KVpress pipelines by answering questions about NVIDIA Wikipedia article.

In [1]:
import requests
from bs4 import BeautifulSoup

from transformers import pipeline

from kvpress import ExpectedAttentionPress, KnormPress, RandomPress

# Load the pipeline and data

In [2]:
# Load pipeline

device = "cuda:0"
ckpt = "Qwen/Qwen3-8B"
attn_implementation = "flash_attention_2"  # use "eager" for ObservedAttentionPress and "sdpa" if you can't use "flash_attention_2"
pipe = pipeline("kv-press-text-generation", model=ckpt, device=device, dtype="auto", model_kwargs={"attn_implementation":attn_implementation})

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

Device set to use cuda:0


In [3]:
# Load data
url = "https://en.wikipedia.org/wiki/Nvidia"
headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) "}
content = requests.get(url, headers=headers).content
soup = BeautifulSoup(content, "html.parser")
context = "".join([p.text for p in soup.find_all("p")]) + "\n\n"
tokens = pipe.tokenizer.encode(context, return_tensors="pt").to(device)
print(f"Number of tokens: {tokens.size(1)}")

Number of tokens: 11686


# Use the pipeline with a press

In [4]:
# Pick a press with a compression ratio, you can run the following cells with different presses
compression_ratio = 0.5
press = ExpectedAttentionPress(compression_ratio)
# press = KnormPress(compression_ratio)
# press = RandomPress(compression_ratio)

In [5]:
# Run the pipeline on a single question

question = "Complete this sentence: The Nvidia GeForce Partner Program was a ..."
true_answer = "marketing program designed to provide partnering companies with benefits such as public relations support, video game bundling, and marketing development funds."
pred_answer = pipe(context, question=question, press=press)["answer"]

print(f"Question:   {question}")
print(f"Answer:     {true_answer}")
print(f"Prediction: {pred_answer}")

Question:   Complete this sentence: The Nvidia GeForce Partner Program was a ...
Answer:     marketing program designed to provide partnering companies with benefits such as public relations support, video game bundling, and marketing development funds.
Prediction: The Nvidia GeForce Partner Program was a marketing initiative designed to provide partnering companies with benefits such as public relations support, video game bundling, and marketing development funds, but it became controversial due to allegations of anti-competitive practices.


In [6]:
# Run the pipeline on multiple questions, the context will be compressed only once

questions = [
    "What happened on March 1, 2024?",
    "What was the unofficial company motto of Nvidia during the early days?",
]

true_answers = [
    "Nvidia became the third company in the history of the United States to close with a market capitalization in excess of $2 trillion",
    "Our company is thirty days from going out of business",
]

pred_answers = pipe(context, questions=questions, press=press)["answers"]
for question, pred_answer, true_answer in zip(questions, pred_answers, true_answers):
    print(f"Question:   {question}")
    print(f"Answer:     {true_answer}")
    print(f"Prediction: {pred_answer}")
    print()

Question:   What happened on March 1, 2024?
Answer:     Nvidia became the third company in the history of the United States to close with a market capitalization in excess of $2 trillion
Prediction: On **March 1, 2024**, **Nvidia** was ranked **#3** on **Forbes' "Best Places to Work" list**. This recognition highlighted the company's strong workplace culture, employee satisfaction, and

Question:   What was the unofficial company motto of Nvidia during the early days?
Answer:     Our company is thirty days from going out of business
Prediction: The unofficial company motto of Nvidia during the early days was **"A flywheel to reach large markets funding huge R&D to solve massive computational problems."** This motto was inspired by the concept of a flywheel, which is a device that stores rotational



In [7]:
# Use an answer prefix and limit the number of tokens in the answer

question = "What is GTC ?"
true_answer = "Nvidia's GPU Technology Conference (GTC) is a series of technical conferences held around the world."
answer_prefix = "Come on you don't know GTC ? Everyone"
max_new_tokens = 30

pred_answer_with_prefix = pipe(context, question=question, answer_prefix=answer_prefix, press=press, max_new_tokens=max_new_tokens)["answer"]
pred_answer_without_prefix = pipe(context, question=question, press=press, max_new_tokens=max_new_tokens)["answer"]

print(f"Question:              {question}")
print(f"Answer:                {true_answer}")
print(f"Prediction w/o prefix: {pred_answer_without_prefix}")
print(f"Prediction w/ prefix : {answer_prefix + pred_answer_with_prefix}")

Question:              What is GTC ?
Answer:                Nvidia's GPU Technology Conference (GTC) is a series of technical conferences held around the world.
Prediction w/o prefix: **GTC** stands for **GPU Technology Conference**, which is a major annual event organized by **NVIDIA**. It is a technical conference that
Prediction w/ prefix : Come on you don't know GTC ? Everyone knows GTC is the biggest AI conference in the world. It's held by NVIDIA, right? I mean, it's like the Super Bowl of
