## Performance Experiments with DeepSeek R1 Distill Qwen 7B Model

This notebook performs performance experiments with the DeepSeek R1 Distill Qwen 7B model for inference. The aim is to evaluate how different parameters and configurations affect the model’s latency and resource usage. In the experiments, we will focus on:

- Comparing inference times while varying parameters such as the number of maximum new tokens and truncation lengths.
- Assessing how changes in memory allocation (using the max_memory variable) impact performance.
- Exploring different precision and quantization settings via the BitsAndBytesConfig.
- Measuring runtime with the current model configuration as a baseline.

This workflow provides useful feedback on optimizing inference performance and resource management for large-scale language models.

In [1]:
# Check PyTorch and CUDA compability
import torch

print("torch version:", torch.__version__)
print("cuda available:", torch.cuda.is_available())

torch version: 2.6.0+cu124
cuda available: True


In [2]:
# Import necessary libraries
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import time
import torch

Using 4 bit BitsAndBytes quantization changes weights from 16 bit (2bytes/parameters) to 4 bit (0.5 byte/parameter) with some minute scale or metadata. For a 7B parameter model FP16 weights are approx 3GB where after 4bit it lands as 0.9 to 1.2 GB.

It keeps only the weights as 4 bit where the Activations and Key Value cache of the Transformer model at FP16 to BF16 accuracy.

4 bit is standard for memeory efficient QLoRA or LoRA finetuning for good inference in 8-12GB GPUs.

In [3]:
# Configure model for 4-bit quantization using BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)

1. (load_in_4bit=True): Quantizes model weights to 4 bit at load time. Activations, Layer Norms and Key_value cache are not quantized and stay in original compute type.
2. (bnb_4bit_quant_type="nf4"): NormalFloat4 preserve weights better than linear int4. nf4 is the best quality for doing 4 bit quantization
3. (bnb_4bit_use_double_quant=True): Used in QLoRA for for storing weigthts in 4 bit along with per group scale/zero to reconstruct original float at runtime w=s(w1-z). s-> sca,e for group weights , z-> zero point. So these are even if metadata can add to the overhead. 
4. (bnb_4bit_compute_dtype=torch.float16): At matrix multiplication time 4 bit weights are dequantized into this datatype. Lowest VRAM usage and good for consumer cards. 

device_map="auto" is handled by Huggingface Accelerate. At load time it estimates each submodule size by following dtype/quantization then assigns whole blocks or layers to devices to be **fit within a budget** configured in max_memory. It fills GPU fast then spills rest of it in the CPU and remaining in disk if allowed. The result is static mapping not live rebalancing duing generation. Observed by model.hf_device_map

sdpa attention is Pytorch's fast, memory-efficient scaled dot product attention.
Using **low_cpu_mem_usage** streams weights during load to avoid big CPU/RAM spikes. 


In [4]:
# Define maximum memory allocation for CPU and GPU
max_memory = {
    0: "5GiB",
    "cpu": "25GiB",
}

# Load tokenizer and model with specified configurations
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-7B")
model = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
    torch_dtype="auto",  # Automatically set to float16 or bfloat16 based on GPU for non-quantized models
    low_cpu_mem_usage=True,
    attn_implementation="sdpa",
    quantization_config=bnb_config,
    max_memory=max_memory,
    device_map="auto",
)

# Set model to evaluation mode
model.eval()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(152064, 3584)
    (layers): ModuleList(
      (0-27): 28 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): Linear4bit(in_features=3584, out_features=3584, bias=True)
          (k_proj): Linear4bit(in_features=3584, out_features=512, bias=True)
          (v_proj): Linear4bit(in_features=3584, out_features=512, bias=True)
          (o_proj): Linear4bit(in_features=3584, out_features=3584, bias=False)
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear4bit(in_features=3584, out_features=18944, bias=False)
          (up_proj): Linear4bit(in_features=3584, out_features=18944, bias=False)
          (down_proj): Linear4bit(in_features=18944, out_features=3584, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNorm((3584,), 

- messages: the chat data to be sent to the model
- apply_chat_template: formats these turns using model chat template then **tokenizes** it.
- truncation, max_length: caps the prompt length to 512 tokens so KV-cache stays small as it its the main runtime VRAM hogs.

In [5]:
# Define input messages for the model
messages = [
    {
        "role": "user",
        "content": "Imagine you are asked to consider, in the most detailed and precise way possible, how intelligence—whether it appears in human beings through biological processes shaped by evolution, or in artificial systems through training on data and optimization algorithms—can be rigorously defined, evaluated, and compared, and in doing so you must take into account the historical roots of the concept of intelligence from ancient philosophy through to the computational era, the biological foundations of cognition such as neurons, synapses, and brain networks, the computational equivalents such as artificial neurons, gradient descent, and representation learning, and then compare how metrics like accuracy or benchmark scores differ from broader qualities like adaptability, creativity, or robustness, and in this comparison also weigh the critiques such as Searle’s Chinese Room argument that challenge whether symbol manipulation equates to understanding, while considering whether embodied cognition, multimodal perception, or interaction with the real world might bring AI closer to human-like grounding, and then reflect on whether scaling laws alone can sustain progress or if new paradigms like neuromorphic hardware, probabilistic reasoning, or hybrid neuro-symbolic approaches are required, and finally I want you to integrate ethical, social, and cultural perspectives, since intelligence is not only a scientific matter but also one that influences human society, governance, and coexistence with AI, so the unifying question is: how can we construct a definition and measurement of intelligence that is at once scientifically rigorous, philosophically sound, practically useful, ethically responsible, and adaptable to future forms of cognition that may arise in both artificial and biological domains?",
    },
]

# Prepare inputs using the tokenizer with chat template
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    truncation=True,
    max_length=512,
)

1. next(model.parameters()).device grabs the device of the first parameter shard
2. Accelerate might have placed earlier layers on GPU 0 or cpu. Input must start where the first layers lives else there will be extra transfers or errors
3. Then every tensor in inputs (input_ids, attention_mask) is moved to that device. So shapes/devices align with forward pass. 

In [6]:
# Move inputs to the same device as the model
device = next(model.parameters()).device
inputs = {k: v.to(device) for k, v in inputs.items()}

model.generate() calles dreedy decoder for causal LM to generate new tokens. Using KV cache so new tokens reuse previous attention states. Crucial for VRAM consumption.
outputs a tensor of token_ids on the same device as model outputs.

In [7]:
# Ensure CUDA operations are complete before starting timing
if torch.cuda.is_available():
    torch.cuda.synchronize()

# Measure inference time
t0 = time.perf_counter()

# Generate outputs with a limit on new tokens
outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1] :]))

# Calculate and print time taken for inference
dt = time.perf_counter() - t0
print(f"Time taken: {dt:.2f} seconds")

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Okay, so I'm trying to wrap my head around this big question about what intelligence really is. The user wants a detailed and precise definition that applies to both humans and AI, considering everything from philosophy
Time taken: 5.15 seconds


In [8]:
# Move result off GPU, drop references, and flush caches
outputs = outputs.to("cpu")
del inputs
import gc

gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    torch.cuda.ipc_collect()

In [9]:
print("First shard device:", next(model.parameters()).device)
print("Allocated MB:", round(torch.cuda.memory_allocated() / 1e6, 1))
print("Reserved  MB:", round(torch.cuda.memory_reserved() / 1e6, 1))
print("Device map:", model.hf_device_map)

First shard device: cuda:0
Allocated MB: 5588.1
Reserved  MB: 5653.9
Device map: {'': 0}
