### Open the notebook on Colab

We should have already started a notebook server in a container on a Chameleon GPU host, and set up an SSH tunnel to this notebook server. Now, we will connect this notebook to the runtime that you have in Chameleon. This is a convenient way to work, because the notebook and its outputs will be saved automatically in your Google Drive.

-   Next to the “Connect” button in the top right, there is a ▼ symbol. Click on this symbol to expand the menu, and choose “Connect to a local runtime”.
-   Paste the `http://127.0.0.1:8888/lab?token=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX` you copied earlier into this space, and choose “Connect”.

**Alternatively, if you prefer not to use Colab** (or can’t, for some reason): just put the `http://127.0.0.1:8888/lab?token=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX` URL you copied earlier into your browser to open the Jupyter interface directly. But, then you’ll have to open a terminal in that Jupyter interface and run

    wget https://raw.githubusercontent.com/teaching-on-testbeds/llm-chi/refs/heads/main/workspace/2_single_gpu_a100.ipynb

to get a copy of this notebook in that workspace.

In [1]:
# Install necessary packages
!pip install transformers datasets torch accelerate

Collecting transformers
  Downloading transformers-4.51.3-py3-none-any.whl.metadata (38 kB)
Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting accelerate
  Downloading accelerate-1.6.0-py3-none-any.whl.metadata (19 kB)
Collecting huggingface-hub<1.0,>=0.30.0 (from transformers)
  Downloading huggingface_hub-0.30.2-py3-none-any.whl.metadata (13 kB)
Collecting regex!=2019.12.17 (from transformers)
  Downloading regex-2024.11.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
Collecting tokenizers<0.22,>=0.21 (from transformers)
  Downloading tokenizers-0.21.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.8 kB)
Collecting safetensors>=0.4.3 (from transformers)
  Downloading safetensors-0.5.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datas

In [2]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
import accelerate
from transformers import GenerationConfig
import time

In [3]:
device = "cuda" if torch.cuda.is_available() else "cpu"

In [8]:
# Function to load models and apply quantization
def load_and_quantize_model(model_name, quantize=True):
    # Load model and tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, use_fast=False)
    model = AutoModelForCausalLM.from_pretrained(model_name,
        device_map="auto",
        torch_dtype=torch.float16,
        trust_remote_code=True)

    # Apply dynamic quantization
    if quantize:
        model = torch.quantization.quantize_dynamic(model, dtype=torch.qint8)

    # Compile model for faster inference (requires PyTorch 2.0+)
    model = torch.compile(model)

    return tokenizer, model

In [5]:
def generate_code(model, tokenizer, prompt, max_new_tokens=128):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    model.config.use_cache = False

    with torch.no_grad():
        start_time = time.time()
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,  # Using greedy decoding for speed
            eos_token_id=tokenizer.eos_token_id,
            use_cache=False
        )
        end_time = time.time()

    inference_time = end_time - start_time
    decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return decoded_output, inference_time

In [6]:
# Load CoNaLa dataset
dataset = load_dataset("neulab/conala", split="train[:5]")  # Just a few samples for quick test

README.md:   0%|          | 0.00/3.90k [00:00<?, ?B/s]

conala.py:   0%|          | 0.00/4.30k [00:00<?, ?B/s]

The repository for neulab/conala contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/neulab/conala.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


conala-paired-train.json:   0%|          | 0.00/518k [00:00<?, ?B/s]

conala-paired-test.json:   0%|          | 0.00/109k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [10]:
# Load both models

models = {
    "Lite-Base": "deepseek-ai/DeepSeek-Coder-V2-Lite-Base",
    "Lite-Instruct": "deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct"
}

# Load both original models
loaded_models = {
    name: {
        "original": load_and_quantize_model(path, quantize=False),
        # "quantized": load_and_quantize_model(path, quantize=True)
    }
    for name, path in models.items()
}




Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/1.28k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/4.61M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

configuration_deepseek.py:   0%|          | 0.00/10.3k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct:
- configuration_deepseek.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_deepseek.py:   0%|          | 0.00/78.7k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct:
- modeling_deepseek.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/480k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00004-of-000004.safetensors:   0%|          | 0.00/5.64G [00:00<?, ?B/s]

model-00002-of-000004.safetensors:   0%|          | 0.00/8.59G [00:00<?, ?B/s]

model-00003-of-000004.safetensors:   0%|          | 0.00/8.59G [00:00<?, ?B/s]

model-00001-of-000004.safetensors:   0%|          | 0.00/8.59G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

In [11]:
# Iterate over dataset and perform inference for each model (original and quantized)
for idx, sample in enumerate(dataset):
    print(f"\n==================== Sample {idx + 1} ====================")
    print(f"🔸 Intent: {sample['intent']}")

    prompt = f"### Instruction:\n{sample['intent']}\n\n### Response:"

    # Loop through models and get output and inference time for both original and quantized models
    for name, models_dict in loaded_models.items():
        for model_type, (tokenizer, model) in models_dict.items():
            output, inference_time = generate_code(model, tokenizer, prompt)
            print(f"\n🔹 Output from {name} ({model_type}):\n{output}")
            print(f"⏱️ Inference time: {inference_time:.4f} seconds\n")

Setting `pad_token_id` to `eos_token_id`:100001 for open-end generation.



🔸 Intent: How to convert a list of multiple integers into a single integer?


KeyboardInterrupt: 

In [12]:
# Load both quantized models
loaded_models = {
    name: {
        # "original": load_and_quantize_model(path, quantize=False),
        "quantized": load_and_quantize_model(path, quantize=True)
    }
    for name, path in models.items()
}

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [13]:
# Iterate over dataset and perform inference for each model (original and quantized)
for idx, sample in enumerate(dataset):
    print(f"\n==================== Sample {idx + 1} ====================")
    print(f"🔸 Intent: {sample['intent']}")

    prompt = f"### Instruction:\n{sample['intent']}\n\n### Response:"

    # Loop through models and get output and inference time for both original and quantized models
    for name, models_dict in loaded_models.items():
        for model_type, (tokenizer, model) in models_dict.items():
            output, inference_time = generate_code(model, tokenizer, prompt)
            print(f"\n🔹 Output from {name} ({model_type}):\n{output}")
            print(f"⏱️ Inference time: {inference_time:.4f} seconds\n")

Setting `pad_token_id` to `eos_token_id`:100001 for open-end generation.



🔸 Intent: How to convert a list of multiple integers into a single integer?


RuntimeError: expected scalar type Float but found Half