# Quantizing DeepSeek-R1-Distill-Qwen-1.5B

This notebook demonstrates how to load the `deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B` model with quantization options for CUDA (using `bitsandbytes` for 4-bit or 8-bit) or load it in default precision on CPU.

**Note:** Please verify the exact model ID on Hugging Face Hub. `bitsandbytes` 4-bit/8-bit quantization is primarily for NVIDIA GPUs.

## 1. Setup: Install Libraries

Uncomment and run the following cell if you haven't installed the necessary libraries.

In [None]:
# !pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 # Adjust cuXXX to your CUDA version
# !pip install transformers accelerate bitsandbytes sentencepiece

## 2. Configuration

Set your model ID and choose your device and quantization options.

In [None]:
MODEL_ID = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"  # !!! VERIFY THIS MODEL ID !!!

# --- Choose your configuration ---
USE_CUDA = True  # Set to True for GPU (with quantization), False for CPU (default precision)
QUANTIZATION_BITS = 4  # Choose 4 or 8. Only applicable if USE_CUDA is True.
# ---------------------------------

## 3. Import Libraries and Load Tokenizer

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"Current GPU: {torch.cuda.get_device_name(torch.cuda.current_device())}")

try:
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
    print(f"Tokenizer for {MODEL_ID} loaded successfully.")
except Exception as e:
    print(f"Error loading tokenizer for {MODEL_ID}: {e}")
    print("Please ensure the MODEL_ID is correct and you have an internet connection.")
    tokenizer = None

## 4. Load Model

This section loads the model based on the configuration set above.

In [None]:
model = None
quantization_config = None

if tokenizer: # Proceed only if tokenizer was loaded
    if USE_CUDA and torch.cuda.is_available():
        print(f"Attempting to load model on CUDA with {QUANTIZATION_BITS}-bit quantization...")
        if QUANTIZATION_BITS == 4:
            quantization_config = BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_quant_type="nf4",        # nf4 is a common choice for 4-bit
                bnb_4bit_compute_dtype=torch.bfloat16 # Or torch.float16 if bfloat16 is not supported
            )
            print("Using 4-bit quantization (NF4) with bfloat16 compute dtype.")
        elif QUANTIZATION_BITS == 8:
            quantization_config = BitsAndBytesConfig(
                load_in_8bit=True
            )
            print("Using 8-bit quantization.")
        else:
            print(f"Unsupported QUANTIZATION_BITS: {QUANTIZATION_BITS}. Loading in default precision on CUDA.")
            # Fallback to default precision on CUDA if bits not 4 or 8
            quantization_config = None 

        try:
            model = AutoModelForCausalLM.from_pretrained(
                MODEL_ID,
                quantization_config=quantization_config if quantization_config else None, # Pass config only if defined
                device_map="auto", # Automatically distributes model layers
                trust_remote_code=True # Add if model requires it
            )
            print(f"Model {MODEL_ID} loaded successfully on CUDA.")
            if quantization_config:
                 print(f"Model memory footprint: {model.get_memory_footprint() / (1024**3):.2f} GB")
        except Exception as e:
            print(f"Error loading model on CUDA: {e}")
            print("Ensure 'bitsandbytes' is correctly installed and your GPU is compatible.")

    elif not USE_CUDA:
        print(f"Attempting to load model {MODEL_ID} on CPU in default precision...")
        print("(Note: bitsandbytes 4/8-bit quantization is primarily for CUDA GPUs)")
        try:
            model = AutoModelForCausalLM.from_pretrained(
                MODEL_ID,
                device_map="cpu",
                trust_remote_code=True # Add if model requires it
            )
            print(f"Model {MODEL_ID} loaded successfully on CPU.")
        except Exception as e:
            print(f"Error loading model on CPU: {e}")
    else:
        print("CUDA selected but not available. Please check your PyTorch and CUDA setup.")
else:
    print("Tokenizer not loaded. Skipping model loading.")

## 5. Test Inference

Let's try a simple prompt to see if the model generates text.

In [None]:
if model and tokenizer:
    prompt = "What is the capital of France?"
    print(f"\nPrompt: {prompt}")

    # Determine the device of the input tensors
    # For models loaded with device_map='auto', the inputs should ideally be on the same device 
    # as the first parameter of the model, or let `generate` handle it.
    # If model is explicitly on CPU, inputs should be on CPU.
    # If model is on CUDA (potentially sharded), `generate` usually handles input placement well.
    
# A common approach for device_map='auto' or single GPU:
# device = next(model.parameters()).device 
# inputs = tokenizer(prompt, return_tensors="pt").to(device)

# Simpler approach, often works as `generate` can handle it:
    inputs = tokenizer(prompt, return_tensors="pt")

    # If you explicitly moved the entire model to one device (e.g., model.to('cuda:0') or model.to('cpu'))
    # then you should move inputs to that device:
    # inputs = {k: v.to(model.device) for k, v in inputs.items()}
    
    # For models loaded with device_map="auto", especially sharded ones,
    # it's often best to let the `generate` method handle the device placement of inputs internally.
    # If you encounter device mismatch errors, you might need to explicitly move `input_ids`
    # to `next(model.parameters()).device`.

    print("Generating response...")
    try:
        # Ensure inputs are on the correct device if not handled automatically
        # This is a robust way if model is sharded via device_map="auto"
        if USE_CUDA and torch.cuda.is_available() and hasattr(model, 'hf_device_map'):
            # For sharded models, let `generate` handle the device placement.
            # If you run into issues, you can explicitly move inputs to the first device:
            # first_device = list(model.hf_device_map.values())[0]
            # inputs = {k: v.to(first_device) for k, v in inputs.items()}
            pass  # Rely on generate's internal handling for device_map='auto'
        elif not USE_CUDA:
             inputs = {k: v.to('cpu') for k, v in inputs.items()}

        outputs = model.generate(**inputs, max_new_tokens=50, pad_token_id=tokenizer.eos_token_id)
        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        print("\nGenerated Text:")
        print(generated_text)
    except Exception as e:
        print(f"Error during inference: {e}")
        if "expected scalar type Half but found Float" in str(e).lower() and USE_CUDA and QUANTIZATION_BITS == 4:
            print("Hint: This error might occur if bnb_4bit_compute_dtype=torch.float16 was used and the GPU has limited float16 support, or an operation wasn't correctly cast. Try torch.bfloat16 if your GPU supports it (Ampere architecture or newer), or remove bnb_4bit_compute_dtype.")
        elif "CUDA out of memory" in str(e):
            print("Hint: CUDA out of memory. Try a smaller model, reduce batch size (if applicable for training/fine-tuning), or use more aggressive quantization if possible.")
else:
    print("Model or tokenizer not loaded. Skipping inference test.")

## 6. Conclusion

This notebook provided a basic framework for loading the DeepSeek model with quantization options.

**Next Steps:**
*   **Verify Model ID:** Double-check the `MODEL_ID` on Hugging Face Hub.
*   **Evaluate Performance:** Measure actual inference speed and output quality.
*   **Advanced Quantization:** For potentially better results (especially if 4-bit with `bitsandbytes` direct load isn't sufficient), explore methods like GPTQ (`auto-gptq`) or AWQ (`autoawq`), which require a calibration dataset.
*   **CPU Quantization:** If you need optimized CPU performance, look into `torch.quantization` (dynamic or static quantization) or using `optimum` with ONNX Runtime and its quantization capabilities for CPU.