
- [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)

- [QWEN2.5-CODER-32B-Instruct](https://www.youtube.com/channel/UCfOvNb3xj28SNqPQ_JIbumg/community?lb=UgkxOmW9TwEVkoTFxfz7kJDUhnh4Eetr2ar5)

### verify everything is set up correctly 

In [1]:
import torch
import transformers
import bitsandbytes
import accelerate

print(f"PyTorch version: {torch.__version__}")
print(f"Transformers version: {transformers.__version__}")
print(f"Bitsandbytes version: {bitsandbytes.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name()}")
    print(f"CUDA memory allocated: {torch.cuda.memory_allocated()/1024**2:.2f}MB")

PyTorch version: 2.5.1+cu121
Transformers version: 4.47.0.dev0
Bitsandbytes version: 0.44.1
CUDA available: True
CUDA device: NVIDIA GeForce GTX 1080 Ti
CUDA memory allocated: 0.00MB


In [2]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import os
from huggingface_hub import login

# Get API token from environment variable
api_key = os.getenv('HUGGING_FACE_HUB_TOKEN')
if api_key is None:
    raise ValueError("HUGGING_FACE_HUB_TOKEN not found in environment variables")

# Login to Hugging Face
login(token=api_key)

# Check CUDA
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


Using device: cuda


In [3]:
if torch.cuda.is_available():
    print(f"GPU Memory Before Loading: {torch.cuda.memory_allocated()/1024**2:.2f}MB")

GPU Memory Before Loading: 0.00MB


In [4]:
try:
    # For 11GB VRAM, let's use a smaller model or enable quantization
    model_name = "Qwen/Qwen1.5-7B-Chat"  # Using a smaller 7B model instead of 32B
    
    # Load model with 8-bit quantization
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,  # Use float16 for reduced memory
        load_in_4bit=True,  # load_in_8bit=True,         # Enable 8-bit quantization
        device_map="auto"
    )
    
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # Prepare prompt
    prompt = "write a quick sort algorithm."
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]

    # Generate response
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    
    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
    
    # Adjust generation parameters for memory efficiency
    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.2,
        top_p=0.95,
        num_beams=1,  # Reduce beam search to save memory
    )
    
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]
    
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    print("Response:", response)

except Exception as e:
    print(f"An error occurred: {str(e)}")
finally:
    # Clear CUDA cache after generation
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        print(f"Final GPU Memory: {torch.cuda.memory_allocated()/1024**2:.2f}MB")

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors.index.json:   0%|          | 0.00/31.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/3.99G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/3.54G [00:00<?, ?B/s]

An error occurred: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `llm_int8_enable_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details. 
Final GPU Memory: 0.00MB
