# Load Llama 3.2 1B, Apply LoRA, and Push Adapters to Hugging Face

This notebook demonstrates how to:
1. Load the Llama 3.2 1B Instruct model from Hugging Face
2. Apply LoRA adapters (without training)
3. Push the LoRA adapters to Hugging Face Hub


## Install Dependencies

Make sure you have the required packages installed.


In [None]:
# Uncomment and run if needed
# !pip install transformers accelerate torch huggingface_hub peft


## Import Libraries


In [52]:
import torch
import os
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
from huggingface_hub import login
from dotenv import load_dotenv


## Authenticate with Hugging Face

You need to be logged in to Hugging Face to:
1. Access gated models like Llama 3.2
2. Push models to your account

Get your token from: https://huggingface.co/settings/tokens


In [53]:
# Login to Hugging Face - you'll be prompted for your token
# Make sure you have accepted the Llama 3.2 license on Hugging Face

load_dotenv()

HF_USERNAME = os.environ.get("HF_USERNAME")
HF_TOKEN = os.environ.get("HF_TOKEN")


## Configuration


In [54]:
# Model configuration
MODEL_ID = "meta-llama/Llama-3.2-1B-Instruct"

# Your Hugging Face username and desired repo name
if HF_USERNAME is None:
    HF_USERNAME = input("Enter your Hugging Face username: ")

if HF_TOKEN is None:
    HF_TOKEN = input("Enter your Hugging Face token: ")

login(token=HF_TOKEN)

NEW_MODEL_NAME = "llama-3.2-1b-lora-adapters"  # <-- Change this to your desired model name

# Full repo ID for pushing
PUSH_REPO_ID = f"{HF_USERNAME}/{NEW_MODEL_NAME}"

# LoRA Configuration
LORA_R = 16  # Rank of the LoRA matrices
LORA_ALPHA = 32  # Alpha scaling factor
LORA_DROPOUT = 0.05  # Dropout probability
TARGET_MODULES = ["q_proj", "k_proj", "v_proj", "o_proj"]


Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


## Load the Model and Tokenizer


In [55]:
print(f"Loading model: {MODEL_ID}")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Load model
# Using float16 to save memory, adjust based on your hardware
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.float16,
    device_map="auto",  # Automatically distribute across available devices
    trust_remote_code=True
)

print(f"Model loaded successfully!")
print(f"Model has {model.num_parameters():,} parameters")


Loading model: meta-llama/Llama-3.2-1B-Instruct
Model loaded successfully!
Model has 1,235,814,400 parameters


## Test the Original Model (Optional)

Let's generate some text before modifying the weights to see how it behaves.


In [56]:
def generate_text(model, tokenizer, prompt, max_new_tokens=100):
    """Generate text from a prompt."""
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.01,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id
        )
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test prompt
test_prompt = "What is the meaning of life?"
print("=" * 50)
print("ORIGINAL MODEL OUTPUT:")
print("=" * 50)
print(generate_text(model, tokenizer, test_prompt))


ORIGINAL MODEL OUTPUT:
What is the meaning of life? A question that has puzzled philosophers, theologians, scientists, and everyday people for centuries. The answer, of course, is subjective and can vary greatly depending on one's beliefs, values, and experiences.

Some people believe that the meaning of life is to find happiness, fulfillment, and purpose. They may think that the key to a happy life is to pursue one's passions, build meaningful relationships, and cultivate a sense of gratitude and contentment.

Others may believe that the meaning of life is to


## Apply LoRA Adapters

We'll apply LoRA (Low-Rank Adaptation) adapters to the model without training. The adapters will be initialized with random weights.


In [57]:
# Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    target_modules=TARGET_MODULES,
    bias="none",
)

print("LoRA Configuration:")
print(f"  - Rank (r): {LORA_R}")
print(f"  - Alpha: {LORA_ALPHA}")
print(f"  - Dropout: {LORA_DROPOUT}")
print(f"  - Target modules: {TARGET_MODULES}")


LoRA Configuration:
  - Rank (r): 16
  - Alpha: 32
  - Dropout: 0.05
  - Target modules: ['q_proj', 'k_proj', 'v_proj', 'o_proj']


In [58]:
# Apply LoRA to the model
print("Applying LoRA adapters to the model...")
peft_model = get_peft_model(model, lora_config)

print("\nLoRA adapters applied successfully!")
print(f"Trainable parameters: {peft_model.num_parameters(only_trainable=True):,}")
print(f"Total parameters: {peft_model.num_parameters():,}")
print(f"Trainable %: {100 * peft_model.num_parameters(only_trainable=True) / peft_model.num_parameters():.2f}%")


Applying LoRA adapters to the model...

LoRA adapters applied successfully!
Trainable parameters: 3,407,872
Total parameters: 1,239,222,272
Trainable %: 0.28%


## View LoRA Adapter Info


In [59]:
# Show LoRA adapter modules
print("LoRA adapter modules:")
peft_model.print_trainable_parameters()

print("\nLoRA layers added:")
for name, param in peft_model.named_parameters():
    if "lora" in name.lower():
        print(f"  - {name}: {param.shape}")


LoRA adapter modules:
trainable params: 3,407,872 || all params: 1,239,222,272 || trainable%: 0.2750

LoRA layers added:
  - base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight: torch.Size([16, 2048])
  - base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight: torch.Size([2048, 16])
  - base_model.model.model.layers.0.self_attn.k_proj.lora_A.default.weight: torch.Size([16, 2048])
  - base_model.model.model.layers.0.self_attn.k_proj.lora_B.default.weight: torch.Size([512, 16])
  - base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight: torch.Size([16, 2048])
  - base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight: torch.Size([512, 16])
  - base_model.model.model.layers.0.self_attn.o_proj.lora_A.default.weight: torch.Size([16, 2048])
  - base_model.model.model.layers.0.self_attn.o_proj.lora_B.default.weight: torch.Size([2048, 16])
  - base_model.model.model.layers.1.self_attn.q_proj.lora_A.default.weight: torch.Size([1

## Test the Model with LoRA (Optional)

Let's see how the model behaves with LoRA adapters. Since we haven't trained the adapters, the output should be similar to the base model (LoRA A matrices are initialized to random values and B matrices to zero, so initially the adapters have minimal effect).


In [60]:
print("=" * 50)
print("MODEL WITH LORA OUTPUT:")
print("=" * 50)
print(generate_text(peft_model, tokenizer, test_prompt))


MODEL WITH LORA OUTPUT:
What is the meaning of life? A question that has puzzled philosophers, theologians, scientists, and everyday people for centuries. The answer, of course, is subjective and can vary greatly depending on one's beliefs, values, and experiences.

Some people believe that the meaning of life is to find happiness, fulfillment, and purpose. They may think that the key to a happy life is to pursue one's passions, build meaningful relationships, and cultivate a sense of gratitude and contentment.

Others may believe that the meaning of life is to


## Push the LoRA Adapters to Hugging Face Hub


In [61]:
print(f"Pushing LoRA adapters to: {PUSH_REPO_ID}")
print("-" * 50)

# Push only the LoRA adapters (not the full model)
peft_model.push_to_hub(
    PUSH_REPO_ID,
    commit_message="Upload LoRA adapters for Llama 3.2 1B",
    private=True  # Set to False if you want a public repo
)

print("LoRA adapters pushed successfully!")


Pushing LoRA adapters to: moo3030/llama-3.2-1b-lora-adapters
--------------------------------------------------


Processing Files (1 / 1): 100%|██████████| 13.6MB / 13.6MB,  316kB/s  
New Data Upload: 100%|██████████| 13.6MB / 13.6MB,  316kB/s  


LoRA adapters pushed successfully!


In [62]:
# Push the tokenizer as well
tokenizer.push_to_hub(
    PUSH_REPO_ID,
    commit_message="Upload tokenizer for Llama 3.2 1B LoRA"
)

print("Tokenizer pushed successfully!")


Processing Files (1 / 1): 100%|██████████| 17.2MB / 17.2MB,  0.00B/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
No files have been modified since last commit. Skipping to prevent empty commit.


Tokenizer pushed successfully!


## Create Model Card (Optional)


In [None]:
from huggingface_hub import ModelCard, ModelCardData

# Calculate trainable params for the card
trainable_params = peft_model.num_parameters(only_trainable=True)
total_params = peft_model.num_parameters()

# Create a model card
card_data = ModelCardData(
    language="en",
    license="llama3.2",
    base_model=MODEL_ID,
    tags=["llama", "lora", "peft", "adapter"]
)

card_content = f"""
# {NEW_MODEL_NAME}

This is a LoRA adapter for [{MODEL_ID}](https://huggingface.co/{MODEL_ID}).

## Model Details

- **Base Model:** {MODEL_ID}
- **Adapter Type:** LoRA (Low-Rank Adaptation)
- **LoRA Rank (r):** {LORA_R}
- **LoRA Alpha:** {LORA_ALPHA}
- **LoRA Dropout:** {LORA_DROPOUT}
- **Target Modules:** {', '.join(TARGET_MODULES)}
- **Trainable Parameters:** {trainable_params:,} ({100 * trainable_params / total_params:.2f}%)

## Note

⚠️ These LoRA adapters have NOT been trained - they contain only their random initialization.
This is intended for experimental/educational purposes.

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained("{MODEL_ID}")
tokenizer = AutoTokenizer.from_pretrained("{MODEL_ID}")

# Load LoRA adapters
model = PeftModel.from_pretrained(base_model, "{PUSH_REPO_ID}")
```

## License

This adapter inherits the Llama 3.2 Community License from the base model.
"""

card = ModelCard(card_content)
card.push_to_hub(PUSH_REPO_ID)

print("Model card created and pushed!")


## Load and Test the Pushed Model

Now let's verify that the pushed adapters work by loading them fresh from Hugging Face and generating predictions.


In [63]:
from peft import PeftModel

# First, clear the existing model from memory to simulate a fresh load
del peft_model
torch.cuda.empty_cache() if torch.cuda.is_available() else None

print(f"Loading LoRA adapters from: {PUSH_REPO_ID}")
print("-" * 50)

# Load a fresh base model
base_model_fresh = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)

# Load the LoRA adapters from Hugging Face Hub
loaded_model = PeftModel.from_pretrained(base_model_fresh, PUSH_REPO_ID)

print("LoRA adapters loaded successfully from Hugging Face Hub!")
print(f"Model parameters: {loaded_model.num_parameters():,}")


Loading LoRA adapters from: moo3030/llama-3.2-1b-lora-adapters
--------------------------------------------------
LoRA adapters loaded successfully from Hugging Face Hub!
Model parameters: 1,239,222,272


In [65]:
# Generate predictions with the loaded model
print("=" * 50)
print("LOADED MODEL PREDICTIONS:")
print("=" * 50)

test_prompts = [
    "What is the meaning of life?",
    "Explain quantum computing in simple terms.",
    "Write a haiku about programming."
]

for prompt in test_prompts:
    print(f"\nPrompt: {prompt}")
    print("-" * 40)
    output = generate_text(loaded_model, tokenizer, prompt, max_new_tokens=80)
    # Remove the prompt from the output for cleaner display
    response = output[len(prompt):].strip() if output.startswith(prompt) else output
    print(f"Response: {response}")
    print()


LOADED MODEL PREDICTIONS:

Prompt: What is the meaning of life?
----------------------------------------
Response: A question that has puzzled philosophers, theologians, scientists, and everyday people for centuries. The answer, of course, is subjective and can vary greatly depending on one's beliefs, values, and experiences.

Some people believe that the meaning of life is to find happiness, fulfillment, and purpose. They may think that the key to a happy life is to pursue one's passions, build meaningful relationships,


Prompt: Explain quantum computing in simple terms.
----------------------------------------
Response: Imagine you have a huge library with an infinite number of books, each representing a possible solution to a problem. In classical computing, you would have to look through the entire library one book at a time to find the solution. But with quantum computing, you can use the principles of quantum mechanics to quickly and efficiently search through the library.

Here