
# 🤗 Hugging Face — Attention & Adapters Quickstart

This notebook demonstrates **Hugging Face** examples for:
- Enabling **FlashAttention-2** (or SDPA fallback) in `transformers`
- Adding **LoRA** adapters with **PEFT** (LongLoRA-style LoRA+ notes)
- Optional **bitsandbytes** quantization for memory savings
- Minimal **SFT training loop** with PEFT on a tiny sample



> **Default base model:** `TinyLlama/TinyLlama-1.1B-Chat-v1.0` (≈1.1B params).  
> Runs comfortably on a single **A100** in Colab, even without quantization.  
> Alternatives you can try:
> - `Qwen/Qwen2.5-1.5B-Instruct` (~1.5B)
> - `microsoft/Phi-3-mini-4k-instruct` (~3.8B) — use 4-bit for comfort on smaller GPUs
>
> For larger models (3–7B), enable the **bitsandbytes 4-bit** cell.


In [None]:

%pip install -U transformers accelerate peft datasets
%pip install -U --index-url https://download.pytorch.org/whl/cu121 torch
%pip install -U flash-attn       # optional; requires matching CUDA & PyTorch
%pip install -U bitsandbytes     # optional for 8-bit / 4-bit loading


Collecting transformers
  Downloading transformers-4.56.2-py3-none-any.whl.metadata (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.1/40.1 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
Collecting datasets
  Downloading datasets-4.1.1-py3-none-any.whl.metadata (18 kB)
Collecting pyarrow>=21.0.0 (from datasets)
  Downloading pyarrow-21.0.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Downloading transformers-4.56.2-py3-none-any.whl (11.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.6/11.6 MB[0m [31m96.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading datasets-4.1.1-py3-none-any.whl (503 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m503.6/503.6 kB[0m [31m25.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-21.0.0-cp312-cp312-manylinux_2_28_x86_64.whl (42.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling co


## 1) Enable FlashAttention-2 (or SDPA) in `transformers`

Use `attn_implementation="flash_attention_2"` when `flash-attn` is installed. Otherwise, try `"sdpa"`.


In [None]:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"  # replace with a larger long-context model when you have a GPU
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Try FlashAttention-2 first; fall back to SDPA if unavailable.
attn_backend = "flash_attention_2"
try:
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
        device_map="auto" if torch.cuda.is_available() else None,
        attn_implementation=attn_backend,
    )
except Exception as e:
    print("flash_attention_2 not available, falling back to sdpa. Err:", e)
    attn_backend = "sdpa"
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
        device_map="auto" if torch.cuda.is_available() else None,
        attn_implementation=attn_backend,
    )

print("Using attention backend:", attn_backend)

inputs = tokenizer("Explain FlashAttention-2 briefly.", return_tensors="pt")
if torch.cuda.is_available():
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(out[0], skip_special_tokens=True))


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

flash_attention_2 not available, falling back to sdpa. Err: FlashAttention2 has been toggled on, but it cannot be used due to the following error: Flash Attention 2 is not available on CPU. Please make sure torch can access a CUDA device.


generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Using attention backend: sdpa
Explain FlashAttention-2 briefly.



## 2) Optional: 4-bit / 8-bit loading with **bitsandbytes**

This helps fit larger models in memory.


In [None]:

# from transformers import BitsAndBytesConfig
# bnb_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_quant_type="nf4",
#     bnb_4bit_compute_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
# )
# model = AutoModelForCausalLM.from_pretrained(
#     model_id,
#     quantization_config=bnb_config,
#     device_map="auto" if torch.cuda.is_available() else None,
#     attn_implementation="sdpa",   # or "flash_attention_2" if flash-attn is installed
# )



## 3) Add **LoRA** Adapters with **PEFT**

This mirrors LongLoRA's LoRA idea. For **LoRA+** (from the paper), also set **embeddings** and **norms** trainable.


In [None]:

from peft import LoraConfig, get_peft_model

# Reuse `model` from above (or load a larger base).
# Choose target modules based on the model architecture; adjust names for LLaMA-style models (q_proj, k_proj, etc.).
lora_cfg = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],  # adjust per model
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

peft_model = get_peft_model(model, lora_cfg)
print("LoRA attached. Trainable params:", sum(p.numel() for p in peft_model.parameters() if p.requires_grad))

# LongLoRA-style tweak (LoRA+): also train embeddings & norms
for name, p in peft_model.named_parameters():
    if any(k in name for k in ["embed_tokens", "wte", "ln_", "norm"]):
        p.requires_grad = True

print("Trainable after LoRA+ tweak:", sum(p.numel() for p in peft_model.parameters() if p.requires_grad))




LoRA attached. Trainable params: 6307840
Trainable after LoRA+ tweak: 71936000



## 4) Minimal SFT Training Loop (Toy Example)

A tiny example to show the **PEFT** workflow. Replace the dataset with your class data.


In [None]:

from torch.utils.data import Dataset, DataLoader
import torch
class TinyTextDataset(Dataset):
    def __init__(self, tokenizer, texts):
        self.data = []
        for t in texts:
            ids = tokenizer(
                t,
                return_tensors="pt",
                truncation=True,
                padding="max_length",
                max_length=64
            )
            # Remove the extra leading batch dimension
            ids = {k: v.squeeze(0) for k, v in ids.items()}
            ids["labels"] = ids["input_ids"].clone()
            self.data.append(ids)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, i):
        return self.data[i]

texts = [
    "Summarize: FlashAttention-2 reduces memory I/O using tiling and online softmax.",
    "Explain: Shifted sparse attention groups tokens and shifts half heads for info flow.",
    "Describe: LoRA updates a small low-rank set of weights for efficient fine-tuning.",
]

ds = TinyTextDataset(tokenizer, texts)
dl = DataLoader(ds, batch_size=2, shuffle=True)

device = next(peft_model.parameters()).device
optim = torch.optim.AdamW([p for p in peft_model.parameters() if p.requires_grad], lr=2e-4)

peft_model.train()
for step, batch in enumerate(dl):
    batch = {k: v.to(device) for k, v in batch.items()}
    out = peft_model(**batch)
    loss = out.loss
    loss.backward()
    optim.step()
    optim.zero_grad()
    print(f"step {step} loss {loss.item():.4f}")


step 0 loss 11.7856
step 1 loss 5.1302



## 5) Inference with the LoRA Adapter


In [None]:

peft_model.eval()
prompt = "What is one benefit of FlashAttention-2?"
inputs = tokenizer(prompt, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}
with torch.no_grad():
    gen = peft_model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(gen[0], skip_special_tokens=True))


What is one benefit of FlashAttention-2?
