# 🔧 Practical Blueprint: QLoRA Fine-Tuning for C & Linux System Programming (T4-ready)

This Colab-ready notebook fine-tunes a 7B instruct model using **QLoRA** on a **T4 (16GB) GPU**.

## What you get
- QLoRA training on a small example dataset (you can replace with your own)
- Domain-focused formatting for **C & Linux system programming**
- Safe defaults for T4: 4-bit quantization, small per-device batch, gradient accumulation
- Quick evaluation and (optional) compile-check of generated C code

**Tip:** For real training, plug in your larger dataset and extend training time/epochs.


## 0) Runtime & GPU check
- In Colab: `Runtime → Change runtime type → GPU (T4)`

In [5]:
import torch, platform
print('CUDA available:', torch.cuda.is_available())
print('GPU:', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU')
print('Torch:', torch.__version__)
print('Python:', platform.python_version())

CUDA available: True
GPU: Tesla T4
Torch: 2.8.0+cu126
Python: 3.12.11


## 1) Install dependencies
We pin versions known to work well on T4 + Colab.

In [4]:
# --- Inspect environment (optional) ---
!nvidia-smi
!python -V

# --- 0) Clean out conflicting installs ---
!pip -q uninstall -y tokenizers transformers bitsandbytes triton peft trl accelerate datasets sentencepiece evaluate

# --- 1) Make sure pip can fetch binary wheels (no source builds) ---
!pip -q install -U pip setuptools wheel

# IMPORTANT: Force a prebuilt wheel for tokenizers (avoids Rust build on Py3.12)
!pip -q install --only-binary=:all: tokenizers==0.19.1

# --- 2) Install PyTorch CUDA 12.1 (works with your 12.4 driver) ---
!pip -q install --upgrade --no-cache-dir \
  torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 \
  --index-url https://download.pytorch.org/whl/cu121

# --- 3) GPU-enabled bitsandbytes + matching triton for Torch 2.4 ---
!pip -q install bitsandbytes==0.43.3 triton==3.0.0

# --- 4) Training stack (versions compatible with our SFTTrainer usage) ---
!pip -q install transformers==4.44.2 trl==0.9.6 peft==0.12.0 accelerate==0.34.2 datasets==2.19.1 sentencepiece evaluate


Mon Sep 22 19:54:16 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   47C    P8             11W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## 2) Configuration
Choose a base model. Defaults to **Mistral 7B Instruct** (permissive license).

In [5]:
BASE_MODEL = 'mistralai/Mistral-7B-Instruct-v0.2'  # or mistralai/Mistral-7B-Instruct-v0.3 if available
OUTPUT_DIR = '/content/qlora-mistral-sysprog'
USE_4BIT = True  # QLoRA
SEQ_LEN = 2048   # keep <= 2048 for T4 comfort
EPOCHS = 1       # increase for real training
LR = 2e-4
BATCH_SIZE = 1
GRAD_ACCUM = 16  # effective batch = 16
WARMUP_RATIO = 0.05
SAVE_STEPS = 200
LOG_STEPS = 10
VAL_SET_SIZE = 0  # set >0 if you add a validation split

## 3) Prepare a **sample** domain dataset (replace with your own)
Format: instruction-style samples targeting C & Linux system topics. We create a tiny JSONL to demo.

In [4]:
import json, os
os.makedirs('/content/data', exist_ok=True)
samples = [
  {
    'instruction': 'Explain the difference between fork() and vfork() with a small C example.',
    'input': '',
    'output': (
      'fork() creates a child with a copy-on-write address space; both parent and child may run concurrently. '
      'vfork() guarantees the child runs first and shares the address space until exec/exit; parent is suspended.\n\n'
      '```c\n#include <stdio.h>\n#include <sys/types.h>\n#include <unistd.h>\nint main(){ pid_t p=fork(); if(p==0){printf("child\\n");} else {printf("parent\\n");} }\n```'
    )
  },
  {
    'instruction': 'Write a C function that safely reads a file and prints its content using POSIX APIs.',
    'input': '',
    'output': (
      '```c\n#define _XOPEN_SOURCE 700\n#include <fcntl.h>\n#include <unistd.h>\n#include <stdio.h>\n#include <errno.h>\n#include <string.h>\n'
      'void print_file(const char *path){ int fd=open(path,O_RDONLY); if(fd<0){perror("open");return;} '
      'char buf[4096]; ssize_t n; while((n=read(fd,buf,sizeof(buf)))>0){ if(write(STDOUT_FILENO,buf,n)<0){perror("write");break;} } '
      'if(n<0) perror("read"); if(close(fd)<0) perror("close"); }\n```'
    )
  },
  {
    'instruction': 'Given this segfaulting code, identify the bug and provide a fix. Code: int *p=NULL; *p=42; explain.',
    'input': '',
    'output': 'Dereferencing a NULL pointer causes a segmentation fault. Allocate memory or point to a valid object before dereferencing.'
  }
]
with open('/content/data/sysprog_train.jsonl', 'w') as f:
    for s in samples:
        f.write(json.dumps(s)+'\n')
print('Saved /content/data/sysprog_train.jsonl with', len(samples), 'examples')

Saved /content/data/sysprog_train.jsonl with 3 examples


## 4) Dataset loader
Converts instruction-style JSONL into a chat template compatible with the model.

In [6]:
from datasets import load_dataset
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, use_fast=True)
ds = load_dataset("json", data_files={
    "train": "/content/sysprog_train_large.jsonl",
    "valid": "/content/sysprog_valid_small.jsonl",
})

def to_chat_format(example):
    user = example["instruction"]
    if example.get("input"):
        user += f"\n\nInput:\n{example['input']}"
    messages = [
        {"role": "user", "content": user},
        {"role": "assistant", "content": example["output"]},
    ]
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
    return {"text": text}

train_ds = ds["train"].map(to_chat_format, remove_columns=[c for c in ds["train"].column_names if c != "text"])
valid_ds = ds["valid"].map(to_chat_format, remove_columns=[c for c in ds["valid"].column_names if c != "text"])

# With TRL SFTTrainer, you can re-enable packing now that you have enough data:
# SEQ_LEN = 1024 (or 1536 if memory allows)


tokenizer_config.json:   0%|          | 0.00/2.10k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Map:   0%|          | 0/25 [00:00<?, ? examples/s]

Map:   0%|          | 0/20 [00:00<?, ? examples/s]

## 5) QLoRA setup & tokenizer
- Load 4-bit quantized base model weights (nf4) for training adapters
- Use **fp16** compute on T4 (bf16 is not supported on T4)

In [7]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, TaskType
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=USE_4BIT,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, use_fast=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=bnb_config if USE_4BIT else None,
    device_map='auto'
)
model.config.use_cache = False  # important for gradient checkpointing
print('Model loaded.')

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Model loaded.


## 6) Format dataset to chat prompts
We wrap each instruction into an **assistant-style completion** prompt.

In [8]:
def format_example(example):
    instruction = example['instruction']
    input_text = example.get('input','')
    output_text = example['output']
    if input_text:
        prompt = f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n"
    else:
        prompt = f"### Instruction:\n{instruction}\n\n### Response:\n"
    return {
        'text': prompt + output_text
    }

train_ds = ds['train'].map(format_example)
train_ds = train_ds.remove_columns([c for c in train_ds.column_names if c!='text'])
train_ds = train_ds.shuffle(seed=42)
train_ds[0]

Map:   0%|          | 0/25 [00:00<?, ? examples/s]

{'text': '### Instruction:\nProvide a minimal TCP server in C that listens on port 10000 and echoes one line.\n\n### Response:\n```c\n#include <stdio.h>\n#include <sys/types.h>\n#include <sys/socket.h>\n#include <netinet/in.h>\n#include <arpa/inet.h>\n#include <unistd.h>\n\nint main(int argc, char **argv) {\n    int s=socket(AF_INET,SOCK_STREAM,0); if(s<0){perror("socket");return 1;}\n    struct sockaddr_in a={0}; a.sin_family=AF_INET; a.sin_addr.s_addr=htonl(INADDR_ANY); a.sin_port=htons(10000);\n    int on=1; setsockopt(s,SOL_SOCKET,SO_REUSEADDR,&on,sizeof(on));\n    if(bind(s,(struct sockaddr*)&a,sizeof(a))<0){perror("bind");return 1;} listen(s,1);\n    int c=accept(s,NULL,NULL); if(c<0){perror("accept");return 1;} char b[256]; ssize_t n=read(c,b,sizeof(b)); if(n>0) write(c,b,n);\n    close(c); close(s); return 0;\n}\n```'}

## 7) Trainer (TRL SFTTrainer with PEFT LoRA)
We apply LoRA on key projection and MLP layers commonly used for Mistral/LLama.

In [9]:
from trl import SFTTrainer
from transformers import TrainingArguments
from peft import LoraConfig, TaskType

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=['q_proj','k_proj','v_proj','o_proj','gate_proj','up_proj','down_proj'],
    lora_dropout=0.05,
    bias='none',
    task_type=TaskType.CAUSAL_LM
)

training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRAD_ACCUM,
    learning_rate=LR,
    lr_scheduler_type='cosine',
    warmup_ratio=WARMUP_RATIO,
    logging_steps=LOG_STEPS,
    save_steps=SAVE_STEPS,
    save_total_limit=2,
    fp16=True, bf16=False,
    max_grad_norm=1.0,
    gradient_checkpointing=True,
    optim='paged_adamw_8bit',
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,           # OK with TRL 0.9.6
    train_dataset=train_ds,
    peft_config=lora_config,
    max_seq_length=SEQ_LEN,        # OK with TRL 0.9.6
    dataset_text_field='text',
    packing=True,                  # OK with TRL 0.9.6
    args=training_args,
)
trainer.train()


RuntimeError: Failed to import trl.trainer.sft_trainer because of the following error (look up to see its traceback):
Failed to import transformers.data.data_collator because of the following error (look up to see its traceback):
empty_like method already has a different docstring

## 8) Save adapter & run quick inference
We keep the base in 4-bit and use LoRA adapters for generation.

In [10]:
trainer.model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
print('Saved adapters to', OUTPUT_DIR)

def generate(prompt, max_new_tokens=256):
    inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
    with torch.no_grad():
        out = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=True, temperature=0.2, top_p=0.9)
    return tokenizer.decode(out[0], skip_special_tokens=True)

test_prompt = (
    '### Instruction:\nWrite a C program that creates a child with fork(), waits for it, and prints both PIDs.\n\n### Response:\n'
)
print(generate(test_prompt))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
Caching is incompatible with gradient checkpointing in MistralDecoderLayer. Setting `past_key_values=None`.


Saved adapters to /content/qlora-mistral-sysprog


  return fn(*args, **kwargs)


### Instruction:
Write a C program that creates a child with fork(), waits for it, and prints both PIDs.

### Response:

 Q: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Quest

## 9) (Optional) Compile-check the generated C code
This cell attempts to extract a C block and compile it with `gcc`. Useful for smoke tests.

In [11]:
import re, tempfile, subprocess, textwrap
gen = generate('### Instruction:\nProvide a minimal C program that prints argc and argv.\n\n### Response:\n')
print(gen)
m = re.search(r'```c\n(.*?)\n```', gen, flags=re.S)
if m:
    code = m.group(1)
    with open('/content/test.c','w') as f:
        f.write(code)
    print('\n[compile] test.c -> a.out')
    proc = subprocess.run(['gcc','/content/test.c','-o','/content/a.out'], capture_output=True, text=True)
    print('gcc rc:', proc.returncode)
    print('stderr:', proc.stderr[:500])
    if proc.returncode==0:
        run = subprocess.run(['/content/a.out','hello','world'], capture_output=True, text=True)
        print('[run] output:\n', run.stdout)
else:
    print('No ```c fenced block found to compile.')

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


### Instruction:
Provide a minimal C program that prints argc and argv.

### Response:

 Q: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question: Question

## 10) (Optional) Merge LoRA into base weights (for export)
Merging increases memory needs; consider doing this on a larger GPU. Skippable if you serve with adapters.

In [None]:
MERGE = False  # set True to attempt a full merge
if MERGE:
    from peft import AutoPeftModelForCausalLM
    m = AutoPeftModelForCausalLM.from_pretrained(OUTPUT_DIR, device_map='auto')
    m = m.merge_and_unload()
    m.save_pretrained(OUTPUT_DIR + '-merged', safe_serialization=True)
    tokenizer.save_pretrained(OUTPUT_DIR + '-merged')
    print('Merged model saved to', OUTPUT_DIR + '-merged')

## 11) Serving tips
- For inference on CPU/GPU with low memory, keep 4-bit base + adapters.
- For best latency, run on GPU with `load_in_4bit=True`.
- Consider adding a **RAG** layer to inject fresh man pages when required.

## 12) Next steps
- Replace `/content/data/sysprog_train.jsonl` with your full dataset.
- Increase `EPOCHS`, consider a validation split and early stopping.
- Add **safety/eval** harness: unit tests, syscall quizzes, troubleshooting tasks.
- Track metrics with Weights & Biases (`wandb`) if desired.