<a href="https://colab.research.google.com/github/tmechouma/finetuninggpt/blob/main/Fine_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!python -m pip install --upgrade pip
!pip install transformers datasets accelerate evaluate
!pip install peft bitsandbytes

Collecting pip
  Downloading pip-25.2-py3-none-any.whl.metadata (4.7 kB)
Downloading pip-25.2-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m64.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-25.2
Collecting evaluate
  Downloading evaluate-0.4.5-py3-none-any.whl.metadata (9.5 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_cupti_cu1

In [2]:
import argparse
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling
import torch
from typing import Dict
from transformers import pipeline;

#Function Name: build_prompt :
 - Purpose: This function takes a dictionary containing a question and (optionally) an answer, and formats them into a standardized prompt string.

1. Extracts the Question

   - example["question"].strip():

     - Fetches the value for the key "question".

     - .strip() removes leading/trailing whitespace (e.g., " What? " → "What?").

     - Stored in variable q.

2. Extracts the Answer (if exists)

   - example.get("answer", "").strip():

     - Safely fetches "answer" if the key exists; defaults to "" if missing.

     - .strip() cleans whitespace (e.g., " Yes " → "Yes").

     - Stored in variable a.

3. Constructs the Prompt

   - Combines q and a into the template:
   - ex: f"Question: {q}\nAnswer: {a}":Question: What is Python? Answer:---­>A programming language.

4. Returns the Formatted String
   - The prompt is returned for downstream use (e.g., input to a model)






In [3]:
def build_prompt(example: Dict):
  # Simple Format : Question: ... \n Answer:
  q = example["question"].strip()
  a = example.get("answer","").strip()
  # Ensure final answer is present
  prompt = f"Question: {q}\nAnswer: {a}"
  return prompt

# Function: `tokenize_and_mask`
#  Purpose
Prepares text data for causal language modeling by :

  1. Tokenizing a Q&A prompt.

  2. Masking labels (setting -100) for the question tokens to ensure the model only learns from the answer during training.

| Parameter  | Type                | Description                                     |
|------------|---------------------|-------------------------------------------------|
| example    | Dict                | Dict with `"question"` and `"answer"` keys.     |
| tokenizer  | PreTrainedTokenizer | HuggingFace tokenizer.                          |
| max_length | int                 | Max token length (default: 512).                |

#Key Steps :
##Build Prompt

1. Formats example into "Question: ...\nAnswer: ..." using build_prompt.
Tokenize:

2. Tokenizes text with:

    - Truncation/padding to max_length.

    - Returns input_ids and attention_mask.

3. Locate Answer Start:

    - Searches for the tokenized "Answer: " span in input_ids.

    - Falls back to masking the first half if not found (rare).

4. Mask Labels:

    - Sets labels to -100 (ignored by loss) for question tokens.

    - Only answer tokens (starting at answer_start) are kept in labels.

- Example :
<pre> ```python example = {"question": "What is Python?", "answer": "A language."} tokenize_and_mask(example, tokenizer) ``` </pre>

<pre> ```python { "input_ids": [314, 512, ..., 102], # Tokenized "Question: ... Answer: ..." "attention_mask": [1, 1, ..., 0], # 1 for real tokens, 0 for padding "labels": [-100, -100, ..., 412, 102] # Only answer tokens are unmasked } ``` </pre>

In [4]:
def tokenize_and_mask(example, tokenizer, max_length=512):
  text = build_prompt(example)
  enc = tokenizer(text, truncation=True, max_length=max_length, padding='max_length')
  input_ids = enc["input_ids"]
  # find index of "Answer:" token span start to mask question tokens in labels
  # We'll reconstruct tokens of "Answer:" to find boundary
  answer_prefix = tokenizer("Answer: ", add_special_tokens=False)["input_ids"]
  # naive search for prefix in input_ids
  def find_sublist(haystack, needle):
    for i in range(len(haystack)-len(needle)+1):
      if haystack[i:i+len(needle)] == needle:
        return i
    return -1
  idx = find_sublist(input_ids, answer_prefix)
  if idx == -1:
    # fallback: mask first half (rare)
    answer_start = len(input_ids)//2
  else:
    answer_start = idx + len(answer_prefix)

  labels = [-100] * len(input_ids)
  # allow loss only on tokens from answer_start onward
  for i in range(answer_start, len(input_ids)):
    labels[i] = input_ids[i]

  return {"input_ids": input_ids, "attention_mask": enc["attention_mask"], "labels": labels}

# Function: `main`

## **Purpose**
Trains a causal language model (e.g., GPT-style) on the GSM8K dataset (math Q&A) using HuggingFace Transformers.  
Key steps:  
1. Loads dataset and model.  
2. Tokenizes data (masking questions to train only on answers).  
3. Fine-tunes the model with specified training arguments.  

---

## **Parameters (Command-Line Args)**
| Argument                         | Default                     | Description                                  |
|----------------------------------|-----------------------------|----------------------------------------------|
| `--model_name`                   | `EleutherAI/gpt-neo-125M`   | Pretrained model to fine-tune.               |
| `--output_dir`                   | `out-small`                 | Directory to save the trained model.         |
| `--epochs`                       | `3`                         | Number of training epochs.                   |
| `--per_device_train_batch_size`  | `2`                         | Batch size per GPU.                          |
| `--max_length`                   | `512`                       | Max token length for truncation/padding.     |

---

## **Workflow**
1. **Load Data**:  
   - Uses `openai/gsm8k` dataset (train split).  
   - Shuffles data with seed `42`.

2. **Setup Tokenizer & Model**:  
   - Adds a `pad_token` if missing (required for padding).  
   - Resizes model embeddings to match tokenizer.

3. **Tokenize & Mask**:  
   - Applies `tokenize_and_mask` to each example:  
     - Formats as `"Question: ...\nAnswer: ..."`.  
     - Masks question tokens in `labels` (set to `-100`).  

4. **Training**:  
   - Uses `Trainer` with:  
     - FP16 if GPU available.  
     - Causal LM training (no MLM).  
   - Saves model and tokenizer to `output_dir`.

---

## **Key Components**
| Component                        | Role                                                                 |
|----------------------------------|----------------------------------------------------------------------|
| `tokenize_and_mask`              | Masks question tokens to train only on answers (see previous doc).   |
| `DataCollatorForLanguageModeling` | Dynamically pads batches; `mlm=False` for causal LM.                 |
| `TrainingArguments`              | Configures training (logging, saving, batch size, etc.).             |

---

## **Example Command**
```bash
python script.py \
  --model_name "EleutherAI/gpt-neo-1.3B" \
  --output_dir "out-large" \
  --epochs 5 \
  --per_device_train_batch_size 4

In [5]:
def main(args):
  print("Loading dataset...")
  ds = load_dataset("openai/gsm8k", 'main')
  train = ds["train"]
  # quick validation split
  train = train.shuffle(seed=42)

  print("Loading tokenizer & model...")
  tokenizer = AutoTokenizer.from_pretrained(args.model_name, use_fast=True)
  # GPT-Neo's tokenizer might not have pad_token
  if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({"pad_token": "<|pad|>"})

  model = AutoModelForCausalLM.from_pretrained(args.model_name)
  model.resize_token_embeddings(len(tokenizer))

  print("Tokenizing and building labels (this may take a minute)...")
  tokenized = train.map(lambda ex: tokenize_and_mask(ex, tokenizer, args.max_length),
                        remove_columns=train.column_names)

  tokenized.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

  training_args = TrainingArguments(
      output_dir=args.output_dir,
      num_train_epochs=args.epochs,
      per_device_train_batch_size=args.per_device_train_batch_size,
      save_strategy="epoch",
      logging_steps=50,
      fp16=torch.cuda.is_available(),
      remove_unused_columns=False,
      push_to_hub=False
  )

  #data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

  trainer = Trainer(
      model=model,
      args=training_args,
      train_dataset=tokenized,
      data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
      #data_collator=data_collator,
  )

  trainer.train()
  trainer.save_model(args.output_dir)
  tokenizer.save_pretrained(args.output_dir)
  print("Done.")

parser = argparse.ArgumentParser()
parser.add_argument("--model_name", default="EleutherAI/gpt-neo-125M")
parser.add_argument("--output_dir", default="out-small")
parser.add_argument("--epochs", type=int, default=3)
parser.add_argument("--per_device_train_batch_size", type=int, default=2)
parser.add_argument("--max_length", type=int, default=512)

_StoreAction(option_strings=['--max_length'], dest='max_length', nargs=None, const=None, default=512, type=<class 'int'>, choices=None, required=False, help=None, metavar=None)

In [6]:
if __name__ == "__main__":
  # Use parse_known_args() to ignore extra arguments passed by the notebook environment
  args, unknown = parser.parse_known_args()
  main(args)

Loading dataset...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

main/train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

main/test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

Loading tokenizer & model...


tokenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/357 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/526M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


Tokenizing and building labels (this may take a minute)...


Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mtmechouma[0m ([33mtmechouma-university-of-quebec-in-montreal[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
50,2.226
100,1.9606
150,1.9652
200,1.7462
250,1.8572
300,1.8048
350,1.7273
400,1.7468
450,1.7491
500,1.7163


Done.


#Purpose
Creates a standardized prompt template for question-answer tasks by formatting the input question into a structured string.

In [8]:
def build_prompt_template(question):
    return f"Question: {question}\nAnswer:"

#Purpose
Evaluates a fine-tuned language model on the GSM8K test set by comparing model predictions against ground truth answers.

## Command Line Arguments

| Argument      | Type | Default     | Description                          |
|--------------|------|------------|--------------------------------------|
| `--model_dir` | str  | `"out-small"` | Directory containing the trained model |
| `--num_examples` | int  | `10`        | Number of test examples to evaluate    |

#Key Components
1. Model Loading:

    - Loads tokenizer and model from model_dir

    - Uses FP16 if GPU available for faster inference

2. Generation Pipeline:

    - Creates a text-generation pipeline with deterministic settings (temperature=0.0)

    - Automatically handles device placement (GPU if available)

3. Evaluation Loop:

    - For each test example:

        - Builds prompt using build_prompt_template

        - Generates answer with 128 token limit

        - Compares prediction vs ground truth

    - Prints formatted results with separation between examples


In [10]:
def mainn():
  parser = argparse.ArgumentParser()
  parser.add_argument("--model_dir", default="/content/out-small")
  parser.add_argument("--num_examples", type=int, default=10)
  args = parser.parse_args()

  ds = load_dataset("openai/gsm8k")
  test = ds["test"]

  tokenizer = AutoTokenizer.from_pretrained(args.model_dir, use_fast=True)
  model = AutoModelForCausalLM.from_pretrained(args.model_dir, torch_dtype=torch.float16 if torch.cuda.is_available() else None)
  model.eval()
  device = 0 if torch.cuda.is_available() else -1
  gen = pipeline("text-generation", model=model, tokenizer=tokenizer, device=device)

  for i in range(min(args.num_examples, len(test))):
    q = test[i]["question"]
    prompt = build_prompt_template(q)
    out = gen(prompt, max_new_tokens=128, do_sample=False, temperature=0.0)
    print("="*30)
    print("Q:", q)
    print("Predicted:\n", out[0]["generated_text"].replace(prompt,"").strip())
    print("Target:\n", test[i].get("answer","").strip()[:400])
    print()

In [11]:
if __name__ == "__mainn__":
    mainn()

In [9]:
print(pipeline("text-generation", model="/content/out-small")("Question: If you have 3 apples and eat one, how many are left?"))


Device set to use cuda:0


[{'generated_text': 'Question: If you have 3 apples and eat one, how many are left?\nAnswer: The total number of apples eaten is 3 + 1 = <<3+1=4>>4 apples\nThe total number of apples remaining is 4 apples + 3 apples = <<4+3=7>>7 apples\n#### 7 apples is 7 apples/apple\n7 apples / 4 apples/apple = <<7/4=1>>1 apple/apple\n#### 1 apple/apple is 1 apple/apple\n#### 1 apple/apple is 1 apple/apple\n#### 1 apple/apple is 1 apple/apple\n#### 1 apple/ apple is 1 apple/apple\n#### 1 apple/apple is 1 apple/apple\n#### 1 apple/apple is 1 apple/apple\n#### 1 apple/apple is 1 apple/apple\n#### 1 apple/apple is 1 apple/apple\n#### 1 apple/apple is 1 apple/apple\n#### 1 apple/apple is 1 apple/apple\n#### 1 apple/apple is 1 apple/apple\n#### 1 apple/apple is 1 apple/apple\n#### 1 apple/apple is 1 apple/apple\n#### 1 apple/apple is 1 apple/apple\n#### 1 apple/apple is 1 apple/apple\n#### 1 apple/apple is 1 apple/apple\n#### 1 apple/'}]
