<a href="https://colab.research.google.com/github/victor-onoja/DeepLearningLearning/blob/main/Copy_of_Data_Preparation_and_Model_Loading.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preparation and Model Loading

In this Colab notebook, you'll work to get your dataset ready for model fine-tuning. You'll also look at options for quantizing model parameters to match your training resources (i.e., GPU memory) and performance requirements.

> This notebook is based on [@maximelabonne's LLama2 fine-tuning notebook](https://github.com/mlabonne/llm-course/blob/main/Fine_tune_Llama_2_in_Google_Colab.ipynb), which is, in turn, based on Younes Belkada's [GitHub Gist](https://gist.github.com/younesbelkada/9f7f75c94bdc1981c8ca5cc937d4a4da). It also borrows from [this example](https://github.com/brevdev/notebooks/blob/main/phi2-finetune-own-data.ipynb) on phi2 fine-tuning.

Note: in this course we're *not* going to cover every line of code. Much of the code is well-commented, and from the comments and the online documentation you should be able to figure out how any particular piece of code works.


## Install Python Libraries

First, we need to install the necessary libraries. We've included specific versions known to work for all packages involved. Few things are more frustrating than when old code that used to run smoothly and give desired results fails to run because of package version updates and resulting incompatibilities. In this case, when you connect to a Colab runtime it starts with many packages already installed. Over time Google updates the versions of those packages.

In [None]:
# Upgrade pip
!pip install -U pip

# Uninstall packages that will conflict with those we're about to install
!pip uninstall --yes opencv-contrib-python thinc opencv-python opencv-python-headless albumentations spacy dopamine-rl albucore fastai jax shap jaxlib pytensor pymc flax chex orbax-checkpoint optax

# Downgrade to numpy 1.26.4 (needed to support pandas 2.2.2).
# NOTE: If asked to restart the runtime, do so. You don't need to rerun this cell after restarting.
!pip uninstall --yes numpy
!pip install numpy==1.26.4

Collecting pip
  Downloading pip-26.0.1-py3-none-any.whl.metadata (4.7 kB)
Downloading pip-26.0.1-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m39.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-26.0.1
Found existing installation: opencv-contrib-python 4.13.0.90
Uninstalling opencv-contrib-python-4.13.0.90:
  Successfully uninstalled opencv-contrib-python-4.13.0.90
Found existing installation: thinc 8.3.10
Uninstalling thinc-8.3.10:
  Successfully uninstalled thinc-8.3.10
Found existing installation: opencv-python 4.13.0.90
Uninstalling opencv-python-4.13.0.90:
  Successfully uninstalled opencv-python-4.13.0.90
Found existing installation: opencv-python-headless 4.13.0.90
Uninstalling opencv-python-headless-4.13.0.90:
  Successfully 

In [None]:
# Install a CUDA 12.1 build of PyTorch compatible with Python 3.12
!pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121

# Core libs
!pip install \
  accelerate==1.10.1 \
  transformers==4.56.2 \
  datasets==4.0.0 \
  peft==0.17.1 \
  sentence-transformers==5.1.0 \
  einops==0.8.1 \
  safetensors==0.6.2 \
  jinja2==3.1.6 \
  regex==2025.9.18 \
  fsspec==2025.3.0 \
  gcsfs==2025.3.0 \
  pandas==2.2.2 \
  pyarrow==15.0.2 \
  pytz==2024.1

# bitsandbytes with CUDA 12 support (use a recent version)
!pip install bitsandbytes==0.47.0

Looking in indexes: https://download.pytorch.org/whl/cu121
Collecting torch==2.5.1
  Downloading https://download.pytorch.org/whl/cu121/torch-2.5.1%2Bcu121-cp312-cp312-linux_x86_64.whl (780.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m780.4/780.4 MB[0m [31m35.4 MB/s[0m  [33m0:00:13[0m
[?25hCollecting torchvision==0.20.1
  Downloading https://download.pytorch.org/whl/cu121/torchvision-0.20.1%2Bcu121-cp312-cp312-linux_x86_64.whl (7.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.3/7.3 MB[0m [31m89.8 MB/s[0m  [33m0:00:00[0m
[?25hCollecting torchaudio==2.5.1
  Downloading https://download.pytorch.org/whl/cu121/torchaudio-2.5.1%2Bcu121-cp312-cp312-linux_x86_64.whl (3.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m57.1 MB/s[0m  [33m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.5.1)
  Downloading https://download.pytorch.org/whl/cu121/nvidia_cuda_nvrtc_cu12-12.1.1

## Import Modules
Next we import the modules we're going to need for fine-tuning.

In [None]:
import torch

from transformers import (
    AutoModelForCausalLM, # Will be used to load the pre-trained model
    AutoTokenizer, # Will be used to load the pre-trained tokenizer
    BitsAndBytesConfig, # For model quantization settings
    GenerationConfig, # To control generation (inference) from a model
    TrainingArguments, # To specify parameters of the fine-tuning process
    Trainer, # The object that abstracts away the training and evaluation loop
    pipeline, # Stringing together tokenization and inference, for convenience
    logging
)

from datasets import Dataset, DatasetDict # For data handling.

from peft import LoraConfig, PeftModel, get_peft_model # PEFT stands for "Parameter Efficient Fine-Tuning"
                                                       # These objects will help us to run Low Rank Adaptation
                                                       # instead of full fine-tuning.

torch.manual_seed(42); # Set the state of the random number generator. Important for reproducibility.

## Verify GPU Availability

If the following cell gives an error, make sure you have a T4 GPU selected in Colab. Go to Runtime -> Change runtime type -> T4 GPU. After that, restart the notebook and re-run the code cells above.

In [None]:
if not torch.cuda.is_available():
    raise ValueError("Wrong runtime type, please fix before proceeding. "
                     "We need a GPU for this fine-tuning notebook to work.")

# Data Preparation

It's now time to load and tokenize the training and evaluation data.

## Load the Data

The first step is to load training and evaluation data. The `gdown` module provides a way to load data from the internet to the Colab notebook's filesystem.

> The URLs point to corresponding URLs for the text on the course page. You can navigate directly to them at [*Men Without Women*](https://uploads.smart.ly/assets/551504549949061acef222a18e665c51dba4bb15c341451942cddeabc6cdcab9/original/551504549949061acef222a18e665c51dba4bb15c341451942cddeabc6cdcab9.txt) and [*The Sun Also Rises*](https://uploads.smart.ly/assets/7f5a282142c700b82ca778a890c74679493ec6651c70e0ee35e84fdef829cccd/original/7f5a282142c700b82ca778a890c74679493ec6651c70e0ee35e84fdef829cccd.txt).

In [None]:
import gdown
gdown.download("https://quanticedu.github.io/llm-fine-tuning/MenWithoutWomenCleaned.txt",
               "./MenWithoutWomen.txt", quiet=True)
gdown.download("https://quanticedu.github.io/llm-fine-tuning/TheSunAlsoRisesCleaned.txt",
               "./TheSunAlsoRises.txt", quiet=True)

# Loading our training data (one book)
with open("MenWithoutWomen.txt", "r", encoding="utf-8") as f:
    raw_training_text = f.read()

# Loading our evaluation and test data (one book)
with open("TheSunAlsoRises.txt", 'r', encoding="utf-8") as f:
    raw_eval_text = f.read()

## Get a Tokenizer

The next step is to tokenize the training and evaluation data. To begin, we get the pretrained tokenizer from the pretrained model. Run this cell, ignoring warnings about `HF_TOKEN` and special tokens.

In [None]:
# The model that you want to train from the Hugging Face hub
model_name = "microsoft/phi-2"
revision = "523a3d62e793d3f51ad6334ccfd3b67de28771c0"
# Once you get a model working, it's good practice to "freeze" the revision so
# subsequent changes to the model don't affect your results. To do this on Hugging Face,
# go to the model's page, select "Files and Versions", then "History" in the upper
# right. The select the Copy icon next to the commit hash of the most recent
# commit and use the full hash as the revision parameter for loading the tokenizer
# and the model itself.

# Load the pre-trained Phi2 tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, revision=revision)
tokenizer.pad_token = tokenizer.eos_token # A common, slightly hacky solution.
                                          # Some models are trained without padding,
                                          # but we can usually reuse the eos (end of sequence) token
                                          # for padding purposes.

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

## Explore Tokenizing Strategies

We can use this to explore different strategies for tokenizing the data. Try various combinations of `True` and `False` for the `truncation` and `return_overflowing_tokens` parameters to see how the tokenizer behaves.

In [None]:
# Applying our pre-trained tokenizer to text.
outputs = tokenizer(
    raw_training_text,
    truncation=True,
    max_length=128,
    return_overflowing_tokens=True,
    return_length=True,
)

print(type(outputs))
for key in outputs:
  print(f"{key}:")
  if len(outputs[key]) > 1:
    print(f"  length: {len(outputs[key])}\n  first three elements: {outputs[key][:3]}\n  last three elements = {outputs[key][-3:]}")
  else:
    print(f"  value: {outputs[key][0]}")

<class 'transformers.tokenization_utils_base.BatchEncoding'>
input_ids:
  length: 543
  first three elements: [[628, 198, 50261, 49275, 42881, 45386, 1677, 628, 628, 198, 50259, 10970, 4725, 7206, 15112, 11617, 628, 198, 10725, 52, 3698, 402, 25793, 3539, 19952, 262, 16046, 284, 2094, 29825, 4990, 2271, 447, 247, 82, 2607, 13, 679, 900, 198, 2902, 465, 45391, 290, 13642, 319, 262, 3420, 13, 1318, 373, 645, 3280, 13, 25995, 11, 198, 5646, 287, 262, 23959, 11, 2936, 612, 373, 617, 530, 287, 262, 2119, 13, 679, 2936, 340, 198, 9579, 262, 3420, 13, 198, 198, 447, 250, 9781, 2271, 11, 447, 251, 339, 531, 11, 8680, 13, 198, 198, 1858, 373, 645, 3280, 13, 198, 198, 1544, 447, 247, 82, 612, 11, 477, 826, 11, 25995, 1807, 13, 198, 198, 447, 250, 9781, 2271, 11, 447, 251, 339, 531, 290, 275, 5102], [262, 3420, 13, 198, 198, 447, 250, 8241, 447, 247, 82, 612, 30, 447, 251, 531, 617, 530, 287, 262, 2607, 13, 198, 198, 447, 250, 5308, 11, 1869, 14057, 11, 447, 251, 25995, 531, 13, 198, 198, 447, 25

## Tokenize the Datasets

We now write the code to tokenize the datasets. Here we'll select `True` for `truncation` and `return_overflowing_tokens`. We'll also add parameters to control the padding and select what type of data will be returned.

In [None]:
raw_train_data = Dataset.from_dict({"text": [raw_training_text]}) # Wrapping our data into a huggingface library's Dataset object,
raw_eval_data = Dataset.from_dict({"text": [raw_eval_text]})      # which allows convenient data preprocessing options.

raw_datasets = DatasetDict( # Wrapping both datasets into a "DatasetDict" object that can hold different data splits.
    {
        "train": raw_train_data,
        "valid": raw_eval_data,
    }
)

raw_datasets.set_format("torch") # Makes the datasets more convenient to use with pytorch.

# How much context should we consider at once. We'll set it to a relatively short 250 token context to keep things manageable.
context_length = 250

def tokenize(element):
    '''A function to tokenize a given element(or a batch of elements) in the data.'''

    outputs = tokenizer(
        element["text"],
        truncation=True,
        max_length=context_length,
        return_overflowing_tokens=True,
        padding=True, # Using a special padding token, extend shorter sequences in the batch of elements to match the length of the longest one.
        return_tensors='pt' # Returned data will be PyTorch tensors.
    )

    tokenized_words = outputs["input_ids"].to("cuda:0")

    # Note that in Causal Language Modeling, the answer to each input is just the next token in the input.
    # So essentially the outputs are inputs shifted by one. Here we provide labels to be the same as inputs
    # because during training, this label shifting will be done for us automatically.
    return {"input_ids": tokenized_words, "labels": tokenized_words.clone()}

tokenized_datasets = raw_datasets.map(
    tokenize, remove_columns="text"
).shuffle()


Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

# Specify Quantization and Load the Model

We'll come back to the tokenized datasets in the next lesson. Now we turn our attention to loading the model.

# Select Quantization

`BitsandBytes` is a Python library used for quantizing models which we'll use to quantize our model. In the cell below, we specify the quantization parameters.

In [None]:
################################################################################
# bitsandbytes (quantization) parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True # Whether to quantize model weights to 4bits.

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16" # For some GPUs, 'bfloat16' format could be the optimal choice.

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "fp4" # Choosing between different number representation formats.

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = True

# Use variables above to define a quantization configuration object.
bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=bnb_4bit_compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,

)


## Load the Model

Load the model using parameters specified above.

In [None]:
# Specify that we want to load the entire model on the GPU 0
if not torch.cuda.is_available():
    raise ValueError("Please make sure your runtime is set to GPU.")

device = "cuda:0" # The first among the available GPUs.
device_map = {"": 0} # Specify which elements of the model go to which device.
                     # This is especially relevant for huge models that don't fit on one GPU.
                     # In our case, we map everything to device 0 (GPU number 0) when loading the model.


# Load base model

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    revision=revision,
    device_map=device_map,
    trust_remote_code=True, # This is to let huggingface know that we are downloading this custom model from a trusted source.
    quantization_config=bnb_config if use_4bit else None,
    torch_dtype=torch.float16 # When quantization is not used,
                              # we need to specify this to avoid loading the model in 32bit.
)

model.config.use_cache = False # Caching speeds up inference, but is irrelevant for training/fine-tuning.
                               # We've found it interferes with Colab behavior when different models are loaded/unloaded.
                               # So we'll keep it off. In practice, for inference, setting it to True (default) is advisable.

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Before jumping to fine-tuning it is crucial to check that the base model works as expected. We'll also use this opportunity to quickly check whether quantization affects performance. In this lesson, we'll make a simple manual comparison.

There is a [popular myth](https://theamericanscholar.org/the-shortest-story-ever-told/) that Hemingway once won a bet by writing a one-sentence story that made people cry. Therefore we'll use the following prompt in our examples: "As promised, here is a one-sentence story that will make you cry: "

In [None]:
# Tokenize our prompt:
inputs = tokenizer('''As promised, here is a one-sentence story that will make you cry: ''',
                      return_tensors="pt").to(device)

torch.manual_seed(42) # specify the seed for the (pseudo)random number generator:
                      # useful when we use sampling something randomly but want our results to be reproducible.

with torch.no_grad(): # Currently we are not training our model, so we don't need to keep track of gradients.

    generation_config = GenerationConfig(max_length=200,
                                         eos_token_id=tokenizer.eos_token_id,
                                         do_sample=False,    # Whether to use deterministic (highest probability) decoding
                                         use_cache=False)    # or sample each next word proportionally to its predicted probability.

    outputs = model.generate(**inputs, generation_config=generation_config)

    text = tokenizer.batch_decode(outputs)[0]
    print(text)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


As promised, here is a one-sentence story that will make you cry: 

"The world is a cruel and unforgiving place, and no amount of love or kindness can change that."
<|endoftext|>


**Bonus exploration suggestions:** When sampling, you can play around with extra parameters to change sampling procedure and outcomes. Try changing `do_sample` to `True` and adding some of the following arguments to the generation config:

*   temperature=1.2 (the higher the temperature-the more creative/unhinged the generation will be)
*   top_k=50 (sampling is restricted to top 50 most likely words, to have some creativity without going too crazy)
* top_p=0.7 (nucleus sampling - similar idea, but each next word is drawn out of a selection of the most probable words that together have a probability of 70%).