### Continued Pretraining with Unsloth

In this notebook, I performed continued pretraining of a language model using the **Unsloth** framework. The goal was to extend a base model's knowledge using a custom dataset in a causal language modeling setup. The main steps included:

- Installing essential Hugging Face libraries such as `transformers` and `datasets`.
- Loading a pretrained language model and tokenizer suitable for further pretraining.
- Creating or formatting a dataset for language modeling using Hugging Face’s `Dataset` class.
- Tokenizing the dataset and applying necessary preprocessing like padding and truncation.
- Configuring training parameters using `TrainingArguments` for unsupervised pretraining.
- Executing the training loop using the `Trainer` class with `DataCollatorForLanguageModeling`.
- Saving the continued pretrained model for downstream fine-tuning or inference.



### Install Required Libraries
- Install Hugging Face libraries like `transformers` and `datasets`.
- These tools are needed for tokenization, model loading, and training.


In [None]:
!pip install -q transformers datasets

### Import Dependencies
- Import essential modules from `torch`, `datasets`, and `transformers`.
- These handle dataset creation, tokenization, model training, and collation.


In [None]:
import torch
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling


### Load Language Model
- Load a causal language model for continued pretraining.
- Model is ready to be trained using the new dataset.


In [None]:
# ✅ Use a small model like DistilGPT-2
model_name = "distilgpt2"

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token  # To avoid padding issues

model = AutoModelForCausalLM.from_pretrained(model_name)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

### Create Dataset
- Define a simple dataset inline using the `Dataset` class from `datasets`.
- This dataset will be used for continued pretraining of the model.


In [None]:
texts = [
    "Photosynthesis is the process by which green plants create energy from sunlight.",
    "The mitochondrion is often referred to as the powerhouse of the cell.",
    "Genetic mutations can be inherited or occur spontaneously.",
    "CRISPR is a revolutionary gene-editing technology.",
    "RNA plays a crucial role in translating DNA into proteins."
]

### Define Tokenization Function
- Create a function to tokenize dataset text fields.
- Includes padding and truncation to fit sequence length.


In [None]:
# Convert to Hugging Face Dataset
dataset = Dataset.from_dict({"text": texts})

# Tokenize function
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=64)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

# Define training arguments

In [None]:
# ✅ Training arguments (CPU-friendly)
training_args = TrainingArguments(
    output_dir="./distilgpt2-medical-pretrain",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    logging_steps=1,
    save_steps=10,
    save_total_limit=1,
    prediction_loss_only=True,
    report_to="none",  # Disable logging
)

### Load Model and Setup Training
- Load a causal language model (`AutoModelForCausalLM`) for continued pretraining.
- Set up `DataCollatorForLanguageModeling` to create labels from inputs.


In [None]:
# Data collator for language modeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator,
)

# Train!
trainer.train()

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
1,4.6184
2,3.6378
3,3.888
4,3.7541
5,3.821


TrainOutput(global_step=5, training_loss=3.943846893310547, metrics={'train_runtime': 7.4214, 'train_samples_per_second': 0.674, 'train_steps_per_second': 0.674, 'total_flos': 81655234560.0, 'train_loss': 3.943846893310547, 'epoch': 1.0})

In [None]:
# Save the trained model and tokenizer
model.save_pretrained("continued-pretrained-model")
tokenizer.save_pretrained("continued-pretrained-model")
print("Model and tokenizer saved.")


Model and tokenizer saved.
