# 🧠 LLM Training Demo
This notebook demonstrates how to load a lightweight dataset, tokenize it, configure a small language model (DistilGPT2), and train it using Hugging Face's Trainer API in Google Colab.

In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"



**Step 1: Load the Dataset**
--------------------------------------------------
In this step, we load a lightweight dataset suitable for quick experimentation.

*   We use the "ag_news" dataset which consists of news article titles and descriptions.






In [None]:
# Step 1: Download and Load CSV Dataset
import pandas as pd
from datasets import Dataset

# Download CSVs (for Colab use)
!wget -q https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv
!wget -q https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/test.csv

# Load CSVs into DataFrames
train_df = pd.read_csv("train.csv", header=None, names=["Class Index", "Title", "Description"])
test_df = pd.read_csv("test.csv", header=None, names=["Class Index", "Title", "Description"]) # Load the test data

# Combine title and description for training text
train_df["text"] = train_df["Title"] + ". " + train_df["Description"]
train_dataset = Dataset.from_pandas(train_df[["text"]].head(100))  # limit to 100 samples for demo

# Combine title and description for test text and create eval dataset
test_df["text"] = test_df["Title"] + ". " + test_df["Description"]
eval_dataset = Dataset.from_pandas(test_df[["text"]].head(50)) # limit test data for evaluation

print("Sample data:")
print(train_dataset[0])
print("\nSample eval data:") # Print sample from eval dataset
print(eval_dataset[0])




Sample data:
{'text': "Wall St. Bears Claw Back Into the Black (Reuters). Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again."}

Sample eval data:
{'text': "Fears for T N pension after talks. Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul."}


**Step 2: Tokenization**
--------------------------------------------------
* We tokenize the text using a pre-trained tokenizer. In this example, we use
the tokenizer from a lightweight causal language model: distilGPT2.
* Tokenization involves splitting text into tokens and padding/truncating them
to a fixed length for batch processing.

In [None]:
# Step 2: Tokenization
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
tokenizer.pad_token = tokenizer.eos_token  # Set pad token to end-of-sequence token

def tokenize_function(example):
    # Tokenize the text
    tokenized_output = tokenizer(example["text"], truncation=True, padding="max_length", max_length=64)
    # Add the input_ids as labels for causal language modeling loss calculation
    tokenized_output["labels"] = tokenized_output["input_ids"].copy()
    return tokenized_output

tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True) # Tokenize train dataset
tokenized_eval_dataset = eval_dataset.map(tokenize_function, batched=True) # Tokenize eval dataset




The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

In [None]:
print(tokenized_train_dataset[0]['text'])
print("------------------------------------")
print(tokenized_train_dataset[0]['input_ids'])
print("------------------------------------")
print(tokenized_eval_dataset[0]['text'])
print("------------------------------------")
print(tokenized_eval_dataset[0]['input_ids'])
# tokenized_train_dataset[0].keys()

Wall St. Bears Claw Back Into the Black (Reuters). Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.
------------------------------------
[22401, 520, 13, 15682, 30358, 5157, 20008, 262, 2619, 357, 12637, 737, 8428, 532, 10073, 12, 7255, 364, 11, 5007, 3530, 338, 45215, 59, 3903, 286, 14764, 12, 948, 77, 873, 11, 389, 4379, 4077, 757, 13, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256]
------------------------------------
Fears for T N pension after talks. Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.
------------------------------------
[37, 4127, 329, 309, 399, 13553, 706, 6130, 13, 791, 507, 10200, 3259, 379, 15406, 220, 220, 968, 439, 910, 484, 389, 705, 6381, 32924, 6, 706, 6130, 351, 47455, 2560, 4081, 5618, 30926, 3

**Step 3: Configure LLM Parameters**
--------------------------------------------------
Here, we define the model architecture and training arguments.
* We use the distilGPT2 model for demonstration as it is small and fast to train.
* TrainingArguments specify hyperparameters such as batch size, number of epochs, logging settings, and save strategy.

In [None]:
# Step 3: Configure LLM Parameters
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

tiny_model = AutoModelForCausalLM.from_pretrained("distilgpt2")

training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    per_device_train_batch_size=4,
    num_train_epochs=10,
    logging_dir="./logs",
    logging_steps=10,
    save_steps=20,
    save_total_limit=1,
    fp16=False
)



Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


**Step 4: Train the LLM**
--------------------------------------------------
* We now initialize the Trainer object with the model, tokenizer, training arguments, and tokenized dataset.
* Trainer handles the training loop internally.
* We then call the `.train()` method to start training.

In [None]:
# Step 4: Train the LLM
trainer = Trainer(
    model=tiny_model,
    args=training_args,
    train_dataset=tokenized_train_dataset, # Use the tokenized training dataset
    eval_dataset=tokenized_eval_dataset,   # Provide the tokenized evaluation dataset
    tokenizer=tokenizer
)

trainer.train()

print("\nTraining complete!")

  trainer = Trainer(
`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Epoch,Training Loss,Validation Loss
1,3.3681,3.45168
2,2.9595,3.384086
3,2.6271,3.388221
4,2.2664,3.456424
5,1.96,3.514488
6,2.0016,3.574892
7,1.7646,3.632213
8,1.7793,3.68539
9,1.7097,3.708081
10,1.719,3.722567



Training complete!


**Step 5: Inference**
--------------------------------------------------
* Now we use the trained model to generate text. We provide a prompt and let the model predict the continuation of the text.

In [None]:
def generate_text(prompt):
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"]
    attention_mask = inputs["attention_mask"]

    # Get the device of the model
    device = tiny_model.device

    # Move input tensors to the model's device
    input_ids = input_ids.to(device)
    attention_mask = attention_mask.to(device)

    outputs = tiny_model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        max_new_tokens=50,
        num_return_sequences=1,
        do_sample=True,
        top_k=50,
        top_p=0.95,
        temperature=0.7
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [None]:
# Example usage:
prompt = "Breaking news:"
generated_text = generate_text(prompt)
print("\nGenerated Text:\n", generated_text)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Generated Text:
 Breaking news: Apple could be planning to sell its operating system to Google in the wake of an agreement it struck with the world's leading technology firm.
