# 🧠 LLM Training Demo
This notebook demonstrates how to load a lightweight dataset, tokenize it, configure a small language model (DistilGPT2), and train it using Hugging Face's Trainer API in Google Colab.

In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"



**Step 1: Load the Dataset**
--------------------------------------------------
In this step, we load a lightweight dataset suitable for quick experimentation.

*   We use the "ag_news" dataset which consists of news article titles and descriptions.






In [None]:
# Step 1: Download and Load CSV Dataset
import pandas as pd
from datasets import Dataset

# Download CSVs (for Colab use)
!wget -q https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv
!wget -q https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/test.csv

# Load CSVs into DataFrames
train_df = pd.read_csv("train.csv", header=None, names=["Class Index", "Title", "Description"])
test_df = pd.read_csv("test.csv", header=None, names=["Class Index", "Title", "Description"]) # Load the test data

# Combine title and description for training text
train_df["text"] = train_df["Title"] + ". " + train_df["Description"]
train_dataset = Dataset.from_pandas(train_df[["text"]].head(100))  # limit to 100 samples for demo

# Combine title and description for test text and create eval dataset
test_df["text"] = test_df["Title"] + ". " + test_df["Description"]
eval_dataset = Dataset.from_pandas(test_df[["text"]].head(50)) # limit test data for evaluation

print("Sample data:")
print(train_dataset[0])
print("\nSample eval data:") # Print sample from eval dataset
print(eval_dataset[0])




Sample data:
{'text': "Wall St. Bears Claw Back Into the Black (Reuters). Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again."}

Sample eval data:
{'text': "Fears for T N pension after talks. Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul."}


**Step 2: Tokenization**
--------------------------------------------------
* We tokenize the text using a pre-trained tokenizer. In this example, we use gpt2.
* Tokenization involves splitting text into tokens and padding/truncating them
to a fixed length for batch processing.

In [None]:
# Step 2: Tokenization
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

def tokenize_function(example):
    result = tokenizer(example["text"], truncation=True, padding="max_length", max_length=64)
    result["labels"] = result["input_ids"].copy()
    return result

tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)
tokenized_eval_dataset = eval_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

In [None]:
print(tokenized_train_dataset[0]['text'])
print("------------------------------------")
print(tokenized_train_dataset[0]['input_ids'])
print("------------------------------------")
print(tokenized_eval_dataset[0]['text'])
print("------------------------------------")
print(tokenized_eval_dataset[0]['input_ids'])
# tokenized_train_dataset[0].keys()

Wall St. Bears Claw Back Into the Black (Reuters). Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.
------------------------------------
[22401, 520, 13, 15682, 30358, 5157, 20008, 262, 2619, 357, 12637, 737, 8428, 532, 10073, 12, 7255, 364, 11, 5007, 3530, 338, 45215, 59, 3903, 286, 14764, 12, 948, 77, 873, 11, 389, 4379, 4077, 757, 13, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256]
------------------------------------
Fears for T N pension after talks. Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.
------------------------------------
[37, 4127, 329, 309, 399, 13553, 706, 6130, 13, 791, 507, 10200, 3259, 379, 15406, 220, 220, 968, 439, 910, 484, 389, 705, 6381, 32924, 6, 706, 6130, 351, 47455, 2560, 4081, 5618, 30926, 3

### 🔧 Step 3: Create GPT-2 from Scratch

In this step, we define and initialize a GPT-2 model **without loading any pretrained weights**. This allows us to train the model from scratch using our own dataset.

#### 🛠️ Key Components:
- **GPT2Config**: Specifies the architecture of the model, including the number of layers, attention heads, embedding size, and context window.
- **GPT2LMHeadModel**: This is the causal language model built on GPT-2, with a linear layer on top of the hidden states for language modeling tasks.

#### 📌 Configuration Details:
```python
config = GPT2Config(
    vocab_size=tokenizer.vocab_size,  # Use vocabulary size from the tokenizer
    n_positions=64,                   # Maximum sequence length the model can handle
    n_ctx=64,                         # Context size (same as n_positions)
    n_embd=128,                       # Size of token embeddings and hidden states
    n_layer=4,                        # Number of transformer blocks (depth of the model)
    n_head=4,                         # Number of attention heads
    pad_token_id=tokenizer.pad_token_id  # Define padding token to avoid mismatch
)


In [None]:
# Step 3: Create GPT-2 from scratch
from transformers import GPT2Config, GPT2LMHeadModel

config = GPT2Config(
    vocab_size=tokenizer.vocab_size,
    n_positions=64,
    n_ctx=64,
    n_embd=128,
    n_layer=4,
    n_head=4,
    pad_token_id=tokenizer.pad_token_id
)

model = GPT2LMHeadModel(config)  # Model initialized from scratch

## 🧾 Step 4: Define TrainingArguments

Once the dataset has been tokenized and the model is ready, we need to define how the training process should proceed. Hugging Face's `TrainingArguments` class allows us to configure all key parameters for training, evaluation, logging, and checkpointing.

Below is the configuration used for training GPT-2 **from scratch** on a small dataset:

```python
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./gpt2_scratch",      # Directory to save model checkpoints and final model
    evaluation_strategy="epoch",      # Evaluate the model after each epoch
    per_device_train_batch_size=4,    # Batch size per device (GPU/CPU)
    num_train_epochs=10,              # Number of training epochs
    logging_dir="./logs",             # Directory to store training logs
    logging_steps=10,                 # Log metrics every 10 steps
    save_steps=20,                    # Save model checkpoint every 20 steps
    save_total_limit=1,               # Retain only the most recent checkpoint
    fp16=False                        # Use mixed precision (set True if supported by GPU)
)


In [None]:
# Step 4: Define TrainingArguments
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./gpt2_scratch",
    # evaluation_strategy="epoch",
    per_device_train_batch_size=4,
    num_train_epochs=100,
    logging_dir="./logs",
    logging_steps=10,
    save_steps=20,
    save_total_limit=1,
    fp16=False
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


### 🏋️‍♂️ Step 5: Train the GPT-2 Model

In this final step, we train our GPT-2 model from scratch using the `Trainer` API provided by Hugging Face's `transformers` library.

#### 🧠 What is the `Trainer`?
The `Trainer` class is a high-level API that simplifies the training and evaluation loop. It takes care of batching, optimization, logging, checkpointing, and evaluation under the hood.

#### ⚙️ Trainer Setup:
We configure the `Trainer` with the following parameters:

```python
trainer = Trainer(
    model=model,                                  # GPT-2 model initialized from scratch
    args=training_args,                           # TrainingArguments that specify epochs, logging, batch size, etc.
    train_dataset=tokenized_train_dataset,        # Tokenized training data
    eval_dataset=tokenized_eval_dataset,          # Tokenized evaluation data (optional but useful for monitoring)
    tokenizer=tokenizer                           # Tokenizer used for encoding/decoding text
)


In [None]:
# Step 5: Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_eval_dataset,
    tokenizer=tokenizer
)

trainer.train()
print("✅ Training complete.")

  trainer = Trainer(


Step,Training Loss
10,5.8075
20,5.4117
30,5.3605
40,5.2996
50,5.4156
60,5.297
70,5.4189
80,5.1114
90,5.4238
100,5.1388




✅ Training complete.


**Step 5: Inference**
--------------------------------------------------
* Now we use the trained model to generate text. We provide a prompt and let the model predict the continuation of the text.

In [None]:
def generate_text(prompt):
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"]
    attention_mask = inputs["attention_mask"]

    # Get the device of the model
    device = model.device

    # Move input tensors to the model's device
    input_ids = input_ids.to(device)
    attention_mask = attention_mask.to(device)

    outputs = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        max_new_tokens=50,
        num_return_sequences=1,
        do_sample=True,
        top_k=50,
        top_p=0.95,
        temperature=0.7
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [None]:
# Example usage:
prompt = "Breaking news:"
generated_text = generate_text(prompt)
print("\nGenerated Text:\n", generated_text)


Generated Text:
 Breaking news: are with the all- (Reuters). Reuters - (Reuters).
