<center><h2><strong><font color="blue">WFH 2024 idBigData - FineTunning LLM</font></strong></h2></center>

<img alt="" src="https://github.com/taudataanalytics/WFH-idBigData-2024/blob/main/images/covers/cover_taudata_uin.jpg?raw=1"/>

# Jangan lupa mengganti Runtime menjadi GPU di Google Colab

In [None]:
import warnings; warnings.simplefilter('ignore')
from tqdm import tqdm
import os
os.environ["WANDB_MODE"] = "disabled"  # Disable wandb logging

try:
    import google.colab; IN_COLAB = True
    print("Installing the required modules")
    !pip install datasets --q
    print("preparing directories and assets")
    #!mkdir data images output models
    #!wget https://raw.githubusercontent.com/taudata...
except:
    IN_COLAB = False
    print("Running the code locally, please make sure all the python module versions agree with colab environment and all data/assets downloaded")

import os
os.environ["WANDB_MODE"] = "disabled"  # Disable wandb logging
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from datasets import load_dataset
import torch

# Step 3: Load pre-trained DistilGPT2 and tokenizer
model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Set padding token to avoid padding error
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.pad_token_id  # Set pad token ID in model config

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# ...

In [None]:
# Step 4: Load a small dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")

# Select a small subset of the data for quick training
small_train_dataset = dataset["train"].select(range(100))  # Select first 100 examples
small_eval_dataset = dataset["validation"].select(range(10))  # Select first 10 examples

# Step 5: Tokenize the dataset
def tokenize_function(examples):
    # Tokenize and set 'labels' to 'input_ids' for supervised training
    tokens = tokenizer(examples["text"], truncation=True, padding="max_length", max_length=128)
    tokens["labels"] = tokens["input_ids"].copy()
    return tokens

tokenized_train_dataset = small_train_dataset.map(tokenize_function, batched=True)
tokenized_eval_dataset = small_eval_dataset.map(tokenize_function, batched=True)

# ...

In [None]:
# Step 6: Define training arguments (disable logging to any external services)
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    logging_strategy="no",  # Disable external logging
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=3,  # Adjust for more training
    weight_decay=0.01,
)

# Step 7: Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_eval_dataset,
)
"Done"

<center><h2><strong><font color="blue">Train Model</font></strong></h2></center>

In [3]:
# Step 8: Train the model
trainer.train()

# Step 9: Save the fine-tuned model
model.save_pretrained("./fine-tuned-distilgpt2")
tokenizer.save_pretrained("./fine-tuned-distilgpt2")

# Step 10: Test the model with text generation
input_text = "Artificial intelligence is transforming"
inputs = tokenizer(input_text, return_tensors="pt", padding=True).to(device)
outputs = model.generate(
    inputs["input_ids"],
    attention_mask=inputs["attention_mask"],  # Pass attention mask explicitly
    max_length=50,
    num_return_sequences=1,
    pad_token_id=tokenizer.pad_token_id  # Set pad_token_id for reliable results
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Generated text:", generated_text)



Epoch,Training Loss,Validation Loss
1,No log,1.324533
2,No log,1.29263
3,No log,1.285892


Generated text: Artificial intelligence is transforming the way we interact with the world.


<center><h2><strong><font color="blue">Keterangan</font></strong></h2></center>

* **Dataset Loading**: We use a small subset of the WikiText dataset for simplicity. You can replace it with your own data if you wish.
* **Tokenization**: The tokenizer converts text into token IDs compatible with DistilGPT2.
* **Training**: The Trainer class handles training and evaluation with the specified parameters.
* **Testing**: After fine-tuning, the model is tested on a short input to demonstrate its ability to generate text.


The code provided uses supervised fine-tuning, specifically leveraging the Trainer class from the Hugging Face transformers library to train a language model with labels derived from its input sequences. Here’s a breakdown of how it works and where it fits in the spectrum of fine-tuning approaches:

1. Supervised Fine-Tuning with Labels
* **Goal** : The model learns to predict the next token in the sequence based on supervised data.
* **Labels** : In this code, the input_ids are set as labels in the tokenized dataset, essentially creating a teacher forcing setup where the model predicts each token given all previous tokens.
* **Loss Calculation**: Since the labels are provided, the model calculates the cross-entropy loss during training, encouraging it to predict tokens correctly within the input context.

## Common Fine-Tuning Approach for Causal Language Models
* **Sequential Prediction**: This method is typical for language models that generate text by predicting the next token, like GPT models. Each token prediction is based on previous tokens up to the current token.
* **Training with Trainer**: The Trainer class simplifies the training loop, handling backpropagation, batching, and evaluation automatically. It’s a practical choice for demonstrations or cases where default training settings are sufficient.

## Fine Tuning Lain 

* **Instruction Tuning**: If we were tuning a model with specific instructions or response patterns (like ChatGPT), we might introduce special tokens to differentiate prompts from responses.
* **Reinforcement Learning (RLHF)**: Reinforcement Learning from Human Feedback (RLHF) is another advanced fine-tuning method, particularly useful for aligning a model’s responses with human preferences, but it’s more complex, involving human feedback and reward models.
* **Parameter-Efficient Tuning**: Techniques like LoRA (Low-Rank Adaptation) and PEFT (Parameter-Efficient Fine-Tuning) are alternatives that adapt specific parts of the model and are efficient for larger models or limited compute resources. These weren’t applied here but can be more suitable for low-resource environments.