<center><h2><strong><font color="blue">WFH 2024 idBigData - FineTunning LLM</font></strong></h2></center>

<img alt="" src="https://github.com/taudataanalytics/WFH-idBigData-2024/blob/main/images/covers/cover_taudata_uin.jpg?raw=1"/>

# Jangan lupa mengganti Runtime menjadi GPU di Google Colab

<center><h2><strong><font color="blue">Contoh 01: Akurasi</font></strong></h2></center>

* small public LLM, like DistilBERT from Hugging Face’s Transformers library, on a text classification task.
* the IMDb movie review dataset for binary sentiment classification

In [None]:
import warnings; warnings.simplefilter('ignore')
from tqdm import tqdm

try:
    import google.colab; IN_COLAB = True
    print("Installing the required modules")
    #!pip install transformers datasets sklearn --q
    print("preparing directories and assets")
    #!mkdir data images output models
    #!wget https://raw.githubusercontent.com/taudata...
except:
    IN_COLAB = False
    print("Running the code locally, please make sure all the python module versions agree with colab environment and all data/assets downloaded")

from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from datasets import load_dataset
import torch

# Set device to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

# Load pre-trained DistilGPT2, tokenizer, and Dataset

In [None]:
# Load the pre-trained model and tokenizer
model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Prepare a small custom dataset
# Replace with your custom data or use a simple example dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train[:1%]")  # Small subset for demonstration

# Step 5: Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=128)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.train_test_split(test_size=0.1)
train_dataset = tokenized_datasets["train"]
eval_dataset = tokenized_datasets["test"]

"Done"

# Define Training Arguments and initialize Training

In [None]:
# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=1,  # Adjust for more training
    weight_decay=0.01,
)

# Step 7: Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)
"Done"

<center><h2><strong><font color="blue">Train Model</font></strong></h2></center>

In [None]:
trainer.train()

# Step 9: Save the fine-tuned model
model.save_pretrained("./fine-tuned-distilgpt2")
tokenizer.save_pretrained("./fine-tuned-distilgpt2")

# Step 10: Test the model with text generation
input_text = "Artificial intelligence is transforming"
inputs = tokenizer(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs["input_ids"], max_length=50, num_return_sequences=1)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Generated text:", generated_text)

<center><h2><strong><font color="blue">Keterangan</font></strong></h2></center>

* **Dataset Loading**: We use a small subset of the WikiText dataset for simplicity. You can replace it with your own data if you wish.
* **Tokenization**: The tokenizer converts text into token IDs compatible with DistilGPT2.
* **Training**: The Trainer class handles training and evaluation with the specified parameters.
* **Testing**: After fine-tuning, the model is tested on a short input to demonstrate its ability to generate text.