# Workshop 11 : Fine-tune a Small Open-source Model with Your Custom Dataset
In this workshop, you'll:
- Load a text dataset (CSV format) for classification
- Tokenize text using a Hugging Face tokenizer
- Load a small Transformer model (e.g., DistilBERT)
- Fine-tune the model on your data
- Evaluate and make predictions

This notebook works for both English and Thai (change model/dataset as needed).

**Requirements:** Place `train.csv` and `val.csv` in the same directory with columns: `text`, `label`.

## Step 1: Install dependencies (if needed)
You may need to restart your kernel after installing.

In [1]:
!pip install transformers datasets torch pandas scikit-learn --quiet

## Step 2: Load Custom Dataset
Assumes `train.csv` and `val.csv` with columns: `text`, `label`.

In [25]:
import pandas as pd
train_df = pd.read_csv('train.csv')
val_df = pd.read_csv('val.csv')
print(train_df)

                      text       label
0    ร้านอาหารไหนอร่อยบ้าง  restaurant
1            Good morning!    greeting
2             ขอบคุณมากค่ะ   gratitude
3                 Bye bye!     goodbye
4   I really appreciate it   gratitude
..                     ...         ...
67       มีรถไฟฟ้าใกล้ๆไหม      travel
68           สวัสดีตอนเย็น    greeting
69         ห้องน้ำไม่สะอาด   complaint
70         อาหารเย็นเกินไป   complaint
71    My room is too noisy   complaint

[72 rows x 2 columns]


In [11]:
# Encode labels
labels = sorted(train_df['label'].unique())
label2id = {l: i for i, l in enumerate(labels)}
id2label = {i: l for l, i in label2id.items()}
train_df['labels'] = train_df['label'].map(label2id)
val_df['labels'] = val_df['label'].map(label2id)

print(label2id)

{'complaint': 0, 'goodbye': 1, 'gratitude': 2, 'greeting': 3, 'joke': 4, 'restaurant': 5, 'travel': 6, 'weather': 7}


## Step 3: Tokenization
Pick a pre-trained model (`distilbert-base-uncased`, `airesearch/wangchanberta-base-att-spm-uncased` for Thai, etc.).

In [3]:
from transformers import AutoTokenizer
# You can change to a Thai model, e.g. 'airesearch/wangchanberta-base-att-spm-uncased'
model_checkpoint = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def tokenize_function(example):
    return tokenizer(
        example['text'],
        truncation=True,
        padding='max_length',
        max_length=128,
    )

In [13]:
from datasets import Dataset

train_ds = Dataset.from_pandas(train_df[['text', 'labels']])
val_ds = Dataset.from_pandas(val_df[['text', 'labels']])

train_ds = train_ds.map(tokenize_function, batched=True)
val_ds = val_ds.map(tokenize_function, batched=True)

# Set format for PyTorch
cols = ['input_ids', 'attention_mask', 'labels']
train_ds.set_format(type='torch', columns=cols)
val_ds.set_format(type='torch', columns=cols)

Map:   0%|          | 0/72 [00:00<?, ? examples/s]

Map:   0%|          | 0/24 [00:00<?, ? examples/s]

In [22]:
import torch

device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
print(f"Using device: {device}")

Using device: mps


## Step 4: Load Pre-trained Model for Sequence Classification

In [14]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint,
    num_labels=len(labels),
    id2label=id2label,
    label2id=label2id,
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
!pip install --upgrade transformers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [10]:
import transformers
print(transformers.__version__)

4.52.4


## Step 5: Set Training Arguments

In [27]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',
    do_eval=True,  
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    learning_rate=2e-5,
    logging_steps=20,
    report_to='none',
    load_best_model_at_end=False,
)

In [15]:
import numpy as np
from sklearn.metrics import accuracy_score, classification_report

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    acc = accuracy_score(labels, predictions)
    return {'accuracy': acc}

In [16]:
from transformers import Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

  trainer = Trainer(


## Step 6: Train the Model

In [17]:
trainer.train()



Step,Training Loss


TrainOutput(global_step=15, training_loss=2.0697107950846356, metrics={'train_runtime': 4.498, 'train_samples_per_second': 48.021, 'train_steps_per_second': 3.335, 'total_flos': 7154004934656.0, 'train_loss': 2.0697107950846356, 'epoch': 3.0})

## Step 7: Evaluate and Predict

In [18]:
metrics = trainer.evaluate()
print(metrics)



{'eval_loss': 2.0598747730255127, 'eval_accuracy': 0.125, 'eval_runtime': 0.2253, 'eval_samples_per_second': 106.52, 'eval_steps_per_second': 8.877, 'epoch': 3.0}


In [19]:
# Predict on validation set
preds = trainer.predict(val_ds)
y_true = preds.label_ids
y_pred = preds.predictions.argmax(axis=-1)
print(classification_report(y_true, y_pred, target_names=labels))



              precision    recall  f1-score   support

   complaint       0.00      0.00      0.00         3
     goodbye       0.00      0.00      0.00         3
   gratitude       0.00      0.00      0.00         3
    greeting       0.00      0.00      0.00         3
        joke       0.00      0.00      0.00         3
  restaurant       0.00      0.00      0.00         3
      travel       0.14      1.00      0.24         3
     weather       0.00      0.00      0.00         3

    accuracy                           0.12        24
   macro avg       0.02      0.12      0.03        24
weighted avg       0.02      0.12      0.03        24



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Step 8: Inference Example

In [24]:
def classify(text):
    # 1. Tokenize the input text
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding='max_length', max_length=128)
    
    # 2. Move the input tensors to the same device as the model
    inputs = {key: value.to(device) for key, value in inputs.items()}
    
    # 3. Perform inference with no gradient calculation for efficiency
    with torch.no_grad():
        outputs = model(**inputs)
    
    # 4. Get the prediction
    pred_id = outputs.logits.argmax(dim=-1).item()
    
    return id2label[pred_id]

print(classify('อากาศดีไหม'))

travel
