## Fine-tuning a multilingual text classifier to identify children's websites

This notebook demonstrates the process of fine-tuning a multilingual text classifier and evaluating it using cross-validation. Specifically, we fine-tune the `sentence-transformers/paraphrase-multilingual-mpnet-base-v2` model using a dataset containing webpage titles and descriptions in different languages.

### Prerequisites
Ensure you have the following Python libraries installed:
- `torch`
- `transformers`
- `datasets`
- `scikit-learn`
- `numpy`
- `sentence-transformers`


### Dataset Description
For the positive class, which identifies child-directed websites, we utilized sources including the [kidSAFE Seal Program](https://www.kidsafeseal.com/certifiedproducts.html), [CommonSense](https://www.commonsensemedia.org/website-lists) (specifically filtered for content suitable for children below the age of 13), a list compiled by [Kaspersky](https://kids.kaspersky.com/kids-website-list/), and children's websites identified by [VirusTotal](https://www.virustotal.com/gui/home/search) from the top
one million websites in the Chrome User Experience Report (CrUX) list from May 2022. For the negative class, which consists of websites not directed at children, we randomly selected webpages from [Common Crawl](https://commoncrawl.org/blog/june-july-2022-crawl-archive-now-available) dataset. To ensure the accuracy and relevance of our dataset, we also conducted a manual review of the samples. Please refer to Section 3 of our paper for more details.


The dataset is consists of the following columns:
1. **Text**: A string that includes the webpage's title and description concatenated with `". "` (dot, space).
2. **Label**: A numerical label (either '0.0' or '1.0') indicating the classification category of the text. In this case, '1.0' indicates that the webpage is a child-directed one, while '0.0' indicates that it is not.
3. **URL**: The URL of the webpage from which the text is extracted.
4. **Language**: The language of the text, indicated by a two-letter code (e.g., 'en' for English).

### Configuration
The following constants are used to configure the model training:
- `RANDOM_SEED`: Seed for random number generation to ensure reproducibility.
- `MODEL_CKPT`: Model checkpoint used for sequence classification.
- `BATCH_SIZE`, `LEARNING_RATE`, `EPOCHS`: Training hyperparameters.


### Model Training
The model is fine-tuned using the following steps:
1. Data Loading: Load the data from a pickle file and preprocess it.
2. Tokenization: Tokenize text data using a pretrained tokenizer.
3. Cross-Validation Setup: Split the data into training and validation sets using KFold cross-validation.
4. Training: Train the model on each fold and evaluate its performance.
5. Results: Aggregate and display the results from all folds.

In [None]:
import os
import torch
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from datasets import Dataset
from transformers import (AutoModelForSequenceClassification, Trainer, TrainingArguments,
                          AutoTokenizer, TrainingArguments)
from sklearn.metrics import precision_recall_fscore_support, accuracy_score

# Setting up constants and configurations
RANDOM_SEED = 200
MODEL_CKPT = 'sentence-transformers/paraphrase-multilingual-mpnet-base-v2'
BATCH_SIZE = 12
LEARNING_RATE = 4.2e-05
EPOCHS = 2
OUTPUT_DIR = 'output'

train_set_path = 'data/train_data.csv'

torch.manual_seed(RANDOM_SEED)
torch.cuda.manual_seed(RANDOM_SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
np.random.seed(RANDOM_SEED)

### Fine-tuning the classifier

In [2]:
# Check if CUDA is available
if torch.cuda.is_available():
    torch.zeros(1).cuda()

def load_data(file_path):
    """Load data from a csv file and return a pandas DataFrame."""
    data = pd.read_csv(file_path, sep=',')
    data['labels'] = data['labels'].astype(int)
    return data[['text', 'labels', 'urls']]

def create_folds(dataframe, n_splits=10):
    """Create KFold cross-validation folds and return a list of tuples
      containing train and validation indices."""
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=RANDOM_SEED)
    return [(train_idx, val_idx) for train_idx, val_idx in kf.split(dataframe)]

def create_dataset_from_indices(df, indices, tokenizer):
    """Create a dataset from a dataframe."""
    dataset = Dataset.from_pandas(df.iloc[indices])
    return dataset.map(lambda x: tokenize_function(x, tokenizer), batched=True)

def tokenize_function(example, tokenizer):
    """Tokenize a dataset."""
    return tokenizer(example["text"], truncation=True, padding=True)

def compute_metrics(pred):
    """Compute metrics for a prediction."""
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    return {'accuracy': accuracy_score(labels, preds), 'f1': f1, 'precision': precision, 'recall': recall}

def train_model_for_fold(train_dataset, val_dataset, tokenizer, model_checkpoint, output_dir):
    """Train a model for a single fold and return evaluation metrics."""
    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=EPOCHS,
        learning_rate=LEARNING_RATE,
        per_device_train_batch_size=BATCH_SIZE,
        per_device_eval_batch_size=BATCH_SIZE,
        logging_steps=len(train_dataset) // BATCH_SIZE,
        evaluation_strategy="epoch",
        save_strategy="epoch",  # Save the model at the end of each epoch
        save_total_limit=1       # Optional: Limits the total amount of checkpoints; helps save disk space
    )

    model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=2)
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics
    )

    trainer.train()
    trainer.save_model(output_dir)  # Explicitly save the final model after training
    tokenizer.save_pretrained(output_dir)  # Save the tokenizer used with the model
    return trainer.evaluate()


In [None]:
def fine_tune_model_with_training_data(train_set_path, model_ckpt, output_dir, n_splits=10):
    """
    Main function to run the cross-validation and train the model.

    Parameters:
    - train_set_path: str, path to the training dataset file.
    - model_ckpt: str, model checkpoint for loading the tokenizer and model.
    - output_dir: str, base path for saving output from each fold.
    - n_splits: int, number of splits for cross-validation.

    Returns:
    - final_results: dict, average evaluation metrics across all folds.
    """
    df_data = load_data(train_set_path)
    folds = create_folds(df_data, n_splits=n_splits)
    tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

    # check if the output directory exists
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    results = []
    for i, (train_idx, val_idx) in enumerate(folds):
        output_dir = f"{output_dir}/output_fold_{i+1}"  # Use a unique output directory for each fold
        train_ds = create_dataset_from_indices(df_data, train_idx, tokenizer)
        val_ds = create_dataset_from_indices(df_data, val_idx, tokenizer)

        print(f"Training fold {i+1}")
        result = train_model_for_fold(train_ds, val_ds, tokenizer, model_ckpt, output_dir)
        results.append(result)

    # Average the evaluation metrics across all folds
    final_results = {key: np.mean([result[key] for result in results]) for key in results[0]}
    print("Average across folds:", final_results)

In [None]:
fine_tune_model_with_training_data(train_set_path, MODEL_CKPT, OUTPUT_DIR, n_splits=10)