# RLHF-101

## Exploring Dataset

In [None]:
from datasets import load_dataset

In [None]:
ds = load_dataset("RLHFlow/Argilla-Math-DPO-standard")

In [None]:
dset = ds['train']
dset

Dataset({
    features: ['chosen', 'rejected', 'chosen_rating', 'rejected_rating', 'metadata'],
    num_rows: 2418
})

In [None]:
len(dset)

2418

In [None]:
ds_split=dset.train_test_split(test_size=0.2)
ds_split

DatasetDict({
    train: Dataset({
        features: ['chosen', 'rejected', 'chosen_rating', 'rejected_rating', 'metadata'],
        num_rows: 1934
    })
    test: Dataset({
        features: ['chosen', 'rejected', 'chosen_rating', 'rejected_rating', 'metadata'],
        num_rows: 484
    })
})

In [None]:
ds_sample = ds['train'][0]
ds_sample

{'chosen': [{'content': 'How can I simplify the algebraic expression `(3x^2 - 4y^3) / (2x)`?.\n Take a deep breath, think step by step, and give an accurate response',
   'role': 'user'},
  {'content': 'To simplify the algebraic expression `(3x^2 - 4y^3) / (2x)`, we can follow a few steps:\n\nStep 1: Distribute the division symbol by multiplying the expression by the reciprocal of the denominator. The reciprocal of `2x` is `1/(2x)`, so the expression becomes `(3x^2 - 4y^3) * (1/(2x))`.\n\nStep 2: Simplify within the parentheses by dividing each term separately. \n\n- For the first term, `3x^2`, divide `3x^2` by `2x`. This gives us `(3x^2) / (2x) = (3/2) * (x^2 / x) = (3/2) * x`.\n\n- For the second term, `-4y^3`, divide `-4y^3` by `2x`. This gives us `(-4y^3) / (2x) = (-2) * (y^3 / x)`.\n\nStep 3: Combine the simplified terms from Step 2. The expression now becomes `(3/2) * x - 2 * (y^3 / x)`.\n\nSo, the simplified form of the algebraic expression `(3x^2 - 4y^3) / (2x)` is `(3/2) * x -

In [None]:
ds_sample.keys()

dict_keys(['chosen', 'rejected', 'chosen_rating', 'rejected_rating', 'metadata'])

In [None]:
ds_sample['chosen']

[{'content': 'How can I simplify the algebraic expression `(3x^2 - 4y^3) / (2x)`?.\n Take a deep breath, think step by step, and give an accurate response',
  'role': 'user'},
 {'content': 'To simplify the algebraic expression `(3x^2 - 4y^3) / (2x)`, we can follow a few steps:\n\nStep 1: Distribute the division symbol by multiplying the expression by the reciprocal of the denominator. The reciprocal of `2x` is `1/(2x)`, so the expression becomes `(3x^2 - 4y^3) * (1/(2x))`.\n\nStep 2: Simplify within the parentheses by dividing each term separately. \n\n- For the first term, `3x^2`, divide `3x^2` by `2x`. This gives us `(3x^2) / (2x) = (3/2) * (x^2 / x) = (3/2) * x`.\n\n- For the second term, `-4y^3`, divide `-4y^3` by `2x`. This gives us `(-4y^3) / (2x) = (-2) * (y^3 / x)`.\n\nStep 3: Combine the simplified terms from Step 2. The expression now becomes `(3/2) * x - 2 * (y^3 / x)`.\n\nSo, the simplified form of the algebraic expression `(3x^2 - 4y^3) / (2x)` is `(3/2) * x - 2 * (y^3 / x

In [None]:
ds_sample['rejected']

[{'content': 'How can I simplify the algebraic expression `(3x^2 - 4y^3) / (2x)`?.\n Take a deep breath, think step by step, and give an accurate response',
  'role': 'user'},
 {'content': "To simplify the algebraic expression `(3x^2 - 4y^3) / (2x)`, you can divide each term in the numerator by the denominator.\n\nFirst, let's divide `3x^2` by `2x`. Since both terms have a common factor of `x`, we can simplify this expression to `3x`.\n\nNext, we divide `-4y^3` by `2x`. We can simplify this expression by dividing each term separately. \n\nDividing `-4` by `2` gives `-2`. Then, dividing `y^3` by `x` gives `y^3/x`.\n\nSo, the simplified form of `(3x^2 - 4y^3) / (2x)` is `3x - 2y^3/x`.",
  'role': 'assistant'}]

In [None]:
ds_sample['chosen_rating']

9.0

In [None]:
ds_sample['rejected_rating']

7.0

In [None]:
ds_sample['metadata']

'{"length-input": 139, "length-generations-1": 539, "length-generations-2": 493, "length-generations-3": 801, "rating-generations-1": 8.0, "rating-generations-2": 7.0, "rating-generations-3": 9.0, "distance-best-rated": 1.0}'

## Modelling

### Simple NN

#### Regression Model

In [None]:
from typing import Dict, List, Tuple
from datasets import Dataset as DS, load_dataset
from sentence_transformers import SentenceTransformer
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm

In [None]:
BATCH_SIZE = 64
EPOCHS = 3
INPUT_DIM = 768
HIDDEN_DIM = 384
OUTPUT_DIM = 1
LEARNING_RATE = 0.01

DEVICE = 'cuda'

In [None]:
embedding_model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2", device=DEVICE)

def embed(text: str) -> torch.Tensor:
    embedding = embedding_model.encode(text)
    return embedding

In [None]:
class MathDataset(Dataset):
    def __init__(self, ds: DS) -> None:
        self.ds = ds


    def __len__(self) -> int:
        return len(self.ds)


    def __getitem__(self, index: int) -> Tuple[torch.Tensor, float]:
        item = self.ds[index]
        question = item['question']
        answer = item['answer']
        rating = torch.tensor(item['rating'] / 10, dtype=torch.float32).reshape(1)
        input_text = f"Question: {question}\n\nAnswer: {answer}"
        input_embedding = embed(input_text)
        return input_embedding, rating

In [None]:
class NeuralNet(nn.Module):
    def __init__(self, input_dim: int, hidden_dim: int, output_dim: int) -> None:
        super(NeuralNet, self).__init__()
        self.model = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, output_dim),
            nn.Sigmoid()
        )


    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.model(x)

In [None]:
def metrics(actual_rating: torch.Tensor, predicted_rating: torch.Tensor) -> Tuple[float, float, float]:
    actual_rating = actual_rating * 10
    predicted_rating = predicted_rating * 10

    mae = torch.mean(torch.abs(actual_rating - predicted_rating))
    mse = torch.mean((actual_rating - predicted_rating) ** 2)
    rmse = torch.sqrt(mse)
    return mae.item(), mse.item(), rmse.item()

In [None]:
def process_data(data_split: DS) -> List[Dict]:
    data_list = []
    for ds in data_split:
        for subset in ('chosen', 'rejected'):
            subset_convo = ds[subset]
            subset_rating = ds[f'{subset}_rating']

            data_dict = dict()
            for i, convo in enumerate(subset_convo):
                if i == 0:
                    data_dict['question'] = convo['content']
                elif i == 1:
                    data_dict['answer'] = convo['content']
            data_dict['rating'] = subset_rating

            data_list.append(data_dict)
    return data_list


raw_dataset = load_dataset("RLHFlow/Argilla-Math-DPO-standard")

dset = raw_dataset['train']
ds_split=dset.train_test_split(test_size=0.2)

train_split = process_data(ds_split['train'])
test_split = process_data(ds_split['test'])

train_dataset = MathDataset(train_split)
test_datast = MathDataset(test_split)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(test_datast, batch_size=BATCH_SIZE, shuffle=True)

In [None]:
model = NeuralNet(
    input_dim=INPUT_DIM,
    hidden_dim=HIDDEN_DIM,
    output_dim=OUTPUT_DIM
).to(DEVICE)

criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)

In [None]:
def train_epoch(
    model: NeuralNet,
    device: str,
    train_loader: DataLoader,
    optimizer: torch.optim.Adam,
    epoch: int
) -> Tuple[float, float, float]:
    model.train()
    running_mae, running_mse, running_rmse = 0.0, 0.0, 0.0
    for data, target in tqdm(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        outputs = model(data)
        loss = criterion(outputs, target)
        loss.backward()
        optimizer.step()

        mae, mse, rmse = metrics(target, outputs)
        running_mae += mae
        running_mse += mse
        running_rmse += rmse

    avg_mae = running_mae / len(train_loader)
    avg_mse = running_mse / len(train_loader)
    avg_rmse = running_rmse / len(train_loader)

    return avg_mae, avg_mse, avg_rmse


def test_epoch(
    model: NeuralNet,
    device: str,
    test_loader: DataLoader,
    criterion: torch.optim.Adam
) -> Tuple[float, float, float]:
    model.eval()
    running_mae, running_mse, running_rmse = 0.0, 0.0, 0.0
    with torch.no_grad():
        for data, target in tqdm(test_loader):
            data, target = data.to(device), target.to(device)
            outputs = model(data)
            loss = criterion(outputs, target)

            mae, mse, rmse = metrics(target, outputs)
            running_mae += mae
            running_mse += mse
            running_rmse += rmse

        avg_mae = running_mae / len(test_loader)
        avg_mse = running_mse / len(test_loader)
        avg_rmse = running_rmse / len(test_loader)

        return avg_mae, avg_mse, avg_rmse

In [None]:
history = []
for epoch in range(1, EPOCHS+1):
    train_mae, train_mse, train_rmse = train_epoch(
        model=model,
        device=DEVICE,
        train_loader=train_loader,
        optimizer=optimizer,
        epoch=epoch
    )
    test_mae, test_mse, test_rmse = test_epoch(
        model=model,
        device=DEVICE,
        test_loader=test_loader,
        criterion=criterion
    )

    history.append((train_mae, train_mse, train_rmse, test_mae, test_mse, test_rmse))

    print(f"Epoch {epoch} | Train MAE: {train_mae:.4f}, Train MSE: {train_mse:.4f}, Train RMSE: {train_rmse:.4f} | Test MAE: {test_mae:.4f}, Test MSE: {test_mse:.4f}, Test RMSE: {test_rmse:.4f}")

100%|██████████| 61/61 [01:43<00:00,  1.70s/it]
100%|██████████| 16/16 [00:25<00:00,  1.62s/it]


Epoch 1 | Train MAE: 1.0458, Train MSE: 1.6541, Train RMSE: 1.2341 | Test MAE: 0.9267, Test MSE: 1.1022, Test RMSE: 1.0453


100%|██████████| 61/61 [01:43<00:00,  1.70s/it]
100%|██████████| 16/16 [00:25<00:00,  1.58s/it]


Epoch 2 | Train MAE: 0.9193, Train MSE: 1.1249, Train RMSE: 1.0568 | Test MAE: 0.9056, Test MSE: 1.1296, Test RMSE: 1.0601


100%|██████████| 61/61 [01:45<00:00,  1.73s/it]
100%|██████████| 16/16 [00:25<00:00,  1.59s/it]

Epoch 3 | Train MAE: 0.9159, Train MSE: 1.1023, Train RMSE: 1.0473 | Test MAE: 0.9060, Test MSE: 1.0859, Test RMSE: 1.0396





In [None]:
def logistic_test(data_split: DS) -> List[Dict]:
    data_list = []
    for ds in tqdm(data_split):
        data_dict = dict()
        for subset in ('chosen', 'rejected'):
            subset_convo = ds[subset]
            subset_rating = ds[f'{subset}_rating']

            for i, convo in enumerate(subset_convo):
                if i == 0:
                    data_dict[f'{subset}_question'] = convo['content']
                elif i == 1:
                    data_dict[f'{subset}_answer'] = convo['content']
            data_dict[f'{subset}_actual_rating'] = subset_rating

            input_text = f"Question: {data_dict[f'{subset}_question']}\n\nAnswer: {data_dict[f'{subset}_answer']}"
            data_dict[f'{subset}_predicted_rating'] = model(torch.Tensor(embed(input_text)).to(DEVICE)).to('cpu').item() * 10

        data_list.append(data_dict)
        # break
    return data_list


def get_scores(metrics: List[Dict], subset: str) -> Tuple[float, float, float, float]:
    actual_vals, predicted_vals = [], []
    for metric in metrics:
        actual = metric['chosen_actual_rating'] > metric['rejected_actual_rating']
        predicted = metric['chosen_predicted_rating'] > metric['rejected_predicted_rating']
        actual_vals.append(actual)
        predicted_vals.append(predicted)

    accuracy = sum(actual == predicted for actual, predicted in zip(actual_vals, predicted_vals)) / len(actual_vals)
    precision = sum(actual == predicted for actual, predicted in zip(actual_vals, predicted_vals)) / sum(predicted_vals)
    recall = sum(actual == predicted for actual, predicted in zip(actual_vals, predicted_vals)) / sum(actual_vals)
    f1_score = 2 * (precision * recall) / (precision + recall)

    print(f"{subset}: Accuracy: {accuracy:.4f} | Precision: {precision:.4f} | Recall: {recall:.4f} | F1 Score: {f1_score:.4f}")



logistic_train_metrics = logistic_test(ds_split['train'])
logistic_test_metrics = logistic_test(ds_split['test'])

100%|██████████| 1934/1934 [01:44<00:00, 18.48it/s]
100%|██████████| 484/484 [00:26<00:00, 18.59it/s]


In [None]:
get_scores(logistic_train_metrics, 'Train')
get_scores(logistic_test_metrics, 'Test')

Train: Accuracy: 0.5471 | Precision: 1.0262 | Recall: 0.5548 | F1 Score: 0.7202
Test: Accuracy: 0.5475 | Precision: 1.0271 | Recall: 0.5556 | F1 Score: 0.7211


Train: Accuracy: 0.5481 | Precision: 1.0291 | Recall: 0.5567 | F1 Score: 0.7226

Test: Accuracy: 0.5372 | Precision: 1.0156 | Recall: 0.5417 | F1 Score: 0.7065

#### Classification Model

In [None]:
import random
from typing import Dict, List, Tuple
from datasets import Dataset as DS, load_dataset
from sentence_transformers import SentenceTransformer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm

In [None]:
BATCH_SIZE = 64
EPOCHS = 10
INPUT_DIM = 768
HIDDEN_DIM = 384
OUTPUT_DIM = 1
LEARNING_RATE = 0.001

DEVICE = 'cuda'

In [None]:
embedding_model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2", device=DEVICE)

def embed(text: str) -> torch.Tensor:
    embedding = embedding_model.encode(text)
    return torch.Tensor(embedding)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
class MathDataset(Dataset):
    def __init__(self, ds: DS) -> None:
        self.ds = ds


    def __len__(self) -> int:
        return len(self.ds)


    def __getitem__(self, index: int) -> Tuple[torch.Tensor, int]:
        item = self.ds[index]
        data_dict = dict()
        subsets = ('chosen', 'rejected')
        for subset in subsets:
            subset_convo = item[subset]
            subset_rating = item[f'{subset}_rating']
            for i, convo in enumerate(subset_convo):
                if i == 0:
                    data_dict[f'{subset}_question'] = convo['content']
                elif i == 1:
                    data_dict[f'{subset}_answer'] = convo['content']
            data_dict[f'{subset}_input_text'] = f"Question: {data_dict[f'{subset}_question']}\n\nAnswer: {data_dict[f'{subset}_answer']}"
            data_dict[f'{subset}_embedding'] = embed(data_dict[f'{subset}_input_text'])
            data_dict[f'{subset}_rating'] = torch.tensor(item[f'{subset}_rating'] / 10, dtype=torch.float32).reshape(1)

        pos_ordering = random.choice([True, False])
        if not pos_ordering:
            subsets = ('rejected', 'chosen')

        input_embeddings = data_dict[f'{subsets[0]}_embedding'] - data_dict[f'{subsets[1]}_embedding']
        rating_diff = data_dict[f'{subsets[0]}_rating'] - data_dict[f'{subsets[1]}_rating']
        rating_label = torch.tensor(0 if rating_diff < 0 else 1, dtype=torch.float32)
        return input_embeddings, rating_label

In [None]:
class NeuralNet(nn.Module):
    def __init__(self, input_dim: int, hidden_dim: int, output_dim: int) -> None:
        super(NeuralNet, self).__init__()
        self.model = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, output_dim),
            nn.Sigmoid()
        )


    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.model(x)

In [None]:
def metrics(actual_rating: torch.Tensor, predicted_rating: torch.Tensor) -> Tuple[float, float, float, float]:
    accuracy = accuracy_score(actual_rating.cpu(), predicted_rating.cpu())
    precision = precision_score(actual_rating.cpu(), predicted_rating.cpu(), zero_division=0)
    recall = recall_score(actual_rating.cpu(), predicted_rating.cpu(), zero_division=0)
    f1 = f1_score(actual_rating.cpu(), predicted_rating.cpu(), zero_division=0)
    return accuracy, precision, recall, f1

In [None]:
raw_dataset = load_dataset("RLHFlow/Argilla-Math-DPO-standard")

dset = raw_dataset['train']
ds_split=dset.train_test_split(test_size=0.2)

train_dataset = MathDataset(ds_split['train'])
test_datast = MathDataset(ds_split['test'])

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(test_datast, batch_size=BATCH_SIZE, shuffle=True)

README.md:   0%|          | 0.00/576 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/2.93M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2418 [00:00<?, ? examples/s]

In [None]:
model = NeuralNet(
    input_dim=INPUT_DIM,
    hidden_dim=HIDDEN_DIM,
    output_dim=OUTPUT_DIM
).to(DEVICE)

criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)

In [None]:
def train_epoch(
    model: NeuralNet,
    device: str,
    train_loader: DataLoader,
    optimizer: torch.optim.Adam,
    epoch: int
) -> Tuple[float, float, float, float]:
    model.train()
    running_accuracy, running_precision, runniung_recall, running_f1 = 0.0, 0.0, 0.0, 0.0
    for data, target in tqdm(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        outputs = model(data)
        loss = criterion(outputs, target.unsqueeze(1))
        loss.backward()
        optimizer.step()

        predicted = (outputs > 0.5).float()
        accuracy, precision, recall, f1 = metrics(target, predicted)
        running_accuracy += accuracy
        running_precision += precision
        runniung_recall += recall
        running_f1 += f1

    avg_accuracy = running_accuracy / len(train_loader)
    avg_precision = running_precision / len(train_loader)
    avg_recall = runniung_recall / len(train_loader)
    avg_f1 = running_f1 / len(train_loader)

    return avg_accuracy, avg_precision, avg_recall, avg_f1


def test_epoch(
    model: NeuralNet,
    device: str,
    test_loader: DataLoader,
    criterion: torch.optim.Adam
) -> Tuple[float, float, float, float]:
    model.eval()
    running_accuracy, running_precision, runniung_recall, running_f1 = 0.0, 0.0, 0.0, 0.0
    with torch.no_grad():
        for data, target in tqdm(test_loader):
            data, target = data.to(device), target.to(device)
            outputs = model(data)
            loss = criterion(outputs, target.unsqueeze(1))

            predicted = (outputs > 0.5).float()
            accuracy, precision, recall, f1 = metrics(target, predicted)
            running_accuracy += accuracy
            running_precision += precision
            runniung_recall += recall
            running_f1 += f1

        avg_accuracy = running_accuracy / len(test_loader)
        avg_precision = running_precision / len(test_loader)
        avg_recall = runniung_recall / len(test_loader)
        avg_f1 = running_f1 / len(test_loader)

        return avg_accuracy, avg_precision, avg_recall, avg_f1

In [None]:
history = []
for epoch in range(1, EPOCHS+1):
    train_acc, train_pre, train_rec, train_f1 = train_epoch(
        model=model,
        device=DEVICE,
        train_loader=train_loader,
        optimizer=optimizer,
        epoch=epoch
    )
    test_acc, test_pre, test_rec, test_f1 = test_epoch(
        model=model,
        device=DEVICE,
        test_loader=test_loader,
        criterion=criterion
    )

    history.append((train_acc, train_pre, train_rec, train_f1, test_acc, test_pre, test_rec, test_f1))

    print(f"Epoch {epoch} | Train Acc: {train_acc:.4f}, Train Pre: {train_pre:.4f}, Train Rec: {train_rec:.4f}, Train F1: {train_f1:.4f} | Test Acc: {test_acc:.4f}, Test Pre: {test_pre:.4f}, Test Rec: {test_rec:.4f}, Test F1: {test_f1:.4f}")

100%|██████████| 31/31 [01:36<00:00,  3.11s/it]
100%|██████████| 8/8 [00:24<00:00,  3.08s/it]


Epoch 1 | Train Acc: 0.5002, Train Pre: 0.4811, Train Rec: 0.7920, Train F1: 0.5622 | Test Acc: 0.5033, Test Pre: 0.6952, Test Rec: 0.0850, Test F1: 0.1467


100%|██████████| 31/31 [01:36<00:00,  3.12s/it]
100%|██████████| 8/8 [00:24<00:00,  3.07s/it]


Epoch 2 | Train Acc: 0.5282, Train Pre: 0.6185, Train Rec: 0.2628, Train F1: 0.2794 | Test Acc: 0.5543, Test Pre: 0.5401, Test Rec: 0.8627, Test F1: 0.6620


100%|██████████| 31/31 [01:36<00:00,  3.12s/it]
100%|██████████| 8/8 [00:24<00:00,  3.09s/it]


Epoch 3 | Train Acc: 0.5896, Train Pre: 0.5668, Train Rec: 0.7745, Train F1: 0.6451 | Test Acc: 0.5137, Test Pre: 0.5588, Test Rec: 0.4742, Test F1: 0.5082


100%|██████████| 31/31 [01:37<00:00,  3.13s/it]
100%|██████████| 8/8 [00:24<00:00,  3.12s/it]


Epoch 4 | Train Acc: 0.5945, Train Pre: 0.6103, Train Rec: 0.5954, Train F1: 0.5904 | Test Acc: 0.5254, Test Pre: 0.5276, Test Rec: 0.6805, Test F1: 0.5932


100%|██████████| 31/31 [01:38<00:00,  3.17s/it]
100%|██████████| 8/8 [00:24<00:00,  3.12s/it]


Epoch 5 | Train Acc: 0.6184, Train Pre: 0.6197, Train Rec: 0.7085, Train F1: 0.6546 | Test Acc: 0.5380, Test Pre: 0.5291, Test Rec: 0.7303, Test F1: 0.6124


100%|██████████| 31/31 [01:37<00:00,  3.15s/it]
100%|██████████| 8/8 [00:25<00:00,  3.17s/it]


Epoch 6 | Train Acc: 0.6244, Train Pre: 0.6409, Train Rec: 0.6198, Train F1: 0.6209 | Test Acc: 0.5373, Test Pre: 0.5662, Test Rec: 0.5929, Test F1: 0.5727


100%|██████████| 31/31 [01:38<00:00,  3.17s/it]
100%|██████████| 8/8 [00:24<00:00,  3.12s/it]


Epoch 7 | Train Acc: 0.6324, Train Pre: 0.6122, Train Rec: 0.7282, Train F1: 0.6595 | Test Acc: 0.5007, Test Pre: 0.5054, Test Rec: 0.4681, Test F1: 0.4825


100%|██████████| 31/31 [01:37<00:00,  3.14s/it]
100%|██████████| 8/8 [00:24<00:00,  3.11s/it]


Epoch 8 | Train Acc: 0.6392, Train Pre: 0.6477, Train Rec: 0.6336, Train F1: 0.6353 | Test Acc: 0.5132, Test Pre: 0.5060, Test Rec: 0.5734, Test F1: 0.5304


100%|██████████| 31/31 [01:37<00:00,  3.14s/it]
100%|██████████| 8/8 [00:24<00:00,  3.11s/it]


Epoch 9 | Train Acc: 0.6534, Train Pre: 0.6494, Train Rec: 0.6872, Train F1: 0.6623 | Test Acc: 0.5111, Test Pre: 0.5103, Test Rec: 0.5380, Test F1: 0.5173


100%|██████████| 31/31 [01:36<00:00,  3.12s/it]
100%|██████████| 8/8 [00:24<00:00,  3.11s/it]

Epoch 10 | Train Acc: 0.6622, Train Pre: 0.6663, Train Rec: 0.6788, Train F1: 0.6673 | Test Acc: 0.5132, Test Pre: 0.5321, Test Rec: 0.6329, Test F1: 0.5731





### Reward Model

#### Simple Frozen-Embedding Input

In [None]:
import random
from typing import Dict, List, Tuple
from datasets import Dataset as DS, load_dataset
from huggingface_hub import login
from sentence_transformers import SentenceTransformer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from transformers import AutoModel, AutoTokenizer
from tqdm import tqdm

In [None]:
login(token="...")

In [None]:
BATCH_SIZE = 64
EPOCHS = 10
INPUT_DIM = 1024 # 768
HIDDEN_DIM = 512 # 384
OUTPUT_DIM = 1
LEARNING_RATE = 0.0001

DEVICE = 'cuda'

In [None]:
# embedding_model = SentenceTransformer("intfloat/multilingual-e5-large-instruct", device=DEVICE)

# def embed(text: str) -> torch.Tensor:
#     embedding = embedding_model.encode(text)
#     return torch.Tensor(embedding)

In [None]:
MODEL_NAME = "microsoft/deberta-v3-large"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer_model = AutoModel.from_pretrained(MODEL_NAME).to(DEVICE)

tokenizer_model.eval()

OUTPUT_DIMENSION = tokenizer_model.config.hidden_size
print(f"Loaded model: {MODEL_NAME} with hidden dimension: {OUTPUT_DIMENSION}")


def tokenize_and_encode(text: str) -> torch.Tensor:
    inputs = tokenizer(
        text,
        return_tensors="pt",       # Return PyTorch tensors
        padding="max_length",      # Pad all sequences to max_length
        truncation=True,           # Truncate sequences longer than max_length
        max_length=512             # Standard max length for BERT-style models
    ).to(DEVICE)

    with torch.no_grad():
        outputs = tokenizer_model(**inputs)

    last_hidden_state = outputs.last_hidden_state
    cls_embedding = last_hidden_state[:, 0, :]

    return cls_embedding.squeeze(0)

input_text = "Prompt: Tell me about RLHF. Answer: RLHF stands for Reinforcement Learning from Human Feedback, which involves three main steps: SFT, RM training, and PPO."

token_embedding = tokenize_and_encode(input_text)

print("-" * 30)
print(f"✅ Input Text Processed.")
print(f"Output type: {type(token_embedding)}")
print(f"Output shape: {token_embedding.shape}") # Should match (OUTPUT_DIMENSION,)
print(f"First 5 values: {token_embedding[:5].cpu().numpy()}")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loaded model: microsoft/deberta-v3-large with hidden dimension: 1024
------------------------------
✅ Input Text Processed.
Output type: <class 'torch.Tensor'>
Output shape: torch.Size([1024])
First 5 values: [-0.33532432 -0.17786105  0.10254461 -0.05691522  0.14579831]


In [None]:
class MathDataset(Dataset):
    def __init__(self, ds: DS) -> None:
        self.ds = ds


    def __len__(self) -> int:
        return len(self.ds)


    def __getitem__(self, index: int) -> Tuple[torch.Tensor, torch.Tensor]:
        item = self.ds[index]
        data_dict = dict()
        subsets = ('chosen', 'rejected')
        for subset in subsets:
            subset_convo = item[subset]
            subset_rating = item[f'{subset}_rating']
            for i, convo in enumerate(subset_convo):
                if i == 0:
                    data_dict[f'{subset}_question'] = convo['content']
                elif i == 1:
                    data_dict[f'{subset}_answer'] = convo['content']
            data_dict[f'{subset}_input_text'] = f"Question: {data_dict[f'{subset}_question']}\n\nAnswer: {data_dict[f'{subset}_answer']}"
            data_dict[f'{subset}_embedding'] = tokenize_and_encode(data_dict[f'{subset}_input_text'])
            data_dict[f'{subset}_rating'] = torch.tensor(item[f'{subset}_rating'] / 10, dtype=torch.float32).reshape(1)

        chosen_data, rejected_data = data_dict[f'chosen_embedding'], data_dict[f'rejected_embedding']
        return chosen_data, rejected_data

In [None]:
class NeuralNet(nn.Module):
    def __init__(self, input_dim: int, hidden_dim: int, output_dim: int) -> None:
        super(NeuralNet, self).__init__()
        self.model = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.GELU(),
            nn.Dropout(),
            nn.Linear(hidden_dim, output_dim),
            nn.Sigmoid()
        )


    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.model(x)

In [None]:
def metrics(actual_rating: torch.Tensor, predicted_rating: torch.Tensor) -> Tuple[float, float, float, float]:
    accuracy = accuracy_score(actual_rating.cpu(), predicted_rating.cpu())
    precision = precision_score(actual_rating.cpu(), predicted_rating.cpu(), zero_division=0)
    recall = recall_score(actual_rating.cpu(), predicted_rating.cpu(), zero_division=0)
    f1 = f1_score(actual_rating.cpu(), predicted_rating.cpu(), zero_division=0)
    return accuracy, precision, recall, f1

In [None]:
raw_dataset = load_dataset("RLHFlow/Argilla-Math-DPO-standard")

dset = raw_dataset['train']
ds_split=dset.train_test_split(test_size=0.2)

train_dataset = MathDataset(ds_split['train'])
test_datast = MathDataset(ds_split['test'])

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(test_datast, batch_size=BATCH_SIZE, shuffle=True)

In [None]:
class BradleyTerryLoss(nn.Module):
    def __init__(self) -> None:
        super(BradleyTerryLoss, self).__init__()


    def forward(self, chosen_reward: torch.Tensor, rejected_reward: torch.Tensor) -> torch.Tensor:
        reward_difference = chosen_reward - rejected_reward
        loss = -F.logsigmoid(reward_difference).mean()
        return loss

In [None]:
model = NeuralNet(
    input_dim=INPUT_DIM,
    hidden_dim=HIDDEN_DIM,
    output_dim=OUTPUT_DIM
).to(DEVICE)

criterion = BradleyTerryLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)

In [None]:
def train_epoch(
    model: NeuralNet,
    device: str,
    train_loader: DataLoader,
    optimizer: torch.optim.Adam,
    epoch: int
) -> Tuple[float, float, float, float]:
    model.train()
    running_accuracy, running_precision, runniung_recall, running_f1 = 0.0, 0.0, 0.0, 0.0
    for chosen_data, rejected_data in tqdm(train_loader):
        chosen_data, rejected_data = chosen_data.to(device), rejected_data.to(device)
        optimizer.zero_grad()
        chosen_output = model(chosen_data)
        rejected_output = model(rejected_data)
        loss = criterion(chosen_output, rejected_output)
        loss.backward()
        optimizer.step()

        actual = torch.ones(chosen_output.shape)
        predicted = (chosen_output > rejected_output).float()
        accuracy, precision, recall, f1 = metrics(actual, predicted)
        running_accuracy += accuracy
        running_precision += precision
        runniung_recall += recall
        running_f1 += f1

    avg_accuracy = running_accuracy / len(train_loader)
    avg_precision = running_precision / len(train_loader)
    avg_recall = runniung_recall / len(train_loader)
    avg_f1 = running_f1 / len(train_loader)

    return avg_accuracy, avg_precision, avg_recall, avg_f1


def test_epoch(
    model: NeuralNet,
    device: str,
    test_loader: DataLoader,
    criterion: torch.optim.Adam
) -> Tuple[float, float, float, float]:
    model.eval()
    running_accuracy, running_precision, runniung_recall, running_f1 = 0.0, 0.0, 0.0, 0.0
    with torch.no_grad():
        for chosen_data, rejected_data in tqdm(test_loader):
            chosen_data, rejected_data = chosen_data.to(device), rejected_data.to(device)
            chosen_output = model(chosen_data)
            rejected_output = model(rejected_data)
            loss = criterion(chosen_output, rejected_output)

            actual = torch.ones(chosen_output.shape)
            predicted = (chosen_output > rejected_output).float()
            accuracy, precision, recall, f1 = metrics(actual, predicted)
            running_accuracy += accuracy
            running_precision += precision
            runniung_recall += recall
            running_f1 += f1

        avg_accuracy = running_accuracy / len(test_loader)
        avg_precision = running_precision / len(test_loader)
        avg_recall = runniung_recall / len(test_loader)
        avg_f1 = running_f1 / len(test_loader)

        return avg_accuracy, avg_precision, avg_recall, avg_f1

In [None]:
history = []
for epoch in range(1, EPOCHS+1):
    train_acc, train_pre, train_rec, train_f1 = train_epoch(
        model=model,
        device=DEVICE,
        train_loader=train_loader,
        optimizer=optimizer,
        epoch=epoch
    )
    test_acc, test_pre, test_rec, test_f1 = test_epoch(
        model=model,
        device=DEVICE,
        test_loader=test_loader,
        criterion=criterion
    )

    history.append((train_acc, train_pre, train_rec, train_f1, test_acc, test_pre, test_rec, test_f1))

    print(f"Epoch {epoch} | Train Acc: {train_acc:.4f}, Train Pre: {train_pre:.4f}, Train Rec: {train_rec:.4f}, Train F1: {train_f1:.4f} | Test Acc: {test_acc:.4f}, Test Pre: {test_pre:.4f}, Test Rec: {test_rec:.4f}, Test F1: {test_f1:.4f}")

100%|██████████| 31/31 [11:26<00:00, 22.13s/it]
100%|██████████| 8/8 [02:50<00:00, 21.32s/it]


Epoch 1 | Train Acc: 0.5369, Train Pre: 1.0000, Train Rec: 0.5369, Train F1: 0.6965 | Test Acc: 0.5039, Test Pre: 1.0000, Test Rec: 0.5039, Test F1: 0.6680


 26%|██▌       | 8/31 [03:01<08:40, 22.65s/it]

#### Simple Non-Frozen-Embedding Input

In [1]:
import os
import random
from typing import Dict, List, Tuple
from datasets import Dataset as DS, load_dataset
from huggingface_hub import login
import mlflow
from sentence_transformers import SentenceTransformer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from transformers import AutoModel, AutoTokenizer
from tqdm import tqdm

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# os.environ["PYTORCH_MPS_HIGH_WATERMARK_RATIO"] = "0.7"
# os.environ["PYTORCH_MPS_LOW_WATERMARK_RATIO"] = "0.6"

In [3]:
BATCH_SIZE = 64
EPOCHS = 5 # 10
OUTPUT_DIM = 1
LEARNING_RATE = 0.0001

BASE_MODEL_NAME = "distilbert-base-uncased"
BASE_MODEL_MAX_LENGTH=512

DEVICE = 'cpu'

In [4]:
from typing import List, Tuple, Dict, Any
import torch
from torch.utils.data import Dataset
from transformers import PreTrainedTokenizerBase

class MathDataset(Dataset):

    def __init__(self, ds: DS, tokenizer: PreTrainedTokenizerBase, max_length: int = 512):
        self.ds = ds
        self.tokenizer = tokenizer
        self.max_length = max_length


    def __len__(self) -> int:
        return len(self.ds)


    def _convo_to_qa(self, convo: List[Dict[str, str]]) -> Tuple[str, str]:
        q = ""
        a = ""
        if not convo:
            return "", ""
        if len(convo) == 1:
            a = convo[0].get("content", "") if isinstance(convo[0], dict) else str(convo[0])
        else:
            first = convo[0]
            second = convo[1]
            q = first.get("content", "") if isinstance(first, dict) else str(first)
            a = second.get("content", "") if isinstance(second, dict) else str(second)
        return q, a


    def __getitem__(self, index: int) -> Tuple[str, str]:
        item = self.ds[index]

        chosen_convo = item.get("chosen", [])
        rejected_convo = item.get("rejected", [])

        chosen_q, chosen_a = self._convo_to_qa(chosen_convo)
        rejected_q, rejected_a = self._convo_to_qa(rejected_convo)

        chosen_text = f"Question: {chosen_q}\n\nAnswer: {chosen_a}".strip()
        rejected_text = f"Question: {rejected_q}\n\nAnswer: {rejected_a}".strip()

        return chosen_text, rejected_text


def rlhf_collate_fn(
    batch: List[Tuple[str, str]],
    tokenizer: PreTrainedTokenizerBase,
    max_length: int = 512
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
    chosen_texts = [b[0] for b in batch]
    rejected_texts = [b[1] for b in batch]

    all_texts = chosen_texts + rejected_texts
    enc = tokenizer(
        all_texts,
        padding=True,
        truncation=True,
        max_length=max_length,
        return_tensors="pt",
    )

    input_ids = enc["input_ids"]
    attention_mask = enc["attention_mask"]

    B = len(batch)
    chosen_input_ids = input_ids[:B]
    rejected_input_ids = input_ids[B:]
    chosen_attention_mask = attention_mask[:B]
    rejected_attention_mask = attention_mask[B:]

    return chosen_input_ids, chosen_attention_mask, rejected_input_ids, rejected_attention_mask

In [5]:
class HFRewardModel(nn.Module):
    def __init__(self, base_model_name: str) -> None:
        super().__init__()
        self.backbone = AutoModel.from_pretrained(base_model_name)
        self.reward_head = nn.Linear(self.backbone.config.hidden_size, 1)

    def forward(self, input_ids: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
        out = self.backbone(input_ids=input_ids, attention_mask=attention_mask)
        hidden = out.last_hidden_state
        last_token = hidden[:, -1, :]
        reward = self.reward_head(last_token).squeeze(-1)
        return reward

In [6]:
def metrics(actual_rating: torch.Tensor, predicted_rating: torch.Tensor) -> Tuple[float, float, float, float]:
    accuracy = accuracy_score(actual_rating.cpu(), predicted_rating.cpu())
    precision = precision_score(actual_rating.cpu(), predicted_rating.cpu(), zero_division=0)
    recall = recall_score(actual_rating.cpu(), predicted_rating.cpu(), zero_division=0)
    f1 = f1_score(actual_rating.cpu(), predicted_rating.cpu(), zero_division=0)
    return accuracy, precision, recall, f1

In [7]:
class BradleyTerryLoss(nn.Module):
    def __init__(self) -> None:
        super(BradleyTerryLoss, self).__init__()


    def forward(self, chosen_reward: torch.Tensor, rejected_reward: torch.Tensor) -> torch.Tensor:
        reward_difference = chosen_reward - rejected_reward
        loss = -F.logsigmoid(reward_difference).mean()
        return loss

In [8]:
raw_dataset = load_dataset("RLHFlow/Argilla-Math-DPO-standard")

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_NAME)

dset = raw_dataset['train'] #.select(range(100))
ds_split=dset.train_test_split(test_size=0.2)

train_dataset = MathDataset(ds_split['train'], tokenizer, BASE_MODEL_MAX_LENGTH)
test_datast = MathDataset(ds_split['test'], tokenizer, BASE_MODEL_MAX_LENGTH)

train_loader = DataLoader(
    train_dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,
    collate_fn=lambda b: rlhf_collate_fn(b, tokenizer=tokenizer, max_length=BASE_MODEL_MAX_LENGTH)
)
test_loader = DataLoader(
    test_datast,
    batch_size=BATCH_SIZE,
    shuffle=True,
    collate_fn=lambda b: rlhf_collate_fn(b, tokenizer=tokenizer, max_length=BASE_MODEL_MAX_LENGTH),
)

In [9]:
model = HFRewardModel(base_model_name=BASE_MODEL_NAME).to(DEVICE)

criterion = BradleyTerryLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)

In [10]:
def train_epoch(
    model: HFRewardModel,
    device: str,
    train_loader: DataLoader,
    optimizer: torch.optim.Adam,
    epoch: int
) -> Tuple[float, float, float, float]:
    model.train()
    running_accuracy, running_precision, runniung_recall, running_f1 = 0.0, 0.0, 0.0, 0.0
    running_loss = 0.0
    for (
        chosen_input_ids,
        chosen_attention_mask,
        rejected_input_ids,
        rejected_attention_mask
    ) in tqdm(train_loader):
        (
            chosen_input_ids,
            chosen_attention_mask,
            rejected_input_ids,
            rejected_attention_mask
        ) = (
            chosen_input_ids.to(device),
            chosen_attention_mask.to(device),
            rejected_input_ids.to(device),
            rejected_attention_mask.to(device)
        )
        optimizer.zero_grad()
        chosen_output = model(chosen_input_ids, chosen_attention_mask)
        rejected_output = model(rejected_input_ids, rejected_attention_mask)
        loss = criterion(chosen_output, rejected_output)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

        actual = torch.ones(chosen_output.shape)
        predicted = (chosen_output > rejected_output).float()
        accuracy, precision, recall, f1 = metrics(actual, predicted)
        running_accuracy += accuracy
        running_precision += precision
        runniung_recall += recall
        running_f1 += f1

    avg_loss = running_loss / len(train_loader)
    avg_accuracy = running_accuracy / len(train_loader)
    avg_precision = running_precision / len(train_loader)
    avg_recall = runniung_recall / len(train_loader)
    avg_f1 = running_f1 / len(train_loader)

    mlflow.log_metric("train_loss", avg_loss, step=epoch)
    mlflow.log_metric("train_accuracy", avg_accuracy, step=epoch)
    mlflow.log_metric("train_precision", avg_precision, step=epoch)
    mlflow.log_metric("train_recall", avg_recall, step=epoch)
    mlflow.log_metric("train_f1", avg_f1, step=epoch)

    return avg_accuracy, avg_precision, avg_recall, avg_f1


def test_epoch(
    model: HFRewardModel,
    device: str,
    test_loader: DataLoader,
    criterion: torch.optim.Adam
) -> Tuple[float, float, float, float, float]:
    model.eval()
    running_accuracy, running_precision, runniung_recall, running_f1 = 0.0, 0.0, 0.0, 0.0
    running_loss = 0.0
    with torch.no_grad():
        for (
            chosen_input_ids,
            chosen_attention_mask,
            rejected_input_ids,
            rejected_attention_mask
        ) in tqdm(test_loader):
            (
                chosen_input_ids,
                chosen_attention_mask,
                rejected_input_ids,
                rejected_attention_mask
            ) = (
                chosen_input_ids.to(device),
                chosen_attention_mask.to(device),
                rejected_input_ids.to(device),
                rejected_attention_mask.to(device)
            )
            chosen_output = model(chosen_input_ids, chosen_attention_mask)
            rejected_output = model(rejected_input_ids, rejected_attention_mask)
            loss = criterion(chosen_output, rejected_output)

            running_loss += loss.item()

            actual = torch.ones(chosen_output.shape)
            predicted = (chosen_output > rejected_output).float()
            accuracy, precision, recall, f1 = metrics(actual, predicted)
            running_accuracy += accuracy
            running_precision += precision
            runniung_recall += recall
            running_f1 += f1

        avg_loss = running_loss / len(test_loader)
        avg_accuracy = running_accuracy / len(test_loader)
        avg_precision = running_precision / len(test_loader)
        avg_recall = runniung_recall / len(test_loader)
        avg_f1 = running_f1 / len(test_loader)

        return avg_accuracy, avg_precision, avg_recall, avg_f1, avg_loss

In [11]:
with mlflow.start_run():
    train_data_size = len(train_dataset)
    test_data_size = len(test_datast)
    
    mlflow.log_param("train_data_size", train_data_size)
    mlflow.log_param("test_data_size", test_data_size)

    history = []
    for epoch in range(1, EPOCHS+1):
        train_acc, train_pre, train_rec, train_f1 = train_epoch(
            model=model,
            device=DEVICE,
            train_loader=train_loader,
            optimizer=optimizer,
            epoch=epoch
        )
        test_acc, test_pre, test_rec, test_f1, test_loss = test_epoch(
            model=model,
            device=DEVICE,
            test_loader=test_loader,
            criterion=criterion
        )

        mlflow.log_metric("test_loss", test_loss, step=epoch) 
        mlflow.log_metric("test_accuracy", test_acc, step=epoch)
        mlflow.log_metric("test_precision", test_pre, step=epoch)
        mlflow.log_metric("test_recall", test_rec, step=epoch)
        mlflow.log_metric("test_f1", test_f1, step=epoch)

        history.append((train_acc, train_pre, train_rec, train_f1, test_acc, test_pre, test_rec, test_f1))

        print(f"Epoch {epoch} | Train Acc: {train_acc:.4f}, Train Pre: {train_pre:.4f}, Train Rec: {train_rec:.4f}, Train F1: {train_f1:.4f} | Test Acc: {test_acc:.4f}, Test Pre: {test_pre:.4f}, Test Rec: {test_rec:.4f}, Test F1: {test_f1:.4f}")

    mlflow.pytorch.log_model(model, "model")

100%|██████████| 31/31 [1:07:39<00:00, 130.96s/it]
100%|██████████| 8/8 [00:35<00:00,  4.48s/it]


Epoch 1 | Train Acc: 0.5540, Train Pre: 1.0000, Train Rec: 0.5540, Train F1: 0.7104 | Test Acc: 0.5308, Test Pre: 1.0000, Test Rec: 0.5308, Test F1: 0.6927


100%|██████████| 31/31 [1:06:15<00:00, 128.23s/it]
100%|██████████| 8/8 [00:37<00:00,  4.64s/it]


Epoch 2 | Train Acc: 0.5832, Train Pre: 1.0000, Train Rec: 0.5832, Train F1: 0.7335 | Test Acc: 0.5382, Test Pre: 1.0000, Test Rec: 0.5382, Test F1: 0.6987


100%|██████████| 31/31 [1:06:32<00:00, 128.78s/it]
100%|██████████| 8/8 [00:36<00:00,  4.60s/it]


Epoch 3 | Train Acc: 0.6760, Train Pre: 1.0000, Train Rec: 0.6760, Train F1: 0.8054 | Test Acc: 0.4987, Test Pre: 1.0000, Test Rec: 0.4987, Test F1: 0.6616


100%|██████████| 31/31 [1:05:46<00:00, 127.30s/it]
100%|██████████| 8/8 [00:36<00:00,  4.54s/it]


Epoch 4 | Train Acc: 0.7874, Train Pre: 1.0000, Train Rec: 0.7874, Train F1: 0.8797 | Test Acc: 0.5354, Test Pre: 1.0000, Test Rec: 0.5354, Test F1: 0.6955


100%|██████████| 31/31 [1:05:58<00:00, 127.70s/it]
100%|██████████| 8/8 [00:36<00:00,  4.61s/it]


Epoch 5 | Train Acc: 0.9102, Train Pre: 1.0000, Train Rec: 0.9102, Train F1: 0.9527 | Test Acc: 0.5425, Test Pre: 1.0000, Test Rec: 0.5425, Test F1: 0.7025


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
