# Tutorial on Training Classification models using BilbyStats

In this is short tutorial we train, run and evaluate some basic transformer based models for classification. First we have an example on a small dataset that runs quickly. We then consider larger datasets below. The dataset we will use is originally from https://www.kaggle.com/datasets/ankurzing/sentiment-analysis-in-commodity-market-gold but is saved in bilbystats. It is a news dataset consisting of over 10000 news headlines regarding commodities. The headlines are in English but I have also translated them to Chinese so that I can illustrate analysis using Chinese Transformer based models. 

In [2]:
import pandas as pd
import bilbystats as bs

# Load in the dataset (originally from https://www.kaggle.com/datasets/ankurzing/sentiment-analysis-in-commodity-market-gold)
df = bs.read_data('gold-dataset-sinha-khandait.parquet')

# Restrict to the first 1000 rows for this simple example
df_first_1000 = df.head(1000)

# Illustrate the first 3 rows of the dataframe
df_first_1000.head(3)

  from .autonotebook import tqdm as notebook_tqdm


Unnamed: 0,Dates,URL,News,Price Direction Up,Price Direction Constant,Price Direction Down,Asset Comparision,Past Information,Future Information,Price Sentiment,News_Chinese
0,28-01-2016,http://www.marketwatch.com/story/april-gold-do...,"april gold down 20 cents to settle at $1,116.1...",0,0,1,0,1,0,negative,四月黄金期货下跌20美分，收于每盎司1116.10美元。
1,13-09-2017,http://www.marketwatch.com/story/gold-prices-s...,gold suffers third straight daily decline,0,0,1,0,1,0,negative,黄金遭遇连续第三天下跌
2,26-07-2016,http://www.marketwatch.com/story/gold-futures-...,Gold futures edge up after two-session decline,1,0,0,0,1,0,positive,黄金期货在两日下跌后小幅上涨


## Performing classification using transformer based models

### Quickly training transformer classifier based on a same subset of the data.

We now generate an example of training a classification model on this dataset using bilbystats. We start off using a simple subset of the data so that you can see and example that runs really fast.

In [None]:
# Specify the covariate and target columns
covariate = 'News'
target = 'Price Direction Up'

# Split the indices into training, validation, and testing
indices = bs.data_idx_split(df_first_1000.index)

# Split the data itself into training, validation, and testing sets
train_data, valid_data, test_data = bs.train_val_test_split(
    df_first_1000, covariate, target, indices)

# Define the model name and tokenize the data
model_name = "distilbert-base-uncased" 
train_data_tk, valid_data_tk, test_data_tk = bs.tokenize_data(
    train_data, valid_data, test_data, model_name)

# The following line can be replaced with a str containing your desired directory. 
savedir = bs.check_dir("bs_examples/") # You can replace this with a str containing your desired directory
savename = "bs_training_example"

# Define the label mapping for the target variable
label2id = {"NEUTRAL": 0, "UP": 1}

# Train the model using bilbystats
trainer, model, training_args = bs.trainTFmodel(
    train_data_tk, valid_data_tk, model_name, savename=savename, savedir=savedir, num_labels=2, label2id=label2id)

Map: 100%|██████████| 800/800 [00:00<00:00, 17646.39 examples/s]
Map: 100%|██████████| 100/100 [00:00<00:00, 16989.93 examples/s]
Map: 100%|██████████| 100/100 [00:00<00:00, 16428.28 examples/s]
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,No log,0.362264,0.9,0.901654,0.87934,0.888592
2,No log,0.317178,0.9,0.888795,0.897569,0.89275
3,0.338700,0.362722,0.9,0.888795,0.897569,0.89275
4,0.338700,0.423557,0.89,0.877015,0.895833,0.883856




Note that due to the small number of datapoints here the training loss is not defined for the first two epochs. In practice I recommend using a greater amount of data in which case that will not be a problem (as e.g. shown below in the example below in which the full dataset is used). This example is designed to run quickly for illustration purposes.

We can do the same for the Chinese sentences by simply changing the model as follows.

In [None]:
# Specify the covariate and target columns
covariate = 'News_Chinese'
target = 'Price Direction Up'

# Split the indices into training, validation, and testing
indices = bs.data_idx_split(df_first_1000.index)

# Split the data itself into training, validation, and testing sets
train_data, valid_data, test_data = bs.train_val_test_split(
    df_first_1000, covariate, target, indices)

# Define the model name and tokenize the data
model_name = "hfl/chinese-roberta-wwm-ext" 
train_data_tk, valid_data_tk, test_data_tk = bs.tokenize_data(
    train_data, valid_data, test_data, model_name)

# Define the output directory and save name for the model
savedir = bs.check_dir("bs_examples/") # You can replace this with a str containing your desired directory
savename = "bs_training_example_chinese"

# Define the label mapping for the target variable
label2id = {"NEUTRAL": 0, "UP": 1}

# Train the model using bilbystats
trainer, model, training_args = bs.trainTFmodel(
    train_data_tk, valid_data_tk, model_name, savename=savename, savedir=savedir, num_labels=2, label2id=label2id)

Map: 100%|██████████| 800/800 [00:00<00:00, 12261.49 examples/s]
Map: 100%|██████████| 100/100 [00:00<00:00, 9043.35 examples/s]
Map: 100%|██████████| 100/100 [00:00<00:00, 11184.51 examples/s]
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at hfl/chinese-roberta-wwm-ext and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,No log,0.22073,0.92,0.908046,0.931424,0.915931
2,No log,0.229893,0.95,0.939024,0.960938,0.947207
3,0.232800,0.289713,0.95,0.939024,0.960938,0.947207




In [None]:
# Specify the covariate and target columns
covariate = 'News_Chinese'
target = 'Price Direction Up'

# Split the indices into training, validation, and testing
indices = bs.data_idx_split(df_first_1000.index)

# Split the data itself into training, validation, and testing sets
train_data, valid_data, test_data = bs.train_val_test_split(
    df_first_1000, covariate, target, indices)

# Define the model name and tokenize the data
model_name = "schen/longformer-chinese-base-4096"
train_data_tk, valid_data_tk, test_data_tk = bs.tokenize_data(
    train_data, valid_data, test_data, model_name)

# Define the output directory and save name for the model
savedir = bs.check_dir("bs_examples/") # You can replace this with a str containing your desired directory
savename = "bs_training_example_longformer_chinese"

# Define the label mapping for the target variable
label2id = {"NEUTRAL": 0, "UP": 1}

# Train the model using bilbystats
trainer, model, training_args = bs.trainTFmodel(
    train_data_tk, valid_data_tk, model_name, savename=savename, savedir=savedir, num_labels=2, label2id=label2id)

Map: 100%|██████████| 800/800 [00:00<00:00, 11457.85 examples/s]
Map: 100%|██████████| 100/100 [00:00<00:00, 12196.29 examples/s]
Map: 100%|██████████| 100/100 [00:00<00:00, 12746.71 examples/s]
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at schen/longformer-chinese-base-4096 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,No log,0.376419,0.88,0.868506,0.894097,0.875
2,No log,0.326329,0.91,0.9,0.929688,0.906629
3,0.314100,0.326802,0.92,0.909091,0.9375,0.916667




### Training on the Full Dataset

Below we do the same thing but using a model fit on the whole dataset (maintaining the same split ratios). Note the increase in the metrics.
In this case there is no problem with the training loss not being defined.

In [None]:
# Specify the covariate and target columns
covariate = 'News'
target = 'Price Direction Up'

# Split the indices into training, validation, and testing
indices = bs.data_idx_split(df.index)

# Split the data itself into training, validation, and testing sets
train_data, valid_data, test_data = bs.train_val_test_split(
    df, covariate, target, indices)

# Define the model name and tokenize the data
model_name = "distilbert-base-uncased"
train_data_tk, valid_data_tk, test_data_tk = bs.tokenize_data(
    train_data, valid_data, test_data, model_name)

# Define the output directory and save name for the model
savedir = bs.check_dir("bs_examples/") # You can replace this with a str containing your desired directory
savename = "bs_training_example_full"

# Define the label mapping for the target variable
label2id = {"NEUTRAL": 0, "UP": 1}

# Train the model using bilbystats
trainer, model, training_args = bs.trainTFmodel(
    train_data_tk, valid_data_tk, model_name, savename=savename, savedir=savedir, num_labels=2, label2id=label2id)

Map: 100%|██████████| 8456/8456 [00:00<00:00, 19346.44 examples/s]
Map: 100%|██████████| 1057/1057 [00:00<00:00, 19857.92 examples/s]
Map: 100%|██████████| 1057/1057 [00:00<00:00, 19154.47 examples/s]
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.2228,0.202126,0.94702,0.946968,0.942595,0.944644
2,0.1312,0.242618,0.945128,0.942939,0.942939,0.942939
3,0.0851,0.249069,0.95175,0.950993,0.948477,0.949688




We can once more do the same for the Chinese sentences as follows.

In [None]:
# Specify the covariate and target columns
covariate = 'News_Chinese'
target = 'Price Direction Up'

# Split the indices into training, validation, and testing
indices = bs.data_idx_split(df.index)

# Split the data itself into training, validation, and testing sets
train_data, valid_data, test_data = bs.train_val_test_split(
    df, covariate, target, indices)

# Define the model name and tokenize the data
model_name = "hfl/chinese-roberta-wwm-ext"
train_data_tk, valid_data_tk, test_data_tk = bs.tokenize_data(
    train_data, valid_data, test_data, model_name)

# Define the output directory and save name for the model
savedir = bs.check_dir("bs_examples/") # You can replace this with a str containing your desired directory
savename = "bs_training_example_full_chinese"

# Define the label mapping for the target variable
label2id = {"NEUTRAL": 0, "UP": 1}

# Train the model using bilbystats
trainer, model, training_args = bs.trainTFmodel(
    train_data_tk, valid_data_tk, model_name, savename=savename, savedir=savedir, num_labels=2, label2id=label2id)

Map: 100%|██████████| 8456/8456 [00:00<00:00, 14119.53 examples/s]
Map: 100%|██████████| 1057/1057 [00:00<00:00, 14232.68 examples/s]
Map: 100%|██████████| 1057/1057 [00:00<00:00, 14423.31 examples/s]
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at hfl/chinese-roberta-wwm-ext and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.2653,0.211437,0.940397,0.938789,0.937057,0.937899
2,0.1745,0.196335,0.947966,0.947769,0.943771,0.945654
3,0.1135,0.215896,0.95175,0.95174,0.947707,0.949607


Using the latest cached version of the module from /Users/samd/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--accuracy/f887c0aab52c2d38e1f8a215681126379eca617f96c447638f751434e8e65b14 (last modified on Fri May  9 20:17:05 2025) since it couldn't be found locally at evaluate-metric--accuracy, or remotely on the Hugging Face Hub.
Using the latest cached version of the module from /Users/samd/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--precision/155d3220d6cd4a6553f12da68eeb3d1f97cf431206304a4bc6e2d564c29502e9 (last modified on Fri May  9 20:17:06 2025) since it couldn't be found locally at evaluate-metric--precision, or remotely on the Hugging Face Hub.
Using the latest cached version of the module from /Users/samd/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--recall/11f90e583db35601050aed380d48e83202a896976b9608432fba9244fb447f24 (last modified on Fri May  9 20:17:08 2025) since it couldn't be found locally at evaluate

## Performing prediction using the trained models

### Prediction using on the English headlines

We can use the saved models to predict bulk predict the test texts as follows.

In [13]:
from sklearn.metrics import accuracy_score
model_path = bs.check_dir("bs_examples/bs_training_example_full/checkpoint-1057/")

covariate = 'News'
target = 'Price Direction Up'

# Split the data (using the same seed as before) - this bit is not necessary if you have already peformed the runs above
indices = bs.data_idx_split(df.index)
train_data, valid_data, test_data = bs.train_val_test_split(df, covariate, target, indices)
model_name = "distilbert-base-uncased" 

test_predictions = bs.predict(test_data, model_path, model_name)
accuracy_score(test_predictions['true_labels'], test_predictions['pred_labels'])

Map: 100%|██████████| 1057/1057 [00:00<00:00, 17376.60 examples/s]


0.9441816461684012

To predict individual pieces of text you can also do the following.

In [29]:
direction_mapping = {
    0: 'Neutral',
    1: 'Up'
}

example_sentence = df.loc[indices['test'][0], 'News']
print(example_sentence)

# You can predict this sentence directly
prediction = bs.predict(example_sentence, model_path, model_name)
print(direction_mapping[prediction['pred_labels'][0]])

gold edges higher, trades at $1,431.20 an ounce


Map: 100%|██████████| 1/1 [00:00<00:00, 485.90 examples/s]


Up


This is equivalent to evaluating on the test predictions as we see here:

In [28]:
# Or obtain it from the trained test set
prediction2 = test_predictions['pred_labels'][0]
print(direction_mapping[prediction2])

Up


Note that running in bulk rather than individually can save time. But at evaluation time you may at times want to run on individual pieces of text so it's useful to have both types of functionality!

### Prediction on the Chinese Headlines

In [3]:
from sklearn.metrics import accuracy_score
model_path = bs.check_dir("bs_examples/bs_training_example_full_chinese/checkpoint-2114/")

covariate = 'News_Chinese'
target = 'Price Direction Up'

# Split the data (using the same seed as before) - this bit is not necessary if you have already peformed the runs above
indices = bs.data_idx_split(df.index)
train_data, valid_data, test_data = bs.train_val_test_split(df, covariate, target, indices)
model_name = "hfl/chinese-roberta-wwm-ext"

test_predictions = bs.predict(test_data, model_path, model_name)
accuracy_score(test_predictions['true_labels'], test_predictions['pred_labels'])

Map: 100%|██████████| 1057/1057 [00:00<00:00, 13605.17 examples/s]


0.935666982024598

Note the small loss in test accuracy - this may be due to a translation quality in this case since I've translated the original English headlines to Chinese for this tutorial. In our datasets the original data will be in Chinese so we shouldn't suffer from this effect. 

As above we can predict pieces of text as follows. 

In [5]:
direction_mapping = {
    0: 'Neutral',
    1: 'Up'
}

example_sentence = df.loc[indices['test'][0], 'News_Chinese']
print(example_sentence)

# You can predict this sentence directly
prediction = bs.predict(example_sentence, model_path, model_name)
print(direction_mapping[prediction['pred_labels'][0]])

黄金小幅走高，报每盎司1,431.20美元。


Map: 100%|██████████| 1/1 [00:00<00:00, 399.65 examples/s]


Up


In [33]:
# Or obtain it from the trained test set
prediction2 = test_predictions['pred_labels'][0]
print(direction_mapping[prediction2])

Up


In [None]:
# We can also try with examples which are not in the training set. For instance:
ex_sentence = bs.translate('Gold is going up', 'gpt-4o', 'Chinese')
prediction = bs.predict(ex_sentence, model_path, model_name)
print(direction_mapping[prediction['pred_labels'][0]])

Map: 100%|██████████| 1/1 [00:00<00:00, 526.00 examples/s]


Up
