# Finetune Hugging Face BERT with PyTorch Lightning

In [3]:
import os
import sys

# Detect if running in Google Colab
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    print("Running in Google Colab. Setting up virtual environment from project repository...")

    # Install uv package manager if not already installed
    !pip install uv

    # Create a project directory and clone the repository
    !git clone https://github.com/anitamaxvim/toxy-bot.git .

    # Verify the files exist
    if os.path.exists('pyproject.toml'):
        print("Found pyproject.toml file.")
    else:
        print("Warning: pyproject.toml not found. Virtual environment setup may be incomplete.")

    if os.path.exists('uv.lock'):
        print("Found uv.lock file.")
    else:
        print("Warning: uv.lock file not found. Will rely on pyproject.toml for dependencies.")

    # Create and activate virtual environment
    !uv venv

    # Install dependencies from pyproject.toml (and uv.lock if available)
    !uv pip install -e .

    # Check installation
    !uv pip list

    print("\nVirtual environment setup complete. You can now import your project packages.")
else:
    print("Not running in Google Colab. Using local environment.")

Not running in Google Colab. Using local environment.


Running the following cells will train the model using settings that are shown.

In [4]:
import torch

from toxy_bot.ml.datamodule import AutoTokenizerDataModule
from toxy_bot.ml.module import SequenceClassificationModule
from toxy_bot.ml.utils import create_dirs
from toxy_bot.ml.config import Config, DataModuleConfig, ModuleConfig, TrainerConfig

from toxy_bot.ml.trainer import train


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.4 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/dbozbay/Dev/toxy-bot/.venv/lib/python3.11/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/Users/dbozbay/Dev/toxy-bot/.venv/lib/python3.11/site-packages/traitlets/config/application.py", line 1075, in launch_instance
    app.start()
  File "/Users/dbozbay/Dev/toxy-bot/.venv/lib/python3.11/site-packages/ipykernel/kernelapp.py", line 739, in start
    self.io

First, let's configure some basic settings

In [5]:
model_name = ModuleConfig.model_name
lr = ModuleConfig.learning_rate
dataset_name = DataModuleConfig.dataset_name
batch_size = DataModuleConfig.batch_size

print(f"Model: {model_name}")
print(f"Learning rate: {lr}")
print(f"Dataset: {dataset_name}")
print(f"Batch size: {batch_size}")

cache_dir = Config.cache_dir
log_dir = Config.log_dir
ckpt_dir = Config.ckpt_dir
perf_dir = Config.perf_dir

print(f"Cache dir: {cache_dir}")
print(f"Log dir: {log_dir}")
print(f"Checkpoints dir: {ckpt_dir}")
print(f"Performance dir: {perf_dir}")

torch.set_float32_matmul_precision("medium")

Model: google/bert_uncased_L-4_H-512_A-8
Learning rate: 3e-05
Dataset: anitamaxvim/jigsaw-toxic-comments
Batch size: 16
Cache dir: /Users/dbozbay/Dev/toxy-bot/data
Log dir: /Users/dbozbay/Dev/toxy-bot/logs
Checkpoints dir: /Users/dbozbay/Dev/toxy-bot/checkpoints
Performance dir: /Users/dbozbay/Dev/toxy-bot/logs/perf


In [None]:
train(perf=True)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/bert_uncased_L-4_H-512_A-8 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/Users/dbozbay/Dev/toxy-bot/.venv/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/accelerator_connector.py:513: You passed `Trainer(accelerator='cpu', precision='16-mixed')` but AMP with fp16 is not supported on CPU. Using `precision='bf16-mixed'` instead.
Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
Seed set to 42
[2025-03-27 16:28:36.163277] Data cache exists. Loading from cache.
Map: 100%|██████████| 135635/135635 [01:24<00:00, 1601.82 examples/s]
Map: 100%|██████████| 23936/23936 [00:14<00:00, 1664.67 examples/s]

  | Name      | Type                

Sanity Checking DataLoader 0:   0%|          | 0/2 [00:00<?, ?it/s]

/Users/dbozbay/Dev/toxy-bot/.venv/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


                                                                           

/Users/dbozbay/Dev/toxy-bot/.venv/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Epoch 0:   0%|          | 3/8478 [09:01<425:00:35,  0.01it/s, v_num=1, train_loss=0.678]