# BERT Fine-tuning for IMDB Sentiment Classification

Fine-tune a pretrained BERT model (bert-base-uncased) using Hugging Face Transformers and datasets. This notebook includes installation cells and a minimal training loop.

## 0. Environment / Install (run if needed)
Run this cell to install packages if they are missing. On Colab you can skip already installed ones.

In [1]:
import sys
print('Python', sys.version)


Python 3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0]


In [2]:
# Install required packages. Run this cell before proceeding (takes time on first run).
# !pip install -q transformers datasets accelerate evaluate
# If using colab, also install sentencepiece if needed.

In [3]:
# We'll prepare a small Hugging Face fine-tuning example. This cell checks package availability.
import importlib
for pkg in ['transformers', 'datasets']:
    spec = importlib.util.find_spec(pkg)
    print(pkg, 'found' if spec else 'NOT found')

transformers found
datasets found


In [4]:
# If packages are installed, continue. We'll load a small subset of the IMDB dataset via datasets.
try:
    from datasets import load_dataset
    ds = load_dataset('imdb')
    print(ds)
except Exception as e:
    print('datasets not available in this environment. Run the install cell and restart kernel.', e)

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

plain_text/unsupervised-00000-of-00001.p(…):   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


In [5]:
# Tokenization and preparation (will run if transformers available)
try:
    from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
    tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
    def preprocess(example):
        return tokenizer(example['text'], truncation=True, padding='max_length', max_length=256)
    tokenized = ds.map(preprocess, batched=True)
    tokenized = tokenized.rename_column('label', 'labels')
    tokenized.set_format(type='torch', columns=['input_ids','attention_mask','labels'])
    small_train = tokenized['train'].shuffle(seed=42).select(range(2000))
    small_test = tokenized['test'].shuffle(seed=42).select(range(1000))
    model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
    print('Prepared tokenizer and model (ready for Trainer).')
except Exception as e:
    print('Transformers not available or download blocked in this environment.', e)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Prepared tokenizer and model (ready for Trainer).


In [None]:
# Training with Trainer API (small run). Uncomment to run if environment supports it.
training_args = TrainingArguments(output_dir='./bert-imdb-demo', num_train_epochs=1, per_device_train_batch_size=8, per_device_eval_batch_size=8, eval_strategy='epoch', save_strategy='no', logging_steps=50, report_to=[])
trainer = Trainer(model=model, args=training_args, train_dataset=small_train, eval_dataset=small_test)
trainer.train()
metrics = trainer.evaluate()
print(metrics)



Epoch,Training Loss,Validation Loss


## Notes
- Fine-tuning BERT requires installing `transformers` and `datasets` and usually benefits from GPU. The notebook provides a minimal, ready-to-run template. For full training, increase dataset sizes, epochs, and use GPU.