Trains BERT, a NLP library developed by Google (specifically 'bert-base-uncased') several times on the IMDB dataset with different optimisation techniques to maximise performance while minimising training time. The objective of this experiment is to compare how efficiently certain optimisers are when testing the early stages of model performance running on a local machine (i.e. no cloud resources).
Model Parameter Size: 110M
The IMDB dataset is a collection of critic-produced movie reviews with sentiment results commonly used in NLP especially for sentiment analysis. It consists of 2 columns, 'review' and 'sentiment' where review is a long string of text delimited by spaces and sentiment is a binary result between 'positive' and 'negative'.
IMDB Train Dataset Full Size: 25k (scaled down to 3.2k for faster training)
IMDB Test Dataset Full Size: 25k (scaled down to 800 for maintaining 80:20 split)
Both datasets are completely independent to one another.
Create Virtual Environment:
python -m venv bert-env
Activate Virtual Environment:
source bert-env/bin/activate (MacOS/Linux)
bert-end\Scripts\Activate.ps1 (Windows)
Libraries/Dependencies used:
- PyTorch
- HuggingFace
- Matplotlib
- sklearn
- tqdm, gc, warnings
Python Version: 3.12.2