This project is based on this TinyBERT repository. Some of the changes we introduced:
- New model architectures based on original TinyBERT implementation.
- Support for newer Python versions.
- Code formatting.
This repository does not include the training code - only code to run inference.
# Clone the NL-DPE repository.
git clone https://github.com/HewlettPackard/nl-dpe
cd ./nl-dpe
# Create Python virtual environment.
virtualenv ./.env --python=3
source ./.env/bin/activate
pip install poetry==2.1.2
# Install project and its dependencies.
poetry install --without=dev
# Make sure project modules are importable.
export PYTHONPATH=$(pwd)
MLFLOW_TRACKING_URI
[optional]: MLflow tracking server URI to load models from (when model URI is mlflow:{MLFLOW_RUN_ID}
).
This only works now when running evaluation on a system where the models under test were trained.
Evaluation script uses the GLUE dataset. See the README file from the TinyBERT repository (Data Augmentation section) for more details. From that README file:
Before running data augmentation of GLUE tasks you should download the GLUE data by running this script and unpack it to some directory.
We provide the download_glue_data.py script which is an updated version of the GitHub gist mentioned above.
-
Download MRPC dataset:
mkdir -p ./datasets/mrpc && cd ./datasets/mrpc wget https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_train.txt wget https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_test.txt ls -lh sha1sum ./msr_paraphrase_test.txt ./msr_paraphrase_train.txt # size sha1 file # 431K 4265196c15cf75620b0b592b8b921f543bda7e6c msr_paraphrase_test.txt # 1023K 716e0f67af962f08220b7e97d229b293077ef41f msr_paraphrase_train.txt
-
Download GLUE dataset (after downloading the MRPC data).
python ./nl_dpe/download_glue_data.py --data_dir=./datasets/glue --tasks=all --path_to_mrpc ./datasets/mrpc
Three models are located in the models directory.
- wise-tern-584: SST-2.
- From paper
- BERT_base_teacher = 93.4, TinyBERT_4 = 92.6, TinyBERT_6 = 93.1
- Our experiments
- TinyBERT = 92.77
- TinyBERT with hidden state averaging (run_name = wise-tern-584, run_id = 70457740292c4ff58fa8cb7ca009f154) = 91.74
- From paper
- blushing-dove-984: CoLA
- From paper
- BERT_base_teacher = 52.8, TinyBERT_4 = 44.1, TinyBERT_6 = 51.1
- Our experiments
- TinyBERT = 40.92
- TinyBERT with hidden state averaging (run_name = blushing-dove-984, run_id = 37059330c45b4a259ad5fadd812beaa5) = 31.18
- From paper
- spiffy-snake-501: MRPC
- From paper
- BERT_base_teacher = 87.5, TinyBERT_4 = 86.4, TinyBERT_6 = 87.3
- Our experiments
- TinyBERT = 87.36
- TinyBERT with hidden state averaging (run_name = spiffy-snake-501, run_id = b3c699617932420da22862c37698724a) = 78.72
- From paper
Look at nl_dpe/evaluate.py file:
python ./nl_dpe/evaluate.py <task_name> <model_uri> <data_dir>
# <task_name>: One of 'cola', 'sst-2', 'mrpc', 'sts-b', 'qqp', 'mnli', 'mnli-mm', 'qnli', 'rte', 'wnli'
# Task name must be consistent with the model being evaluated.
# <model_uri>: Path to a model directory
# <data_dir>: Path to the data directory (e.g., ${GLUE_DIR}\\SST-2).
python ./nl_dpe/evaluate.py sst-2 ./models/wise-tern-584/ ./datasets/glue/SST-2
# acc = 0.9174311926605505, eval_loss = 0.2411253090415682
python ./nl_dpe/evaluate.py cola ./models/blushing-dove-984/ ./datasets/glue/CoLA
# mcc = 0.3118300074953714, eval_loss = 0.5854980647563934
python ./nl_dpe/evaluate.py mrpc ./models/spiffy-snake-501/ ./datasets/glue/MRPC
# acc_and_f1 = 0.787219887955182, eval_loss = 0.583455924804394
The model.py provides a number of CLI commands to extract self-attention weights, and then update the original model with new weights.
Currently, all models in this repository have the same architecture - 4 encoder layers, 12 attention heads, hidden size
is 312, attention dimension 26 (26 * 12 = 312
). One self-attention head then can be indexed with a tuple
[layer, head]
where layer is a 1-based layer index (1, ..., 4 inclusive) and head is the 1-based index of the
attention head (1, ..., 12 inclusive).
# Evaluate this model to get the expected performance.
python ./nl_dpe/evaluate.py cola ./models/blushing-dove-984/ ./datasets/glue/CoLA
# eval_loss = 0.5854980764967023 mcc = 0.3118300074953714
# Save attention weights for 1st layer and last attention head
python ./nl_dpe/model.py extract-attention-weights --model-path=./models/blushing-dove-984/ --output-file=att-matrices.pkl --layer=1 --head=12
# Update model with the unmodified weights. The model will be saved in a different directory.
python ./nl_dpe/model.py update-attention-weights --model-path=./models/blushing-dove-984/ --weights-file=att-matrices.pkl --head=12 --output-dir=./models/blushing-dove-984-updated
# Evaluate this model to confirm the original performance
python ./nl_dpe/evaluate.py cola ./models/blushing-dove-984-updated ./datasets/glue/CoLA
# eval_loss = 0.5854980764967023, mcc = 0.3118300074953714
# Now, remove the saved model and add Gaussian noise to the serialized attention weights.
rm -r ./models/blushing-dove-984-updated
python ./nl_dpe/model.py add-noise-to-attention-weights --weights-file=att-matrices.pkl --mean=0.0 --std=0.1
# Update the model again with slightly different attention weights
python ./nl_dpe/model.py update-attention-weights --model-path=./models/blushing-dove-984/ --weights-file=att-matrices.pkl --head=12 --output-dir=./models/blushing-dove-984-updated
# Confirm that evaluation metics are different
python ./nl_dpe/evaluate.py cola ./models/blushing-dove-984-updated ./datasets/glue/CoLA
# eval_loss = 0.59219029545784, mcc = 0.29623405262647357
The add-noise-to-attention-weights
command above is used only for testing purposes. The file with weights must always
have the following properties:
- It must be a pickle file containing Python dictionary.
- Keys in that dictionary must be valid tensor names in the original model. Users, when modifying saved weights file must keep these names as is.
- There must be 6 entries in this dictionary - weights and biases for K, Q and V matrices. Biases are 1D numpy arrays, while wights are 2D numpy arrays.
- Given models this repository provides, the weights arrays have
(26, 26)
shape, while biases arrays have(26,)
shape.
The inference code in this project supports hybrid inference when different layers of the model run on different compute devices.
Concretely, the inference code supports computing K
, Q
and V
weights of one self-attention head on an external device, while
majority of the model runs on a CPU in a host system. It can be achieved following these steps:
- Extract self-attention weights (
W_k
,W_q
andW_v
) using procedures outlined above. - Program or transfer these self-attention weights to an external compute devices.
- Change the dpe.py implementation. Concretely:
- Update
_BERT_LAYER_INDEX
and_ATTENTION_HEAD_INDEX
variables to match the self-attention weights that have been exported. These variables provide brief doc strings. - Implement the
call
method in that file. We provide a placeholder implementation that demonstrates one possible implementation.
- Update
- Run model inference using CLI described above. The inference code will detect if this functionality is enabled and for what self-attention head. The inference code then will call the
dpe.py::call
function and will use its returned three matrices instead of those computed by the model itself. For implementation details on how thecall
method is called during the inference, see theBertSelfAttention::forward
method in the modelling.py file. The relevant code segment starts withif dpe.enabled(self.bert_layer_index)
line.