Lil Bevo — UT Austin's submission to BabyLM Challenge

This repository contains code and instructions to build Lil Bevo — UT Austin's submission towards the BabyLM Challenge.

Python Environment

Install latest version of miniconda from here.

To recreate the exact python environment configuration in conda, run the following commands in order:

conda create -n bevo pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch-nightly -c nvidia jupyter pandas numpy matplotlib scikit-learn tqdm
pip install git+https://github.com/huggingface/transformers wandb ipdb datasets sentencepiece evaluate pytest accelerate mido

Scripts

training_bevo.py takes as argument any encoder style LM on the Huggingface Hub, and trains the model on babyLM data. First, concatenate all the train and dev files into one text file to pass as input to this script (cat babylm_data/babylm_10M/*.train > train.txt). Set the WANDB_PROJECT environment variable to lil-bevo and run.

export WANDB_PROJECT="lil-bevo"
python training_bevo.py --config_name microsoft/deberta-v3-small --tokenizer_name_or_path tokenizers/10m_maestro/ --train_file babylm_data/maestro/all-10M.txt --validation_file babylm_data/babylm_dev/dev.txt --per_device_train_batch_size 770 --per_device_eval_batch_size 128 --do_train --num_train_epochs 5 --do_eval --save_strategy epoch --optim adamw_torch_fused --warmup_ratio=0.0001 --weight_decay 0.1 --log_level error --learning_rate 5e-4 --evaluation_strategy steps --eval_steps 500 --output_dir deberta-small/redux/ --logging_steps 10 --save_total_limit 1 --overwrite_output_dir --torch_compile True --disable_tqdm False --max_seq_length 32 --report_to wandb

Evaluation

To setup evaluation pipeline as the BabyLM repo instructs, but in a separate conda environment:

git clone https://github.com/babylm/evaluation-pipeline
cd evaluation-pipeline
conda create -n babyeval python==3.9 pip git-lfs
conda activate babyeval
pip install --no-build-isolation -e ".[dev]"
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113 sentencepiece

Models

We trained two models, one for the strict-small track and another for the strict track:

Lil-Bevo is based on a deberta-small-v3 model, and has 55M parameters with a vocab size of 16640.
Lil-Bevo-X is based on a deberta-base-v3 model and has 112M parameters with a vocab size of 33280.

We also pretrained some ablation models, details in our paper. You can find all of our models in this Huggingface collection

Training Regime for Lil-Bevo

5 epochs on MAESTRO dataset (85M non-language music tokens) combined with strict small dataset.
50 epochs of pretraining with sequence length of 128 on strict-small dataset.
2 epochs of targeted MLM.

Training Regime for Lil-Bevo-X

5 epochs on MAESTRO dataset (85M non-language music tokens) combined with strict small dataset.
50 epochs of pretraining with sequence length of 128 on strict dataset.
150 epochs of pretraining with sequence length of 512 on strict dataset.
10 epochs of targeted MLM.

Please read our paper to get more details on our training regime and reasoning behind these decisions.

Results

DynaBench

Model	Score
Lil-Bevo	0.64
Lil-Bevo-X	0.69

BLiMP

Model	Anaphor Agr.	Agr. Structure	Binding	Control/Raising	D-N Agr.	Ellipsis	Filler-Gap	Irregular Forms	Island Effects	NPI Licensing	Quantifiers	S-V Agr.
Lil-Bevo	90.9	72.5	63.3	70.0	91.7	82.0	77.5	85.3	55.8	78.5	68.7	84.8
Lil-Bevo-X	97.2	80.6	63.9	69.5	96.4	87.0	78.4	89.2	71.4	85.6	63.2	86.3

BLiMP Supplement

Model	Hypernym	QA Congruence (easy)	QA Congruence (tricky)	Subj.-Aux. Inversion	Turn Taking
Lil-Bevo	48.1	82.8	57.0	76.5	68.2
Lil-Bevo-X	45.2	75.0	63.6	81.4	78.2

(Super)GLUE

Model	CoLA	SST-2	MRPC (F1)	QQP (F1)	MNLI	MNLI-mm	QNLI	RTE	BoolQ	MultiRC	WSC
Lil-Bevo	73.7	88.4	82.2	85.5	75.4	76.3	81.6	46.5	65.4	66.0	61.5
Lil-Bevo-X	76.5	88.8	82.6	86.4	77.7	79.0	83.6	49.5	68.0	65.6	61.4

MSGS

Model	CR (Control)	CR_LC	CR_RTP	LC (Control)	MV (Control)	MV_LC	MV_RTP	RP (Control)	SC (Control)	SC_LC	SC_RP
Lil-Bevo	91.9	66.6	67.4	100.0	99.8	75.7	78.0	93.8	91.5	65.7	64.2
Lil-Bevo-X	92.5	66.5	68.5	100.0	100.0	66.7	68.5	99.1	90.0	68.2	64.7

Age-of-acquisition Prediction (Mean absolute deviation in months across LOO cross-validation folds)

Model	Overall (591 words)	Nouns (322)	Predicates (167)	Function words (102)
Lil-Bevo	2.06	2.0	1.84	2.65
Lil-Bevo-X	2.05	1.99	1.85	2.59

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
midi		midi
tokenizers		tokenizers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Results.numbers		Results.numbers
train_clm.py		train_clm.py
training_bevo.py		training_bevo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

midi

midi

tokenizers

tokenizers

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Results.numbers

Results.numbers

train_clm.py

train_clm.py

training_bevo.py

training_bevo.py

Repository files navigation

Lil Bevo — UT Austin's submission to BabyLM Challenge

Python Environment

Scripts

Evaluation

Models

Training Regime for Lil-Bevo

Training Regime for Lil-Bevo-X

Results

About

Contributors 3

Languages

License

venkatasg/Lil-Bevo

Folders and files

Latest commit

History

Repository files navigation

Lil Bevo — UT Austin's submission to BabyLM Challenge

Python Environment

Scripts

Evaluation

Models

Training Regime for Lil-Bevo

Training Regime for Lil-Bevo-X

Results

About

Topics

Resources

License

Stars

Watchers

Forks

Languages