Skip to content

rimads/avey-b

Repository files navigation

Avey-B

Paper HuggingFace License Python

This repository contains the official implementation, pretraining code, and evaluation scripts for Avey-B, as presented in the paper "Avey-B" (ICLR 2026).

Abstract: Compact pretrained bidirectional encoders remain the backbone of industrial NLP under tight compute and memory budgets. Their effectiveness stems from self-attention’s ability to deliver high-quality bidirectional contextualization with sequence-level parallelism, as popularized by BERT-style architectures. Recently, Avey was introduced as an autoregressive, attention-free alternative that naturally admits an encoder-only adaptation. In this paper, we reformulate Avey for the encoder-only paradigm and propose several innovations to its architecture, including decoupled static and dynamic parameterizations, stability-oriented normalization, and neural compression. Results show that this reformulated architecture compares favorably to four widely used Transformer-based encoders, consistently outperforming them on standard token-classification and information-retrieval benchmarks while scaling more efficiently to long contexts.


Repository Structure

.
├── avey_b/              # Core implementation of the Avey-B model architecture
├── EncodEval/           # Evaluation framework (SC, TC, QA, IR benchmarks)
├── EncodEval/neobert/   # Custom implementations for NeoBERT baseline comparisons
├── bench_latency.py     # Script for benchmarking inference latency
├── bench_throughput.py  # Script for benchmarking training/inference throughput
├── setup.sh             # Environment setup script
├── train.sh             # Training launcher script
├── train_mlm.py         # Masked Language Modeling (MLM) pretraining script
└── pyproject.toml       # Dependency management via uv

Setup & Installation

The codebase is tested on Ubuntu 22.04 using NVIDIA A100 and H100 GPUs. Python environments are managed using uv for strict reproducibility.

  1. clone the repo

    git clone https://github.com/rimads/avey-b
    cd avey-b
  2. Initialize Environment: The provided setup.sh script installs system dependencies (including awscli), installs uv, and syncs the Python environment defined in pyproject.toml.

    bash setup.sh
  3. Activate Environment:

    source .venv/bin/activate

Pre-training

We provide scripts to pretrain Avey-B from scratch using the Masked Language Modeling (MLM) objective. Note that running pre-training will download the dataset specified in dataloader.py (sample-10BT from HuggingFaceFW/fineweb by default).

  1. Configuration:

    • Either login to wandb with wandb login or disable it wandb disabled
    • It is recommended to login to hf with hf auth login, to prevent rate limit errors while downloading the datasets
  2. Model Config: Adjust model hyperparameters inside train_mlm.py (approx. line 242) if needed.

  3. Launch Training: Use train.sh to automatically detect available GPUs and launch the training run. You can control the per-device batch size via environment variables.

    # Example: Set batch size to 16 (fits on 80GB VRAM)
    export BATCH_SIZE=16
    bash train.sh

    Note: train.sh handles single-node multi-GPU setups. for multi-node training, please invoke torchrun manually with the appropriate rendezvous arguments.


📊 Evaluation

Our evaluation framework is adapted from EncodEval.

1. Preparation

Navigate to the EncodEval directory (cd EncodEval) to run evals. If you intend to run long-range Needle-In-A-Haystack (NIAH) benchmarks:

python gen_niah.py

Running Benchmarks

  1. Open EncodEval/run.py and specify:

    • model_name: Local path or HuggingFace ID (e.g., google-bert/bert-base-uncased).
    • learning_rates: List of LRs to sweep.
    • Benchmarks and random seeds.
  2. Ensure YAML configurations for your chosen benchmarks, learning rate, and seeds exist in EncodEval/configs (configs for specified values in run.py are already provided).

  3. Run the evaluation:

    python run.py

    run.py will automatically schedule the benchmarks to run on all GPUs on the machine as they become available.

  4. Print results by running

    python print_results.py

    Model name and learning rates will need to be specified inside print_results.sh. The script will print results in a format that can be pasted into Google Sheets.

NeoBERT Specifics

If evaluating NeoBERT, specific token classification implementations are required:

  1. Download the NeoBERT model.
  2. Move the files from EncodEval/neobert/ (in this repo) into your downloaded NeoBERT model directory.
  3. Point model_name in EncodEval/run.py to this local directory.

Efficiency Benchmarks

To reproduce the efficiency plots (throughput and latency) found in the paper:

# Generate throughput data
python bench_throughput.py

# Generate latency data
python bench_latency.py

To run the unoptimized version of Avey-B, the @torch.compile decorator can be removed from the implementation. To test the optimized versions of the other models, flash-attention will need to be installed.

Note on NeoBERT Efficiency Testing: To test NeoBERT beyond its training window (solely for efficiency measurements), you must manually override its config:

  1. Download the NeoBERT checkpoint.
  2. Modify config.json: Set max_length to a large value (e.g., 100000).
  3. Update the benchmarking scripts to point to this modified local checkpoint.

Citation

If you use Avey-B or this codebase in your research, please cite our paper:

@inproceedings{2026aveyb,
  title={Avey-B},
  author={Acharya, Devang and Hammoud, Mohammad},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026}
}