GitHub - satwik77/Transformer-Simplicity

Simplicity Bias in Transformers

Simplicity Bias in Transformers and their Ability to Learn Sparse Boolean Functions

Despite the widespread success of Transformers on NLP tasks, recent works have found that they struggle to model several formal languages when compared to recurrent models. This raises the question of why Transformers perform well in practice and whether they have any properties that enable them to generalize better than recurrent models. In this work, we conduct an extensive empirical study on Boolean functions to demonstrate the following: (i) Random Transformers are relatively more biased towards functions of low sensitivity. (ii) When trained on Boolean functions, both Transformers and LSTMs prioritize learning functions of low sensitivity, with Transformers ultimately converging to functions of lower sensitivity. (iii) On sparse Boolean functions which have low sensitivity, we find that Transformers generalize near perfectly even in the presence of noisy labels whereas LSTMs overfit and achieve poor generalization accuracy. Overall, our results provide strong quantifiable evidence that suggests differences in the inductive biases of Transformers and recurrent models which may help explain Transformer's effective generalization performance despite relatively limited expressiveness.

Dependencies

compatible with python 3
dependencies can be installed using Transformer-Simplicity/requirements.txt

Setup

Install VirtualEnv using the following (optional):

$ [sudo] pip install virtualenv

Create and activate your virtual environment (optional):

$ virtualenv -p python3 venv
$ source venv/bin/activate

Install all the required packages:

at Transformer-Simplicity/:

$ pip install -r requirements.txt

Models

The current repository includes 4 directories implementing different models and settings:

Training Transformer on Boolean functions : Transformer-Simplicity/FLTAtt
Training LSTMs on Boolean functions : Transformer-Simplicity/FLTClassifier
Experiments with Random Transformer : Transformer-Simplicity/RandFLTAtt
Experiments with Random LSTM : Transformer-Simplicity/RandFLTClassifier

Usage

The set of command line arguments available can be seen in the respective args.py file. Here, we illustrate running the experiment for training Transformers on sparse parities. Follow the same methodology for running any experiments with LSTMs.

At Transformer-Simplicity/FLTAtt:

$	python -m src.main -mode train -gpu 0 -dataset sparity40_5k -run_name trafo_sparity_40_5k -depth 4 -lr 0.001

To compute sensitivity of randomly initialized Transformers,
At Transformer-Simplicity/RandFLTAtt:

$	python rand_sensi.py -gpu 0 -sample_size 1000 -len 20 -trials 100

Citation

If you use our data or code, please cite our work:

@inproceedings{bhattamishra-etal-2023-simplicity,
    title = "Simplicity Bias in Transformers and their Ability to Learn Sparse {B}oolean Functions",
    author = "Bhattamishra, Satwik  and
      Patel, Arkil  and
      Kanade, Varun  and
      Blunsom, Phil",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.317",
    pages = "5767--5791",
    abstract = "Despite the widespread success of Transformers on NLP tasks, recent works have found that they struggle to model several formal languages when compared to recurrent models. This raises the question of why Transformers perform well in practice and whether they have any properties that enable them to generalize better than recurrent models. In this work, we conduct an extensive empirical study on Boolean functions to demonstrate the following: (i) Random Transformers are relatively more biased towards functions of low sensitivity. (ii) When trained on Boolean functions, both Transformers and LSTMs prioritize learning functions of low sensitivity, with Transformers ultimately converging to functions of lower sensitivity. (iii) On sparse Boolean functions which have low sensitivity, we find that Transformers generalize near perfectly even in the presence of noisy labels whereas LSTMs overfit and achieve poor generalization accuracy. Overall, our results provide strong quantifiable evidence that suggests differences in the inductive biases of Transformers and recurrent models which may help explain Transformer{'}s effective generalization performance despite relatively limited expressiveness.",
}

For any clarification, comments, or suggestions please contact Satwik or Arkil.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
FLTAtt		FLTAtt
FLTClassifier		FLTClassifier
RandFLTAtt		RandFLTAtt
RandFLTClf		RandFLTClf
images		images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FLTAtt

FLTAtt

FLTClassifier

FLTClassifier

RandFLTAtt

RandFLTAtt

RandFLTClf

RandFLTClf

images

images

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Simplicity Bias in Transformers

Simplicity Bias in Transformers and their Ability to Learn Sparse Boolean Functions

Dependencies

Setup

Models

Usage

Citation

About

Releases

Packages

Languages

License

satwik77/Transformer-Simplicity

Folders and files

Latest commit

History

Repository files navigation

Simplicity Bias in Transformers

Simplicity Bias in Transformers and their Ability to Learn Sparse Boolean Functions

Dependencies

Setup

Models

Usage

Citation

About

Resources

License

Stars

Watchers

Forks

Languages