llm-behavior-survey

Webpage for the paper: Language Model Behavior: A Comprehensive Survey (Computational Linguistics, 2024).

See paper for discussion and details on each section. Links to citations are included below.

Abstract

Transformer language models have received widespread public attention, yet their generated text is often surprising even to NLP researchers. In this survey, we discuss over 250 recent studies of English language model behavior before task-specific fine-tuning. Language models possess basic capabilities in syntax, semantics, pragmatics, world knowledge, and reasoning, but these capabilities are sensitive to specific inputs and surface features. Despite dramatic increases in generated text quality as models scale to hundreds of billions of parameters, the models are still prone to unfactual responses, commonsense errors, memorized text, and social biases. Many of these weaknesses can be framed as over-generalizations or under-generalizations of learned patterns in text. We synthesize recent results to highlight what is currently known about large language model capabilities, thus providing a resource for applied work and for research in adjacent fields that use language models.

Citation

@article{chang-bergen-2024-language,
  title={Language Model Behavior: A Comprehensive Survey},
  author={Tyler A. Chang and Benjamin K. Bergen},
  journal={Computational Linguistics},
  year={2024},
  url={https://arxiv.org/abs/2303.11504},
}

Transformer Language Models

Architectures

The basic Transformer language model architecture has remained largely unchanged since 2018 (Radford et al., 2018; Devlin et al., 2019). First, an input text string is converted into a sequence of tokens, roughly corresponding to words. Each token is mapped to a fixed vector "embedding"; the embedding for each token is learned during the pre-training process. The sequence of embeddings is passed through a stack of Transformer layers that essentially mix the embeddings between tokens (using "self-attention"; Vaswani et al. 2017). This mixing results in a "contextualized" vector representation for each token (e.g. a representation for the word "dog" in the context "I saw a dog"). Finally, after the stack of Transformer layers, each output token representation is projected into a distribution over the same token vocabulary used in the input. In other words, the overall architecture maps each input token to a probability distribution over output tokens (e.g. the next token).

Training

Language modeling refers to predicting tokens (roughly equivalent to words) from context, usually text. Masked and autoregressive language models are "pre-trained" to predict masked (i.e. hidden, fill-in-the-blank) or upcoming tokens respectively. Popular recent language models (e.g. ChatGPT) are primarily autoregressive language models; for each input token, the model produces a probability distribution over the next token (i.e. predicting each next token, which can be used for text generation). These models are trained to maximize the probability of each next token.

Language models are pre-trained using gradient descent, observing many examples of plain text. Due to high computational costs, relatively few language models are pre-trained from scratch, and they are usually trained in industry labs. In practice, most NLP researchers build applications upon existing pre-trained language models. Recent language models often contain further non-task-specific fine-tuning stages, such as additional training on examples that correctly follow instructions ("instruction tuning"; Wei et al., 2022), or reinforcement learning based on human preferences ("RLHF"; Ouyang et al., 2022). We focus on non-fine-tuned language models, which still serve as the foundation for more recent language models.

Syntax

Language models generally produce grammatical text, adhering to a wide variety of syntactic rules.

Citations: Warstadt et al. (2020); Hu et al. (2020); Gauthier et al. (2020); Park et al. (2021); Wilcox et al. (2022); Hu et al. (2020); Warstadt et al. (2019); Lee and Schuster (2022); Perez-Mayos et al. (2021); Mahowald (2023); Zhang et al. (2022).

They learn subject-verb agreement, but they are sensitive to intervening clauses and specific words.

Citations: van Schijndel et al. (2019); Goldberg (2019); Bacon and Regier (2019); Ryu and Lewis (2021); Lakretz et al. (2022); Lampinen (2022); Yu et al. (2020); Chaves and Richter (2021); Newman et al. (2021); Wei et al. (2021); Lasri et al. (2022); Lasri et al. (2022).

They learn syntactic rules early in pre-training.

Citations: Liu et al. (2021); Zhang et al. (2021); Huebner et al. (2021); Choshen et al. (2022); Misra (2022); Chang and Bergen (2022).

They can learn word order without explicit position information, but word order is not necessary in many examples.

Citations: Sinha et al. (2021); Abdou et al. (2022); Haviv et al. (2022); Chang et al. (2021); Lasri et al. (2022); Wettig et al. (2023); Malkin et al. (2021); Sinha et al. (2022).

Semantics and Pragmatics

Language models learn semantic and compositional properties of individual words, including argument structure, synonyms, and hypernyms (i.e. lexical semantics).

Citations: Senel and Schutze (2021); Hanna and Marecek (2021); Ravichander et al. (2020); Misra et al. (2021); Arefyev et al. (2020); Warstadt et al. (2020); Davis and van Schijndel (2020); Upadhye et al. (2020); Kementchedjhieva et al. (2021); Huynh et al. (2022); Hawkins et al. (2020).

They struggle with negation, often performing worse as models scale.

Citations: Ettinger (2020); Kassner and Schutze (2020); Michaelov and Bergen (2022); Gubelmann and Handschuh (2022); Jang et al. (2022).

They construct coherent but brittle situation models.

Citations: Schuster and Linzen (2022); Pandit and Hou (2021); Zhang et al. (2023); Summers-Stay et al. (2021).

They recognize basic analogies, metaphors, and figurative language.

Citations: Pedinotti et al. (2021); Griciute et al. (2022); Comsa et al. (2022); Liu et al. (2022); He et al. (2022); Ushio et al. (2021); Czinczoll et al. (2022); Bhavya et al. (2022); Weissweiler et al. (2022).

They can infer the mental states of characters in text.

Citations: Summers-Stay et al. (2021); Sap et al. (2022); Lal et al. (2022); Hu et al. (2022); Trott et al. (2022); Masis and Anderson (2021).

They struggle with implied meaning and pragmatics.

Citations: Beyer et al. (2021); Ruis et al. (2022); Cong (2022); Kabbara and Cheung (2022); Kim et al. (2022).

Commonsense and World Knowledge

Language models learn facts and commonsense properties of objects, particularly as models scale.

Citations: Davison et al. (2019); Petroni et al. (2019); Penha and Hauff (2020); Jiang et al. (2020); Adolphs et al. (2021); Kalo and Fichtel (2022); Lin et al. (2020); Peng et al. (2022); Misra et al. (2023); Sahu et al. (2022); Kadavath et al. (2022).

They are less sensitive than people to physical properties.

Citations: Apidianaki and Gari Soler (2021); Weir et al. (2020); Paik et al. (2021); Liu et al. (2022); Shi and Wolff (2021); De Bruyn et al. (2022); Jiang and Riloff (2021); Jones et al. (2022); Stevenson et al. (2022).

Learned facts are sensitive to context.

Citations: Elazar et al. (2021); Cao et al. (2022); Podkorytov et al. (2021); Cao et al. (2021); Kwon et al. (2019); Beloucif and Biemann (2021); Lin et al. (2020); Poerner et al. (2019); Pandia and Ettinger (2021); Kassner and Sch{"u}tze (2020); Elazar et al. (2022).

Learned facts are also sensitive to a fact's frequency in the pre-training corpus.

Citations: Kassner et al. (2020); Kandpal et al. (2022); Mallen et al. (2022); Romero and Razniewski (2022).

Factual knowledge continues to evolve late in pre-training.

Citations: Chiang et al. (2020); Swamy et al. (2021); Liu et al. (2021); Zhang et al. (2021); Porada et al. (2022); Misra et al. (2023).

Language models have a limited but nontrivial ability to make commonsense inferences about actions and events.

Citations: Cho et al. (2021); Shwartz and Choi (2020); Beyer et al. (2021); Kauf et al. (2022); Qin et al. (2021); Zhao et al. (2021); Li et al. (2022); Stammbach et al. (2022); Jin et al. (2022); Tamborrino et al. (2020); Misra (2022); Pandia et al. (2021); Ko and Li (2020); Lee et al. (2021); Pedinotti et al. (2021); Li et al. (2022); Zhou et al. (2021); Sancheti and Rudinger (2022); Aroca-Ouellette et al. (2021); Jones and Bergen (2021).

Logical and Numerical Reasoning

Large language models can perform basic logical reasoning when prompted.

Citations: Wei et al. (2022); Suzgun et al. (2022); Lampinen et al. (2022); Webb et al. (2022); Han et al. (2022); Kojima et al. (2022); Wang et al. (2022); Min et al. (2022).

They still struggle with complex reasoning.

Citations: Saparov and He (2023); Valmeekam et al. (2022); Press et al. (2022); Katz et al. (2022); Betz et al. (2021); Dasgupta et al. (2022).

They exhibit basic numerical and probabilistic reasoning abilities, but they are dependent on specific inputs.

Citations: Brown et al. (2020); Wang et al. (2021); Wallace et al. (2019); Jiang et al. (2020); Fujisawa and Kanai (2022); Razeghi et al. (2022); Stolfo et al. (2022); Shi et al. (2023); Hagendorff et al. (2022); Hendrycks et al. (2021); Binz and Schulz (2023).

Memorized vs. Novel Text

As language models scale, they are more likely to generate memorized text from the pre-training corpus.

Citations: Carlini et al. (2021); Lee et al. (2022); Carlini et al. (2023); Kandpal et al. (2022); Hernandez et al. (2022); Lee et al. (2023); Ippolito et al. (2022); Tirumala et al. (2022); Kharitonov et al. (2021).

They generate novel text that is consistent with the input context.

Citations: Tuckute et al. (2022); McCoy et al. (2021); Meister and Cotterell (2021); Chiang and Chen (2021); Massarelli et al. (2020); Cifka and Liutkus (2022); Dou et al. (2022); Sinclair et al. (2022); Sinha et al. (2022); Aina and Linzen (2021); Reif et al. (2022); O'Connor and Andreas (2021); Misra et al. (2020); Michaelov and Bergen (2022); Armeni et al. (2022).

Bias, Privacy, and Toxicity

Language models sometimes generate offensive text and hate speech, particularly in response to targeted prompts.

Citations: Ganguli et al. (2022); Gehman et al. (2020); Wallace et al. (2019); Heidenreich and Williams (2021); Mehrabi et al. (2022); Perez et al. (2022).

They can expose private information, but often not tied to specific individuals.

Citations: Ganguli et al. (2022); Perez et al. (2022); Huang et al. (2022); Lehman et al. (2021); Shwartz et al. (2020).

Language model performance varies across demographic groups.

Citations: Smith et al. (2022); Brandl et al. (2022); Zhang et al. (2021); Groenwold et al. (2020); Zhou et al. (2022).

Probabilities of toxic text vary across demographic groups.

Citations: Hassan et al. (2021); Ousidhoum et al. (2021); Nozza et al. (2022); Sheng et al. (2019); Magee et al. (2021); Dhamala et al. (2021); Sheng et al. (2021); Akyurek et al. (2022); Kurita et al. (2019); Silva et al. (2021).

Language models reflect harmful group-specific stereotypes based on gender, sexuality, race, religion, and other demographic identities.

Citations: Nangia et al. (2020); Kurita et al. (2019); Choenni et al. (2021); Nadeem et al. (2021); Nozza et al. (2021); Felkner et al. (2022); Abid et al. (2021); Kirk et al. (2021); Bartl et al. (2020); de Vassimon Manela et al. (2021); Touileb (2022); Alnegheimish et al. (2022); Tal et al. (2022); Srivastava et al. (2022); Tang and Jiang (2022); Seshadri et al. (2022); Mattern et al. (2022); Akyurek et al. (2022); Shaikh et al. (2022).

Misinformation, Personality, and Politics

Language models can generate convincing unfactual text.

Citations: Levy et al. (2021); Lin et al. (2022); Rae et al. (2021); Raj et al. (2022); Heidenreich and Williams (2021); Spitale et al. (2023); Chen et al. (2022).

They can generate unsafe advice.

Citations: Zellers et al. (2021); Chuang and Yang (2022); Levy et al. (2022); Jin et al. (2022).

Model-generated text is difficult to distinguish from human-generated text.

Citations: Brown et al. (2020); Wahle et al. (2022); Spitale et al. (2023); Ippolito et al. (2020); Clark et al. (2021); Dugan et al. (2023); Jakesch et al. (2023); Jawahar et al. (2020); Wahle et al. (2022).

Language model "personality" and politics depend on the input context.

Citations: Perez et al. (2022); Simmons (2022); Argyle et al. (2023); Liu et al. (2022); Johnson et al. (2022); Bang et al. (2021); Sheng et al. (2021); Patel and Pavlick (2021); Chen et al. (2022); Caron and Srivastava (2022); Jiang et al. (2022); Li et al. (2022); Miotto et al. (2022); Aher et al. (2022).

Discussion

See full paper for discussion on the effects of scale (model size), language modeling as generalization, and levels of analysis in understanding language models.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
final_papers		final_papers
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

final_papers

final_papers

scripts

scripts

README.md

README.md

Repository files navigation

llm-behavior-survey

Abstract

Contents

Citation

Transformer Language Models

Syntax

Semantics and Pragmatics

Commonsense and World Knowledge

Logical and Numerical Reasoning

Memorized vs. Novel Text

Bias, Privacy, and Toxicity

Misinformation, Personality, and Politics

Discussion

About

Releases

Packages

Languages

tylerachang/llm-behavior-survey

Folders and files

Latest commit

History

Repository files navigation

llm-behavior-survey

Abstract

Contents

Citation

Transformer Language Models

Syntax

Semantics and Pragmatics

Commonsense and World Knowledge

Logical and Numerical Reasoning

Memorized vs. Novel Text

Bias, Privacy, and Toxicity

Misinformation, Personality, and Politics

Discussion

About

Resources

Stars

Watchers

Forks

Languages