Skip to content

1. Best practices for corpus building

Shelley Staples edited this page Sep 2, 2022 · 17 revisions

Contents

What is a corpus?

A corpus (plural: corpora) is a principled collection of texts (e.g., news language, academic research articles, conversations) that are stored electronically. But this isn’t just a random collection, it’s shaped to represent the type of language you want to explore. That way a corpus can be used to answer questions about language. To put it very simply, a corpus is a collection of language that is used to describe some aspect of language use or to answer a question about language use. For example, we can find out:

  • Which verbs and tenses are most common when presenting an argument vs. telling a story
  • Whether the use of adjectives varies across genders
  • Which transition words are most common in different genres

What kinds of corpora are there?

There are two types of corpora:

  1. General corpora that represent a general aspect of language use (e.g., COCA)
  2. Specific corpora that represent a particular slice of language use (e.g., Crow)

There are also two main ways to access corpora:

  1. Online/web-based interfaces (Crow; MICUSP; COCA) where the corpus is online and is explored by tools from the website
  2. Offline corpora that you use on your computer and interact with corpus software to explore or run your own computer programming scripts

Corpora can include both spoken and written language.

What kinds of searches can you do with a corpus?

There are many ways that we can search a corpus to find answers to research questions or to develop teaching materials. Here are four common types of corpus searches:

  1. Frequency lists with different ways to sort: to see most frequent words
  2. Keyword(s) in context (KWIC): to see the company a word or words keep
  3. Wildcard searches: to see different forms of a word
  4. N-grams or clusters/groups of words: to see groups of words that go together

Below we show examples of each of these four types of searches with screenshots from the freeware program AntConc.

Frequency lists

Frequency order

Screenshot of AntConc frequency order

Alphabetical order (select POS under "Sort by")

Screenshot of AntConc alphabetical order

Word-end order (select TypeEnd under "Sort by")

Screenshot of AntConc word-end order

KeyWord(s) In Context (KWICs)

Looking at the placement of however

Screenshot of AntConc KWIC: However

Looking at a group of words: there are

Screenshot of AntConc there are

Wildcards to find word forms: believe* → believe; believed; believes

Screenshot of AntConc there are

N-grams or word groups/clusters

Screenshot of AntConc there are

Do I need to build a corpus or can I use an existing one?

It’s always best to use an existing corpus if it represents the language you are interested in exploring. Here are some corpora that might be of use for you:

How big does my corpus need to be?

The size of the corpus you need depends on the types of questions you are trying to answer.

A small corpus can be great for classroom research, to answer questions such as:

  • What words will students need to be familiar with in order to read biology or engineering texts?
  • How are my students using transitions or time markers in their writing?

A large corpus is essential if you want to capture all the variation across different aspects of language use, to answer questions such as:

  • What are the linguistic characteristics of travel blogs?
  • What language is used in introductory biology textbooks?
  • How do people express disagreement?

If I decide to build a corpus, where do I start?

You've come to the right place! The rest of this wiki describes the steps you'll want to take to build a corpus that meets your needs.

Additional resources

Readings

Reppen, R. (2010). Building a corpus: What are the key considerations? In A. O’Keefe & M. McCarthy (Eds.), The Routledge handbook of corpus linguistics (pp. 31-37). Routledge.

Egbert, J., Larsson, T., & Biber, D. (2020). Doing linguistics with a corpus: Methodological considerations for the everyday user. Cambridge University Press.

Tools

Video presentation

A video version of this content is available on the Crow YouTube channel.

Video: Best practices for corpus building

Navigating CIABATTA

Previous: Home

Next: 2. CIABATTA overview

Clone this wiki locally