CABNC transcription project
The CABNC corpus is a open-licensed, detailed conversation analytic re-transcription of naturalistic conversations from a subcorpus of the British National Corpus amounting to around 4.2 million words in 1436 separate conversations.
The project aims to produce transcripts usable for both computational and detailed qualitative analysis. If you are a CA transcriptionist and you use the data, please make sure you re-submit your updated transcripts to help improve the corpus over time.
Accessing and using CABNC data
- You can browse and listen to the latest versions of individual CABNC transcripts as CHAT-CA files on TalkBank, and the latest stable release of CABNC can be downloaded as an archive from TalkBank.
- The corresponding audio files for each transcript can be downloaded from the Audio BNC project’s list of urls.
A guideline for transcribers is currently being devised to help with standardisation - the guidelines adhere as closely as possible to current standards in CA without sacrificing machine readability.
To use or contribute to these transcripts:
- download and install CLAN,
- download the corresponding audio file from the Audio BNC site,
- improve existing transcripts with CLAN, then submit them to the CABNC project for inclusion.
Underlying BNC Data and Usage Rights
Accessing original BNC data
The data on which this project builds is available here:
- The original Audio BNC transcripts are available in HTML format via the AudioBNC
- Audio data and Praat TextGrid are available on the Oxford Phonetics Institute AudioBNC site.
If you want to perform complex searches on BNC data:
- Lancaster University's BNCweb tool: provides a useful web interface for searching the BNC and checking the audio location of sections of transcript.
Subcorpus Data Selection Rationale
The Audio BNC contains about 7.5 million words of recorded speech, all of it already roughly transcribed, with audio recordings of sufficient quality for automated phonetic transcriptions, and full Praat TextGrid files aligning audio to transcriptions are available for the entire corpus. There are also comprehensive wordclass and part-of-speech tag annotations. Within the overall BNC corpus, this project focuses on a subcorpus of more naturalistic, conversations from informal contexts. These include 152 rough transcripts of audio files, labelled by the original BNC transcribers with the following tags:
- Overall category: Demographically sampled (subjects carrying audio recorders around)
- Interaction type: Dialogue (rather than speeches/monologues)
- Genre type: conv. (conversation).
These are conversations around water-coolers, in corridors, bus-stops, homes etc. and as such are most useful for analysing natural talk-in-interaction. There are 4,228,314 words in this subcorpus.
Rights and Usage Information
- All files are publicly available under a Creative Commons Attribution License (details here)
- Please cite use of these transcriptions as: Saul Albert, Laura E. de Ruiter, and J.P. de Ruiter (2015) CABNC: the Jeffersonian transcription of the Spoken British National Corpus. https://saulalbert.github.io/CABNC/.
- BNC spoken audio recordings were created or collected from other sources by Longman Dictionaries for the British National Corpus Consortium. Their usage is governed by the terms of the original recording permissions agreement with the contributors, which requires that they can only be "used for scientific study and publication by writers of dictionaries and educational material and language researchers".
- Please cite use of the AudioBNC recordings and associated transcription/annotations as: John Coleman, Ladan Baghai-Ravary, John Pybus, and Sergio Grau (2012) Audio BNC: the audio edition of the Spoken British National Corpus. Phonetics Laboratory, University of Oxford. http://www.phon.ox.ac.uk/AudioBNC.
- Many thanks to Dr. Margaret E. L. Renwick for her forced alignment data, which we used to enrich the BNC-XML with word and turn-timing data.
- We are very grateful to Prof. Brian MacWhinney for his tireless support helping us prepare these data for TalkBank.