Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
CHAT CA lite
This page documents a set of constraints on full Jeffersonian transcription that CA transcribers should adopt to be able to convert their transcripts into computer-readable CHAT format using the jeffersonize script.
In CA transcription, new TCUs start with a capital letter (Hepburn & Bolden 2017 pp.55-57).
When transcribing in CAlite make sure your turn-beginnings also start with a newline.
TCU-initial capitalisation is analytically useful for CA (marking turn-beginnings), but introduces ambiguities into the transcript that cannot be resolved straight-forwardly in computational analysis. The jeffersonize script converts the first letter in each new speaker line into a capital when going from CHAT to CAlite, and reverses that process when going from CAlite to CHAT).
Use ceil and floor characters instead of [ and ]
When marking overlaps, use CLAN's overlap markers: ⌈ ⌉ for the top row and ⌋ ⌊ for the bottom row instead of standard square brackets: [ and ] which are more ambiguous in multiple rows of overlap, and can easily be confused with CHAT-CA's use of square braces for [^ comments] and other non-transcript information.
Do not space out words to show their length
Please ensure that spaces are not used to to enhance the visual representation of the temporal relationships between speaker overlap timings.
For example, a pair of overlapping turns in one of the CABNC transcripts could be re-transcribed manually like so:
*NOR: those co⌈lours ⌉ are- (.) ⌈n o t t̲ ̲o̲ ̲o̲ b a : d. they're no- *CHR: ⌊hmm? ⌋ ⌊(hm) no problem the̲::̲re?
However, this would render 'not too bad' unrecoverable in a lexical search. Therefore please use something like this instead:
*NOR: those co⌈lours ⌉ are- (.) ⌈not t̲o̲:̲:o ba:̲d they're no- *CHR: ⌊hmm? ⌋ ⌊(hm) no problem the̲::̲re?
The only compromise here is that the spacings will have to be added between words. In other cases where overlap may occur within a word, use the overlap markers to indicate the extent of the overlap e.g.:
*NOR: °something's pressed h⌈ole⌉s for the::re°≈ *CHR: ⌊°°hm°°⌋
*NOR: °something's pressed h⌈ ole ⌉s for the::re°≈ *CHR: ⌊°°hm°°⌋
Although the latter is clearly more readable in terms of a graphical representation - it renders the word 'holes' unavailable via lexical search. For publication, the spacing can always be re-arranged for greater readability.
For the purposes of searching large corpora, it is essential to maintain the recoverability of lexemes. Therefore, only use non-standard orthography when absolutely necessary to capture the detail of a person's speech.
If a word sounds like it is being cut short (e.g. it sounds like "it's cos" instead of "it's because", you may be able to hear that it is actually pronounced by the participant, but simply spoken very fast, so consider using:
If, however, the participants themselves are making an issue of not sticking to standard pronunciation e.g. putting on an accent, doing reported speech, or the pronunciation is otherwise relevant interactionally, use this replacement format:
it's cos [: because]
A search by the TalkBank search engine can then target the replacement lexeme.
CABNC timing uses both objective machine timing and Jeffersonian timing of 'beats' approximated by analysts relative to the pace of surrounding talk.
- Jeffersonian 'beats' of silence are marked using single parentheses e.g.: (1.0) - denoting 1 'beat' of silence.
- Objective machine-measured timings can be associated with these more subjective measurements using the bullet delimited duration time-codes produced by CLAN. e.g.: •1094676_1095860• - denoting a duration of 1184 ms or 1.184 seconds.
In this case, 1.0 'beats' equates to 1.184 s of silence - given a 'normal' pace of surrounding talk. As a rule of thumb, try to round up/down based on whether the talk is slow or fast, and always subtract approximately 0.2 ms. to account for the standard turn-transition time when measuring Jeffersonian 'beats', so a 1184 millisecond pause would be transcribed as (1.0), which might be marked in the transcript as •1094676_1095860•
In general, any pause/gap time under 0.5 seconds (once the 0.2 seconds has been subtracted) can be transcribed as a micropause like so: (.), although may still be marked up with a machine-timing e.g.: •1024676_1024870•. - denoting 194 ms or approximately 0.2 s of silence.
Laughter within words and words within laughter
Laughter-within-words in CA transcription use strings of either h or H in parentheses e.g.:
If used consistently, this is relatively simple to filter out algorithmically, so no special alteration to existing practices is necessary to maintain lexical recoverability in a search.
Words-within-laughter however are more complex e.g.:
A minimal adaptation to existing practices would involve using Ἡ (H with dasia) - already one of the CHAT-CA symbols easily available via keyboard shortcuts in CLAN.
For lowercase laugh-tokens, ħ could be used instead of 'h'.
Even laugh-token strings that include vowels: e.g.: huh/hah/heh/hih could be filtered out if rendered as ħuħ/ħaħ/ħeħ/ħiħ or ἩuἩ/ἩaἩ/ἩeἩ/ἩiἩ or some combination of these:
Although, a ((laughter)) comment can also be used, use of these characters to mark laughter would make it easier find episodes of laughter, within-word laughter and laughter-within-words in a large corpus.
Alternatively, as with non-standard orthography, a replacement code can be used:
hihyeahh [: yeah]
Variations from CHAT-CA
If you are a CHAT-CA transcriber with expertise in maintaining machine readability of transcripts, please give feedback on this section of the document.
This section discusses the ways that CHAT-CA-lite will vary from CHAT-CA's use of novel machine-readable symbols in place of traditional CA conventions. It focuses on the rationale for how Jeffersonian conventions can be translated into CHAT-CA algorithmically.
As much as possible, the burden is placed on algorithmic approaches to reconciling differences between Jeffersonian conventions and CHAT-CA approaches to maximise the number of traditionally CA-trained transcribers who can get involved in the CABNC project.
Most CA transcripts follow Jefferson in using TCU-initial (not sentence initial) capitalization. This introduces significant ambiguities - especially if also marking loudness using capitals, so CHAT CA lite will have to come up with some way of disambiguating TCU-initiation capitalisations from other kinds of capitalization.
Round parentheses for doubtful transcriptions:
I find it useful to keep these - maintaining a collaboratively authored transcript with these 'best guesses' will help with incremental improvements as future transcribers will re-listen knowing what previous transcribes guessed. Also sometimes participants will design talk for indistinctness - this convention reflects that practice. The double parentheses can also be maintained harmlessly (search/replace all parenthesized text with nothing to convert to strict CHAT-CA). CHAT-CA has a code for this (??) - but as a reader of CA transcripts I find that harder to scan as an ambiguous coding - and I think it diminishes the lay readability of the transcript.
In the CA manuals I've read, a period marks unit-final intonation - not necessarily sentence structure and, a '?' marks rising intonation not necessarily interrogative morphosyntax. There are many others in current use - these are very well established conventions, and it seems to me that a simple mapping could be made:
- question mark? = ⇗rising to high
- comma, = ↗ rising to mid
- underscore_ = → level
- period. = ↘ falling to mid
I don't think there's a distinct convention for ⇘ falling to low beyond 'period intonation' in CA other than emphasizing the drop with an within-unit ↓ shift to low pitch.
All these marks are used unit-finally in CA transcripts - not necessarily turn-finally or sentence finally, so it doesn't make sense to only put them at the end of a turn or sentence as such as they don't reflect syntax or turn structures - just things hearable as unit-ends.
However, there are also widely used CA conventions for pitch variation that use colons and underlines:
fa̲:llingcontours underline followed by a colon
ri:̲singcontours take underlined colons
w̲o:̲:rd'up-down-up' type contours do both.
These could be mapped to ↗ rising to mid and ↘ falling to mid relatively easily. Multiple colons in a row could be easily search/replaced out, so I think all this can be left out of any guidelines as it does not necessarily introduce any harmful ambiguities.
The equals sign is generally used by CA transcribers in place of CHAT-CA's double and triple wavy and triple equals signs. Given the placement of tiers, I don't think all CHAT-CA's differentiations are all necessary for the transcriber to make. A simple search could replace = to ≈ signs, and replace the second of two matched ≈ with a +≈. It would not be much more complicated to search for = or ≈ signs latched across another speaker tier's turn and replace them with ≋ and +≋. I don't think there is a risk of introducing irreconcilable ambiguities by using either = or ≈ in all these cases in CHAT-CA-lite - but I may have misunderstood something about these distinctions.
Angle brackets for faster and slower:
These can easily be search replaced with ∆ and ∇- either >matched< (fast/slow), >>doubled<< (faster/slower) and unmatched (e.g. the 'left push' < ) indicating a fast start done as if part of the previous unit/turn. I don't see any unresolvable ambiguities here.
I see CA transcripts use °degrees° to mark quiet talk, °°double degrees°° to mark extra quiet talk. This is different from ∬whispers∬ which can be done quite loudly and have a distinctive production from very quiet talk. Again - I don't think there's a problem with search/replacing double to single degrees if necessary.
CA transcribers use non-standard orthography, but replacement lexemes e.g.:
it's cos [: it's because]
Should solve this problem. If the transcript becomes unacceptably unreadable as a result of so many replacement markings, it might be useful to use an alternate track in the transcript. This will need to be investigated more thoroughly.
Loudness with ALL CAPS:
This is very readable and widely used in CA - It should be straight forward - especially if non-capitalized turn-beginnings are not being used - to search replace strings of caps with ◉ markers, so I don't think this needs to be in the guidelines. There is the issue of proper nouns or pronouns like ◉I◉ being accidentally loud-marked in a search/replace. However this ambiguity could be very significantly mitigated by a search/replace function omitting single-character-capitalisations at word-beginnings and special cases like 'I'. On balance I still think losing ALL CAPS may be a deal-breaker for some CA transcribers.
Overlap markers next to/within words:
A simple search/replace can shift all ⌉ ⌈ ⌋ and ⌊ out of any words they intersect with... or a search algorithm could easily be programmed ignore them to recover lexical matches, so I'm not sure specific guidance on that front is necessary.
These are used in standard CA to mark emphasis rather than loudness per se - so many follow Jefferson in using them to also mark pitch shifts within words in conjunction with colons as above. They can also be used in combination with ALL CAPS for emphatic shouting as opposed to just loud speech. I'm not sure I see the ambiguities here - more a matter of degrees. Underlined speech is loud/emphatic, ALL CAPS (or ◉) marked speech is louder, UNDERLINED ALL CAPS is really loud. If CHAT-CA doesn't need use this level of distinction, both variants could be search/replaced with ◉ markers.
- Hepburn, A. & Bolden, G. B. (2013) The Conversation Analytic Approach to Transcription in Sidnell, J. & Stivers, T. (Eds.) The Handbook of Conversation Analysis, Wiley-Blackwell, 57-76