Engligh-Czech Cohesion Corpus Readme
Parallel corpus from 2002 for the comparative study of Czech and English cohesion. Developed by Dominik Lukeš for the study of the difference between Czech and English text cohesive devices.
The complete corpus consists of about 100,000 tokens.
46,689 in Czech and 56,905 in English.
The corpus consists of 24 texts in two broad genres: fiction and non-fiction. Each genre is represented by 6 texts in each direction of translation, i.e. 6 fiction texts translated from English to Czech, 6 fiction texts translated from Czech to English, etc
This means that there are a total of 48 samples, half in Czech and half in English. The average sample length is about 2,000 tokens.
Origins of the corpus
The corpus was compiled for a particular study at a time where reliable parallel corpora were not available. Since, then the Czech national corpus has made a better parallel corpus available. Still, this corpus probably can still be of use for smaller exploratory studies.
The corpus was tagged for use by the ParaConc parallel concondancer.