Parallel corpus from 2002 for the comparative study of Czech and English cohesion.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
Corpus by Genre
Pilot Corpus
Sample queries and search results
.gitignore
Czech-Source.txt
Czech-Target.txt
English-Source.txt
English-Target.txt
README.md

README.md

Engligh-Czech Cohesion Corpus Readme

Parallel corpus from 2002 for the comparative study of Czech and English cohesion. Developed by Dominik Lukeš for the study of the difference between Czech and English text cohesive devices.

Corpus information

The complete corpus consists of about 100,000 tokens.

46,689 in Czech and 56,905 in English.

The corpus consists of 24 texts in two broad genres: fiction and non-fiction. Each genre is represented by 6 texts in each direction of translation, i.e. 6 fiction texts translated from English to Czech, 6 fiction texts translated from Czech to English, etc

This means that there are a total of 48 samples, half in Czech and half in English. The average sample length is about 2,000 tokens.

Origins of the corpus

The corpus was compiled for a particular study at a time where reliable parallel corpora were not available. Since, then the Czech national corpus has made a better parallel corpus available. Still, this corpus probably can still be of use for smaller exploratory studies.

Mark up

The corpus was tagged for use by the ParaConc parallel concondancer.