DLI32 and DLI32-2 are two small corpora dedicated to Automatic Language identification of written texts. They are collected over different discussion forums, and they contain noisy texts encoded with UTF-8 encoding. The texts may contain any kind of the following noises: URLs, Citations in other language, Tags, Abbreviations, Unaccented characte…
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
DLI32-2.rar
DLI32.rar
LICENSE
README.md

README.md

DLI32-corpus

DLI32 and DLI32-2 are two small corpora dedicated to Automatic Language identification of written texts. They are collected over different discussion forums, and they contain noisy texts encoded with UTF-8 encoding. The texts may contain any kind of the following noises: URLs, Citations in other language, Tags, Abbreviations, Unaccented characters, Misspelling errors, Typing errors, Html tags and objects, Insignificant characters and SMS writing style. The DLI32 corpus contains 320 texts corresponding to 10 texts per language, in which the text length ranges between 93 and 146 words. The DLI32-2 corpus is a subdivision of the DLI32, where it contains 640 texts (20 texts per languages) and the text length ranges between 43 and 67 words.

For more details, I recommand you to read the DLI32.pdf file.

This corpus is used to evaluate the proposed algorithms in the following article:

K. Abainia, S. Ouamour and H. Sayoud, "Robust language identification of noisy texts: Proposal of hybrid approaches", International Workshop on Text-based Information Retrieval TIR’14, pp. 228-232.