DLI32 and DLI32-2 are two small corpora dedicated to Automatic Language identification of written texts. They are collected over different discussion forums, and they contain noisy texts encoded with UTF-8 encoding. The texts may contain any kind of the following noises: URLs, Citations in other language, Tags, Abbreviations, Unaccented characters, Misspelling errors, Typing errors, Html tags and objects, Insignificant characters and SMS writing style. The DLI32 corpus contains 320 texts corresponding to 10 texts per language, in which the text length ranges between 93 and 146 words. The DLI32-2 corpus is a subdivision of the DLI32, where it contains 640 texts (20 texts per languages) and the text length ranges between 43 and 67 words.
For more details, I recommand you to read the DLI32.pdf file.
This corpus is used to evaluate the proposed algorithms in the following article:
K. Abainia, S. Ouamour and H. Sayoud, "Robust language identification of noisy texts: Proposal of hybrid approaches", International Workshop on Text-based Information Retrieval TIR’14, pp. 228-232.