ANTSIX is a small Arabic corpus dedicated to Automatic Topic Identification of written texts. The corpus contains noisy texts collected over different Arabic discussion forums related to 6 topics, where the texts may be corrupted with the following noises: URLs, Citations in other language, Tags, Abbreviations, Misspelling errors, Typing errors, Html tags and objects, Insignificant characters, SMS writing style and Letters Mistakenness. Moreover, the corpus is unbalanced in terms of text sizes, where the text length ranges between 32 and 318 words, and further, each topic contains 50 texts encoded with UTF-8 encoding. Therefore, in overall the ANTSIX corpus contains 300 short noisy Arabic texts related to 6 different topics.
For more details, I recommand you to read the ATNSIX.pdf file.
This corpus is used to evaluation the proposed algorithms in the following articles:
K. Abainia, S. Ouamour and H. Sayoud, "A novel robust Arabic light stemmer", Journal of Experimental & Theoretical Artificial Intelligence, Vol. 29, No. 3, pp. 557-573.
K. Abainia, S. Ouamour and H. Sayoud, "Topic Identification of Arabic Noisy Texts Based on KNN", International Conference on Information and Communication Technology Research, pp. 89-92.
K. Abainia, S. Ouamour and H. Sayoud, "Topic Identification of Noisy Arabic Texts Using Graph Approaches", International Workshop on Text-based Information Retrieval TIR’15, pp. 254-258.
K. Abainia, "Topic Identification of Noisy Texts: Statistical Approaches", International Journal of Hidden Data Mining and Scientific Knowledge Discovery, Vol. 01, No. 01, pp. 2-8.