Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Failed to load latest commit information.|
Polyglot is a language identifier for detecting text documents containing text written in more than one language, and for identifying the languages therein. It is an experimental project. For monolingual language detection, langid.py is a proven off-the-shelf solution. The theoretical motivation behind it is described in "Automatic Detection and Language Identification of Multilingual Documents. Marco Lui, Jey Han Lau, Timothy Baldwin. TACL Vol 2 (2014)" . To re-train polyglot on custom data, use the training tools for langid.py  to build a model, and convert it to polyglot's format using the script in ./polyglot/convert.py Marco Lui <firstname.lastname@example.org>, November 2013  https://github.com/saffsd/langid.py  https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/86