Skip to content

Language identification program for 32 languages in Python

Notifications You must be signed in to change notification settings

vevake/language_identifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

language_identifier

This project is aimed to creating a python program to identify tha language a document/text is written in without the use of any external libraries.

The implementation is done using Naive Bayes approach. The languages that the model is trained on are

French, English, Arabic, Russian, German, Italian, Greek, Spanish, Thai,
Persian, Chinese, Turkish, Finnish, Portuguese, Roman, Indonesian,
Polish, Dutch, Irish, Icelandic, Hindi, Czech, Malay, Bulgarian, Urdu,
Norwegian, Danish, Hebrew, Swedish, Hungarian, Latin, and Albanian

Train and Test data

The train data used for the model training are the DLI32-corpus

The DLI32 corpus containing 320 texts corresponding to 10 texts per language which is used for training and DLI32-2 corpus consisting of 640 texts corresponding to 20 texts per language is used for testing.

The train and test data are parsed already and saved in a csv file

Running the program

To identify the language of a text sentence directly

$ python language_identifier.py <text_sentence>

example :

$ python language_identifier.py 'The weather here is awesome'

To identify the language for a document.

$ python language_identifier.py <file_path>

About

Language identification program for 32 languages in Python

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages