language_identifier

This project is aimed to creating a python program to identify tha language a document/text is written in without the use of any external libraries.

The implementation is done using Naive Bayes approach. The languages that the model is trained on are

French, English, Arabic, Russian, German, Italian, Greek, Spanish, Thai,
Persian, Chinese, Turkish, Finnish, Portuguese, Roman, Indonesian,
Polish, Dutch, Irish, Icelandic, Hindi, Czech, Malay, Bulgarian, Urdu,
Norwegian, Danish, Hebrew, Swedish, Hungarian, Latin, and Albanian

Train and Test data

The train data used for the model training are the DLI32-corpus

The DLI32 corpus containing 320 texts corresponding to 10 texts per language which is used for training and DLI32-2 corpus consisting of 640 texts corresponding to 20 texts per language is used for testing.

The train and test data are parsed already and saved in a csv file

Running the program

To identify the language of a text sentence directly

$ python language_identifier.py <text_sentence>

example :

$ python language_identifier.py 'The weather here is awesome'

To identify the language for a document.

$ python language_identifier.py <file_path>

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
README.md		README.md
language_identifier.py		language_identifier.py
test_data.csv		test_data.csv
train_data.csv		train_data.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md