lang_detector

lang_detector is a program for detecting the language of input texts, and for training and evaluating the models used for detection.

Dependencies

lang_detector has been developed and tested on Ubuntu, and so these instructions should be followed with this in mind.

lang_detector is written in Python, and so a recent version of python3 should be downloaded before using it. Downloads for python can be found at https://www.python.org/downloads/ (version 3.5.1 recommended).

The following python libraries will be needed to run lang_detector components:

Usage

With python installed, lang_detector can then be run from the command line.

In all examples, *PATH* refers to the path leading from the current directory to the directory in which the lang_detector folder is located.

Detecting language

Passing command line options to lang_detector

In Linux, using the -d / --detect option will enable the language of a string of text to be detected using a given model and choosing to use a bigram or trigram-based language profile:

For example, to detect the language of 'the quick brown fox jumped over the lazy dog', using a pre-trained model called 'test_model':

python3 *PATH*/lang_detector/lang_detector.py -i "The quick brown fox jumped over the lazy dog" -m models/test_model

Required arguments

-d/--detect

-i/--input [string]

A string of text for which to detect the language (enclosed in quotation marks).

Optional arguments

N.B.: If the -m / --model option is not provided, the default language model (models/default) will be used.

-m/--model [string]

A pre-trained language detection model to be used.

Passing a string of text to lang-detector

Alternatively, a single text string (enclosed in quotation marks) can be passed as an argument to lang_detector. This will return the most likely language using the default language model (models/default). For example, in Linux:

python3 *PATH*/lang_detector/lang_detector.py "The quick brown fox jumped over the lazy dog"

Taking input text from standard input

lang_detector also accepts text passed via standard input. This will return the most likely language using the default language model (models/default). For example, in Linux:

echo "The quick brown fox jumped over the lazy dog" | python3 *PATH*/lang_detector/lang_detector.py

cat example.txt | python3 *PATH*/lang_detector/lang_detector.py

python3 *PATH*/lang_detector/lang_detector.py < example.txt

Default model performance

The following is an evaluation of lang_detector's default model, which can be found in models/default. The model was trained using the data/default_data.csv file, which contains 50000 sentences each taken from the English, French, Spanish, Portuguese, Italian, and German sides of the Europarl parallel corpus (created using data/process_data.py).

The model has been trained on 45000 sentences from each language, and evaluated on 5000 segments from each language. The training and test sets used in the model can be found in models/default/train.csv and models/default/test.csv.

Language	Sentences	% Correct
Spanish	5000	98.42%
Portuguese	5000	98.66%
French	5000	98.86%
English	5000	99.26%
German	5000	99.5%
Italian	5000	99.5%
------	------	------
TOTAL	30000	99.03%

Tests

lang_detector is complemented by unit tests, located in the tests directory. To run them, type the following from the lang_detector's root directory:

python3 -m unittest discover -v -s tests

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
data		data
src		src
tests		tests
.coveragerc		.coveragerc
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
config.py		config.py
lang_detector.py		lang_detector.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

lang_detector

Dependencies

Usage

Detecting language

Passing command line options to lang_detector

Required arguments

-d/--detect

-i/--input [string]

Optional arguments

-m/--model [string]

Passing a string of text to lang-detector

Taking input text from standard input

Default model performance

Tests

About

Releases

Packages

Languages

steveneale/lang_detector

Folders and files

Latest commit

History

Repository files navigation

lang_detector

Dependencies

Usage

Detecting language

Passing command line options to lang_detector

Required arguments

-d/--detect

-i/--input [string]

Optional arguments

-m/--model [string]

Passing a string of text to lang-detector

Taking input text from standard input

Default model performance

Tests

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages