Pure C natural language identifier with support for 97 languages
C
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.gitignore
LICENSE
Makefile
README
TODO
_langid.c
acquis.model
compact_lang_det_batch.cc
compile_batch.sh
langid.c
langid.proto
ldpy.model
ldpy2ldc.py
liblangid.c
liblangid.h
model.c
model.h
setup.py
sparseset.c
sparseset.h

README

================
``langid.c`` readme
================

Introduction
------------
`langid.c` is an experimental implementation of the language identifier
described by [1] in pure C. It is largely based on the design of
`langid.py`[2], and uses `langid.py` to train models. 

Planned features
----------------
See TODO

Speed
-----

Initial comparisons against Google's cld2[3] suggest that `langid.c` is about
twice as fast.

    (langid.c) @mlui langid.c git:[master] wc -l wikifiles 
    28600 wikifiles
    (langid.c) @mlui langid.c git:[master] time cat wikifiles | ./compact_lang_det_batch > xxx
    cat wikifiles  0.00s user 0.00s system 0% cpu 7.989 total
    ./compact_lang_det_batch > xxx  7.77s user 0.60s system 98% cpu 8.479 total
    (langid.c) @mlui langid.c git:[master] time cat wikifiles | ./langidOs -b > xxx           
    cat wikifiles  0.00s user 0.00s system 0% cpu 3.577 total
    ./langidOs -b > xxx  3.44s user 0.24s system 97% cpu 3.759 total

    (langid.c) @mlui langid.c git:[master] wc -l rcv2files 
    20000 rcv2files
    (langid.c) @mlui langid.c git:[master] time cat rcv2files | ./langidO2 -b > xxx     
    cat rcv2files  0.00s user 0.00s system 0% cpu 31.702 total
    ./langidO2 -b > xxx  8.23s user 0.54s system 22% cpu 38.644 total
    (langid.c) @mlui langid.c git:[master] time cat rcv2files | ./compact_lang_det_batch > xxx 
    cat rcv2files  0.00s user 0.00s system 0% cpu 18.343 total
    ./compact_lang_det_batch > xxx  18.14s user 0.53s system 97% cpu 19.155 total


Model Training
--------------

Google's protocol buffers [4] are used to transfer models between languages. The
Python program `ldpy2ldc.py` can convert a model produced by langid.py [2] into
the protocol-buffer format, and also the C source format used to compile an
in-built model directly into executable.

Dependencies
------------
Protocol buffers [4]
protobuf-c [5]

Contact
-------
Marco Lui <saffsd@gmail.com>

References
----------
[1] http://aclweb.org/anthology-new/I/I11/I11-1062.pdf
[2] https://github.com/saffsd/langid.py
[3] https://code.google.com/p/cld2/
[4] https://github.com/google/protobuf/
[5] https://github.com/protobuf-c/protobuf-c