Skip to content

saffsd/langid.c

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
================
``langid.c`` readme
================

Introduction
------------
`langid.c` is an experimental implementation of the language identifier
described by [1] in pure C. It is largely based on the design of
`langid.py`[2], and uses `langid.py` to train models. 

Planned features
----------------
See TODO

Speed
-----

Initial comparisons against Google's cld2[3] suggest that `langid.c` is about
twice as fast.

    (langid.c) @mlui langid.c git:[master] wc -l wikifiles 
    28600 wikifiles
    (langid.c) @mlui langid.c git:[master] time cat wikifiles | ./compact_lang_det_batch > xxx
    cat wikifiles  0.00s user 0.00s system 0% cpu 7.989 total
    ./compact_lang_det_batch > xxx  7.77s user 0.60s system 98% cpu 8.479 total
    (langid.c) @mlui langid.c git:[master] time cat wikifiles | ./langidOs -b > xxx           
    cat wikifiles  0.00s user 0.00s system 0% cpu 3.577 total
    ./langidOs -b > xxx  3.44s user 0.24s system 97% cpu 3.759 total

    (langid.c) @mlui langid.c git:[master] wc -l rcv2files 
    20000 rcv2files
    (langid.c) @mlui langid.c git:[master] time cat rcv2files | ./langidO2 -b > xxx     
    cat rcv2files  0.00s user 0.00s system 0% cpu 31.702 total
    ./langidO2 -b > xxx  8.23s user 0.54s system 22% cpu 38.644 total
    (langid.c) @mlui langid.c git:[master] time cat rcv2files | ./compact_lang_det_batch > xxx 
    cat rcv2files  0.00s user 0.00s system 0% cpu 18.343 total
    ./compact_lang_det_batch > xxx  18.14s user 0.53s system 97% cpu 19.155 total


Model Training
--------------

Google's protocol buffers [4] are used to transfer models between languages. The
Python program `ldpy2ldc.py` can convert a model produced by langid.py [2] into
the protocol-buffer format, and also the C source format used to compile an
in-built model directly into executable.

Dependencies
------------
Protocol buffers [4]
protobuf-c [5]

Contact
-------
Marco Lui <saffsd@gmail.com>

References
----------
[1] http://aclweb.org/anthology-new/I/I11/I11-1062.pdf
[2] https://github.com/saffsd/langid.py
[3] https://code.google.com/p/cld2/
[4] https://github.com/google/protobuf/
[5] https://github.com/protobuf-c/protobuf-c

About

Pure C natural language identifier with support for 97 languages

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages