-
Notifications
You must be signed in to change notification settings - Fork 11
Pure C natural language identifier with support for 97 languages
License
saffsd/langid.c
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
================ ``langid.c`` readme ================ Introduction ------------ `langid.c` is an experimental implementation of the language identifier described by [1] in pure C. It is largely based on the design of `langid.py`[2], and uses `langid.py` to train models. Planned features ---------------- See TODO Speed ----- Initial comparisons against Google's cld2[3] suggest that `langid.c` is about twice as fast. (langid.c) @mlui langid.c git:[master] wc -l wikifiles 28600 wikifiles (langid.c) @mlui langid.c git:[master] time cat wikifiles | ./compact_lang_det_batch > xxx cat wikifiles 0.00s user 0.00s system 0% cpu 7.989 total ./compact_lang_det_batch > xxx 7.77s user 0.60s system 98% cpu 8.479 total (langid.c) @mlui langid.c git:[master] time cat wikifiles | ./langidOs -b > xxx cat wikifiles 0.00s user 0.00s system 0% cpu 3.577 total ./langidOs -b > xxx 3.44s user 0.24s system 97% cpu 3.759 total (langid.c) @mlui langid.c git:[master] wc -l rcv2files 20000 rcv2files (langid.c) @mlui langid.c git:[master] time cat rcv2files | ./langidO2 -b > xxx cat rcv2files 0.00s user 0.00s system 0% cpu 31.702 total ./langidO2 -b > xxx 8.23s user 0.54s system 22% cpu 38.644 total (langid.c) @mlui langid.c git:[master] time cat rcv2files | ./compact_lang_det_batch > xxx cat rcv2files 0.00s user 0.00s system 0% cpu 18.343 total ./compact_lang_det_batch > xxx 18.14s user 0.53s system 97% cpu 19.155 total Model Training -------------- Google's protocol buffers [4] are used to transfer models between languages. The Python program `ldpy2ldc.py` can convert a model produced by langid.py [2] into the protocol-buffer format, and also the C source format used to compile an in-built model directly into executable. Dependencies ------------ Protocol buffers [4] protobuf-c [5] Contact ------- Marco Lui <saffsd@gmail.com> References ---------- [1] http://aclweb.org/anthology-new/I/I11/I11-1062.pdf [2] https://github.com/saffsd/langid.py [3] https://code.google.com/p/cld2/ [4] https://github.com/google/protobuf/ [5] https://github.com/protobuf-c/protobuf-c
About
Pure C natural language identifier with support for 97 languages
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published