libvarnam is a cross platform, self learning, open source library which support transliteration and reverse transliteration for Indian languages. At the core is a C shared library providing algorithms and patterns for transliteration.
libvarnam has a simple learning module built-in which can learn words to improve the transliteration experience.
wget http://download.savannah.gnu.org/releases/varnamproject/libvarnam/source/libvarnam-$VERSION.tar.gz tar -xvf libvarnam-$VERSION.tar.gz cd libvarnam-$VERSION cmake . && make sudo make install
This will install
libvarnam shared libraries and
varnamc command line utility.
varnamc can be used to quickly try out varnam.
Installation on Windows
In Windows, you can compile
libvarnam using Visual Studio. Use the following
cmake command to generate the project files.
cmake -DBUILD_TESTS=false -DBUILD_VST=false -DRUN_TESTS=false .
Usage: varnamc -s lang_code -t word
varnamc -s ml -t varnam വർണം വർണമേറിയത്
Usage: varnamc -s lang_code -r word
varnamc -s ml -r വർണം varnam
libvarnam is a learning system. It works better with a word corpus. You can obtain the word corpus and make varnam learn all the words. This will enable
libvarnam to provide intelligent suggestions.
Here is an example of loading Malayalam word corpus:
mkdir words cd words wget http://download.savannah.gnu.org/releases/varnamproject/words/ml/ml.tar.gz tar -xvf ml.tar.gz varnamc -s ml --learn-from .
This will take some time depends on how much words you are loading.
There is a
--import-learnings-from option to import files which already has the learnt parameter. Importing these files don't take too much time as the word corpus.
If you just wanted to use varnam for input, you have the following options
If you are a programmer, you will be interested in
libvarnam. You can use it to provide Indian language support in your applications.
libvarnam can be used from different programming languages.
How Varnam works
- Scheme files and symbol tables
Scheme files and symbol tables
Scheme file maps English letters to phonetic equivalent indic letters. In this, all vowels, consonants and consonant clusters are mapped to the indic equivalent. Varnam uses the scheme file mapping to perform transliteration.
Scheme files are plain text but uses a custom DSL to make the mapping easier. This DSL is implemented using Ruby and it can contain any valid Ruby code. It also provides many helper functions to make the mapping easier.
schemes/ directory contains all the scheme files for the supported languages. Each language is represented with it's ISO language code.
Compiled version of Scheme file is called as Varnam Symbol Table (vst). This compilation is done using
varnamc command line utility
varnamc --compile schemes/ml
Symbol tables are binary representation of the plain text scheme files. It also contains other metadata items to make the lookup easier.
libvarnam understand only the symbol table format. Because of this, every scheme file should be compiled into vst format before it can be used with varnam.
can be used to compile all scheme files present in the schemes directory.
Symbol table lookup
Varnam can be initialized with just the ISO language code. When this happens, varnam will scan the following directories and tries to find a matching symbol table file. If one is found, it will be loaded and used for all operations.
varnam_transliterate(varnam *handle, const char *input, varray **output);
Is the entry point for transliteration. Transliteration converts input to the phonetic equivalent indic text. It also provides a set of matches which are possible for the given input.
Transliteration does the following steps under the hood:
Performs tokenization on the input. Varnam uses a greedy tokenizer which processes input from left to right. Tokenizer tries all possible to combinations to generate the longest possible tokens for the given input. This token will be generated by utilizing the symbol table which is provided to varnam
Generated tokens is assembled and varnam computes all possibilities of these tokens. Assume the input is malayalam, varnam generates tokens like, മ, ല, യാ, ളം ([ma], [la], [ya], [lam]) and many others. Once these tokens are generated, they are combined and tested against the learning model to get rid of garbage values and come up with most used words. Words are sorted according to the frequency value and returned to the caller function.
All of the processing is varnam is mostly language agnostic. It should work fine for all Indian languages. However, sometimes language specific fixes might be required. Varnam handles this using Renderers. Any language can register renderers and varnam will invoke the renderers just before rendering the final output. This can have language specific rules which can't be generalized otherwise.
varnam_learn(varnam *handle, const char *word);
Varnam can learn new words. The more words it learns, the better it performs. Learning process learns the words and it's patterns.
Learning process persists the following data:
- Patterns: All english combinations which can be used to input the given indic text
- Words: Indic text itself
- Prefixes: Prefixes of patterns and words
When an indic word is learned, varnam tokenizes the word using the symbol table and tries to learn all possible patterns that can be used to input the word. Internally, varnam keeps a prefix tree and frequencies of all patterns. This storage structure allows varnam to retrieve matching words efficiently when a pattern is presented. Basic stemming is also performed while learning words.
When the same word/pattern combination is learned, varnam computes frequency at which it has seen this pattern. This frequency is used to sort and pick the best candidate while performing transliteration.
Learning can be initiated by calling Varnam APIs directly or using varnamc.
Input tools like ibus-engine will automatically learn the words that you are typing.
Learned data is kept in one of the following locations:
- APPDATA\varnam\suggestions (Windows)
Mozilla Public License
Copyright (c) 2016 Navaneeth.K.N
This Source Code Form is subject to the terms of the Mozilla Public License, v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain one at https://mozilla.org/MPL/2.0/.