docs.txt

For all most-important info please read README file in this directory.

Implementation details:

1) general rules. 
    Rules are generated by a Python script `build_predicates.py` in the
    `databases` directory, and stored in `rules.pl`. As a source it uses the
    *.txt files in the `databases` directory (which are heavily based on
    https://en.m.wikibooks.org/wiki/Czech/Numbers). 
    The Script performs diacritics removal to allow both diacritised and
    no-diacritised variants of input tokens. The source files have to contain
    inflected forms of the input tokens that are needed (7th case for division,
    "sto deleno peti").
    We wanted to implement automatic inflection, because the author focuses on
    automatic inflection in his thesis, but the current inflection models
    support inflection of nouns only (and therefore are not useful for
    numerals).

2) tests.
    All tests are in `tests.pl`. It can be run by
    
    # Run SWI-prolog
    > swipl
    
    # compile tests
    ?- [tests].
    true.
    
    # Run tests.
    ?- test.

    It runs multiple tests that are self-explanatory.

3) Expression parsing
    is available in `process_expression.pl` and `process_integer.pl`.
    The predicates are explain inside the source code, but just for an overview
    we explain the basic procedure:
    - the expression string is split to tokens
    - the list of tokens is processed from the beginning, always trying to find
    longest valid sequence representing a unit (that's how we call a part of an
    expression: an integer unit representing one integer, operator unit
    representing operator or a parenthesis one).
    - the units are validated independently and every unit results to one token
    in the expression list containing numbers and operators.
    - the most interesting and non-trivial part is parsing a string
    representing an integer (in `process_integer.pl`). We show it on an example
    string "padesat milionu sto dvanact tisic pet set osmdesat devet".

        # input: padesat milionu sto dvanact tisic pet set osmdesat devet
        
        I) It splits the integer tokens list to smaller parts (called components,
        each of them representing thousands ('tisic'), millions ('milion'),
        billions ('miliarda') etc. ).
        
        # components:   padesat milionu; 
                        sto dvanact tisic; 
                        pet set osmdesat devet (to be exact, this is parsed as
                        an element, not a component)
        
        These components are processed by further decomposition to its
        identifier part ('tisic', 'milion', ...) and the element part (that's
        how we call the part representing the actual value: 'sto padesat sest'
        in the component 'sto padesat sest tisic').

        # components:   ID: milionu; element: padesat
                        ID: tisic; element: sto dvanact
                        element: pet set osmdesat devet

        The element part is processed by even further decomposition to a
        hundreds part and tens part, which is decomposed to tens part
        ('dvacet') and digits part ('pet') or a teens part ('sestnact').

        # elements:     padesat     ->  hundreds part:  []
                                        tens part:      padesat
                        sto dvanact ->  hundreds part:  sto
                                        teens part:     dvanact
                        pet set osmdesat devet
                                    ->  hundreds part:  pet set
                                    ->  tens part:      osmdesat
                                    ->  digits part:    devet

4) expression evaluation
    Is performed by a simple prolog DCG (Definite Clause Grammar). It takes an
    expression represented by a list of tokens (numbers, operators,
    parentheses) and evaluates it.

5) handling of incorrect input
    Is not provided. If the user enters a completely incorrect input, the
    predicate is simply false.
    On the contrary, it is allowed to use some incorrect phrases which combine
    multiple possibilities of expressing something (e.g. 'dva tisic' instead of
    the correct 'dva tisice' or even 'peti miliardou' instead of the correct
    'peti miliardami' etc.).

6) possible improvements
    If we had enough time and wanted to create a really useful library, we
    could add support for:
        - auto inflection of the rules (see section 1)
        - floats and fractions (it would be simple to add another possible
        units, and the parsing of the integers would need to be extended to
        support the notations of floats and fractions)
        - support of "composed" tens in Czech (such as "dvaadvacet")
        - support of another parentheses (curly braces, square braces etc.)