-
Notifications
You must be signed in to change notification settings - Fork 0
/
docs.txt
104 lines (85 loc) · 4.73 KB
/
docs.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
For all most-important info please read README file in this directory.
Implementation details:
1) general rules.
Rules are generated by a Python script `build_predicates.py` in the
`databases` directory, and stored in `rules.pl`. As a source it uses the
*.txt files in the `databases` directory (which are heavily based on
https://en.m.wikibooks.org/wiki/Czech/Numbers).
The Script performs diacritics removal to allow both diacritised and
no-diacritised variants of input tokens. The source files have to contain
inflected forms of the input tokens that are needed (7th case for division,
"sto deleno peti").
We wanted to implement automatic inflection, because the author focuses on
automatic inflection in his thesis, but the current inflection models
support inflection of nouns only (and therefore are not useful for
numerals).
2) tests.
All tests are in `tests.pl`. It can be run by
# Run SWI-prolog
> swipl
# compile tests
?- [tests].
true.
# Run tests.
?- test.
It runs multiple tests that are self-explanatory.
3) Expression parsing
is available in `process_expression.pl` and `process_integer.pl`.
The predicates are explain inside the source code, but just for an overview
we explain the basic procedure:
- the expression string is split to tokens
- the list of tokens is processed from the beginning, always trying to find
longest valid sequence representing a unit (that's how we call a part of an
expression: an integer unit representing one integer, operator unit
representing operator or a parenthesis one).
- the units are validated independently and every unit results to one token
in the expression list containing numbers and operators.
- the most interesting and non-trivial part is parsing a string
representing an integer (in `process_integer.pl`). We show it on an example
string "padesat milionu sto dvanact tisic pet set osmdesat devet".
# input: padesat milionu sto dvanact tisic pet set osmdesat devet
I) It splits the integer tokens list to smaller parts (called components,
each of them representing thousands ('tisic'), millions ('milion'),
billions ('miliarda') etc. ).
# components: padesat milionu;
sto dvanact tisic;
pet set osmdesat devet (to be exact, this is parsed as
an element, not a component)
These components are processed by further decomposition to its
identifier part ('tisic', 'milion', ...) and the element part (that's
how we call the part representing the actual value: 'sto padesat sest'
in the component 'sto padesat sest tisic').
# components: ID: milionu; element: padesat
ID: tisic; element: sto dvanact
element: pet set osmdesat devet
The element part is processed by even further decomposition to a
hundreds part and tens part, which is decomposed to tens part
('dvacet') and digits part ('pet') or a teens part ('sestnact').
# elements: padesat -> hundreds part: []
tens part: padesat
sto dvanact -> hundreds part: sto
teens part: dvanact
pet set osmdesat devet
-> hundreds part: pet set
-> tens part: osmdesat
-> digits part: devet
4) expression evaluation
Is performed by a simple prolog DCG (Definite Clause Grammar). It takes an
expression represented by a list of tokens (numbers, operators,
parentheses) and evaluates it.
5) handling of incorrect input
Is not provided. If the user enters a completely incorrect input, the
predicate is simply false.
On the contrary, it is allowed to use some incorrect phrases which combine
multiple possibilities of expressing something (e.g. 'dva tisic' instead of
the correct 'dva tisice' or even 'peti miliardou' instead of the correct
'peti miliardami' etc.).
6) possible improvements
If we had enough time and wanted to create a really useful library, we
could add support for:
- auto inflection of the rules (see section 1)
- floats and fractions (it would be simple to add another possible
units, and the parsing of the integers would need to be extended to
support the notations of floats and fractions)
- support of "composed" tens in Czech (such as "dvaadvacet")
- support of another parentheses (curly braces, square braces etc.)