Skip to content
Newer
Older
100644 74 lines (50 sloc) 2.57 KB
6f2bfae @hxw Add a list of bugs and known problems
hxw authored
1 BUGS and Problems
2 =================
3
4 1. Missing Fonts
5
6 A few fonts are missing in Cyrillic so Russian Wikipedia shows a few
7 missing characters (the "box" character)
8
9 Korean and Arabic sets are absent from all fonts
10
11
12 2. Restricted search index character set
13
14 The index is restricted to the set [A-Z0-9] and some punctuation
15 characters in order to speed up the searching process and reduce the
16 size of the index files. This leads to problems with non-Latin
17 letters.
18
19 These are described below:
20
21 a. All accents are stripped i.e. everything that looks like 'A'
22 (e.g. "aāáăàȧĀÁĂÀȦ" etc.) is converted an 'A'.
23
24 This uses Python function: unicodedata.normalize('NFD', text)
25
26 b. Japanese is handled as a special case using a two stage
27 translation. stage one uses a dictionary (Currently MeCab)
28 translate to Katakana. stage two is to translate Katakana and
29 Hiragana to Romaji. This is only Activated if language is set to
30 "ja".
31
32 c. Chinese is translated character by character to Pinyin. Accent
33 stripping causes both 西安 and 先 to convert to "xian" so index
34 sort order is not as would be expected.
35
36 d. Korean, Cyrillic, Greek, Coptic... are looked up in the Unicode
37 tables provided by Python unicodedata.name() (in Python 2.6 these
38 tables are missing some characters)
39
40 e.g. unicodedata.name(u'서')
41 returns: 'HANGUL SYLLABLE SEO'
42 therefore 'SEO' will be used to represent the '서' character.
43
44 Notes: for Cyrillic some extra 'H' and 'E' are dropped from the
45 name to make typing easier.
46
47 Katakana and Hiragana will get processed by this method
48 except when using the Japanese Dictionary - the result
49 will not be the same as Romaji.
50
51 e. Ligatures like: "æœij" are replaced by "ae", "oe" and "ij"
52 respectively
53
54 f. Some special letters are also converted.
55
56 e.g. "ÐðÞþ" (eth and thorn are represented by "eth" and "th")
57 (Used in Icelandic)
58
59 g. Anything left over is unchanged and eventually end up being
60 dropped.
61
62 When the index is prepared from the string as translated by the
63 rules above any character that is not in the limited [A-Z0-9] plus
64 punctuation is just dropped. The sort order is then based on these
65 modified strings. The original string is kept for display so the
66 order of the search results can appear out of order.
67
68
69 3. Keyboard
70
71 There is only a basic QWERTY keyboard plus a second numbers +
72 punctuation (the index process matches this character subset).
73 This make creating other language difficult in this version.
Something went wrong with that request. Please try again.