Skip to content
Browse files

Add a list of bugs and known problems

Signed-off-by: Christopher Hall <>
  • Loading branch information...
1 parent 84cc0e1 commit 6f2bfae531377bbde0c4e121a08ea57eaf0c66ec @hxw hxw committed May 12, 2010
Showing with 73 additions and 0 deletions.
  1. +73 −0 BUGS
@@ -0,0 +1,73 @@
+BUGS and Problems
+1. Missing Fonts
+ A few fonts are missing in Cyrillic so Russian Wikipedia shows a few
+ missing characters (the "box" character)
+ Korean and Arabic sets are absent from all fonts
+2. Restricted search index character set
+ The index is restricted to the set [A-Z0-9] and some punctuation
+ characters in order to speed up the searching process and reduce the
+ size of the index files. This leads to problems with non-Latin
+ letters.
+ These are described below:
+ a. All accents are stripped i.e. everything that looks like 'A'
+ (e.g. "aāáăàȧĀÁĂÀȦ" etc.) is converted an 'A'.
+ This uses Python function: unicodedata.normalize('NFD', text)
+ b. Japanese is handled as a special case using a two stage
+ translation. stage one uses a dictionary (Currently MeCab)
+ translate to Katakana. stage two is to translate Katakana and
+ Hiragana to Romaji. This is only Activated if language is set to
+ "ja".
+ c. Chinese is translated character by character to Pinyin. Accent
+ stripping causes both 西安 and 先 to convert to "xian" so index
+ sort order is not as would be expected.
+ d. Korean, Cyrillic, Greek, Coptic... are looked up in the Unicode
+ tables provided by Python (in Python 2.6 these
+ tables are missing some characters)
+ e.g.'서')
+ therefore 'SEO' will be used to represent the '서' character.
+ Notes: for Cyrillic some extra 'H' and 'E' are dropped from the
+ name to make typing easier.
+ Katakana and Hiragana will get processed by this method
+ except when using the Japanese Dictionary - the result
+ will not be the same as Romaji.
+ e. Ligatures like: "æœij" are replaced by "ae", "oe" and "ij"
+ respectively
+ f. Some special letters are also converted.
+ e.g. "ÐðÞþ" (eth and thorn are represented by "eth" and "th")
+ (Used in Icelandic)
+ g. Anything left over is unchanged and eventually end up being
+ dropped.
+ When the index is prepared from the string as translated by the
+ rules above any character that is not in the limited [A-Z0-9] plus
+ punctuation is just dropped. The sort order is then based on these
+ modified strings. The original string is kept for display so the
+ order of the search results can appear out of order.
+3. Keyboard
+ There is only a basic QWERTY keyboard plus a second numbers +
+ punctuation (the index process matches this character subset).
+ This make creating other language difficult in this version.

0 comments on commit 6f2bfae

Please sign in to comment.
Something went wrong with that request. Please try again.