Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ukrainian koi8u char freq. tab + language model #2

Merged
merged 17 commits into from May 23, 2019
@@ -41,7 +41,7 @@ Pulls together:
* Traditional and Simplified Chinese: Big5, GB18030, EUC-TW, HZ-GB-2312, ISO-2022-CN
* Japanese: EUC-JP, SHIFT_JIS, ISO-2022-JP
* Korean: EUC-KR, ISO-2022-KR
* Cyrillic: KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, WINDOWS-1251
* Cyrillic: KOI8-R, KOI8-U, MacCyrillic, IBM855, IBM866, ISO-8859-5, WINDOWS-1251
* Hungarian: ISO-8859-2, WINDOWS-1250
* Bulgarian: ISO-8859-5, WINDOWS-1251
* English: WINDOWS-1252
@@ -32,14 +32,25 @@
maxrank = 64

def Usage():
print "Usage: mkchartoorder.py <charstats file> <text reference file>"
print "Usage: mkchartoorder.py [--apostrophe] <charstats file> <text reference file>"
sys.exit(1)

if len(sys.argv) != 3:
if len(sys.argv) < 3:
Usage()

charstats = sys.argv[1]
reftext = sys.argv[2]
if sys.argv[1] == "--apostrophe":
# required for Ukrainian because apostrophe is used as frequently used letter there
if len(sys.argv) != 4:
Usage()
apostrophe_code = 0x27
charstats = sys.argv[2]
reftext = sys.argv[3]
else:
if len(sys.argv) != 3:
Usage()
apostrophe_code = -1
charstats = sys.argv[1]
reftext = sys.argv[2]

# print "Charstats file:", charstats, "Ref text:", reftext

@@ -62,7 +73,7 @@ def Usage():
# Eliminate the common control/punctuation areas. Note that this is only
# the ascii control / punctuation because the winxxx encodings have
# lexical characters in the 80-a0 area
if bytevalue <= 0x40 or \
if (bytevalue <= 0x40 and bytevalue != apostrophe_code) or \
(bytevalue >= 0x5b and bytevalue <= 0x60) or \
(bytevalue >= 0x7b and bytevalue <= 0x7f):
continue
@@ -12,6 +12,9 @@ Steps:
- Produce character frequency table by running charstats on the chunk, as:
mkcharstats french/french_cp1252.txt | sort -nr +2 > \
french/charstats_french_cp1252.txt
or (for other versions of sort)
mkcharstats french/french_cp1252.txt | sort -nr -k3 > \
french/charstats_french_cp1252.txt

- Edit the resulting file, Just get rid of a few lines that break the
following step (the first one, the last one and the one for space (0x20)
Loading