Could not build data files #20

felixonmars · 2012-08-22T01:53:34Z

I'm trying to:

scons --prefix=/usr

mkdir -p $srcdir/$_gitname-build/raw
cd $srcdir/$_gitname-build/raw
tar xjvf ${srcdir}/lm_sc.t3g.arpa.tar.bz2
tar xjvf ${srcdir}/dict.utf8.tar.bz2

PATH=$PATH:$srcdir/$_gitname-build/src
make -f ../doc/SLM.mk slm_bin

but I got the following error when trying the genpyt part:

scons: done building targets.
lm_sc.t3g.arpa
dict.utf8
slmbuild -n 3 -w 200000 -c 0,2,2 -d ABS,0.0005 -d ABS -d ABS -b 10 -e 9 \\
        -o lm_sc.3gram
Parameter input_file error

Usage:
  slmbuild options idngram

Description:
  This program generate language model from idngram file.

Options:
  -n --ngram     N            # 1 for unigram, 2 for bigram, 3 for trigram...
  -o --out       output       # output file name
  -l --log                    # using -log(pr), default use pr directly
  -w --wordcount N            # Lexicon size, number of different word
  -b --brk       id[,id...]   # set the ids which should be treat as breaker
  -e --exclude   id[,id...]   # set the ids which should not be put into LM
  -c --cut       c1[,c2...]   # k-gram whose freq <= c[k] are droped
  -d --discount  method,param # the k-th -d parm specify the discount method
      for k-gram. Possible values for method/param:
          GT,R,dis  : GT discount for r <= R, r is the freq of a ngram.
                      Linear discount for those r > R, i.e. r'=r*dis
                      0 << dis < 1.0, for example 0.999
          ABS,[dis] : Absolute discount r'=r-dis. And dis is optional
                      0 < dis < cut[k]+1.0, normally dis < 1.0.
          LIN,[dis] : Linear discount r'=r*dis. And dis is optional
                      0 < dis < 1.0

Notes:
      -n must be given before -c -b. And -c must give right number of cut-off,
  also -d must appear exactly N times specify discount for 1-gram, 2-gram...,
  respectively.
      BREAKER-IDs could be SentenceTokens or ParagraphTokens. Concepturally,
  these ids has no meaning when they appeared in the middle of n-gram.
      EXCLUDE-IDs could be ambiguious-ids. Concepturally, n-grams which
  contain those ids are meaningless.
      We can not erase ngrams according to BREAKER-IDS and EXCLUDE-IDs directly
  from IDNGRAM file, because some low-level information still useful in it.

Example:
      Following example read 'all.id3gram' and write trigram model 'all.slm'.
  At 1-gram level, use Good-Turing discount with cut-off 0, R=8, dis=0.9995. At
  2-gram level, use Absolute discount with cut-off 3, dis auto-calc. At 3-gram
  level, use Absolute discount wgenpyt -i dict.utf8 -s  -l pydict3_sc.log -o pyd
ict3_sc.bin
ith cut-off 2, dis auto-calc. Word id 10,11,12
  are breakers (sentence/para/paper breaker, etc). Exclude-ID is 9. Lexicon
  contains 200000 words. The result languagme model use -log(pr).

        slmbuild -l -n 3 -o all.slm -w 200000 -c 0,3,2 -d GT,8,0.9995
                 -d ABS -d ABS -b 10,11,12 -e 9 all.id3gram

make: *** [lm_sc.3gram] Error 100
make: *** Waiting for unfinished jobs....
Opening language model...open -l: No such file or directory
error!
make: *** [pydict3_sc.bin] Error 255

The text was updated successfully, but these errors were encountered:

bigeagle · 2012-08-22T02:16:49Z

One data file corpus.utf8 is missing , so that

mmseg_ids: ${DICT_FILE} ${CORPUS_FILE}
    mmseg -f bin -s 10 -a 9 -d ${DICT_FILE} ${CORPUS_FILE} > ${IDS_FILE}

would just FAIL.

felixonmars · 2012-08-22T02:20:39Z

I used strace to find out which sentence FAIL, but I found it to be the genpyt part, and never goes to mmseg.
In addition, where could I get the file corpus.utf8?

yongsun · 2012-08-22T02:45:29Z

download the *.tar.bz2 files from open-gram project, and place them under 'raw' folder,
download this Makefile to 'data' folder, https://raw.github.com/sunpinyin/sunpinyin/d427362a4b784a6ce62bea6c37a6b1d0a115f299/data/Makefile
run make under 'data' folder

this is a temporary solution, waiting for caspervector@gmail.com to fix this ...

felixonmars · 2012-08-22T03:20:40Z

Two problems:

executable files such as ./genpyt are located in 'src' folder, put Makefile into 'data' then make won't work, but it works correctly if put into 'src'.
there's no install section in the Makefile, so no way to make install the built files. I've put them into '/usr/lib/sunpinyin/data/' as before.

many thanks.

CasperVector · 2012-08-22T05:13:12Z

Fixed in 65be3e1.
The issue is because of make(1)'s dependency mechanism; separating lexicon installation code into another Makefile fixes the issues easily.
Sorry for not testing before committing, and I will try to avoid similar mistakes in the future :(

CasperVector · 2012-08-22T05:15:49Z

BTW, users can now refer to doc/README[.in] for instructions on installation of lexicon data files.

ghost assigned CasperVector Aug 22, 2012

CasperVector closed this as completed Aug 22, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could not build data files #20

Could not build data files #20

felixonmars commented Aug 22, 2012

bigeagle commented Aug 22, 2012

felixonmars commented Aug 22, 2012

yongsun commented Aug 22, 2012

felixonmars commented Aug 22, 2012

CasperVector commented Aug 22, 2012

CasperVector commented Aug 22, 2012

Could not build data files #20

Could not build data files #20

Comments

felixonmars commented Aug 22, 2012

bigeagle commented Aug 22, 2012

felixonmars commented Aug 22, 2012

yongsun commented Aug 22, 2012

felixonmars commented Aug 22, 2012

CasperVector commented Aug 22, 2012

CasperVector commented Aug 22, 2012