Issue to train a new pipeline #16

jplu · 2019-02-06T16:47:16Z

First of all, thanks a lot for this new Neural NLP pipeline. I'm currently trying to train a new model for French with my data + UD datasets, but before that I would like to be able to properly reproduce the training steps. Now, I'm struggling to train the Tokenizer. Here what I'm doing:

mkdir -p ./extern_data/word2vec
scripts/download_vectors.sh ./extern_data/word2vec/
mkdir -p ./data/uddata/
mkdir ./data/tokenize
cd ./data/uddata
git clone https://github.com/UniversalDependencies/UD_French-GSD.git
cd ../..
Changing the UDBASE env variable to ./data/uddata in scripts/config.sh
scripts/run_tokenize.sh UD_French-GSD

And I get the following output with several errors:

Preparing tokenizer train data...
Traceback (most recent call last):
  File "stanfordnlp/utils/prepare_tokenizer_data.py", line 14, in <module>
    with open(args.plaintext_file, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: './data/uddata/UD_French-GSD/fr_gsd-ud-train.txt'
cp: cannot stat './data/uddata/UD_French-GSD/fr_gsd-ud-train.txt': No such file or directory
bash: warning: setlocale: LC_ALL: cannot change locale (fr_FR.UTF-8)
bash: warning: setlocale: LC_ALL: cannot change locale (fr_FR.UTF-8)
Preparing tokenizer dev data...
Traceback (most recent call last):
  File "stanfordnlp/utils/prepare_tokenizer_data.py", line 14, in <module>
    with open(args.plaintext_file, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: './data/uddata/UD_French-GSD/fr_gsd-ud-dev.txt'
cp: cannot stat './data/uddata/UD_French-GSD/fr_gsd-ud-dev.txt': No such file or directory
Traceback (most recent call last):
  File "stanfordnlp/utils/avg_sent_len.py", line 12, in <module>
    with open(toklabels, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: './data/tokenize/fr_gsd-ud-train.toklabels'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
TypeError: ceil() argument after * must be an iterable, not float
Running tokenizer with ...
usage: tokenizer.py [-h] [--txt_file TXT_FILE] [--label_file LABEL_FILE]
                    [--json_file JSON_FILE] [--mwt_json_file MWT_JSON_FILE]
                    [--conll_file CONLL_FILE] [--dev_txt_file DEV_TXT_FILE]
                    [--dev_label_file DEV_LABEL_FILE]
                    [--dev_json_file DEV_JSON_FILE]
                    [--dev_conll_gold DEV_CONLL_GOLD] [--lang LANG]
                    [--shorthand SHORTHAND] [--mode {train,predict}]
                    [--emb_dim EMB_DIM] [--hidden_dim HIDDEN_DIM]
                    [--conv_filters CONV_FILTERS] [--no-residual]
                    [--no-hierarchical] [--hier_invtemp HIER_INVTEMP]
                    [--input_dropout] [--conv_res CONV_RES]
                    [--rnn_layers RNN_LAYERS] [--max_grad_norm MAX_GRAD_NORM]
                    [--anneal ANNEAL] [--anneal_after ANNEAL_AFTER]
                    [--lr0 LR0] [--dropout DROPOUT]
                    [--unit_dropout UNIT_DROPOUT] [--tok_noise TOK_NOISE]
                    [--weight_decay WEIGHT_DECAY] [--max_seqlen MAX_SEQLEN]
                    [--batch_size BATCH_SIZE] [--epochs EPOCHS]
                    [--steps STEPS] [--report_steps REPORT_STEPS]
                    [--shuffle_steps SHUFFLE_STEPS] [--eval_steps EVAL_STEPS]
                    [--save_name SAVE_NAME] [--load_name LOAD_NAME]
                    [--save_dir SAVE_DIR] [--cuda CUDA] [--cpu] [--seed SEED]
tokenizer.py: error: argument --max_seqlen: expected one argument
Running tokenizer in predict mode
Directory saved_models/tokenize do not exist; creating...
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/jplu/stanfordnlp/stanfordnlp/models/tokenizer.py", line 182, in <module>
    main()
  File "/home/jplu/stanfordnlp/stanfordnlp/models/tokenizer.py", line 93, in main
    evaluate(args)
  File "/home/jplu/stanfordnlp/stanfordnlp/models/tokenizer.py", line 159, in evaluate
    mwt_dict = load_mwt_dict(args['mwt_json_file'])
  File "/home/jplu/stanfordnlp/stanfordnlp/models/tokenize/utils.py", line 10, in load_mwt_dict
    with open(filename, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: './data/tokenize/fr_gsd-ud-dev-mwt.json'
Traceback (most recent call last):
  File "stanfordnlp/utils/conll18_ud_eval.py", line 532, in <module>
    main()
  File "stanfordnlp/utils/conll18_ud_eval.py", line 500, in main
    evaluation = evaluate_wrapper(args)
  File "stanfordnlp/utils/conll18_ud_eval.py", line 483, in evaluate_wrapper
    system_ud = load_conllu_file(args.system_file)
  File "stanfordnlp/utils/conll18_ud_eval.py", line 477, in load_conllu_file
    _file = open(path, mode="r", **({"encoding": "utf-8"} if sys.version_info >= (3, 0) else {}))
FileNotFoundError: [Errno 2] No such file or directory: './data/tokenize/fr_gsd.dev.pred.conllu'
fr_gsd

Apparently by checking the the stanfordnlp/utils/prepare_tokenizer_data.py file, indeed there is a need for txt files, but they do not exists in the repo of the UD dataset I'm using. Any hint on where can I find them? Or if they are not provided for free, what do they should look like? In order for me to be able to create them from the conllu files.

Thanks in advance :)

The text was updated successfully, but these errors were encountered:

qipeng · 2019-02-06T17:55:29Z

It looks like the git repo doesn't have the plaintext files necessary to train the tokenizer -- could you try downloading the treebank (with the plaintext inputs) from https://universaldependencies.org and see if that works?

jplu · 2019-02-07T10:15:47Z

Thanks a lot for your answer. I did download the data from the website you proposed and indeed now it is training. Nevertheless, I still get an error with the stanfordnlp/utils/prepare_tokenizer_data.py file:

Traceback (most recent call last):
  File "stanfordnlp/utils/prepare_tokenizer_data.py", line 87, in <module>
    index, word_found = find_next_word(index, text, word, output)
  File "stanfordnlp/utils/prepare_tokenizer_data.py", line 39, in find_next_word
    assert text[index].replace('\n', ' ') == word[idx], "character mismatch: raw text contains |%s| but the next word is |%s|." % (word_sofar, word)
AssertionError: character mismatch: raw text contains |,| but the next word is |6|.

qipeng · 2019-02-08T23:21:45Z

Hi @jplu, I just downloaded UD treebanks v2.3 and attempted to replicate the issue but cannot seem to reproduce the issue.

The error message you're seeing is highly likely due to a mismatch in the conllu file and the txt file, which could be caused if you have conllu files and txt files from different sources (git vs the UD website, or different UD releases, for example). Could you double check that you're using the consistent .txt and .conllu files, and that neither is corrupted?

jplu · 2019-02-09T10:16:21Z

Indeed it works! Maybe there is a difference between the data from the Github repository and the data provided by the downloaded archive.

Thanks a lot :)

yunlizzzzhu · 2019-03-28T08:22:03Z

Hi, @qipeng , I met similar error messages when following the instructions on https://stanfordnlp.github.io/stanfordnlp/training.html.

When I run bash scripts/run_tokenize.sh UD_English-EWT --batch_size 32 --dropout 0.33. I met the errors

Running tokenizer with --batch_size 32 --dropout 0.33...
Running tokenizer in train mode
Traceback (most recent call last):
  File "E:\Anaconda\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "E:\Anaconda\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "F:\zhudanyang\stanfordnlp\stanfordnlp\models\tokenizer.py", line 182, in <module>
    main()
  File "F:\zhudanyang\stanfordnlp\stanfordnlp\models\tokenizer.py", line 91, in main
    train(args)
  File "F:\zhudanyang\stanfordnlp\stanfordnlp\models\tokenizer.py", line 96, in train
    mwt_dict = load_mwt_dict(args['mwt_json_file'])
  File "F:\zhudanyang\stanfordnlp\stanfordnlp\models\tokenize\utils.py", line 10, in load_mwt_dict
    with open(filename, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: './scripts/data/tokenize/en_ewt-ud-dev-mwt.json'
Running tokenizer in predict mode
Traceback (most recent call last):
 File "E:\Anaconda\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "E:\Anaconda\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "F:\zhudanyang\stanfordnlp\stanfordnlp\models\tokenizer.py", line 182, in <module>
    main()
  File "F:\zhudanyang\stanfordnlp\stanfordnlp\models\tokenizer.py", line 93, in main
    evaluate(args)
  File "F:\zhudanyang\stanfordnlp\stanfordnlp\models\tokenizer.py", line 159, in evaluate
    mwt_dict = load_mwt_dict(args['mwt_json_file'])
  File "F:\zhudanyang\stanfordnlp\stanfordnlp\models\tokenize\utils.py", line 10, in load_mwt_dict
    with open(filename, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: './scripts/data/tokenize/en_ewt-ud-dev-mwt.json'
Traceback (most recent call last):
      File "stanfordnlp/utils/conll18_ud_eval.py", line 532, in <module>
    main()
  File "stanfordnlp/utils/conll18_ud_eval.py", line 500, in main evaluation = evaluate_wrapper(args)
  File "stanfordnlp/utils/conll18_ud_eval.py", line 483, in evaluate_wrapper system_ud = load_conllu_file(args.system_file)
  File "stanfordnlp/utils/conll18_ud_eval.py", line 477, in load_conllu_file _file = open(path, mode="r", **({"encoding": "utf-8"} if sys.version_info >= (3, 0) else {}))
FileNotFoundError: [Errno 2] No such file or directory: './scripts/data/tokenize/en_ewt.dev.pred.conllu'
en_ewt --batch_size 32 --dropout 0.33

Could you please give me some advice? Thank you in advance :)

wangcug · 2020-12-12T06:17:13Z

Hi, @qipeng , I met similar error messages when following the instructions on https://stanfordnlp.github.io/stanfordnlp/training.html.

When I run bash scripts/run_tokenize.sh UD_English-EWT --batch_size 32 --dropout 0.33. I met the errors

Running tokenizer with --batch_size 32 --dropout 0.33...
Running tokenizer in train mode
Traceback (most recent call last):
  File "E:\Anaconda\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "E:\Anaconda\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "F:\zhudanyang\stanfordnlp\stanfordnlp\models\tokenizer.py", line 182, in <module>
    main()
  File "F:\zhudanyang\stanfordnlp\stanfordnlp\models\tokenizer.py", line 91, in main
    train(args)
  File "F:\zhudanyang\stanfordnlp\stanfordnlp\models\tokenizer.py", line 96, in train
    mwt_dict = load_mwt_dict(args['mwt_json_file'])
  File "F:\zhudanyang\stanfordnlp\stanfordnlp\models\tokenize\utils.py", line 10, in load_mwt_dict
    with open(filename, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: './scripts/data/tokenize/en_ewt-ud-dev-mwt.json'
Running tokenizer in predict mode
Traceback (most recent call last):
 File "E:\Anaconda\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "E:\Anaconda\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "F:\zhudanyang\stanfordnlp\stanfordnlp\models\tokenizer.py", line 182, in <module>
    main()
  File "F:\zhudanyang\stanfordnlp\stanfordnlp\models\tokenizer.py", line 93, in main
    evaluate(args)
  File "F:\zhudanyang\stanfordnlp\stanfordnlp\models\tokenizer.py", line 159, in evaluate
    mwt_dict = load_mwt_dict(args['mwt_json_file'])
  File "F:\zhudanyang\stanfordnlp\stanfordnlp\models\tokenize\utils.py", line 10, in load_mwt_dict
    with open(filename, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: './scripts/data/tokenize/en_ewt-ud-dev-mwt.json'
Traceback (most recent call last):
      File "stanfordnlp/utils/conll18_ud_eval.py", line 532, in <module>
    main()
  File "stanfordnlp/utils/conll18_ud_eval.py", line 500, in main evaluation = evaluate_wrapper(args)
  File "stanfordnlp/utils/conll18_ud_eval.py", line 483, in evaluate_wrapper system_ud = load_conllu_file(args.system_file)
  File "stanfordnlp/utils/conll18_ud_eval.py", line 477, in load_conllu_file _file = open(path, mode="r", **({"encoding": "utf-8"} if sys.version_info >= (3, 0) else {}))
FileNotFoundError: [Errno 2] No such file or directory: './scripts/data/tokenize/en_ewt.dev.pred.conllu'
en_ewt --batch_size 32 --dropout 0.33

Could you please give me some advice? Thank you in advance :)

you can try to copyen_test-ud-dev-mwt.json and rename it

qipeng added the awaiting feedback label Feb 8, 2019

jplu closed this as completed Feb 9, 2019

ghost mentioned this issue Aug 5, 2021

Connection refused after large number of queries smilli/py-corenlp#26

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue to train a new pipeline #16

Issue to train a new pipeline #16

jplu commented Feb 6, 2019 •

edited

Loading

qipeng commented Feb 6, 2019

jplu commented Feb 7, 2019

qipeng commented Feb 8, 2019

jplu commented Feb 9, 2019

yunlizzzzhu commented Mar 28, 2019

wangcug commented Dec 12, 2020

Issue to train a new pipeline #16

Issue to train a new pipeline #16

Comments

jplu commented Feb 6, 2019 • edited Loading

qipeng commented Feb 6, 2019

jplu commented Feb 7, 2019

qipeng commented Feb 8, 2019

jplu commented Feb 9, 2019

yunlizzzzhu commented Mar 28, 2019

wangcug commented Dec 12, 2020

jplu commented Feb 6, 2019 •

edited

Loading