Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue to train a new pipeline #16

Closed
jplu opened this issue Feb 6, 2019 · 6 comments
Closed

Issue to train a new pipeline #16

jplu opened this issue Feb 6, 2019 · 6 comments

Comments

@jplu
Copy link

jplu commented Feb 6, 2019

First of all, thanks a lot for this new Neural NLP pipeline. I'm currently trying to train a new model for French with my data + UD datasets, but before that I would like to be able to properly reproduce the training steps. Now, I'm struggling to train the Tokenizer. Here what I'm doing:

  • mkdir -p ./extern_data/word2vec
  • scripts/download_vectors.sh ./extern_data/word2vec/
  • mkdir -p ./data/uddata/
  • mkdir ./data/tokenize
  • cd ./data/uddata
  • git clone https://github.com/UniversalDependencies/UD_French-GSD.git
  • cd ../..
  • Changing the UDBASE env variable to ./data/uddata in scripts/config.sh
  • scripts/run_tokenize.sh UD_French-GSD

And I get the following output with several errors:

Preparing tokenizer train data...
Traceback (most recent call last):
  File "stanfordnlp/utils/prepare_tokenizer_data.py", line 14, in <module>
    with open(args.plaintext_file, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: './data/uddata/UD_French-GSD/fr_gsd-ud-train.txt'
cp: cannot stat './data/uddata/UD_French-GSD/fr_gsd-ud-train.txt': No such file or directory
bash: warning: setlocale: LC_ALL: cannot change locale (fr_FR.UTF-8)
bash: warning: setlocale: LC_ALL: cannot change locale (fr_FR.UTF-8)
Preparing tokenizer dev data...
Traceback (most recent call last):
  File "stanfordnlp/utils/prepare_tokenizer_data.py", line 14, in <module>
    with open(args.plaintext_file, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: './data/uddata/UD_French-GSD/fr_gsd-ud-dev.txt'
cp: cannot stat './data/uddata/UD_French-GSD/fr_gsd-ud-dev.txt': No such file or directory
Traceback (most recent call last):
  File "stanfordnlp/utils/avg_sent_len.py", line 12, in <module>
    with open(toklabels, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: './data/tokenize/fr_gsd-ud-train.toklabels'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
TypeError: ceil() argument after * must be an iterable, not float
Running tokenizer with ...
usage: tokenizer.py [-h] [--txt_file TXT_FILE] [--label_file LABEL_FILE]
                    [--json_file JSON_FILE] [--mwt_json_file MWT_JSON_FILE]
                    [--conll_file CONLL_FILE] [--dev_txt_file DEV_TXT_FILE]
                    [--dev_label_file DEV_LABEL_FILE]
                    [--dev_json_file DEV_JSON_FILE]
                    [--dev_conll_gold DEV_CONLL_GOLD] [--lang LANG]
                    [--shorthand SHORTHAND] [--mode {train,predict}]
                    [--emb_dim EMB_DIM] [--hidden_dim HIDDEN_DIM]
                    [--conv_filters CONV_FILTERS] [--no-residual]
                    [--no-hierarchical] [--hier_invtemp HIER_INVTEMP]
                    [--input_dropout] [--conv_res CONV_RES]
                    [--rnn_layers RNN_LAYERS] [--max_grad_norm MAX_GRAD_NORM]
                    [--anneal ANNEAL] [--anneal_after ANNEAL_AFTER]
                    [--lr0 LR0] [--dropout DROPOUT]
                    [--unit_dropout UNIT_DROPOUT] [--tok_noise TOK_NOISE]
                    [--weight_decay WEIGHT_DECAY] [--max_seqlen MAX_SEQLEN]
                    [--batch_size BATCH_SIZE] [--epochs EPOCHS]
                    [--steps STEPS] [--report_steps REPORT_STEPS]
                    [--shuffle_steps SHUFFLE_STEPS] [--eval_steps EVAL_STEPS]
                    [--save_name SAVE_NAME] [--load_name LOAD_NAME]
                    [--save_dir SAVE_DIR] [--cuda CUDA] [--cpu] [--seed SEED]
tokenizer.py: error: argument --max_seqlen: expected one argument
Running tokenizer in predict mode
Directory saved_models/tokenize do not exist; creating...
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/jplu/stanfordnlp/stanfordnlp/models/tokenizer.py", line 182, in <module>
    main()
  File "/home/jplu/stanfordnlp/stanfordnlp/models/tokenizer.py", line 93, in main
    evaluate(args)
  File "/home/jplu/stanfordnlp/stanfordnlp/models/tokenizer.py", line 159, in evaluate
    mwt_dict = load_mwt_dict(args['mwt_json_file'])
  File "/home/jplu/stanfordnlp/stanfordnlp/models/tokenize/utils.py", line 10, in load_mwt_dict
    with open(filename, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: './data/tokenize/fr_gsd-ud-dev-mwt.json'
Traceback (most recent call last):
  File "stanfordnlp/utils/conll18_ud_eval.py", line 532, in <module>
    main()
  File "stanfordnlp/utils/conll18_ud_eval.py", line 500, in main
    evaluation = evaluate_wrapper(args)
  File "stanfordnlp/utils/conll18_ud_eval.py", line 483, in evaluate_wrapper
    system_ud = load_conllu_file(args.system_file)
  File "stanfordnlp/utils/conll18_ud_eval.py", line 477, in load_conllu_file
    _file = open(path, mode="r", **({"encoding": "utf-8"} if sys.version_info >= (3, 0) else {}))
FileNotFoundError: [Errno 2] No such file or directory: './data/tokenize/fr_gsd.dev.pred.conllu'
fr_gsd

Apparently by checking the the stanfordnlp/utils/prepare_tokenizer_data.py file, indeed there is a need for txt files, but they do not exists in the repo of the UD dataset I'm using. Any hint on where can I find them? Or if they are not provided for free, what do they should look like? In order for me to be able to create them from the conllu files.

Thanks in advance :)

@qipeng
Copy link
Collaborator

qipeng commented Feb 6, 2019

It looks like the git repo doesn't have the plaintext files necessary to train the tokenizer -- could you try downloading the treebank (with the plaintext inputs) from https://universaldependencies.org and see if that works?

@jplu
Copy link
Author

jplu commented Feb 7, 2019

Thanks a lot for your answer. I did download the data from the website you proposed and indeed now it is training. Nevertheless, I still get an error with the stanfordnlp/utils/prepare_tokenizer_data.py file:

Traceback (most recent call last):
  File "stanfordnlp/utils/prepare_tokenizer_data.py", line 87, in <module>
    index, word_found = find_next_word(index, text, word, output)
  File "stanfordnlp/utils/prepare_tokenizer_data.py", line 39, in find_next_word
    assert text[index].replace('\n', ' ') == word[idx], "character mismatch: raw text contains |%s| but the next word is |%s|." % (word_sofar, word)
AssertionError: character mismatch: raw text contains |,| but the next word is |6|.

@qipeng
Copy link
Collaborator

qipeng commented Feb 8, 2019

Hi @jplu, I just downloaded UD treebanks v2.3 and attempted to replicate the issue but cannot seem to reproduce the issue.

The error message you're seeing is highly likely due to a mismatch in the conllu file and the txt file, which could be caused if you have conllu files and txt files from different sources (git vs the UD website, or different UD releases, for example). Could you double check that you're using the consistent .txt and .conllu files, and that neither is corrupted?

@jplu
Copy link
Author

jplu commented Feb 9, 2019

Indeed it works! Maybe there is a difference between the data from the Github repository and the data provided by the downloaded archive.

Thanks a lot :)

@jplu jplu closed this as completed Feb 9, 2019
@yunlizzzzhu
Copy link

Hi, @qipeng , I met similar error messages when following the instructions on https://stanfordnlp.github.io/stanfordnlp/training.html.

When I run bash scripts/run_tokenize.sh UD_English-EWT --batch_size 32 --dropout 0.33. I met the errors

Running tokenizer with --batch_size 32 --dropout 0.33...
Running tokenizer in train mode
Traceback (most recent call last):
  File "E:\Anaconda\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "E:\Anaconda\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "F:\zhudanyang\stanfordnlp\stanfordnlp\models\tokenizer.py", line 182, in <module>
    main()
  File "F:\zhudanyang\stanfordnlp\stanfordnlp\models\tokenizer.py", line 91, in main
    train(args)
  File "F:\zhudanyang\stanfordnlp\stanfordnlp\models\tokenizer.py", line 96, in train
    mwt_dict = load_mwt_dict(args['mwt_json_file'])
  File "F:\zhudanyang\stanfordnlp\stanfordnlp\models\tokenize\utils.py", line 10, in load_mwt_dict
    with open(filename, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: './scripts/data/tokenize/en_ewt-ud-dev-mwt.json'
Running tokenizer in predict mode
Traceback (most recent call last):
 File "E:\Anaconda\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "E:\Anaconda\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "F:\zhudanyang\stanfordnlp\stanfordnlp\models\tokenizer.py", line 182, in <module>
    main()
  File "F:\zhudanyang\stanfordnlp\stanfordnlp\models\tokenizer.py", line 93, in main
    evaluate(args)
  File "F:\zhudanyang\stanfordnlp\stanfordnlp\models\tokenizer.py", line 159, in evaluate
    mwt_dict = load_mwt_dict(args['mwt_json_file'])
  File "F:\zhudanyang\stanfordnlp\stanfordnlp\models\tokenize\utils.py", line 10, in load_mwt_dict
    with open(filename, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: './scripts/data/tokenize/en_ewt-ud-dev-mwt.json'
Traceback (most recent call last):
      File "stanfordnlp/utils/conll18_ud_eval.py", line 532, in <module>
    main()
  File "stanfordnlp/utils/conll18_ud_eval.py", line 500, in main evaluation = evaluate_wrapper(args)
  File "stanfordnlp/utils/conll18_ud_eval.py", line 483, in evaluate_wrapper system_ud = load_conllu_file(args.system_file)
  File "stanfordnlp/utils/conll18_ud_eval.py", line 477, in load_conllu_file _file = open(path, mode="r", **({"encoding": "utf-8"} if sys.version_info >= (3, 0) else {}))
FileNotFoundError: [Errno 2] No such file or directory: './scripts/data/tokenize/en_ewt.dev.pred.conllu'
en_ewt --batch_size 32 --dropout 0.33

Could you please give me some advice? Thank you in advance :)

@wangcug
Copy link

wangcug commented Dec 12, 2020

Hi, @qipeng , I met similar error messages when following the instructions on https://stanfordnlp.github.io/stanfordnlp/training.html.

When I run bash scripts/run_tokenize.sh UD_English-EWT --batch_size 32 --dropout 0.33. I met the errors

Running tokenizer with --batch_size 32 --dropout 0.33...
Running tokenizer in train mode
Traceback (most recent call last):
  File "E:\Anaconda\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "E:\Anaconda\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "F:\zhudanyang\stanfordnlp\stanfordnlp\models\tokenizer.py", line 182, in <module>
    main()
  File "F:\zhudanyang\stanfordnlp\stanfordnlp\models\tokenizer.py", line 91, in main
    train(args)
  File "F:\zhudanyang\stanfordnlp\stanfordnlp\models\tokenizer.py", line 96, in train
    mwt_dict = load_mwt_dict(args['mwt_json_file'])
  File "F:\zhudanyang\stanfordnlp\stanfordnlp\models\tokenize\utils.py", line 10, in load_mwt_dict
    with open(filename, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: './scripts/data/tokenize/en_ewt-ud-dev-mwt.json'
Running tokenizer in predict mode
Traceback (most recent call last):
 File "E:\Anaconda\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "E:\Anaconda\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "F:\zhudanyang\stanfordnlp\stanfordnlp\models\tokenizer.py", line 182, in <module>
    main()
  File "F:\zhudanyang\stanfordnlp\stanfordnlp\models\tokenizer.py", line 93, in main
    evaluate(args)
  File "F:\zhudanyang\stanfordnlp\stanfordnlp\models\tokenizer.py", line 159, in evaluate
    mwt_dict = load_mwt_dict(args['mwt_json_file'])
  File "F:\zhudanyang\stanfordnlp\stanfordnlp\models\tokenize\utils.py", line 10, in load_mwt_dict
    with open(filename, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: './scripts/data/tokenize/en_ewt-ud-dev-mwt.json'
Traceback (most recent call last):
      File "stanfordnlp/utils/conll18_ud_eval.py", line 532, in <module>
    main()
  File "stanfordnlp/utils/conll18_ud_eval.py", line 500, in main evaluation = evaluate_wrapper(args)
  File "stanfordnlp/utils/conll18_ud_eval.py", line 483, in evaluate_wrapper system_ud = load_conllu_file(args.system_file)
  File "stanfordnlp/utils/conll18_ud_eval.py", line 477, in load_conllu_file _file = open(path, mode="r", **({"encoding": "utf-8"} if sys.version_info >= (3, 0) else {}))
FileNotFoundError: [Errno 2] No such file or directory: './scripts/data/tokenize/en_ewt.dev.pred.conllu'
en_ewt --batch_size 32 --dropout 0.33

Could you please give me some advice? Thank you in advance :)

you can try to copyen_test-ud-dev-mwt.json and rename it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants