-
Notifications
You must be signed in to change notification settings - Fork 877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue to train a new pipeline #16
Comments
It looks like the git repo doesn't have the plaintext files necessary to train the tokenizer -- could you try downloading the treebank (with the plaintext inputs) from https://universaldependencies.org and see if that works? |
Thanks a lot for your answer. I did download the data from the website you proposed and indeed now it is training. Nevertheless, I still get an error with the
|
Hi @jplu, I just downloaded UD treebanks v2.3 and attempted to replicate the issue but cannot seem to reproduce the issue. The error message you're seeing is highly likely due to a mismatch in the conllu file and the txt file, which could be caused if you have conllu files and txt files from different sources (git vs the UD website, or different UD releases, for example). Could you double check that you're using the consistent |
Indeed it works! Maybe there is a difference between the data from the Github repository and the data provided by the downloaded archive. Thanks a lot :) |
Hi, @qipeng , I met similar error messages when following the instructions on https://stanfordnlp.github.io/stanfordnlp/training.html. When I run bash scripts/run_tokenize.sh UD_English-EWT --batch_size 32 --dropout 0.33. I met the errors
Could you please give me some advice? Thank you in advance :) |
you can try to copy |
First of all, thanks a lot for this new Neural NLP pipeline. I'm currently trying to train a new model for French with my data + UD datasets, but before that I would like to be able to properly reproduce the training steps. Now, I'm struggling to train the Tokenizer. Here what I'm doing:
mkdir -p ./extern_data/word2vec
scripts/download_vectors.sh ./extern_data/word2vec/
mkdir -p ./data/uddata/
mkdir ./data/tokenize
cd ./data/uddata
git clone https://github.com/UniversalDependencies/UD_French-GSD.git
cd ../..
UDBASE
env variable to./data/uddata
inscripts/config.sh
scripts/run_tokenize.sh UD_French-GSD
And I get the following output with several errors:
Apparently by checking the the
stanfordnlp/utils/prepare_tokenizer_data.py
file, indeed there is a need fortxt
files, but they do not exists in the repo of the UD dataset I'm using. Any hint on where can I find them? Or if they are not provided for free, what do they should look like? In order for me to be able to create them from theconllu
files.Thanks in advance :)
The text was updated successfully, but these errors were encountered: