Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Directory used when running test_phrase_grammar.py #24

Closed
YianZhang opened this issue Apr 19, 2020 · 9 comments
Closed

Data Directory used when running test_phrase_grammar.py #24

YianZhang opened this issue Apr 19, 2020 · 9 comments

Comments

@YianZhang
Copy link

Hi Yikang and other Contributors,

Thank you for making public the source code! I am trying to reproduce your results, but I am not sure what path to use as the command line argument of test_phrase_grammar --data. I downloaded PTB data and I am currently using treebank_3/parsed/mrg as the data argument. It does not work.

The listings under treebank_3/parsed/mrg:
atis brown readme.mrg swbd wsj

The listings under treebank_3/parsed/mrg/wsj:

00 06 12 18 24
01 07 13 19 MERGE.LOG
02 08 14 20
03 09 15 21
04 10 16 22
05 11 17 23

Thank you for your time!
Ian

@yikangshen
Copy link
Owner

Hi Ian,
You need to copy the wsj folder to ~/nltk_data/corpora/ptb/WSJ.

@YianZhang
Copy link
Author

Hi Yikang,

Thanks for the response! I figured that out. However, what is args.data in test_phrase_grammar used for?

Thanks,
Ian

@yikangshen
Copy link
Owner

It points to the dictionary that the model actually uses.

@YianZhang
Copy link
Author

It points to the dictionary that the model actually uses.

Thanks for the response! Do you mean "directory" or "dictionary"?

Best,
Ian

@yikangshen
Copy link
Owner

Dictionary

@yikangshen
Copy link
Owner

While testing parsing F1, the model still needs to load dictionary from training corpus

@YianZhang
Copy link
Author

YianZhang commented Apr 28, 2020

Thanks for your prompt response!

After carefully checking your code, I believe the dictionary is loaded from a fixed path:

fn = 'corpus.{}.data'.format(hashlib.md5('data/penn'.encode()).hexdigest())
print('Loading cached dataset...')
corpus = torch.load(fn)
dictionary = corpus.dictionary

And args.data is used as the directory of the test data:

corpus = data_ptb.Corpus(args.data)

Am I correct?

Thanks for your help again! It would be appreciated if you can also check the other issue of mine: #25. As far as I know, this problem also confuses other researchers.

Best,
Ian

@shawntan
Copy link
Collaborator

The code assumes you have the cached dataset in the directory, and it would be cached if the training script was run prior to test_phrase_grammar.py.

But yes, you are correct.

@YianZhang
Copy link
Author

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants