Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing tokenize_sent.sh and detokenize.sh ? #1

Closed
GenTxt opened this issue Dec 20, 2018 · 11 comments
Closed

Missing tokenize_sent.sh and detokenize.sh ? #1

GenTxt opened this issue Dec 20, 2018 · 11 comments

Comments

@GenTxt
Copy link

GenTxt commented Dec 20, 2018

Thanks for the interesting repo.

I've downloaded the specified version of the stanford parser but I can't find "stanford-parser-full-2017-06-09/tokenize_sent.sh" and "detokenize.sh" required in the notebook.

Are these renamed files in the parser folder?
If not, is it possible to upload these to this repo or provide a link?

Thanks,

@superMDguy
Copy link
Owner

Sorry, this is super messy, and all built around stuff I have downloaded. I actually don't have access to the machine I have the code on right now, and won't for a few weeks, so I don't have the exact files. I do know roughly what their contents are though:

  • tokenize_sent.sh is pretty much java edu.stanford.nlp.process.DocumentPreprocessor /tmp/in.txt > /tmp/out.txt. I might also have the -preserveLines option, but I'm not sure if I'm using that.
  • detokenize_sent.sh is pretty much java edu.stanford.nlp.process.DocumentPreprocessor /tmp/in.txt > /tmp/out.txt.

I put both of those files in the same folder as the parser to simplify things. I don't use java much, but I probably should've added it to the classpath or something. If you aren't able to get that to work, you could probably throw in a different sentence tokenizer without much of a difference.

Good luck on getting it to work, and let me know how it goes!

@GenTxt
Copy link
Author

GenTxt commented Dec 21, 2018

Thanks for the quick reply and solution. Hopefully only one more as described below.
Have parser and required .sh files in:
Datasets/stanford-parser-full-2017-06-09/tokenize_sent.sh (and detokenize.sh)
Runs without error in revised jupyter notebook.

Changed location of books to:
prefix = 'Textfiles/sources/' (same program folder containing 'Datasets')
Placed renamed text files 'JaneAustenNorthanger_Abbey.txt' and 'Sir_Arthur_Conan_Doyle' in 'sources' subfolder.
Also changed some of the code in notebook to remove spaces in output names (running ubuntu 18)
Continue running notebook and generates this 'FileNotFound' error:

changed = change_book(open(prefix + 'JaneAustenNorthanger_Abbey.txt').read(), get_corpus('Sir_Arthur_Conan_Doyle'))
write_file('Northanger_Abbey_x_Doyle_with_translation', "Sir_Arthur_Conan_Doyle's_Northanger_Abbey", changed)


FileNotFoundError Traceback (most recent call last)
in
----> 1 changed = change_book(open(prefix + 'JaneAustenNorthanger_Abbey.txt').read(), get_corpus('Sir_Arthur_Conan_Doyle'))
2 write_file('Northanger_Abbey_x_Doyle_with_translation', "Sir_Arthur_Conan_Doyle's_Northanger_Abbey", changed)
FileNotFoundError: [Errno 2] No such file or directory:
'Textfiles/sources/JaneAustenNorthanger_Abbey.txt'

Changed to: prefix = 'Datasets/Textfiles/sources/' (similar location logic as original)
Same error No such file or directory: 'Datasets/Textfiles/sources/JaneAustenNorthanger_Abbey.txt'

Not sure why notebook isn't finding the file. I'm new to python and would appreciate knowing how to fix this for future projects.

Thanks

@superMDguy
Copy link
Owner

So the Datasets folder is inside this projects directory?

Also, I think you'll have problems with the get_corpus('Sir_Arthur_Conan_Doyle') call. The get_corpus() method assumes that you have the project Gutenberg dataset from here downloaded and unzipped into the prefix directory. If you're using other files, it should still work, but you'll have to replace the get_corpus call with something like open(FILE_NAME).read().

@GenTxt
Copy link
Author

GenTxt commented Dec 21, 2018 via email

@GenTxt
Copy link
Author

GenTxt commented Dec 23, 2018

Hi: Have Gutenberg setup and running notebook with original get_corpus call. Changed a few directory locations, for example all instances of '/tmp/in.txt' and '/tmp/out.txt' to 'Datasets/tmp/in.txt' and 'Datasets/tmp/out.txt'. Appears to be working until this error below.

FileNotFoundError: [Errno 2] No such file or directory: 'Datasets/tmp/out.txt'

Writes 'in.txt' (processed Northranger Abbey) to 'Datasets/tmp/in.txt' but not 'out.txt'

I would appreciate any suggestions on how to fix. Thanks


FileNotFoundError Traceback (most recent call last)
in
----> 1 changed = change_book(open(prefix + 'Jane Austen___Northanger Abbey.txt').read(), get_corpus('Sir Arthur Conan Doyle'))
2 write_file('Northanger_Abbey_x_Doyle_with_translation', "Sir Arthur Conan Doyle's Northanger Abbey", changed)

in change_book(toChange, source, withTranslation, useAnnoy, maxChars)
1 def change_book(toChange, source, withTranslation = True, useAnnoy = False, maxChars = 5000000):
----> 2 toChangeSent = tokenize_sentences(toChange)
3 sourceSent = tokenize_sentences(source[:maxChars])
4
5 model.build_vocab(toChangeSent + sourceSent, tokenize=True)

in tokenize_sentences(text)
3 open('Datasets/tmp/in.txt', 'w').write(text.replace('\n\n', NEWLINE))
4 os.system('Datasets/stanford-parser-full-2017-06-09/tokenize_sent.sh')
----> 5 tokens = open('Datasets/tmp/out.txt').read().split('\n')
6 print('Total tokens in dataset', len(tokens))
7

FileNotFoundError: [Errno 2] No such file or directory: 'Datasets/tmp/out.txt'

@superMDguy
Copy link
Owner

I know it's some issue with the file paths in the sentence tokenizer, but I'm not sure exactly what. I would try changing tokenize_sent.sh to java edu.stanford.nlp.process.DocumentPreprocessor ../tmp/in.txt > ../tmp/out.txt and detokenize_sent.sh to java edu.stanford.nlp.process.DocumentPreprocessor ../tmp/in.txt > ../tmp/out.txt.

@GenTxt
Copy link
Author

GenTxt commented Dec 29, 2018

Thanks for the reply. Unfortunately it's the same error. I'll close this for now and search for possible solutions.

Cheers

@GenTxt GenTxt closed this as completed Dec 29, 2018
@superMDguy
Copy link
Owner

Were you ever able to get it to work? I have access to the machine that I originally developed this on now, so I might be able to help you more, if you want.

@GenTxt
Copy link
Author

GenTxt commented Jan 18, 2019 via email

@superMDguy
Copy link
Owner

Wow, looks like there are several differences from what I remembered. You'll still have to change the /tmp/in.txt paths to wherever your tmp directory is relative to the parser directory. But, other than that, you should be able to use the same files.

tokenize.sh:

export CLASSPATH=$(dirname $0)/stanford-parser.jar

java edu.stanford.nlp.process.PTBTokenizer -preserveLines /tmp/in.txt > /tmp/out.txt

detokenize.sh

export CLASSPATH=$(dirname $0)/stanford-parser.jar

java edu.stanford.nlp.process.PTBTokenizer -untok /tmp/in.txt > /tmp/out.txt

@GenTxt
Copy link
Author

GenTxt commented Jan 19, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants