Missing tokenize_sent.sh and detokenize.sh ? #1

GenTxt · 2018-12-20T20:59:48Z

Thanks for the interesting repo.

I've downloaded the specified version of the stanford parser but I can't find "stanford-parser-full-2017-06-09/tokenize_sent.sh" and "detokenize.sh" required in the notebook.

Are these renamed files in the parser folder?
If not, is it possible to upload these to this repo or provide a link?

Thanks,

superMDguy · 2018-12-20T21:29:02Z

Sorry, this is super messy, and all built around stuff I have downloaded. I actually don't have access to the machine I have the code on right now, and won't for a few weeks, so I don't have the exact files. I do know roughly what their contents are though:

tokenize_sent.sh is pretty much java edu.stanford.nlp.process.DocumentPreprocessor /tmp/in.txt > /tmp/out.txt. I might also have the -preserveLines option, but I'm not sure if I'm using that.
detokenize_sent.sh is pretty much java edu.stanford.nlp.process.DocumentPreprocessor /tmp/in.txt > /tmp/out.txt.

I put both of those files in the same folder as the parser to simplify things. I don't use java much, but I probably should've added it to the classpath or something. If you aren't able to get that to work, you could probably throw in a different sentence tokenizer without much of a difference.

Good luck on getting it to work, and let me know how it goes!

GenTxt · 2018-12-21T14:59:56Z

Thanks for the quick reply and solution. Hopefully only one more as described below.
Have parser and required .sh files in:
Datasets/stanford-parser-full-2017-06-09/tokenize_sent.sh (and detokenize.sh)
Runs without error in revised jupyter notebook.

Changed location of books to:
prefix = 'Textfiles/sources/' (same program folder containing 'Datasets')
Placed renamed text files 'JaneAustenNorthanger_Abbey.txt' and 'Sir_Arthur_Conan_Doyle' in 'sources' subfolder.
Also changed some of the code in notebook to remove spaces in output names (running ubuntu 18)
Continue running notebook and generates this 'FileNotFound' error:

changed = change_book(open(prefix + 'JaneAustenNorthanger_Abbey.txt').read(), get_corpus('Sir_Arthur_Conan_Doyle'))
write_file('Northanger_Abbey_x_Doyle_with_translation', "Sir_Arthur_Conan_Doyle's_Northanger_Abbey", changed)

FileNotFoundError Traceback (most recent call last)
in
----> 1 changed = change_book(open(prefix + 'JaneAustenNorthanger_Abbey.txt').read(), get_corpus('Sir_Arthur_Conan_Doyle'))
2 write_file('Northanger_Abbey_x_Doyle_with_translation', "Sir_Arthur_Conan_Doyle's_Northanger_Abbey", changed)
FileNotFoundError: [Errno 2] No such file or directory:
'Textfiles/sources/JaneAustenNorthanger_Abbey.txt'

Changed to: prefix = 'Datasets/Textfiles/sources/' (similar location logic as original)
Same error No such file or directory: 'Datasets/Textfiles/sources/JaneAustenNorthanger_Abbey.txt'

Not sure why notebook isn't finding the file. I'm new to python and would appreciate knowing how to fix this for future projects.

Thanks

superMDguy · 2018-12-21T15:09:44Z

So the Datasets folder is inside this projects directory?

Also, I think you'll have problems with the get_corpus('Sir_Arthur_Conan_Doyle') call. The get_corpus() method assumes that you have the project Gutenberg dataset from here downloaded and unzipped into the prefix directory. If you're using other files, it should still work, but you'll have to replace the get_corpus call with something like open(FILE_NAME).read().

GenTxt · 2018-12-21T15:52:23Z

Yes. SentenceEmbeddings/SentenceChange.ipynb (revised notebook) SentenceEmbeddings/Infersent/dataset (model) SentenceEmbeddings/Infersent/encoder (.pkl) SentenceEmbeddings/Datasets/stanford-parser-full-2017-06-09 (parser and .sh files) SentenceEmbeddings/Datasets/Textfiles/sources (revised location and named text files) Downloading PG archive and will install as per your advice and make changes as necessary. Thanks

…

On Fri, Dec 21, 2018 at 10:09 AM Matthew Dangerfield < ***@***.***> wrote: So the Datasets folder is inside this projects directory? Also, I think you'll have problems with the get_corpus('Sir_Arthur_Conan_Doyle') call. The get_corpus() method assumes that you have the project Gutenberg dataset from here <https://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html> downloaded and unzipped into the prefix directory. If you're using other files, it should still work, but you'll have to replace the get_corpus call with something like open(FILE_NAME).read(). — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AVgLPaFZ5-B05K1pvb5QAgy83Eeu96uUks5u7Pm4gaJpZM4Zc5xE> .

GenTxt · 2018-12-23T16:27:37Z

Hi: Have Gutenberg setup and running notebook with original get_corpus call. Changed a few directory locations, for example all instances of '/tmp/in.txt' and '/tmp/out.txt' to 'Datasets/tmp/in.txt' and 'Datasets/tmp/out.txt'. Appears to be working until this error below.

FileNotFoundError: [Errno 2] No such file or directory: 'Datasets/tmp/out.txt'

Writes 'in.txt' (processed Northranger Abbey) to 'Datasets/tmp/in.txt' but not 'out.txt'

I would appreciate any suggestions on how to fix. Thanks

FileNotFoundError Traceback (most recent call last)
in
----> 1 changed = change_book(open(prefix + 'Jane Austen___Northanger Abbey.txt').read(), get_corpus('Sir Arthur Conan Doyle'))
2 write_file('Northanger_Abbey_x_Doyle_with_translation', "Sir Arthur Conan Doyle's Northanger Abbey", changed)

in change_book(toChange, source, withTranslation, useAnnoy, maxChars)
1 def change_book(toChange, source, withTranslation = True, useAnnoy = False, maxChars = 5000000):
----> 2 toChangeSent = tokenize_sentences(toChange)
3 sourceSent = tokenize_sentences(source[:maxChars])
4
5 model.build_vocab(toChangeSent + sourceSent, tokenize=True)

in tokenize_sentences(text)
3 open('Datasets/tmp/in.txt', 'w').write(text.replace('\n\n', NEWLINE))
4 os.system('Datasets/stanford-parser-full-2017-06-09/tokenize_sent.sh')
----> 5 tokens = open('Datasets/tmp/out.txt').read().split('\n')
6 print('Total tokens in dataset', len(tokens))
7

FileNotFoundError: [Errno 2] No such file or directory: 'Datasets/tmp/out.txt'

superMDguy · 2018-12-28T00:57:00Z

I know it's some issue with the file paths in the sentence tokenizer, but I'm not sure exactly what. I would try changing tokenize_sent.sh to java edu.stanford.nlp.process.DocumentPreprocessor ../tmp/in.txt > ../tmp/out.txt and detokenize_sent.sh to java edu.stanford.nlp.process.DocumentPreprocessor ../tmp/in.txt > ../tmp/out.txt.

GenTxt · 2018-12-29T14:50:04Z

Thanks for the reply. Unfortunately it's the same error. I'll close this for now and search for possible solutions.

Cheers

superMDguy · 2019-01-18T17:05:00Z

Were you ever able to get it to work? I have access to the machine that I originally developed this on now, so I might be able to help you more, if you want.

GenTxt · 2019-01-18T17:30:07Z

Hi: Thanks for getting back to me. Same errors as before. Have tried numerous > redirect combinations but generates original error or new. Posted a "please help" on stackoverflow but no solution. Can you post the original tokenize.sh and detokenize.sh? Any help is appreciated.

…

On Fri, Jan 18, 2019 at 12:05 PM Matthew Dangerfield < ***@***.***> wrote: Were you ever able to get it to work? I have access to the machine that I originally developed this on now, so I might be able to help you more, if you want. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AVgLPcUoof7wCVsXJoSHes3VdmVL0eDKks5vEf68gaJpZM4Zc5xE> .

superMDguy · 2019-01-18T17:57:48Z

Wow, looks like there are several differences from what I remembered. You'll still have to change the /tmp/in.txt paths to wherever your tmp directory is relative to the parser directory. But, other than that, you should be able to use the same files.

tokenize.sh:

export CLASSPATH=$(dirname $0)/stanford-parser.jar

java edu.stanford.nlp.process.PTBTokenizer -preserveLines /tmp/in.txt > /tmp/out.txt

detokenize.sh

export CLASSPATH=$(dirname $0)/stanford-parser.jar

java edu.stanford.nlp.process.PTBTokenizer -untok /tmp/in.txt > /tmp/out.txt

GenTxt · 2019-01-19T17:11:29Z

Works like a charm now with PG format text files. Thanks. Testing unwrapped line texts but that generates out of memory errors on my GTX 1070. Will test java memory settings. Cheers

…

On Fri, Jan 18, 2019 at 12:57 PM Matthew Dangerfield < ***@***.***> wrote: Wow, looks like there are several differences from what I remembered. You'll still have to change the /tmp/in.txt paths to wherever your tmp directory is relative to the parser directory. But, other than that, you should be able to use the same files. tokenize.sh: export CLASSPATH=$(dirname $0)/stanford-parser.jar java edu.stanford.nlp.process.PTBTokenizer -preserveLines /tmp/in.txt > /tmp/out.txt detokenize.sh export CLASSPATH=$(dirname $0)/stanford-parser.jar java edu.stanford.nlp.process.PTBTokenizer -untok /tmp/in.txt > /tmp/out.txt — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AVgLPbsCPCMG6qMzQU09KAZJXbefwiI5ks5vEgscgaJpZM4Zc5xE> .

GenTxt closed this as completed Dec 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing tokenize_sent.sh and detokenize.sh ? #1

Missing tokenize_sent.sh and detokenize.sh ? #1

GenTxt commented Dec 20, 2018

superMDguy commented Dec 20, 2018

GenTxt commented Dec 21, 2018

superMDguy commented Dec 21, 2018

GenTxt commented Dec 21, 2018 via email

GenTxt commented Dec 23, 2018

superMDguy commented Dec 28, 2018

GenTxt commented Dec 29, 2018

superMDguy commented Jan 18, 2019

GenTxt commented Jan 18, 2019 via email

superMDguy commented Jan 18, 2019

GenTxt commented Jan 19, 2019 via email

Missing tokenize_sent.sh and detokenize.sh ? #1

Missing tokenize_sent.sh and detokenize.sh ? #1

Comments

GenTxt commented Dec 20, 2018

superMDguy commented Dec 20, 2018

GenTxt commented Dec 21, 2018

superMDguy commented Dec 21, 2018

GenTxt commented Dec 21, 2018 via email

GenTxt commented Dec 23, 2018

superMDguy commented Dec 28, 2018

GenTxt commented Dec 29, 2018

superMDguy commented Jan 18, 2019

GenTxt commented Jan 18, 2019 via email

superMDguy commented Jan 18, 2019

GenTxt commented Jan 19, 2019 via email