-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing tokenize_sent.sh and detokenize.sh ? #1
Comments
Sorry, this is super messy, and all built around stuff I have downloaded. I actually don't have access to the machine I have the code on right now, and won't for a few weeks, so I don't have the exact files. I do know roughly what their contents are though:
I put both of those files in the same folder as the parser to simplify things. I don't use java much, but I probably should've added it to the classpath or something. If you aren't able to get that to work, you could probably throw in a different sentence tokenizer without much of a difference. Good luck on getting it to work, and let me know how it goes! |
Thanks for the quick reply and solution. Hopefully only one more as described below. Changed location of books to: changed = change_book(open(prefix + 'JaneAustenNorthanger_Abbey.txt').read(), get_corpus('Sir_Arthur_Conan_Doyle')) FileNotFoundError Traceback (most recent call last) Changed to: prefix = 'Datasets/Textfiles/sources/' (similar location logic as original) Not sure why notebook isn't finding the file. I'm new to python and would appreciate knowing how to fix this for future projects. Thanks |
So the Also, I think you'll have problems with the |
Yes.
SentenceEmbeddings/SentenceChange.ipynb (revised notebook)
SentenceEmbeddings/Infersent/dataset (model)
SentenceEmbeddings/Infersent/encoder (.pkl)
SentenceEmbeddings/Datasets/stanford-parser-full-2017-06-09 (parser and .sh
files)
SentenceEmbeddings/Datasets/Textfiles/sources (revised location and named
text files)
Downloading PG archive and will install as per your advice and make changes
as necessary.
Thanks
…On Fri, Dec 21, 2018 at 10:09 AM Matthew Dangerfield < ***@***.***> wrote:
So the Datasets folder is inside this projects directory?
Also, I think you'll have problems with the
get_corpus('Sir_Arthur_Conan_Doyle') call. The get_corpus() method
assumes that you have the project Gutenberg dataset from here
<https://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html> downloaded
and unzipped into the prefix directory. If you're using other files, it
should still work, but you'll have to replace the get_corpus call with
something like open(FILE_NAME).read().
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AVgLPaFZ5-B05K1pvb5QAgy83Eeu96uUks5u7Pm4gaJpZM4Zc5xE>
.
|
Hi: Have Gutenberg setup and running notebook with original get_corpus call. Changed a few directory locations, for example all instances of '/tmp/in.txt' and '/tmp/out.txt' to 'Datasets/tmp/in.txt' and 'Datasets/tmp/out.txt'. Appears to be working until this error below. FileNotFoundError: [Errno 2] No such file or directory: 'Datasets/tmp/out.txt' Writes 'in.txt' (processed Northranger Abbey) to 'Datasets/tmp/in.txt' but not 'out.txt' I would appreciate any suggestions on how to fix. Thanks FileNotFoundError Traceback (most recent call last) in change_book(toChange, source, withTranslation, useAnnoy, maxChars) in tokenize_sentences(text) FileNotFoundError: [Errno 2] No such file or directory: 'Datasets/tmp/out.txt' |
I know it's some issue with the file paths in the sentence tokenizer, but I'm not sure exactly what. I would try changing |
Thanks for the reply. Unfortunately it's the same error. I'll close this for now and search for possible solutions. Cheers |
Were you ever able to get it to work? I have access to the machine that I originally developed this on now, so I might be able to help you more, if you want. |
Hi:
Thanks for getting back to me. Same errors as before. Have tried numerous >
redirect combinations but generates original error or new. Posted a "please
help" on stackoverflow but no solution.
Can you post the original tokenize.sh and detokenize.sh?
Any help is appreciated.
…On Fri, Jan 18, 2019 at 12:05 PM Matthew Dangerfield < ***@***.***> wrote:
Were you ever able to get it to work? I have access to the machine that I
originally developed this on now, so I might be able to help you more, if
you want.
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AVgLPcUoof7wCVsXJoSHes3VdmVL0eDKks5vEf68gaJpZM4Zc5xE>
.
|
Wow, looks like there are several differences from what I remembered. You'll still have to change the
|
Works like a charm now with PG format text files. Thanks.
Testing unwrapped line texts but that generates out of memory errors on my
GTX 1070. Will test java memory settings.
Cheers
…On Fri, Jan 18, 2019 at 12:57 PM Matthew Dangerfield < ***@***.***> wrote:
Wow, looks like there are several differences from what I remembered.
You'll still have to change the /tmp/in.txt paths to wherever your tmp
directory is relative to the parser directory. But, other than that, you
should be able to use the same files.
tokenize.sh:
export CLASSPATH=$(dirname $0)/stanford-parser.jar
java edu.stanford.nlp.process.PTBTokenizer -preserveLines /tmp/in.txt > /tmp/out.txt
detokenize.sh
export CLASSPATH=$(dirname $0)/stanford-parser.jar
java edu.stanford.nlp.process.PTBTokenizer -untok /tmp/in.txt > /tmp/out.txt
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AVgLPbsCPCMG6qMzQU09KAZJXbefwiI5ks5vEgscgaJpZM4Zc5xE>
.
|
Thanks for the interesting repo.
I've downloaded the specified version of the stanford parser but I can't find "stanford-parser-full-2017-06-09/tokenize_sent.sh" and "detokenize.sh" required in the notebook.
Are these renamed files in the parser folder?
If not, is it possible to upload these to this repo or provide a link?
Thanks,
The text was updated successfully, but these errors were encountered: