-
Notifications
You must be signed in to change notification settings - Fork 893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inputting Tokenized data to Depparser #34
Comments
Hi @lwolfsonkin, it seems like your need is relevant to https://github.com/stanfordnlp/stanfordnlp/issues/24. Can you take a look and see if it works for you? |
More specifically, the following code import stanfordnlp
config = {
'processors': 'tokenize,pos,lemma,depparse',
'tokenize_pretokenized': True,
}
nlp = stanfordnlp.Pipeline(**config)
doc = nlp('Joe Smith lives in California .\nHe loves pizza .')
print(doc.conll_file.conll_as_string()) should produce
When |
Thanks @yuhaozhang and @qipeng! The other thread was helpful context and the In my case, since not only is my text pretokenized, but in a CoNNL-U file, can I feed the file directly to be annotated? Something like: (adopted from @qipeng's example in #24) config = {
'processors': 'tokenize,pos,lemma,depparse',
'tokenize_pretokenized': True,
'lang': 'es',
'treebank': 'es_ancora'
}
nlp = stanfordnlp.Pipeline(**config)
doc = stanfordnlp.Document('')
# I'm assuming that this tokenized file is quite large
my_tokenized_file_path = 'tokenized.conllu'
doc.conll_file = CoNLLFile(filename=my_tokenized_file_path)
annotated = nlp(doc) Unfortunately, when I try to run this, I get: <ipython-input-120-698f2cb0fcf6> in <module>
----> 1 annotated = nlp(doc)
~/.linuxbrew/opt/python/lib/python3.7/site-packages/stanfordnlp/pipeline/core.py in __call__(self, doc)
72 if isinstance(doc, str):
73 doc = Document(doc)
---> 74 self.process(doc)
75 return doc
~/.linuxbrew/opt/python/lib/python3.7/site-packages/stanfordnlp/pipeline/core.py in process(self, doc)
66 for processor_name in self.processor_names:
67 if self.processors[processor_name] is not None:
---> 68 self.processors[processor_name].process(doc)
69 doc.load_annotations()
70
~/.linuxbrew/opt/python/lib/python3.7/site-packages/stanfordnlp/pipeline/tokenize_processor.py in process(self, doc)
32 output_predictions(conll_output_string, self.trainer, batches, self.vocab, None, self.config['max_seqlen'])
33 # set conll file for doc
---> 34 doc.conll_file = conll.CoNLLFile(input_str=conll_output_string.getvalue())
35
~/.linuxbrew/opt/python/lib/python3.7/site-packages/stanfordnlp/models/common/conll.py in __init__(self, filename, input_str, ignore_gapping)
18 raise Exception("File not found at: " + filename)
19 if filename is None:
---> 20 assert input_str is not None and len(input_str) > 0
21 self._file = input_str
22 self._from_str = True
AssertionError: Thanks! |
@lwolfsonkin It seems like there are two issues here:
So in order for this to work, you have two options:
|
Thanks for the thorough description, @yuhaozhang! That helps a lot! I opted for the second option, which seems to be working, though, I have the issue that my tokenized CoNNL-U file is significantly too large to put into memory at once. Is there a way to make the annotator act lazily or in a streaming fashion of some kind? |
We don't have that now, so you'll have to hack a little bit: you can create a wrapper that reads N sentences at a time from the pre-tokenized CoNLL file, converts those sentences into a string (with words separated by space and sentences separated by newline), processes that string with the pipeline, and appends the CoNLL format output into a file. Note that you don't need to recreate the pipeline and reload the model every time, so that'll save you a lot of time. |
Another alternative is to simply divide up your CoNLL-U file and process each part with stanfordnlp, before putting them back together. |
Now part of the v0.1.2 release, and part of the PyPI distribution! |
For my application, I have data in a (very large) CoNNL-U file that has already been tokenized that I would like to parse. How can I (lazily) parse this and write it back to file?
Thank you!
The text was updated successfully, but these errors were encountered: