Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inputting Tokenized data to Depparser #34

Closed
lwolfsonkin opened this issue Feb 19, 2019 · 8 comments
Closed

Inputting Tokenized data to Depparser #34

lwolfsonkin opened this issue Feb 19, 2019 · 8 comments

Comments

@lwolfsonkin
Copy link
Contributor

For my application, I have data in a (very large) CoNNL-U file that has already been tokenized that I would like to parse. How can I (lazily) parse this and write it back to file?

Thank you!

@yuhaozhang
Copy link
Member

Hi @lwolfsonkin, it seems like your need is relevant to https://github.com/stanfordnlp/stanfordnlp/issues/24. Can you take a look and see if it works for you?

@qipeng
Copy link
Collaborator

qipeng commented Feb 19, 2019

More specifically, the following code

import stanfordnlp

config = {
        'processors': 'tokenize,pos,lemma,depparse',
        'tokenize_pretokenized': True,
         }
nlp = stanfordnlp.Pipeline(**config)
doc = nlp('Joe Smith lives in California .\nHe loves pizza .')
print(doc.conll_file.conll_as_string())

should produce

1	Joe	Joe	PROPN	NNP	Number=Sing	3	nsubj	_	_
2	Smith	Smith	PROPN	NNP	Number=Sing	1	flat	_	_
3	lives	live	VERB	VBZ	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	0	root	_	_
4	in	in	ADP	IN	_	5	case	_	_
5	California	California	PROPN	NNP	Number=Sing	3	obl	_	_
6	.	.	PUNCT	.	_	3	punct	_	_

1	He	he	PRON	PRP	Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs	2	nsubj	_	_
2	loves	love	VERB	VBZ	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	0	root	_	_
3	pizza	pizza	NOUN	NN	Number=Sing	2	obj	_	_
4	.	.	PUNCT	.	_	2	punct	_	_

When tokenize_pretokenized is set to True, we treat the input text as newline separated sentences, each a list of space separated words.

@lwolfsonkin
Copy link
Contributor Author

Thanks @yuhaozhang and @qipeng! The other thread was helpful context and the tokenized_pretokenized bit is definitely useful for providing pre-tokenized text.

In my case, since not only is my text pretokenized, but in a CoNNL-U file, can I feed the file directly to be annotated?

Something like: (adopted from @qipeng's example in #24)

config = {
        'processors': 'tokenize,pos,lemma,depparse',
        'tokenize_pretokenized': True,
        'lang': 'es',
        'treebank': 'es_ancora'
         }
nlp = stanfordnlp.Pipeline(**config)
doc = stanfordnlp.Document('')
# I'm assuming that this tokenized file is quite large
my_tokenized_file_path = 'tokenized.conllu'
doc.conll_file = CoNLLFile(filename=my_tokenized_file_path)
annotated = nlp(doc)

Unfortunately, when I try to run this, I get:

<ipython-input-120-698f2cb0fcf6> in <module>
----> 1 annotated = nlp(doc)

~/.linuxbrew/opt/python/lib/python3.7/site-packages/stanfordnlp/pipeline/core.py in __call__(self, doc)
     72         if isinstance(doc, str):
     73             doc = Document(doc)
---> 74         self.process(doc)
     75         return doc

~/.linuxbrew/opt/python/lib/python3.7/site-packages/stanfordnlp/pipeline/core.py in process(self, doc)
     66         for processor_name in self.processor_names:
     67             if self.processors[processor_name] is not None:
---> 68                 self.processors[processor_name].process(doc)
     69         doc.load_annotations()
     70

~/.linuxbrew/opt/python/lib/python3.7/site-packages/stanfordnlp/pipeline/tokenize_processor.py in process(self, doc)
     32             output_predictions(conll_output_string, self.trainer, batches, self.vocab, None, self.config['max_seqlen'])
     33             # set conll file for doc
---> 34             doc.conll_file = conll.CoNLLFile(input_str=conll_output_string.getvalue())
     35

~/.linuxbrew/opt/python/lib/python3.7/site-packages/stanfordnlp/models/common/conll.py in __init__(self, filename, input_str, ignore_gapping)
     18             raise Exception("File not found at: " + filename)
     19         if filename is None:
---> 20             assert input_str is not None and len(input_str) > 0
     21             self._file = input_str
     22             self._from_str = True

AssertionError:

Thanks!

@yuhaozhang
Copy link
Member

@lwolfsonkin It seems like there are two issues here:

  1. Based on the error log it seems like you are using the v0.1.0 release. The tokenize_pretokenized option is a newly added feature that's not in any release. Currently you'll need to install stanfordnlp again from source to use this feature.

  2. When you add the tokenize processor, the pipeline will assume there is a valid string input, instead of a CoNLL file object.

So in order for this to work, you have two options:

  1. Install stanfordnlp from the master branch, and convert your pre-tokenized CoNLL file to a text string and use your current config ; OR
  2. Feed the pre-tokenized CoNLL object as you are doing now, but remove tokenize processor from the list.

@lwolfsonkin
Copy link
Contributor Author

Thanks for the thorough description, @yuhaozhang! That helps a lot! I opted for the second option, which seems to be working, though, I have the issue that my tokenized CoNNL-U file is significantly too large to put into memory at once. Is there a way to make the annotator act lazily or in a streaming fashion of some kind?

@yuhaozhang
Copy link
Member

We don't have that now, so you'll have to hack a little bit: you can create a wrapper that reads N sentences at a time from the pre-tokenized CoNLL file, converts those sentences into a string (with words separated by space and sentences separated by newline), processes that string with the pipeline, and appends the CoNLL format output into a file. Note that you don't need to recreate the pipeline and reload the model every time, so that'll save you a lot of time.

@qipeng
Copy link
Collaborator

qipeng commented Feb 22, 2019

Another alternative is to simply divide up your CoNLL-U file and process each part with stanfordnlp, before putting them back together.

@qipeng
Copy link
Collaborator

qipeng commented Feb 26, 2019

Now part of the v0.1.2 release, and part of the PyPI distribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants