Inputting Tokenized data to Depparser #34

lwolfsonkin · 2019-02-19T16:43:01Z

For my application, I have data in a (very large) CoNNL-U file that has already been tokenized that I would like to parse. How can I (lazily) parse this and write it back to file?

Thank you!

yuhaozhang · 2019-02-19T18:18:27Z

Hi @lwolfsonkin, it seems like your need is relevant to https://github.com/stanfordnlp/stanfordnlp/issues/24. Can you take a look and see if it works for you?

qipeng · 2019-02-19T19:31:48Z

More specifically, the following code

import stanfordnlp

config = {
        'processors': 'tokenize,pos,lemma,depparse',
        'tokenize_pretokenized': True,
         }
nlp = stanfordnlp.Pipeline(**config)
doc = nlp('Joe Smith lives in California .\nHe loves pizza .')
print(doc.conll_file.conll_as_string())

should produce

1	Joe	Joe	PROPN	NNP	Number=Sing	3	nsubj	_	_
2	Smith	Smith	PROPN	NNP	Number=Sing	1	flat	_	_
3	lives	live	VERB	VBZ	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	0	root	_	_
4	in	in	ADP	IN	_	5	case	_	_
5	California	California	PROPN	NNP	Number=Sing	3	obl	_	_
6	.	.	PUNCT	.	_	3	punct	_	_

1	He	he	PRON	PRP	Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs	2	nsubj	_	_
2	loves	love	VERB	VBZ	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	0	root	_	_
3	pizza	pizza	NOUN	NN	Number=Sing	2	obj	_	_
4	.	.	PUNCT	.	_	2	punct	_	_

When tokenize_pretokenized is set to True, we treat the input text as newline separated sentences, each a list of space separated words.

lwolfsonkin · 2019-02-22T00:45:53Z

Thanks @yuhaozhang and @qipeng! The other thread was helpful context and the tokenized_pretokenized bit is definitely useful for providing pre-tokenized text.

In my case, since not only is my text pretokenized, but in a CoNNL-U file, can I feed the file directly to be annotated?

Something like: (adopted from @qipeng's example in #24)

config = {
        'processors': 'tokenize,pos,lemma,depparse',
        'tokenize_pretokenized': True,
        'lang': 'es',
        'treebank': 'es_ancora'
         }
nlp = stanfordnlp.Pipeline(**config)
doc = stanfordnlp.Document('')
# I'm assuming that this tokenized file is quite large
my_tokenized_file_path = 'tokenized.conllu'
doc.conll_file = CoNLLFile(filename=my_tokenized_file_path)
annotated = nlp(doc)

Unfortunately, when I try to run this, I get:

<ipython-input-120-698f2cb0fcf6> in <module>
----> 1 annotated = nlp(doc)

~/.linuxbrew/opt/python/lib/python3.7/site-packages/stanfordnlp/pipeline/core.py in __call__(self, doc)
     72         if isinstance(doc, str):
     73             doc = Document(doc)
---> 74         self.process(doc)
     75         return doc

~/.linuxbrew/opt/python/lib/python3.7/site-packages/stanfordnlp/pipeline/core.py in process(self, doc)
     66         for processor_name in self.processor_names:
     67             if self.processors[processor_name] is not None:
---> 68                 self.processors[processor_name].process(doc)
     69         doc.load_annotations()
     70

~/.linuxbrew/opt/python/lib/python3.7/site-packages/stanfordnlp/pipeline/tokenize_processor.py in process(self, doc)
     32             output_predictions(conll_output_string, self.trainer, batches, self.vocab, None, self.config['max_seqlen'])
     33             # set conll file for doc
---> 34             doc.conll_file = conll.CoNLLFile(input_str=conll_output_string.getvalue())
     35

~/.linuxbrew/opt/python/lib/python3.7/site-packages/stanfordnlp/models/common/conll.py in __init__(self, filename, input_str, ignore_gapping)
     18             raise Exception("File not found at: " + filename)
     19         if filename is None:
---> 20             assert input_str is not None and len(input_str) > 0
     21             self._file = input_str
     22             self._from_str = True

AssertionError:

Thanks!

yuhaozhang · 2019-02-22T02:13:54Z

@lwolfsonkin It seems like there are two issues here:

Based on the error log it seems like you are using the v0.1.0 release. The tokenize_pretokenized option is a newly added feature that's not in any release. Currently you'll need to install stanfordnlp again from source to use this feature.
When you add the tokenize processor, the pipeline will assume there is a valid string input, instead of a CoNLL file object.

So in order for this to work, you have two options:

Install stanfordnlp from the master branch, and convert your pre-tokenized CoNLL file to a text string and use your current config ; OR
Feed the pre-tokenized CoNLL object as you are doing now, but remove tokenize processor from the list.

lwolfsonkin · 2019-02-22T03:12:20Z

Thanks for the thorough description, @yuhaozhang! That helps a lot! I opted for the second option, which seems to be working, though, I have the issue that my tokenized CoNNL-U file is significantly too large to put into memory at once. Is there a way to make the annotator act lazily or in a streaming fashion of some kind?

yuhaozhang · 2019-02-22T06:23:28Z

We don't have that now, so you'll have to hack a little bit: you can create a wrapper that reads N sentences at a time from the pre-tokenized CoNLL file, converts those sentences into a string (with words separated by space and sentences separated by newline), processes that string with the pipeline, and appends the CoNLL format output into a file. Note that you don't need to recreate the pipeline and reload the model every time, so that'll save you a lot of time.

qipeng · 2019-02-22T16:27:21Z

Another alternative is to simply divide up your CoNLL-U file and process each part with stanfordnlp, before putting them back together.

qipeng · 2019-02-26T23:36:18Z

Now part of the v0.1.2 release, and part of the PyPI distribution!

qipeng added the awaiting feedback label Feb 20, 2019

yuhaozhang added enhancement and removed awaiting feedback labels Feb 23, 2019

qipeng closed this as completed Feb 26, 2019

erickrf mentioned this issue Mar 6, 2019

Running tagger and parser with pretokenized files #48

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inputting Tokenized data to Depparser #34

Inputting Tokenized data to Depparser #34

lwolfsonkin commented Feb 19, 2019

yuhaozhang commented Feb 19, 2019

qipeng commented Feb 19, 2019

lwolfsonkin commented Feb 22, 2019

yuhaozhang commented Feb 22, 2019

lwolfsonkin commented Feb 22, 2019

yuhaozhang commented Feb 22, 2019

qipeng commented Feb 22, 2019

qipeng commented Feb 26, 2019

Inputting Tokenized data to Depparser #34

Inputting Tokenized data to Depparser #34

Comments

lwolfsonkin commented Feb 19, 2019

yuhaozhang commented Feb 19, 2019

qipeng commented Feb 19, 2019

lwolfsonkin commented Feb 22, 2019

yuhaozhang commented Feb 22, 2019

lwolfsonkin commented Feb 22, 2019

yuhaozhang commented Feb 22, 2019

qipeng commented Feb 22, 2019

qipeng commented Feb 26, 2019