-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compatibility with udpipe? #3
Comments
I am essentially trying to use the nice
and the likes with the output generated from |
@kasperwelbers, I'm guessing you have a good answer for this :) |
Hi @randomgambit, There have recently been some major updates to the package, which actually include direct support for udpipe. There have also been minor changes to the syntax (compared to the tutorial from which you got the code). I've added a pdf of a paper we are working on that serves as the most recent tutorial (see the readme) The udpipe implementation is not yet in the tutorial, but see the ?udpipe_tokenindex function for details. You can create the tokens as follows.
|
(Thanks Kasper! Closing this issue, feel free to reopen if there are still things unclear!) |
@kasperwelbers @vanatteveldt thanks for your quick reply, but are you able to get the output of
using the Thanks! |
There are several things at play here. Firstly, the "save" argument is now called "label", so the query becomes
Secondly, udpipe uses universal dependencies, so the POS tag query should be "VERB". The example above is based on Penn treebank, which uses different POS tags for verbs that start with VB. So it becomes:
This now by default also gives the 'fill' nodes. So if you just want to find the verb fill should be turned off.
|
thank you @kasperwelbers , this is very helpful. Actually what I am trying to do is to reproduce the noun-phrase extraction that is available in
My understanding is that |
yes @randomgambit, you could use rsyntax for noun phrase extraction. Broadly speaking, the purpose of rynstax is to provide a query language for dependency trees. This means you can very specifically determine what part of the tree to extract. The tradeoff is that you do need to define the query yourself. For a base noun phrase this should be fairly easy. The words that describe the noun should be its children, so you get these automatically due to the fill heuristic (see paper for details). The following code annotates the tokens and plots the dependency tree (which helps to develop queries)
the tokens data.table now has a column (np_id) that gives the noun phrases unique ids. Note that you might prefer to exclude the 'case' child ("from") from the third noun phrase. This can be achieved by specifying the query.
|
interesting, thanks! so this is perhaps a way to get the proper noun phrases back
something I dont get is how did you figure out that I needed to exclude the |
Also, I am a bit puzzled by the following parsing In Spacy I get the correct
but here I get very strange results
gives:
while
Am I missing something obvious here? |
Right, sorry, I didn't mean to imply that the query I gave was already good enough. My point was that you can use rsyntax to create a noun phrase chunker, by writing queries for noun phrases in dependency trees. This is what spacy does as well. To develop a proper noun chunker, we can use the spacy code as inspiration. We see that spacy specifically looks at nouns, proper nouns and pronouns with certain types of parent relations (nsubj, ROOT, etc.). This for instance negates the problem that nouns that have the 'compound' relation are not seen as the head of a noun phrase. We cannot directly use spacy's approach, because spacy seems to use a slightly different version of universal dependencies for english. For instance, the "case" as seen before, isn't actually a child of the noun in the Spacy dependency tree (which is probably better). Also, udpipe has slightly different relation labels, such as 'obl'. Also, to really see what spacy does here we'd need to see what .subtree returns. I recall spacy doesn't include relative clauses (relcl) in the noun chunks, so those might be excluded from the subtree. A quick and dirty mimic in the rsyntax query, with the previous exclusion of "case", would be:
So, this says: look for a head where the relation is one of the given labels, and the pos tag is a type of noun. Then look for all children except those of relation 'case' or 'relcl'. The connected=T argument states that the children of excluded children should also be excluded. Note that in the output 'doubt' is included in the noun chunk for 'china trade deal'. Again, this is either how udpipe does it, or a parse error. In my eyes, the parse tree from spacy makes more sense here.
For udpipe we'd have to exclude this type of relation manually as well.
So this is approaching a decent noun phrase chunker, but surely we'd need a bit more testing. If we have the code for another noun phrase chunker based on udpipe we could imitate it. |
thanks @kasperwelbers ! I will run some more tests and report to the base |
Hello there! Thanks for this nice package. I am trying to use it with
udpipe
, as it appears this should be compatible.However, I am unable to make it to work.. what is the exact syntax to use?
Here my
udpipe
output isThanks!
The text was updated successfully, but these errors were encountered: