Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compatibility with udpipe? #3

Closed
randomgambit opened this issue Nov 27, 2018 · 12 comments
Closed

compatibility with udpipe? #3

randomgambit opened this issue Nov 27, 2018 · 12 comments

Comments

@randomgambit
Copy link

Hello there! Thanks for this nice package. I am trying to use it with udpipe, as it appears this should be compatible.

However, I am unable to make it to work.. what is the exact syntax to use?
Here my udpipe output is


> head(x,1)
    doc_id paragraph_id sentence_id                           sentence token_id token lemma  upos xpos feats head_token_id dep_rel
1 32198807            1           1 Gwen fue una magnifica anfitriona.        1  Gwen  Gwen PROPN <NA>  <NA>             5   nsubj
  deps misc phrase_tag
1 <NA> <NA>          N

Thanks!

@randomgambit
Copy link
Author

I am essentially trying to use the nice


q = tquery(POS='VB*', save='verb')
apply_queries(tokens, q1 = q)

and the likes with the output generated from udpipe above. Is that possible?

@vanatteveldt
Copy link
Owner

@kasperwelbers, I'm guessing you have a good answer for this :)

@kasperwelbers
Copy link
Collaborator

Hi @randomgambit,

There have recently been some major updates to the package, which actually include direct support for udpipe.

There have also been minor changes to the syntax (compared to the tutorial from which you got the code). I've added a pdf of a paper we are working on that serves as the most recent tutorial (see the readme)

The udpipe implementation is not yet in the tutorial, but see the ?udpipe_tokenindex function for details. You can create the tokens as follows.

udpipe_tokenindex("John loves Mary", udpipe_model = "english")

## visualize dependency tree
tokens %>%
  plot_tree(token, lemma, POS)

@vanatteveldt
Copy link
Owner

(Thanks Kasper! Closing this issue, feel free to reopen if there are still things unclear!)

@randomgambit
Copy link
Author

@kasperwelbers @vanatteveldt thanks for your quick reply, but are you able to get the output of

q = tquery(POS='VB*', save='verb')
apply_queries(tokens, q1 = q)

using the udpipe_tokenindex ?
I tried and it did not work for me. I am much more interested in that function than the plot_tree one.

Thanks!

@kasperwelbers
Copy link
Collaborator

There are several things at play here.

Firstly, the "save" argument is now called "label", so the query becomes

q = tquery(POS='VB*', label='verb')

Secondly, udpipe uses universal dependencies, so the POS tag query should be "VERB". The example above is based on Penn treebank, which uses different POS tags for verbs that start with VB.

So it becomes:

tokens = udpipe_tokenindex("John loves Mary", udpipe_model = 'english')

q = tquery(POS='VERB', label='verb')
apply_queries(tokens, q1 = q)

This now by default also gives the 'fill' nodes. So if you just want to find the verb fill should be turned off.

apply_queries(tokens, q1 = q, fill = F)

@randomgambit
Copy link
Author

thank you @kasperwelbers , this is very helpful. Actually what I am trying to do is to reproduce the noun-phrase extraction that is available in spacy, such as


import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers")
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
          chunk.root.head.text)

Autonomous cars cars nsubj shift
insurance liability liability dobj shift
manufacturers manufacturers pobj toward

My understanding is that rsyntax is the right tool for that. Is that correct?
Thanks!

@kasperwelbers
Copy link
Collaborator

yes @randomgambit, you could use rsyntax for noun phrase extraction.

Broadly speaking, the purpose of rynstax is to provide a query language for dependency trees. This means you can very specifically determine what part of the tree to extract. The tradeoff is that you do need to define the query yourself.

For a base noun phrase this should be fairly easy. The words that describe the noun should be its children, so you get these automatically due to the fill heuristic (see paper for details). The following code annotates the tokens and plots the dependency tree (which helps to develop queries)

tokens = udpipe_tokenindex("The young cat scratched the curls from the old and broken stairs", udpipe_model = 'english')

q = tquery(POS='NOUN', label='NP')

tokens = tokens %>%
  annotate('np', q, overwrite = T) %>%
  plot_tree(annotation = 'np', token, lemma, POS)

tokens

the tokens data.table now has a column (np_id) that gives the noun phrases unique ids.

Note that you might prefer to exclude the 'case' child ("from") from the third noun phrase. This can be achieved by specifying the query.

q = tquery(POS='NOUN', label='NP',
           fill(NOT(relation = 'case')))

@randomgambit
Copy link
Author

interesting, thanks! so this is perhaps a way to get the proper noun phrases back

library(dplyr)
> tokens %>% filter(!is.na(np_id)) %>% group_by(doc_id, sentence, np_id) %>% 
+   summarize(noun_phrase = paste(lemma, collapse = ' '))
# A tibble: 3 x 4
# Groups:   doc_id, sentence [?]
  doc_id sentence np_id  noun_phrase             
  <fct>     <int> <fct>  <chr>                   
1 1             1 1.1.3  the young cat           
2 1             1 1.1.6  the curl                
3 1             1 1.1.12 the old and broken stair

something I dont get is how did you figure out that I needed to exclude the case child? Is that a common exception?

@randomgambit
Copy link
Author

Also, I am a bit puzzled by the following parsing

In Spacy I get the correct white house noun chunk

import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u"White House casts doubt on China trade deal at G20")
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
          chunk.root.head.text)

White House House nsubj casts
doubt doubt dobj casts
China trade deal deal pobj on
G20 G20 pobj at

but here I get very strange results

q = tquery(POS='NOUN', label='NP')

gives:

# Groups:   doc_id, sentence [?]
  doc_id sentence np_id noun_phrase     
  <fct>     <int> <fct> <chr>           
1 1             1 1.1.8 doubt China deal
2 1             1 1.1.7 trade  

while

q = tquery(POS='PROPN', label='NP')

# A tibble: 4 x 4
# Groups:   doc_id, sentence [?]
  doc_id sentence np_id  noun_phrase
  <fct>     <int> <fct>  <chr>      
1 1             1 1.1.1  White      
2 1             1 1.1.2  House      
3 1             1 1.1.6  China      
4 1             1 1.1.10 G20  

Am I missing something obvious here?
Thanks!!

@kasperwelbers
Copy link
Collaborator

Right, sorry, I didn't mean to imply that the query I gave was already good enough.

My point was that you can use rsyntax to create a noun phrase chunker, by writing queries for noun phrases in dependency trees. This is what spacy does as well.

To develop a proper noun chunker, we can use the spacy code as inspiration. We see that spacy specifically looks at nouns, proper nouns and pronouns with certain types of parent relations (nsubj, ROOT, etc.). This for instance negates the problem that nouns that have the 'compound' relation are not seen as the head of a noun phrase.

We cannot directly use spacy's approach, because spacy seems to use a slightly different version of universal dependencies for english. For instance, the "case" as seen before, isn't actually a child of the noun in the Spacy dependency tree (which is probably better). Also, udpipe has slightly different relation labels, such as 'obl'. Also, to really see what spacy does here we'd need to see what .subtree returns. I recall spacy doesn't include relative clauses (relcl) in the noun chunks, so those might be excluded from the subtree.

A quick and dirty mimic in the rsyntax query, with the previous exclusion of "case", would be:

tokens = udpipe_tokenindex("The White House casts doubt on China trade deal at G20")

## stolen from spacy, but added 'obl'. Needs update based on universal dependency labels
labels = c('nsubj', 'nsubj:pass', 'obl', 'obj', 'iobj', 'ROOT', 'appos', 'nmod', 'nmod:poss')

q = tquery(relation = labels, POS = c('PROPN', 'NOUN', 'PRON'), label='NP',
           fill(NOT(relation = c('case', 'relcl')), connected = T))

tokens = tokens %>%
  annotate('np', q, overwrite = T) %>%
  plot_tree(annotation = 'np', token, lemma, POS)

So, this says: look for a head where the relation is one of the given labels, and the pos tag is a type of noun. Then look for all children except those of relation 'case' or 'relcl'. The connected=T argument states that the children of excluded children should also be excluded.

Note that in the output 'doubt' is included in the noun chunk for 'china trade deal'. Again, this is either how udpipe does it, or a parse error. In my eyes, the parse tree from spacy makes more sense here.

library(spacyr)
spacy_parse("The White House casts doubt on China trade deal at G20", dependency=T) %>%
  plot_tree(token, lemma, pos)

For udpipe we'd have to exclude this type of relation manually as well.

q = tquery(relation = labels, POS = c('PROPN', 'NOUN', 'PRON'), label='NP',
           fill(NOT(relation = c('case', 'advmod', 'relcl')), connected = T))

So this is approaching a decent noun phrase chunker, but surely we'd need a bit more testing. If we have the code for another noun phrase chunker based on udpipe we could imitate it.

@randomgambit
Copy link
Author

thanks @kasperwelbers ! I will run some more tests and report to the base

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants