compatibility with udpipe? #3

randomgambit · 2018-11-27T02:32:11Z

Hello there! Thanks for this nice package. I am trying to use it with udpipe, as it appears this should be compatible.

However, I am unable to make it to work.. what is the exact syntax to use?
Here my udpipe output is


> head(x,1)
    doc_id paragraph_id sentence_id                           sentence token_id token lemma  upos xpos feats head_token_id dep_rel
1 32198807            1           1 Gwen fue una magnifica anfitriona.        1  Gwen  Gwen PROPN <NA>  <NA>             5   nsubj
  deps misc phrase_tag
1 <NA> <NA>          N

Thanks!

The text was updated successfully, but these errors were encountered:

randomgambit · 2018-11-27T02:53:36Z

I am essentially trying to use the nice


q = tquery(POS='VB*', save='verb')
apply_queries(tokens, q1 = q)

and the likes with the output generated from udpipe above. Is that possible?

vanatteveldt · 2018-11-27T08:47:45Z

@kasperwelbers, I'm guessing you have a good answer for this :)

kasperwelbers · 2018-11-27T10:49:10Z

Hi @randomgambit,

There have recently been some major updates to the package, which actually include direct support for udpipe.

There have also been minor changes to the syntax (compared to the tutorial from which you got the code). I've added a pdf of a paper we are working on that serves as the most recent tutorial (see the readme)

The udpipe implementation is not yet in the tutorial, but see the ?udpipe_tokenindex function for details. You can create the tokens as follows.

udpipe_tokenindex("John loves Mary", udpipe_model = "english")

## visualize dependency tree
tokens %>%
  plot_tree(token, lemma, POS)

vanatteveldt · 2018-11-27T10:58:56Z

(Thanks Kasper! Closing this issue, feel free to reopen if there are still things unclear!)

randomgambit · 2018-11-27T14:02:51Z

@kasperwelbers @vanatteveldt thanks for your quick reply, but are you able to get the output of

q = tquery(POS='VB*', save='verb')
apply_queries(tokens, q1 = q)

using the udpipe_tokenindex ?
I tried and it did not work for me. I am much more interested in that function than the plot_tree one.

Thanks!

kasperwelbers · 2018-11-27T14:17:13Z

There are several things at play here.

Firstly, the "save" argument is now called "label", so the query becomes

q = tquery(POS='VB*', label='verb')

Secondly, udpipe uses universal dependencies, so the POS tag query should be "VERB". The example above is based on Penn treebank, which uses different POS tags for verbs that start with VB.

So it becomes:

tokens = udpipe_tokenindex("John loves Mary", udpipe_model = 'english')

q = tquery(POS='VERB', label='verb')
apply_queries(tokens, q1 = q)

This now by default also gives the 'fill' nodes. So if you just want to find the verb fill should be turned off.

apply_queries(tokens, q1 = q, fill = F)

randomgambit · 2018-11-27T14:47:58Z

thank you @kasperwelbers , this is very helpful. Actually what I am trying to do is to reproduce the noun-phrase extraction that is available in spacy, such as


import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers")
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
          chunk.root.head.text)

Autonomous cars cars nsubj shift
insurance liability liability dobj shift
manufacturers manufacturers pobj toward

My understanding is that rsyntax is the right tool for that. Is that correct?
Thanks!

kasperwelbers · 2018-11-27T15:35:41Z

yes @randomgambit, you could use rsyntax for noun phrase extraction.

Broadly speaking, the purpose of rynstax is to provide a query language for dependency trees. This means you can very specifically determine what part of the tree to extract. The tradeoff is that you do need to define the query yourself.

For a base noun phrase this should be fairly easy. The words that describe the noun should be its children, so you get these automatically due to the fill heuristic (see paper for details). The following code annotates the tokens and plots the dependency tree (which helps to develop queries)

tokens = udpipe_tokenindex("The young cat scratched the curls from the old and broken stairs", udpipe_model = 'english')

q = tquery(POS='NOUN', label='NP')

tokens = tokens %>%
  annotate('np', q, overwrite = T) %>%
  plot_tree(annotation = 'np', token, lemma, POS)

tokens

the tokens data.table now has a column (np_id) that gives the noun phrases unique ids.

Note that you might prefer to exclude the 'case' child ("from") from the third noun phrase. This can be achieved by specifying the query.

q = tquery(POS='NOUN', label='NP',
           fill(NOT(relation = 'case')))

randomgambit · 2018-11-28T00:33:39Z

interesting, thanks! so this is perhaps a way to get the proper noun phrases back

library(dplyr)
> tokens %>% filter(!is.na(np_id)) %>% group_by(doc_id, sentence, np_id) %>% 
+   summarize(noun_phrase = paste(lemma, collapse = ' '))
# A tibble: 3 x 4
# Groups:   doc_id, sentence [?]
  doc_id sentence np_id  noun_phrase             
  <fct>     <int> <fct>  <chr>                   
1 1             1 1.1.3  the young cat           
2 1             1 1.1.6  the curl                
3 1             1 1.1.12 the old and broken stair

something I dont get is how did you figure out that I needed to exclude the case child? Is that a common exception?

randomgambit · 2018-11-28T02:18:03Z

Also, I am a bit puzzled by the following parsing

In Spacy I get the correct white house noun chunk

import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u"White House casts doubt on China trade deal at G20")
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
          chunk.root.head.text)

White House House nsubj casts
doubt doubt dobj casts
China trade deal deal pobj on
G20 G20 pobj at

but here I get very strange results

q = tquery(POS='NOUN', label='NP')

gives:

# Groups:   doc_id, sentence [?]
  doc_id sentence np_id noun_phrase     
  <fct>     <int> <fct> <chr>           
1 1             1 1.1.8 doubt China deal
2 1             1 1.1.7 trade

while

q = tquery(POS='PROPN', label='NP')

# A tibble: 4 x 4
# Groups:   doc_id, sentence [?]
  doc_id sentence np_id  noun_phrase
  <fct>     <int> <fct>  <chr>      
1 1             1 1.1.1  White      
2 1             1 1.1.2  House      
3 1             1 1.1.6  China      
4 1             1 1.1.10 G20

Am I missing something obvious here?
Thanks!!

kasperwelbers · 2018-11-28T08:13:36Z

Right, sorry, I didn't mean to imply that the query I gave was already good enough.

My point was that you can use rsyntax to create a noun phrase chunker, by writing queries for noun phrases in dependency trees. This is what spacy does as well.

To develop a proper noun chunker, we can use the spacy code as inspiration. We see that spacy specifically looks at nouns, proper nouns and pronouns with certain types of parent relations (nsubj, ROOT, etc.). This for instance negates the problem that nouns that have the 'compound' relation are not seen as the head of a noun phrase.

We cannot directly use spacy's approach, because spacy seems to use a slightly different version of universal dependencies for english. For instance, the "case" as seen before, isn't actually a child of the noun in the Spacy dependency tree (which is probably better). Also, udpipe has slightly different relation labels, such as 'obl'. Also, to really see what spacy does here we'd need to see what .subtree returns. I recall spacy doesn't include relative clauses (relcl) in the noun chunks, so those might be excluded from the subtree.

A quick and dirty mimic in the rsyntax query, with the previous exclusion of "case", would be:

tokens = udpipe_tokenindex("The White House casts doubt on China trade deal at G20")

## stolen from spacy, but added 'obl'. Needs update based on universal dependency labels
labels = c('nsubj', 'nsubj:pass', 'obl', 'obj', 'iobj', 'ROOT', 'appos', 'nmod', 'nmod:poss')

q = tquery(relation = labels, POS = c('PROPN', 'NOUN', 'PRON'), label='NP',
           fill(NOT(relation = c('case', 'relcl')), connected = T))

tokens = tokens %>%
  annotate('np', q, overwrite = T) %>%
  plot_tree(annotation = 'np', token, lemma, POS)

So, this says: look for a head where the relation is one of the given labels, and the pos tag is a type of noun. Then look for all children except those of relation 'case' or 'relcl'. The connected=T argument states that the children of excluded children should also be excluded.

Note that in the output 'doubt' is included in the noun chunk for 'china trade deal'. Again, this is either how udpipe does it, or a parse error. In my eyes, the parse tree from spacy makes more sense here.

library(spacyr)
spacy_parse("The White House casts doubt on China trade deal at G20", dependency=T) %>%
  plot_tree(token, lemma, pos)

For udpipe we'd have to exclude this type of relation manually as well.

q = tquery(relation = labels, POS = c('PROPN', 'NOUN', 'PRON'), label='NP',
           fill(NOT(relation = c('case', 'advmod', 'relcl')), connected = T))

So this is approaching a decent noun phrase chunker, but surely we'd need a bit more testing. If we have the code for another noun phrase chunker based on udpipe we could imitate it.

randomgambit · 2018-11-29T22:44:23Z

thanks @kasperwelbers ! I will run some more tests and report to the base

randomgambit mentioned this issue Nov 27, 2018

comparing noun chunks with Spacy bnosac/udpipe#36

Closed

vanatteveldt closed this as completed Nov 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compatibility with udpipe? #3

compatibility with udpipe? #3

randomgambit commented Nov 27, 2018

randomgambit commented Nov 27, 2018

vanatteveldt commented Nov 27, 2018

kasperwelbers commented Nov 27, 2018

vanatteveldt commented Nov 27, 2018

randomgambit commented Nov 27, 2018

kasperwelbers commented Nov 27, 2018

randomgambit commented Nov 27, 2018

kasperwelbers commented Nov 27, 2018

randomgambit commented Nov 28, 2018

randomgambit commented Nov 28, 2018

kasperwelbers commented Nov 28, 2018

randomgambit commented Nov 29, 2018

compatibility with udpipe? #3

compatibility with udpipe? #3

Comments

randomgambit commented Nov 27, 2018

randomgambit commented Nov 27, 2018

vanatteveldt commented Nov 27, 2018

kasperwelbers commented Nov 27, 2018

vanatteveldt commented Nov 27, 2018

randomgambit commented Nov 27, 2018

kasperwelbers commented Nov 27, 2018

randomgambit commented Nov 27, 2018

kasperwelbers commented Nov 27, 2018

randomgambit commented Nov 28, 2018

randomgambit commented Nov 28, 2018

kasperwelbers commented Nov 28, 2018

randomgambit commented Nov 29, 2018