Incorrect skipgrams #24

koheiw · 2016-10-04T18:46:12Z

I expect skipgrams with k=2 to produce

"a b c" "a b d" "a c d" "a c e" "b c d" "b c e" "b d e" "c d e"

But I am getting

> tokenizers::tokenize_skip_ngrams('a b c d e', n=3, k=2)
[[1]]
[1] "a c e" "a b c" "b c d" "c d e"

K=1 is not affecting the behaviour of the function.

> tokenizers::tokenize_skip_ngrams('a b c d e', n=3, k=1)
[[1]]
[1] "a c e" "a b c" "b c d" "c d e"

The text was updated successfully, but these errors were encountered:

lmullen · 2016-10-05T13:44:57Z

@koheiw: The input in the example you give is too short for what you are asking. You have an input string which will tokenize to five words, but the window when n = 3 and k = 2 is 7 words. You get the output that you expect if you use a longer input string. For example:

x <- paste(letters, collapse = " ")
out <- tokenizers::tokenize_skip_ngrams(x, n = 3, k = 2)

That returns:

[[1]]
 [1] "a d g" "b e h" "c f i" "d g j" "e h k" "f i l" "g j m" "h k n" "i l o" "j m p" "k n q" "l o r"
[13] "m p s" "n q t" "o r u" "p s v" "q t w" "r u x" "s v y" "t w z" "a c e" "b d f" "c e g" "d f h"
[25] "e g i" "f h j" "g i k" "h j l" "i k m" "j l n" "k m o" "l n p" "m o q" "n p r" "o q s" "p r t"
[37] "q s u" "r t v" "s u w" "t v x" "u w y" "v x z" "a b c" "b c d" "c d e" "d e f" "e f g" "f g h"
[49] "g h i" "h i j" "i j k" "j k l" "k l m" "l m n" "m n o" "n o p" "o p q" "p q r" "q r s" "r s t"
[61] "s t u" "t u v" "u v w" "v w x" "w x y" "x y z"

And if I test for the output you expected, they are all there:

test <- c("a b c", "a b d", "a c d", "a c e", "b c d", "b c e", "b d e", "c d e")
all(test %in% out)

Now, the function should fail or warn when the requested window is too big for the input vector, so that's a fix I'll have to make. But do you get the output you expect when you use longer inputs?

koheiw · 2016-10-05T13:58:09Z

I run your command, but got a very different result. Correct me if I am wrong.

> x <- paste(letters, collapse = " ")
> out <- tokenizers::tokenize_skip_ngrams(x, n = 3, k = 2)
> out
[[1]]
 [1] "a d g" "b e h" "c f i" "d g j" "e h k" "f i l" "g j m" "h k n" "i l o" "j m p" "k n q" "l o r" "m p s" "n q t" "o r u"
[16] "p s v" "q t w" "r u x" "s v y" "t w z" "a c e" "b d f" "c e g" "d f h" "e g i" "f h j" "g i k" "h j l" "i k m" "j l n"
[31] "k m o" "l n p" "m o q" "n p r" "o q s" "p r t" "q s u" "r t v" "s u w" "t v x" "u w y" "v x z" "a b c" "b c d" "c d e"
[46] "d e f" "e f g" "f g h" "g h i" "h i j" "i j k" "j k l" "k l m" "l m n" "m n o" "n o p" "o p q" "p q r" "q r s" "r s t"
[61] "s t u" "t u v" "u v w" "v w x" "w x y" "x y z"

> test <- c("a b c", "a b d", "a c d", "a c e", "b c d", "b c e", "b d e", "c d e")
> test
[1] "a b c" "a b d" "a c d" "a c e" "b c d" "b c e" "b d e" "c d e"
> test %in% out
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> all(test %in% out)
[1] FALSE

lmullen · 2016-10-05T15:14:44Z

Can you please post the results of sessionInfo()?

koheiw · 2016-10-05T15:24:52Z

Here it is. Thank you for investigating this issue.

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tokenizers_0.1.4 quanteda_0.9.9.1

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.7               lattice_0.20-34           SnowballC_0.5.1           chron_2.3-47             
 [5] grid_3.3.1                R6_2.1.3                  RcppParallel_4.3.20       httr_1.2.1               
 [9] stringi_1.1.1             data.table_1.9.6          RcppArmadillo_0.7.400.2.0 ca_0.64                  
[13] Matrix_1.2-7.1            fastmatch_1.0-4           tools_3.3.1               parallel_3.3.1

lmullen · 2016-10-05T16:57:54Z

I had an error in the code above. It really should be test %in% out[[1]].

But looking more closely at the things that you are expecting, some of them don't belong. For instance, "a b d" has a skip of one between b and d but not between a and b, which would never happen. Same with "a c d".

kbenoit · 2016-10-05T17:21:58Z

This is really a definitional debate then... We defined skipgrams in quanteda following Guthrie, D., B. Allison, W. Liu, and L. Guthrie. 2006. "A Closer Look at Skip-Gram Modelling.":

Skip-grams reported for a certain skip distance k allow a
total of k or less skips to construct the n-gram. As such, “4-skip-n-gram” results include 4 skips, 3 skips, 2 skips, 1 skip, and 0 skips (typical n-grams formed from adjacent words).

So this would include "a b d" and "a c d". The output we programmed matches this, for instance their "2-skip-tri-grams" from the paper are matched as:

> tokens <- quanteda::tokenize(toLower("Insurgents killed in ongoing fighting."), 
                               removePunct = TRUE, simplify = TRUE)
> quanteda::skipgrams(tokens, n = 3, skip = 0:2, concatenator = " ")   
 [1] "insurgents killed in"        "insurgents killed ongoing"  
 [3] "insurgents killed fighting"  "insurgents in ongoing"      
 [5] "insurgents in fighting"      "insurgents ongoing fighting"
 [7] "killed in ongoing"           "killed in fighting"         
 [9] "killed ongoing fighting"     "in ongoing fighting"

By the way I really like your package and we are considering using it for quanteda's tokenizer.

lmullen · 2016-10-05T19:31:20Z

@kbenoit That's an interesting citation that I wasn't aware of. We will have to take account of this.

@dselivanov What are your thoughts? I'm thinking that I should take a look at the skipgrams in quanteda and modify our implementation. For the kind of applications I've been using, especially since there is often noisy OCR interpolated between the real words, that kind of definition of skip grams would make a lot of sense.

lmullen · 2016-10-05T19:35:30Z

BTW, @kbenoit, glad to hear you are thinking about tokenizers for quanteda. Let us know if you have any particular improvements that you think would be useful.

dselivanov · 2016-10-05T20:01:10Z

@lmullen I can only add that I'm disappointed that @kbenoit & co decided to reimplement some features that have already been in text2vec for quite some time and on which I spent a lot of time (feature hashing and co-occurences - see https://github.com/kbenoit/quanteda/branches).

Especially because of some edge cases that not every developer can realize.

dselivanov · 2016-10-05T20:02:36Z

Instead of implementing something new and useful for community.

lmullen · 2016-10-05T20:05:21Z

@dselivanov Well, everyone is free to implement whatever they like. While I think that some convergence of the many options for text analysis in R is desirable, I think that is far more likely to happen in the R environment by working together to make our packages inter-operable rather than by asking everyone to pick a single package.

dselivanov · 2016-10-05T20:20:55Z

everyone is free to implement whatever they like

vs

working together to make our packages inter-operable

lmullen · 2017-04-01T23:56:02Z

There is a Python implementation here based on the paper @kbenoit and @koheiw cite above: http://stackoverflow.com/questions/31847682/how-to-compute-skipgrams-in-python

Ironholds · 2017-04-02T23:23:47Z

appears, as requested

what's the latest?

lmullen · 2017-04-02T23:40:59Z

@Ironholds I've figured out how to do the correct skip grams in a way that I think will be both performant and correct. I'm working on this on the skip grams branch.

https://github.com/ropensci/tokenizers/blob/skipgrams/R/ngram-tokenizers.R#L99

Here's the idea. The R function tokenize_skip_ngrams() will call the function get_valid_skips() linked above. This produces a list of 0-indexed word positions which should be part of the skip grams, depending on the values of n and k. Results look like this:

tokenizers:::get_valid_skips(n = 3, k = 2)
#> [[1]]
#> [1] 0 1 2
#> 
#> [[2]]
#> [1] 0 1 3
#> 
#> [[3]]
#> [1] 0 1 4
#> 
#> [[4]]
#> [1] 0 2 3
#> 
#> [[5]]
#> [1] 0 2 4
#> 
#> [[6]]
#> [1] 0 2 5
#> 
#> [[7]]
#> [1] 0 3 4
#> 
#> [[8]]
#> [1] 0 3 5
#> 
#> [[9]]
#> [1] 0 3 6

A vector of words, along with that list of positions and n and k, will be passed to the C++ function. The C++ function will operate like so (in hastily written Rish pseudo code.

generate_skip_grams <- function(words, skips, n, k) {
  # remove stopwords from words here
  # initialize a std::deque here
  for (i in seq_len(length(words)) {
    for (j in seq_len(length(skips)) {
       word_indices <- skips[[j]] + i # add the index of the word 
       if (all(word_indices <= length(words)) { # make sure that at end of vector we don't overstep
         skipgram <- paste(words[word_indices], collapse = " ")
         out.push_back(skipgram)
       }
    }
  }
  return(out) # as a character vector
}

I can write that out in C++ if you like, but you certainly know what you are doing in C++ better than me. Are you willing to work on that function?

lmullen · 2017-04-02T23:45:55Z

@Ironholds Also, I just pushed some tests drawn from the paper above for valid skip gram output. So now that branch is failing, which is to be expected.

Ironholds · 2017-04-03T00:24:13Z

Definitely happy to work on it! Wouldn't it return a list, rather than a vector?

lmullen · 2017-04-03T00:41:12Z

Yes, good point. It would be a list with one element for each input document to tokenize_skip_grams() with each element containing a character vector of tokens.

Ironholds · 2017-04-03T00:53:02Z

thumbs up, want me to just put the skipgrams-branch code in my fork so you can review it all at once?

lmullen · 2017-04-03T01:04:50Z

Yes, that'd be perfect.

🚀

Ironholds · 2017-04-03T13:35:19Z

Er. So I've implemented it but not run it because I'm pretty sure there's a use-case in which it'll segfault; specifically, how will it handle stopwords in generating the skipgram indices?

Ironholds · 2017-04-03T14:42:16Z

Ah, clever! Let's see what I can do with that.

Ironholds · 2017-04-03T14:46:25Z

Output for the initial example is now:

tokenizers::tokenize_skip_ngrams('a b c d e', n=3, k=2)
[[1]]
[1] "a b c" "a b d" "a b"   "a c d" "a c"   "a c"   "a d"   "a d"   "a d"

Is this the desired behaviour? (should n=3 be excluding the length-2 entries? Code in the PR)

lmullen · 2017-04-03T14:52:40Z

I made a change in the signature of the function. When someone is using skip grams it's because they want to get a lot of different tokens out of the input text (e.g., to deal with noisy OCR or to be robust to intentional changes when looking for text reuse). So I added an argument n_min as in the tokenize_ngrams() functions, so that users can get the n-grams for a range of n. Unlike in tokenize_ngrams(), however, I've set n_min to 1 instead of to n, since by default people will probably want the range from 1 to n. So a long way of saying that the bi-grams should be there. I will need to look to see why it doesn't have uni-grams too, though.

Ironholds · 2017-04-03T15:01:07Z

Probably because I didn't pass n_min through. Lemme rerun.

lmullen · 2017-04-03T15:05:28Z

@Ironholds The PR looks really good. I think we just have a git problem at this point.

You shouldn't need n_min: that is used to generate the list of skips, but once you have the skips you don't need n_min.

Does your fork have my changes in my skipgrams branch? E.g., I don't see n_min here:
https://github.com/Ironholds/tokenizers/blob/master/R/ngram-tokenizers.R#L106 That would also account for the old tests being run by Travis.

I think we need to combine your PR plus the skipgrams branch. How do you want to do that?

Ironholds · 2017-04-03T15:10:22Z

tokenizers::tokenize_skip_ngrams('a b c d e', n=3, k=2)
[[1]]
 [1] "a"     "a b"   "a c"   "a d"   "a b c" "a b d" "a b"   "a c d" "a c"   "a c"   "a d"   "a d"   "a d"

Now gets unigrams, but duplicatively, which looks to be because 'a d' (0, 4) is actually (0, 4, 6), and so when the Rcpp hits it, it doesn't bother including the 6th element, since it's a vector of 5. And since (0, 4) is already in the skiplist...yeah.

Got any suggestions for how to fix this in the skiplist logic? I imagine it's a classic off-by-one error caused by indexing differences. I'd rather not fix it iteratively in the Rcpp, since that'll remove our ability to construct the output vector in a single allocation.

Ironholds · 2017-04-03T15:10:54Z

And yep, I meant n_min being an argument to tokenize* at all! The merging I'll handle by hand when this is worked out, it looks like GitHub now has a nice clean web editor for merge collisions, which is nice.

lmullen · 2017-04-03T15:19:53Z

So if I understand correctly, in the Rcpp function if you get a vector of indices c(0, 4, 6) and the vector of words is length 5, the function throws away only the 6 and not the whole vector of indices? Or am I misunderstanding?

Ironholds · 2017-04-03T15:24:56Z

Precisely. And if we change it to throw away the whole vector - which we can, as a solution to this problem - then we lose the ability to allocate the output object efficiently, since it means the actual length of the output is non-deterministic. So fixing it in the skip generation would be better.

Ironholds · 2017-04-03T15:26:23Z

Although thinking it through out loud, since skip generation is for the entire list of tokens, this is probably impossible to do correctly. I'll see if I can fix it in the compiled code without ruining performance.

lmullen · 2017-04-03T15:36:14Z

Yes, I don't think it is possible to do fix it in the skip-index generation, since it's for the entire vector of words. I would think that given a vector of length w and the value of k and the range of values of n, it should be possible to get a formula for the total number of skip grams. This paper gives tables for different values of n, k, w and also a formula for a single value of n and k, but I haven't succeeded in generalizing the formula.

http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf

Ironholds · 2017-04-03T15:37:37Z

Think I've fixed it in the compiled code; running some tests and performance benchmarks now.

Ironholds · 2017-04-03T15:48:35Z

Welp, it produces the right result:

> tokenizers::tokenize_skip_ngrams('a b c d e', n=3, k=2)
[[1]]
 [1] "a"     "a b"   "a c"   "a d"   "a b c" "a b d" "a b e" "a c d" "a c e" "a d e"

And is also faster for large vectors!

> a <- rep("a b c d e", 100000)
> library(microbenchmark)

> microbenchmark({tokenize_skip_ngrams(a)})

# Old version
Unit: milliseconds
                            expr      min      lq     mean   median       uq      max neval
 {     tokenize_skip_ngrams(a) } 730.8347 765.642 787.9186 779.7935 798.9349 895.6179   100

> microbenchmark({tokenize_skip_ngrams(a)})

# New version
Unit: milliseconds
                            expr      min       lq     mean   median      uq      max neval
 {     tokenize_skip_ngrams(a) } 511.7679 538.9027 559.7002 549.6378 575.628 687.7942   100

Throwing it in the PR branch now; will resolve merge conflicts so you can accept it, unless you spot a goof in the output?

lmullen · 2017-04-03T16:58:47Z

@Ironholds Hmm, sorry to impugn on your good graces, but I think there is a problem with the output.

tokenize_skip_ngrams("This is the sample sentence which has exactly ten words.", n = 3, k = 2)

produces

[[1]]
 [1] "this"                 "this is"              "this the"            
 [4] "this sample"          "this is the"          "this is sample"      
 [7] "this is sentence"     "this the sample"      "this the sentence"   
[10] "this the which"       "this sample sentence" "this sample which"   
[13] "this sample has"

That output is correct for the skip indices at the first word, "this," but it doesn't look like it is iterating over the rest of the words in the sentence. Looking at the Rcpp code, it looks like there is not a loop that goes over the words.

I've also added some tests drawn from the paper defining skip-grams which reproduce this problem.

lmullen · 2017-04-03T17:08:22Z

Just pushed the tests which I forgot to do earlier.

Ironholds · 2017-04-03T20:52:09Z

Oh, doh. Wait, so it should be, for each word, generating paste(word, skips), as R pseudocode? And presumably dropping the skips value that corresponds to word if it appears in skips

kbenoit · 2017-04-03T20:56:38Z

If you want to try to match the quanteda behaviour (see above...way above), here's an update using our latest API changes:

> require(quanteda)
Loading required package: quanteda
quanteda version 0.9.9.44
Using 7 of 8 cores for parallel computing

Attaching package: ‘quanteda’

The following object is masked from ‘package:utils’:

    View

> quanteda::tokens("insurgents killed in ongoing fighting", n = 3, skip = 0:2, concatenator = " ")
tokens from 1 document.
Component 1 :
 [1] "insurgents killed in"        "insurgents killed ongoing"   "insurgents killed fighting"  "insurgents in ongoing"      
 [5] "insurgents in fighting"      "insurgents ongoing fighting" "killed in ongoing"           "killed in fighting"         
 [9] "killed ongoing fighting"     "in ongoing fighting"

lmullen · 2017-04-03T20:57:09Z

Yes, the index positions are the pattern for a window of skip grams, and that window has to be slid over every word. By adding the value of the iterator for the word to the index positions in skips, you get new index positions. (Skip grams can generate an enormous number of tokens.) I should have noticed this earlier, but we were using very short input texts so I didn't.

lmullen · 2017-04-03T20:58:29Z

@kbenoit Thanks for the update. That's helpful. That's exactly the text that I'm using in the tests here. https://github.com/ropensci/tokenizers/blob/master/tests/testthat/test-ngrams.R#L138

Ironholds · 2017-04-03T21:41:28Z

Gotcha. Okay, this'll impact performance some - I'll see what I can come up with.

Ironholds · 2017-04-04T03:39:44Z

Hrm, and now I can't make it anything but duplicative. Wanna take a stab?

lmullen · 2017-04-04T03:41:56Z

@Ironholds Sure, I'll make an attempt. Will have to give it a shot in the afternoon after tomorrow's classes.

lmullen · 2017-04-04T04:16:46Z

@Ironholds How is this? Any potential problems I'm missing? 0a952ef

The results are the same as quanteda and it passes the failing tests I added earlier, and performance seems acceptable.

suppressPackageStartupMessages(library(quanteda))
suppressPackageStartupMessages(library(tokenizers))
library(microbenchmark)

# We really need a better sentence
input <- "insurgents killed in ongoing fighting"
microbenchmark(
  q <- quanteda::tokens(input, n = 3, skip = 0:2, concatenator = " "),
  w <- tokenizers::tokenize_skip_ngrams(input, n = 3, n_min = 3, k = 2, simplify = TRUE)
)
#> Unit: microseconds
#>                                                                                         expr
#>                          q <- quanteda::tokens(input, n = 3, skip = 0:2, concatenator = " ")
#>  w <- tokenizers::tokenize_skip_ngrams(input, n = 3, n_min = 3,      k = 2, simplify = TRUE)
#>      min       lq     mean   median       uq      max neval cld
#>  610.369 641.2515 694.2775 658.9820 701.2025 2918.523   100   b
#>  264.379 291.4475 334.6837 301.3375 325.6095 2942.237   100  a
identical(as.character(q), w)
#> [1] TRUE

Ironholds · 2017-04-04T04:33:12Z

Ohh nice, I see where I was going wrong!

lmullen · 2017-04-04T04:37:46Z

Okay, closing this then. Thanks for all your work on it!

koheiw changed the title ~~Skipgrams~~ Incorrect skipgrams Oct 5, 2016

kbenoit mentioned this issue Oct 5, 2016

integration into quanteda as a core tokenizer #25

Closed

lmullen mentioned this issue Oct 11, 2016

Hostility to other developers is counterproductive and unacceptable dselivanov/text2vec#146

Closed

This was referenced Mar 31, 2017

Remove requirement for C++11 #26

Closed

Rewrite tokenize_skip_ngrams to preserve order #11

Closed

lmullen added this to the Release v0.2 milestone Apr 1, 2017

Ironholds mentioned this issue Apr 1, 2017

Add stopwords argument to tokenize_skip_ngrams() #31

Closed

lmullen self-assigned this Apr 1, 2017

lmullen mentioned this issue Apr 2, 2017

Fix #31 #38

Merged

lmullen closed this as completed in e6960b6 Apr 3, 2017

lmullen reopened this Apr 3, 2017

lmullen mentioned this issue Apr 3, 2017

Fix #37 and update NEWS #40

Merged

lmullen closed this as completed Apr 4, 2017

lmullen mentioned this issue Mar 14, 2018

Fix tests for corrected tokenize_skip_ngrams() juliasilge/tidytext#106

Merged

Incorrect skipgrams #24

Incorrect skipgrams #24

Comments

koheiw commented Oct 4, 2016

lmullen commented Oct 5, 2016

koheiw commented Oct 5, 2016 • edited Loading

lmullen commented Oct 5, 2016

koheiw commented Oct 5, 2016

lmullen commented Oct 5, 2016

kbenoit commented Oct 5, 2016 • edited Loading

lmullen commented Oct 5, 2016

lmullen commented Oct 5, 2016

dselivanov commented Oct 5, 2016

dselivanov commented Oct 5, 2016

lmullen commented Oct 5, 2016

dselivanov commented Oct 5, 2016

lmullen commented Apr 1, 2017

Ironholds commented Apr 2, 2017

lmullen commented Apr 2, 2017

lmullen commented Apr 2, 2017

Ironholds commented Apr 3, 2017

lmullen commented Apr 3, 2017

Ironholds commented Apr 3, 2017

lmullen commented Apr 3, 2017

Ironholds commented Apr 3, 2017

Ironholds commented Apr 3, 2017

Ironholds commented Apr 3, 2017 • edited Loading

lmullen commented Apr 3, 2017

Ironholds commented Apr 3, 2017

lmullen commented Apr 3, 2017

Ironholds commented Apr 3, 2017

Ironholds commented Apr 3, 2017 • edited Loading

lmullen commented Apr 3, 2017

Ironholds commented Apr 3, 2017

Ironholds commented Apr 3, 2017

lmullen commented Apr 3, 2017

Ironholds commented Apr 3, 2017

Ironholds commented Apr 3, 2017

lmullen commented Apr 3, 2017

lmullen commented Apr 3, 2017

Ironholds commented Apr 3, 2017

kbenoit commented Apr 3, 2017

lmullen commented Apr 3, 2017

lmullen commented Apr 3, 2017

Ironholds commented Apr 3, 2017

Ironholds commented Apr 4, 2017

lmullen commented Apr 4, 2017

lmullen commented Apr 4, 2017

Ironholds commented Apr 4, 2017

lmullen commented Apr 4, 2017

koheiw commented Oct 5, 2016 •

edited

Loading

kbenoit commented Oct 5, 2016 •

edited

Loading

Ironholds commented Apr 3, 2017 •

edited

Loading

Ironholds commented Apr 3, 2017 •

edited

Loading