Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect skipgrams #24

Closed
koheiw opened this issue Oct 4, 2016 · 47 comments
Closed

Incorrect skipgrams #24

koheiw opened this issue Oct 4, 2016 · 47 comments
Assignees
Milestone

Comments

@koheiw
Copy link

koheiw commented Oct 4, 2016

I expect skipgrams with k=2 to produce

"a b c" "a b d" "a c d" "a c e" "b c d" "b c e" "b d e" "c d e"

But I am getting

> tokenizers::tokenize_skip_ngrams('a b c d e', n=3, k=2)
[[1]]
[1] "a c e" "a b c" "b c d" "c d e"

K=1 is not affecting the behaviour of the function.

> tokenizers::tokenize_skip_ngrams('a b c d e', n=3, k=1)
[[1]]
[1] "a c e" "a b c" "b c d" "c d e"
@koheiw koheiw changed the title Skipgrams Incorrect skipgrams Oct 5, 2016
@lmullen
Copy link
Member

lmullen commented Oct 5, 2016

@koheiw: The input in the example you give is too short for what you are asking. You have an input string which will tokenize to five words, but the window when n = 3 and k = 2 is 7 words. You get the output that you expect if you use a longer input string. For example:

x <- paste(letters, collapse = " ")
out <- tokenizers::tokenize_skip_ngrams(x, n = 3, k = 2)

That returns:

[[1]]
 [1] "a d g" "b e h" "c f i" "d g j" "e h k" "f i l" "g j m" "h k n" "i l o" "j m p" "k n q" "l o r"
[13] "m p s" "n q t" "o r u" "p s v" "q t w" "r u x" "s v y" "t w z" "a c e" "b d f" "c e g" "d f h"
[25] "e g i" "f h j" "g i k" "h j l" "i k m" "j l n" "k m o" "l n p" "m o q" "n p r" "o q s" "p r t"
[37] "q s u" "r t v" "s u w" "t v x" "u w y" "v x z" "a b c" "b c d" "c d e" "d e f" "e f g" "f g h"
[49] "g h i" "h i j" "i j k" "j k l" "k l m" "l m n" "m n o" "n o p" "o p q" "p q r" "q r s" "r s t"
[61] "s t u" "t u v" "u v w" "v w x" "w x y" "x y z"

And if I test for the output you expected, they are all there:

test <- c("a b c", "a b d", "a c d", "a c e", "b c d", "b c e", "b d e", "c d e")
all(test %in% out)

Now, the function should fail or warn when the requested window is too big for the input vector, so that's a fix I'll have to make. But do you get the output you expect when you use longer inputs?

@koheiw
Copy link
Author

koheiw commented Oct 5, 2016

I run your command, but got a very different result. Correct me if I am wrong.

> x <- paste(letters, collapse = " ")
> out <- tokenizers::tokenize_skip_ngrams(x, n = 3, k = 2)
> out
[[1]]
 [1] "a d g" "b e h" "c f i" "d g j" "e h k" "f i l" "g j m" "h k n" "i l o" "j m p" "k n q" "l o r" "m p s" "n q t" "o r u"
[16] "p s v" "q t w" "r u x" "s v y" "t w z" "a c e" "b d f" "c e g" "d f h" "e g i" "f h j" "g i k" "h j l" "i k m" "j l n"
[31] "k m o" "l n p" "m o q" "n p r" "o q s" "p r t" "q s u" "r t v" "s u w" "t v x" "u w y" "v x z" "a b c" "b c d" "c d e"
[46] "d e f" "e f g" "f g h" "g h i" "h i j" "i j k" "j k l" "k l m" "l m n" "m n o" "n o p" "o p q" "p q r" "q r s" "r s t"
[61] "s t u" "t u v" "u v w" "v w x" "w x y" "x y z"

> test <- c("a b c", "a b d", "a c d", "a c e", "b c d", "b c e", "b d e", "c d e")
> test
[1] "a b c" "a b d" "a c d" "a c e" "b c d" "b c e" "b d e" "c d e"
> test %in% out
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> all(test %in% out)
[1] FALSE

@lmullen
Copy link
Member

lmullen commented Oct 5, 2016

Can you please post the results of sessionInfo()?

@koheiw
Copy link
Author

koheiw commented Oct 5, 2016

Here it is. Thank you for investigating this issue.

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tokenizers_0.1.4 quanteda_0.9.9.1

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.7               lattice_0.20-34           SnowballC_0.5.1           chron_2.3-47             
 [5] grid_3.3.1                R6_2.1.3                  RcppParallel_4.3.20       httr_1.2.1               
 [9] stringi_1.1.1             data.table_1.9.6          RcppArmadillo_0.7.400.2.0 ca_0.64                  
[13] Matrix_1.2-7.1            fastmatch_1.0-4           tools_3.3.1               parallel_3.3.1 

@lmullen
Copy link
Member

lmullen commented Oct 5, 2016

I had an error in the code above. It really should be test %in% out[[1]].

But looking more closely at the things that you are expecting, some of them don't belong. For instance, "a b d" has a skip of one between b and d but not between a and b, which would never happen. Same with "a c d".

@kbenoit
Copy link
Contributor

kbenoit commented Oct 5, 2016

This is really a definitional debate then... We defined skipgrams in quanteda following Guthrie, D., B. Allison, W. Liu, and L. Guthrie. 2006. "A Closer Look at Skip-Gram Modelling.":

Skip-grams reported for a certain skip distance k allow a
total of k or less skips to construct the n-gram. As such, “4-skip-n-gram” results include 4 skips, 3 skips, 2 skips, 1 skip, and 0 skips (typical n-grams formed from adjacent words).

So this would include "a b d" and "a c d". The output we programmed matches this, for instance their "2-skip-tri-grams" from the paper are matched as:

> tokens <- quanteda::tokenize(toLower("Insurgents killed in ongoing fighting."), 
                               removePunct = TRUE, simplify = TRUE)
> quanteda::skipgrams(tokens, n = 3, skip = 0:2, concatenator = " ")   
 [1] "insurgents killed in"        "insurgents killed ongoing"  
 [3] "insurgents killed fighting"  "insurgents in ongoing"      
 [5] "insurgents in fighting"      "insurgents ongoing fighting"
 [7] "killed in ongoing"           "killed in fighting"         
 [9] "killed ongoing fighting"     "in ongoing fighting"   

By the way I really like your package and we are considering using it for quanteda's tokenizer.

@lmullen
Copy link
Member

lmullen commented Oct 5, 2016

@kbenoit That's an interesting citation that I wasn't aware of. We will have to take account of this.

@dselivanov What are your thoughts? I'm thinking that I should take a look at the skipgrams in quanteda and modify our implementation. For the kind of applications I've been using, especially since there is often noisy OCR interpolated between the real words, that kind of definition of skip grams would make a lot of sense.

@lmullen
Copy link
Member

lmullen commented Oct 5, 2016

BTW, @kbenoit, glad to hear you are thinking about tokenizers for quanteda. Let us know if you have any particular improvements that you think would be useful.

@dselivanov
Copy link
Contributor

@lmullen I can only add that I'm disappointed that @kbenoit & co decided to reimplement some features that have already been in text2vec for quite some time and on which I spent a lot of time (feature hashing and co-occurences - see https://github.com/kbenoit/quanteda/branches).

Especially because of some edge cases that not every developer can realize.

@dselivanov
Copy link
Contributor

Instead of implementing something new and useful for community.

@lmullen
Copy link
Member

lmullen commented Oct 5, 2016

@dselivanov Well, everyone is free to implement whatever they like. While I think that some convergence of the many options for text analysis in R is desirable, I think that is far more likely to happen in the R environment by working together to make our packages inter-operable rather than by asking everyone to pick a single package.

@dselivanov
Copy link
Contributor

everyone is free to implement whatever they like

vs

working together to make our packages inter-operable

@lmullen
Copy link
Member

lmullen commented Apr 1, 2017

There is a Python implementation here based on the paper @kbenoit and @koheiw cite above: http://stackoverflow.com/questions/31847682/how-to-compute-skipgrams-in-python

@lmullen lmullen mentioned this issue Apr 2, 2017
@Ironholds
Copy link
Member

appears, as requested

what's the latest?

@lmullen
Copy link
Member

lmullen commented Apr 2, 2017

@Ironholds I've figured out how to do the correct skip grams in a way that I think will be both performant and correct. I'm working on this on the skip grams branch.

https://github.com/ropensci/tokenizers/blob/skipgrams/R/ngram-tokenizers.R#L99

Here's the idea. The R function tokenize_skip_ngrams() will call the function get_valid_skips() linked above. This produces a list of 0-indexed word positions which should be part of the skip grams, depending on the values of n and k. Results look like this:

tokenizers:::get_valid_skips(n = 3, k = 2)
#> [[1]]
#> [1] 0 1 2
#> 
#> [[2]]
#> [1] 0 1 3
#> 
#> [[3]]
#> [1] 0 1 4
#> 
#> [[4]]
#> [1] 0 2 3
#> 
#> [[5]]
#> [1] 0 2 4
#> 
#> [[6]]
#> [1] 0 2 5
#> 
#> [[7]]
#> [1] 0 3 4
#> 
#> [[8]]
#> [1] 0 3 5
#> 
#> [[9]]
#> [1] 0 3 6

A vector of words, along with that list of positions and n and k, will be passed to the C++ function. The C++ function will operate like so (in hastily written Rish pseudo code.

generate_skip_grams <- function(words, skips, n, k) {
  # remove stopwords from words here
  # initialize a std::deque here
  for (i in seq_len(length(words)) {
    for (j in seq_len(length(skips)) {
       word_indices <- skips[[j]] + i # add the index of the word 
       if (all(word_indices <= length(words)) { # make sure that at end of vector we don't overstep
         skipgram <- paste(words[word_indices], collapse = " ")
         out.push_back(skipgram)
       }
    }
  }
  return(out) # as a character vector
}

I can write that out in C++ if you like, but you certainly know what you are doing in C++ better than me. Are you willing to work on that function?

@lmullen
Copy link
Member

lmullen commented Apr 2, 2017

@Ironholds Also, I just pushed some tests drawn from the paper above for valid skip gram output. So now that branch is failing, which is to be expected.

@Ironholds
Copy link
Member

Definitely happy to work on it! Wouldn't it return a list, rather than a vector?

@lmullen
Copy link
Member

lmullen commented Apr 3, 2017

Yes, good point. It would be a list with one element for each input document to tokenize_skip_grams() with each element containing a character vector of tokens.

@Ironholds
Copy link
Member

thumbs up, want me to just put the skipgrams-branch code in my fork so you can review it all at once?

@lmullen
Copy link
Member

lmullen commented Apr 3, 2017

Yes, that'd be perfect.

🚀

@Ironholds
Copy link
Member

Er. So I've implemented it but not run it because I'm pretty sure there's a use-case in which it'll segfault; specifically, how will it handle stopwords in generating the skipgram indices?

@Ironholds
Copy link
Member

Ah, clever! Let's see what I can do with that.

@Ironholds
Copy link
Member

Ironholds commented Apr 3, 2017

Output for the initial example is now:

tokenizers::tokenize_skip_ngrams('a b c d e', n=3, k=2)
[[1]]
[1] "a b c" "a b d" "a b"   "a c d" "a c"   "a c"   "a d"   "a d"   "a d"

Is this the desired behaviour? (should n=3 be excluding the length-2 entries? Code in the PR)

@lmullen
Copy link
Member

lmullen commented Apr 3, 2017

I made a change in the signature of the function. When someone is using skip grams it's because they want to get a lot of different tokens out of the input text (e.g., to deal with noisy OCR or to be robust to intentional changes when looking for text reuse). So I added an argument n_min as in the tokenize_ngrams() functions, so that users can get the n-grams for a range of n. Unlike in tokenize_ngrams(), however, I've set n_min to 1 instead of to n, since by default people will probably want the range from 1 to n. So a long way of saying that the bi-grams should be there. I will need to look to see why it doesn't have uni-grams too, though.

@Ironholds
Copy link
Member

Probably because I didn't pass n_min through. Lemme rerun.

@lmullen
Copy link
Member

lmullen commented Apr 3, 2017

@Ironholds The PR looks really good. I think we just have a git problem at this point.

You shouldn't need n_min: that is used to generate the list of skips, but once you have the skips you don't need n_min.

Does your fork have my changes in my skipgrams branch? E.g., I don't see n_min here:
https://github.com/Ironholds/tokenizers/blob/master/R/ngram-tokenizers.R#L106 That would also account for the old tests being run by Travis.

I think we need to combine your PR plus the skipgrams branch. How do you want to do that?

@Ironholds
Copy link
Member

tokenizers::tokenize_skip_ngrams('a b c d e', n=3, k=2)
[[1]]
 [1] "a"     "a b"   "a c"   "a d"   "a b c" "a b d" "a b"   "a c d" "a c"   "a c"   "a d"   "a d"   "a d" 

Now gets unigrams, but duplicatively, which looks to be because 'a d' (0, 4) is actually (0, 4, 6), and so when the Rcpp hits it, it doesn't bother including the 6th element, since it's a vector of 5. And since (0, 4) is already in the skiplist...yeah.

Got any suggestions for how to fix this in the skiplist logic? I imagine it's a classic off-by-one error caused by indexing differences. I'd rather not fix it iteratively in the Rcpp, since that'll remove our ability to construct the output vector in a single allocation.

@Ironholds
Copy link
Member

Ironholds commented Apr 3, 2017

And yep, I meant n_min being an argument to tokenize* at all! The merging I'll handle by hand when this is worked out, it looks like GitHub now has a nice clean web editor for merge collisions, which is nice.

@lmullen
Copy link
Member

lmullen commented Apr 3, 2017

So if I understand correctly, in the Rcpp function if you get a vector of indices c(0, 4, 6) and the vector of words is length 5, the function throws away only the 6 and not the whole vector of indices? Or am I misunderstanding?

@Ironholds
Copy link
Member

Precisely. And if we change it to throw away the whole vector - which we can, as a solution to this problem - then we lose the ability to allocate the output object efficiently, since it means the actual length of the output is non-deterministic. So fixing it in the skip generation would be better.

@Ironholds
Copy link
Member

Although thinking it through out loud, since skip generation is for the entire list of tokens, this is probably impossible to do correctly. I'll see if I can fix it in the compiled code without ruining performance.

@lmullen
Copy link
Member

lmullen commented Apr 3, 2017

Yes, I don't think it is possible to do fix it in the skip-index generation, since it's for the entire vector of words. I would think that given a vector of length w and the value of k and the range of values of n, it should be possible to get a formula for the total number of skip grams. This paper gives tables for different values of n, k, w and also a formula for a single value of n and k, but I haven't succeeded in generalizing the formula.

http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf

@Ironholds
Copy link
Member

Think I've fixed it in the compiled code; running some tests and performance benchmarks now.

@Ironholds
Copy link
Member

Welp, it produces the right result:

> tokenizers::tokenize_skip_ngrams('a b c d e', n=3, k=2)
[[1]]
 [1] "a"     "a b"   "a c"   "a d"   "a b c" "a b d" "a b e" "a c d" "a c e" "a d e"

And is also faster for large vectors!

> a <- rep("a b c d e", 100000)
> library(microbenchmark)

> microbenchmark({tokenize_skip_ngrams(a)})

# Old version
Unit: milliseconds
                            expr      min      lq     mean   median       uq      max neval
 {     tokenize_skip_ngrams(a) } 730.8347 765.642 787.9186 779.7935 798.9349 895.6179   100

> microbenchmark({tokenize_skip_ngrams(a)})

# New version
Unit: milliseconds
                            expr      min       lq     mean   median      uq      max neval
 {     tokenize_skip_ngrams(a) } 511.7679 538.9027 559.7002 549.6378 575.628 687.7942   100

Throwing it in the PR branch now; will resolve merge conflicts so you can accept it, unless you spot a goof in the output?

@lmullen lmullen closed this as completed in e6960b6 Apr 3, 2017
@lmullen
Copy link
Member

lmullen commented Apr 3, 2017

@Ironholds Hmm, sorry to impugn on your good graces, but I think there is a problem with the output.

tokenize_skip_ngrams("This is the sample sentence which has exactly ten words.", n = 3, k = 2)

produces

[[1]]
 [1] "this"                 "this is"              "this the"            
 [4] "this sample"          "this is the"          "this is sample"      
 [7] "this is sentence"     "this the sample"      "this the sentence"   
[10] "this the which"       "this sample sentence" "this sample which"   
[13] "this sample has"    

That output is correct for the skip indices at the first word, "this," but it doesn't look like it is iterating over the rest of the words in the sentence. Looking at the Rcpp code, it looks like there is not a loop that goes over the words.

I've also added some tests drawn from the paper defining skip-grams which reproduce this problem.

@lmullen
Copy link
Member

lmullen commented Apr 3, 2017

Just pushed the tests which I forgot to do earlier.

@Ironholds
Copy link
Member

Oh, doh. Wait, so it should be, for each word, generating paste(word, skips), as R pseudocode? And presumably dropping the skips value that corresponds to word if it appears in skips

@kbenoit
Copy link
Contributor

kbenoit commented Apr 3, 2017

If you want to try to match the quanteda behaviour (see above...way above), here's an update using our latest API changes:

> require(quanteda)
Loading required package: quanteda
quanteda version 0.9.9.44
Using 7 of 8 cores for parallel computing

Attaching package:quantedaThe following object is masked frompackage:utils:

    View

> quanteda::tokens("insurgents killed in ongoing fighting", n = 3, skip = 0:2, concatenator = " ")
tokens from 1 document.
Component 1 :
 [1] "insurgents killed in"        "insurgents killed ongoing"   "insurgents killed fighting"  "insurgents in ongoing"      
 [5] "insurgents in fighting"      "insurgents ongoing fighting" "killed in ongoing"           "killed in fighting"         
 [9] "killed ongoing fighting"     "in ongoing fighting"  

@lmullen
Copy link
Member

lmullen commented Apr 3, 2017

Yes, the index positions are the pattern for a window of skip grams, and that window has to be slid over every word. By adding the value of the iterator for the word to the index positions in skips, you get new index positions. (Skip grams can generate an enormous number of tokens.) I should have noticed this earlier, but we were using very short input texts so I didn't.

@lmullen
Copy link
Member

lmullen commented Apr 3, 2017

@kbenoit Thanks for the update. That's helpful. That's exactly the text that I'm using in the tests here. https://github.com/ropensci/tokenizers/blob/master/tests/testthat/test-ngrams.R#L138

@Ironholds
Copy link
Member

Gotcha. Okay, this'll impact performance some - I'll see what I can come up with.

@Ironholds
Copy link
Member

Hrm, and now I can't make it anything but duplicative. Wanna take a stab?

@lmullen
Copy link
Member

lmullen commented Apr 4, 2017

@Ironholds Sure, I'll make an attempt. Will have to give it a shot in the afternoon after tomorrow's classes.

@lmullen
Copy link
Member

lmullen commented Apr 4, 2017

@Ironholds How is this? Any potential problems I'm missing? 0a952ef

The results are the same as quanteda and it passes the failing tests I added earlier, and performance seems acceptable.

suppressPackageStartupMessages(library(quanteda))
suppressPackageStartupMessages(library(tokenizers))
library(microbenchmark)

# We really need a better sentence
input <- "insurgents killed in ongoing fighting"
microbenchmark(
  q <- quanteda::tokens(input, n = 3, skip = 0:2, concatenator = " "),
  w <- tokenizers::tokenize_skip_ngrams(input, n = 3, n_min = 3, k = 2, simplify = TRUE)
)
#> Unit: microseconds
#>                                                                                         expr
#>                          q <- quanteda::tokens(input, n = 3, skip = 0:2, concatenator = " ")
#>  w <- tokenizers::tokenize_skip_ngrams(input, n = 3, n_min = 3,      k = 2, simplify = TRUE)
#>      min       lq     mean   median       uq      max neval cld
#>  610.369 641.2515 694.2775 658.9820 701.2025 2918.523   100   b
#>  264.379 291.4475 334.6837 301.3375 325.6095 2942.237   100  a
identical(as.character(q), w)
#> [1] TRUE

@Ironholds
Copy link
Member

Ohh nice, I see where I was going wrong!

@lmullen
Copy link
Member

lmullen commented Apr 4, 2017

Okay, closing this then. Thanks for all your work on it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants