-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect skipgrams #24
Comments
@koheiw: The input in the example you give is too short for what you are asking. You have an input string which will tokenize to five words, but the window when
That returns:
And if I test for the output you expected, they are all there:
Now, the function should fail or warn when the requested window is too big for the input vector, so that's a fix I'll have to make. But do you get the output you expect when you use longer inputs? |
I run your command, but got a very different result. Correct me if I am wrong.
|
Can you please post the results of |
Here it is. Thank you for investigating this issue.
|
I had an error in the code above. It really should be But looking more closely at the things that you are expecting, some of them don't belong. For instance, |
This is really a definitional debate then... We defined skipgrams in quanteda following Guthrie, D., B. Allison, W. Liu, and L. Guthrie. 2006. "A Closer Look at Skip-Gram Modelling.":
So this would include "a b d" and "a c d". The output we programmed matches this, for instance their "2-skip-tri-grams" from the paper are matched as: > tokens <- quanteda::tokenize(toLower("Insurgents killed in ongoing fighting."),
removePunct = TRUE, simplify = TRUE)
> quanteda::skipgrams(tokens, n = 3, skip = 0:2, concatenator = " ")
[1] "insurgents killed in" "insurgents killed ongoing"
[3] "insurgents killed fighting" "insurgents in ongoing"
[5] "insurgents in fighting" "insurgents ongoing fighting"
[7] "killed in ongoing" "killed in fighting"
[9] "killed ongoing fighting" "in ongoing fighting" By the way I really like your package and we are considering using it for quanteda's tokenizer. |
@kbenoit That's an interesting citation that I wasn't aware of. We will have to take account of this. @dselivanov What are your thoughts? I'm thinking that I should take a look at the skipgrams in quanteda and modify our implementation. For the kind of applications I've been using, especially since there is often noisy OCR interpolated between the real words, that kind of definition of skip grams would make a lot of sense. |
BTW, @kbenoit, glad to hear you are thinking about tokenizers for quanteda. Let us know if you have any particular improvements that you think would be useful. |
@lmullen I can only add that I'm disappointed that @kbenoit & co decided to reimplement some features that have already been in text2vec for quite some time and on which I spent a lot of time (feature hashing and co-occurences - see https://github.com/kbenoit/quanteda/branches). Especially because of some edge cases that not every developer can realize. |
Instead of implementing something new and useful for community. |
@dselivanov Well, everyone is free to implement whatever they like. While I think that some convergence of the many options for text analysis in R is desirable, I think that is far more likely to happen in the R environment by working together to make our packages inter-operable rather than by asking everyone to pick a single package. |
vs
|
There is a Python implementation here based on the paper @kbenoit and @koheiw cite above: http://stackoverflow.com/questions/31847682/how-to-compute-skipgrams-in-python |
appears, as requested what's the latest? |
@Ironholds I've figured out how to do the correct skip grams in a way that I think will be both performant and correct. I'm working on this on the skip grams branch. https://github.com/ropensci/tokenizers/blob/skipgrams/R/ngram-tokenizers.R#L99 Here's the idea. The R function tokenizers:::get_valid_skips(n = 3, k = 2)
#> [[1]]
#> [1] 0 1 2
#>
#> [[2]]
#> [1] 0 1 3
#>
#> [[3]]
#> [1] 0 1 4
#>
#> [[4]]
#> [1] 0 2 3
#>
#> [[5]]
#> [1] 0 2 4
#>
#> [[6]]
#> [1] 0 2 5
#>
#> [[7]]
#> [1] 0 3 4
#>
#> [[8]]
#> [1] 0 3 5
#>
#> [[9]]
#> [1] 0 3 6 A vector of words, along with that list of positions and
I can write that out in C++ if you like, but you certainly know what you are doing in C++ better than me. Are you willing to work on that function? |
@Ironholds Also, I just pushed some tests drawn from the paper above for valid skip gram output. So now that branch is failing, which is to be expected. |
Definitely happy to work on it! Wouldn't it return a list, rather than a vector? |
Yes, good point. It would be a list with one element for each input document to |
thumbs up, want me to just put the skipgrams-branch code in my fork so you can review it all at once? |
Yes, that'd be perfect. 🚀 |
Er. So I've implemented it but not run it because I'm pretty sure there's a use-case in which it'll segfault; specifically, how will it handle stopwords in generating the skipgram indices? |
Ah, clever! Let's see what I can do with that. |
Output for the initial example is now:
Is this the desired behaviour? (should n=3 be excluding the length-2 entries? Code in the PR) |
I made a change in the signature of the function. When someone is using skip grams it's because they want to get a lot of different tokens out of the input text (e.g., to deal with noisy OCR or to be robust to intentional changes when looking for text reuse). So I added an argument |
Probably because I didn't pass n_min through. Lemme rerun. |
@Ironholds The PR looks really good. I think we just have a git problem at this point. You shouldn't need Does your fork have my changes in my I think we need to combine your PR plus the |
Now gets unigrams, but duplicatively, which looks to be because 'a d' (0, 4) is actually (0, 4, 6), and so when the Rcpp hits it, it doesn't bother including the 6th element, since it's a vector of 5. And since (0, 4) is already in the skiplist...yeah. Got any suggestions for how to fix this in the skiplist logic? I imagine it's a classic off-by-one error caused by indexing differences. I'd rather not fix it iteratively in the Rcpp, since that'll remove our ability to construct the output vector in a single allocation. |
And yep, I meant |
So if I understand correctly, in the Rcpp function if you get a vector of indices |
Precisely. And if we change it to throw away the whole vector - which we can, as a solution to this problem - then we lose the ability to allocate the output object efficiently, since it means the actual length of the output is non-deterministic. So fixing it in the skip generation would be better. |
Although thinking it through out loud, since skip generation is for the entire list of tokens, this is probably impossible to do correctly. I'll see if I can fix it in the compiled code without ruining performance. |
Yes, I don't think it is possible to do fix it in the skip-index generation, since it's for the entire vector of words. I would think that given a vector of length w and the value of k and the range of values of n, it should be possible to get a formula for the total number of skip grams. This paper gives tables for different values of n, k, w and also a formula for a single value of n and k, but I haven't succeeded in generalizing the formula. http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf |
Think I've fixed it in the compiled code; running some tests and performance benchmarks now. |
Welp, it produces the right result:
And is also faster for large vectors!
Throwing it in the PR branch now; will resolve merge conflicts so you can accept it, unless you spot a goof in the output? |
@Ironholds Hmm, sorry to impugn on your good graces, but I think there is a problem with the output.
produces
That output is correct for the skip indices at the first word, "this," but it doesn't look like it is iterating over the rest of the words in the sentence. Looking at the Rcpp code, it looks like there is not a loop that goes over the words. I've also added some tests drawn from the paper defining skip-grams which reproduce this problem. |
Just pushed the tests which I forgot to do earlier. |
Oh, doh. Wait, so it should be, for each word, generating |
If you want to try to match the quanteda behaviour (see above...way above), here's an update using our latest API changes: > require(quanteda)
Loading required package: quanteda
quanteda version 0.9.9.44
Using 7 of 8 cores for parallel computing
Attaching package: ‘quanteda’
The following object is masked from ‘package:utils’:
View
> quanteda::tokens("insurgents killed in ongoing fighting", n = 3, skip = 0:2, concatenator = " ")
tokens from 1 document.
Component 1 :
[1] "insurgents killed in" "insurgents killed ongoing" "insurgents killed fighting" "insurgents in ongoing"
[5] "insurgents in fighting" "insurgents ongoing fighting" "killed in ongoing" "killed in fighting"
[9] "killed ongoing fighting" "in ongoing fighting" |
Yes, the index positions are the pattern for a window of skip grams, and that window has to be slid over every word. By adding the value of the iterator for the word to the index positions in skips, you get new index positions. (Skip grams can generate an enormous number of tokens.) I should have noticed this earlier, but we were using very short input texts so I didn't. |
@kbenoit Thanks for the update. That's helpful. That's exactly the text that I'm using in the tests here. https://github.com/ropensci/tokenizers/blob/master/tests/testthat/test-ngrams.R#L138 |
Gotcha. Okay, this'll impact performance some - I'll see what I can come up with. |
Hrm, and now I can't make it anything but duplicative. Wanna take a stab? |
@Ironholds Sure, I'll make an attempt. Will have to give it a shot in the afternoon after tomorrow's classes. |
@Ironholds How is this? Any potential problems I'm missing? 0a952ef The results are the same as quanteda and it passes the failing tests I added earlier, and performance seems acceptable. suppressPackageStartupMessages(library(quanteda))
suppressPackageStartupMessages(library(tokenizers))
library(microbenchmark)
# We really need a better sentence
input <- "insurgents killed in ongoing fighting"
microbenchmark(
q <- quanteda::tokens(input, n = 3, skip = 0:2, concatenator = " "),
w <- tokenizers::tokenize_skip_ngrams(input, n = 3, n_min = 3, k = 2, simplify = TRUE)
)
#> Unit: microseconds
#> expr
#> q <- quanteda::tokens(input, n = 3, skip = 0:2, concatenator = " ")
#> w <- tokenizers::tokenize_skip_ngrams(input, n = 3, n_min = 3, k = 2, simplify = TRUE)
#> min lq mean median uq max neval cld
#> 610.369 641.2515 694.2775 658.9820 701.2025 2918.523 100 b
#> 264.379 291.4475 334.6837 301.3375 325.6095 2942.237 100 a
identical(as.character(q), w)
#> [1] TRUE |
Ohh nice, I see where I was going wrong! |
Okay, closing this then. Thanks for all your work on it! |
I expect skipgrams with k=2 to produce
But I am getting
K=1 is not affecting the behaviour of the function.
The text was updated successfully, but these errors were encountered: