-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow asymmetic window sizes in fcm() #1413
Comments
I thought this can be done by using asymmetric > require(quanteda)
> txt <- c("a b c", "b d e")
> fcm(tokens(txt), "window", window = c(0, 1), tri = FALSE,)
Error in try(if (window < 2) stop("The window size is too small.")) :
The window size is too small.
In addition: Warning message:
In if (window < 2) stop("The window size is too small.") :
the condition has length > 1 and only the first element will be used
Error in qatd_cpp_fcm(x, length(types), count, window, weights, ordered, :
Expecting a single value: [extent=2]. By looking at the code, I remembered that |
Yes that was my first line of attack too. In a test branch I commented out the condition about window needing to be >2, and it produced some results I could not quite explain. It's something we need to dig deeper into. The other angle of attack would be:
Then 6. could be used as a bigram generator, similar to the one in the NLTK book. |
Alternatively, the size of the window could be only an integer, but we could record words before/after in upper/lower triangle when a new argument |
That’s what or at least I think that is what it does. When I ran some tests I found some results I could not explain. That’s why I think this needs to be checked again very thoroughly, and matched to results we would count by hand. |
You are right, it is there but produces what I do not expect: > require(quanteda)
> txt <- c("a b c", "b d e")
> toks <- tokens(txt)
> fcm(toks, "window", window = 2, ordered = TRUE, tri = FALSE)
Feature co-occurrence matrix of: 5 by 5 features.
5 x 5 sparse Matrix of class "fcm"
features
features a b c d e
a 0 1 1 0 0
b 0 0 1 1 1
c 0 0 0 0 0
d 0 0 0 0 1
e 0 0 0 0 0 |
Yes, that's what I found. If this did work, and we removed the We need better substantive tests on this, clearly - should start a new issue to fix this. |
Don't you see something like this? > require(quanteda)
> require(Matrix)
>
> packageVersion("quanteda")
[1] ‘1.3.5’
>
> txt <- c("a b c", "b a e")
> toks <- tokens(txt)
> mt <- fcm(toks, "window", window = 1, ordered = TRUE, tri = FALSE)
> mt[lower.tri(mt)] <- 0
> mt
4 x 4 sparse Matrix of class "dgCMatrix"
features
features a b c e
a . 1 . 1
b . . 1 .
c . . . .
e . . . . |
@koheiw The goal is next word prediction. Link to the SO question where I enclosed that text and code similar to yours. |
You should not run this on a large corpus, but it does the job: > require(quanteda)
> ngms <- tokens("a b 1 2 3 a b 2 3 4 a b 3 4 5", n = 2:5)
> fmt <-
+ ngms %>%
+ as.list() %>%
+ unlist() %>%
+ stringi::stri_replace_last_fixed("_", " ") %>%
+ tokens() %>%
+ fcm(ordered = TRUE)
> fmt[33:39, 1:6]
Feature co-occurrence matrix of: 7 by 6 features.
7 x 6 sparse Matrix of class "fcm"
features
features a b 1 2 3 4
3_a_b_2 0 0 0 0 1 0
a_b_2_3 0 0 0 0 0 1
b_2_3_4 1 0 0 0 0 0
2_3_4_a 0 1 0 0 0 0
3_4_a_b 0 0 0 0 1 0
4_a_b_3 0 0 0 0 0 1
a_b_3_4 0 0 0 0 0 0 |
Thank you, very kind and very useful. I posted a simplified version of your code on SO so other people can also benefit from it |
posted question here on stackoverflow without answer yet
Is there a way to access a dfm using the n-1 precedessor types as indexes values?
[abstract pasted from link above] "For next word prediction using ngrams I would need to find all the ngrams (and their frequencies) given n-1 predecessor words.
In dfm I could not see any way to do that, so started implementing it manually on texstat_frequency (data.frame). ...
(Implicitly maybe wrongly excluding using regexes, that I normally love, becauses of prejudice that running them on hundred thousands strings might be too slow/heavy)
The text was updated successfully, but these errors were encountered: