Allow asymmetic window sizes in fcm() #1413

enrico200165 · 2018-08-09T15:21:56Z

posted question here on stackoverflow without answer yet
Is there a way to access a dfm using the n-1 precedessor types as indexes values?

[abstract pasted from link above] "For next word prediction using ngrams I would need to find all the ngrams (and their frequencies) given n-1 predecessor words.
In dfm I could not see any way to do that, so started implementing it manually on texstat_frequency (data.frame). ...
(Implicitly maybe wrongly excluding using regexes, that I normally love, becauses of prejudice that running them on hundred thousands strings might be too slow/heavy)

koheiw · 2018-08-13T14:58:59Z

I thought this can be done by using asymmetric window values in fcm() but it raises error:

> require(quanteda)
> txt <- c("a b c", "b d e")
> fcm(tokens(txt), "window", window = c(0, 1), tri = FALSE,)
Error in try(if (window < 2) stop("The window size is too small.")) : 
  The window size is too small.
In addition: Warning message:
In if (window < 2) stop("The window size is too small.") :
  the condition has length > 1 and only the first element will be used
Error in qatd_cpp_fcm(x, length(types), count, window, weights, ordered,  : 
  Expecting a single value: [extent=2].

By looking at the code, I remembered that fcm() only takes single value for window unlike the window argument of tokens_select(). It makes sense to me to add of such a capability to the function.

kbenoit · 2018-08-13T16:24:38Z

Yes that was my first line of attack too. In a test branch I commented out the condition about window needing to be >2, and it produced some results I could not quite explain. It's something we need to dig deeper into.

The other angle of attack would be:

form bigrams with space concatenator
coerce those tokens to character
tokenize them again, coerce to a list
coerce to a data.frame
use some tidy-fu to get a transition matrix of features in rows and the features that followed them in columns
Divide by rowSums() to get a transition matrix

Then 6. could be used as a bigram generator, similar to the one in the NLTK book.

koheiw · 2018-08-15T07:54:51Z

Alternatively, the size of the window could be only an integer, but we could record words before/after in upper/lower triangle when a new argument asymmetric = TRUE.

kbenoit · 2018-08-15T07:57:27Z

That’s what ordered = TRUE already does: shows in rows a term and then in columns, the counts of the words that occur following the row word, within the (one-sided/asymmetric) window size.

or at least I think that is what it does. When I ran some tests I found some results I could not explain. That’s why I think this needs to be checked again very thoroughly, and matched to results we would count by hand.

koheiw · 2018-08-15T08:21:23Z

You are right, it is there but produces what I do not expect:

> require(quanteda)
> txt <- c("a b c", "b d e")
> toks <- tokens(txt)
> fcm(toks, "window", window = 2, ordered = TRUE, tri = FALSE)
Feature co-occurrence matrix of: 5 by 5 features.
5 x 5 sparse Matrix of class "fcm"
        features
features a b c d e
       a 0 1 1 0 0
       b 0 0 1 1 1
       c 0 0 0 0 0
       d 0 0 0 0 1
       e 0 0 0 0 0

kbenoit · 2018-08-15T08:32:07Z

Yes, that's what I found. If this did work, and we removed thewindow >=2 condition, then it could easily serve as a generator for a bigram transition matrix.

We need better substantive tests on this, clearly - should start a new issue to fix this.

enrico200165 · 2018-08-20T15:04:34Z

@koheiw I installed today 20th August from github, I am still unable to see/find the a way.
Have updated question on SO to include my unsuccessful/wrong way and a simple text to be used if you reply with some sample code to illustrate how to do it

koheiw · 2018-08-21T04:12:59Z

Don't you see something like this?

> require(quanteda)
> require(Matrix)
> 
> packageVersion("quanteda")
[1] ‘1.3.5’
> 
> txt <- c("a b c", "b a e")
> toks <- tokens(txt)
> mt <- fcm(toks, "window", window = 1, ordered = TRUE, tri = FALSE)
> mt[lower.tri(mt)] <- 0
> mt
4 x 4 sparse Matrix of class "dgCMatrix"
        features
features a b c e
       a . 1 . 1
       b . . 1 .
       c . . . .
       e . . . .

enrico200165 · 2018-08-21T05:55:49Z

@koheiw
I see exactly that, but that is not what I asked (or meant to ask) except when n-1 is 1.
My question is not "given an ngram find subsequent ngrams and their frequencies (as continuations of the given ngram)"
my question is
"given an (n-1)gram get the next words and their frequencies as successors, (possibly avoiding searches that imply regex matching as I guess that they might be slow, happy to be informed/corrected on that).
So, using the text from the updated SO questionI quoted before:
"a b 1 2 3 a b 2 3 4 a b 3 4 5"
using "a","b" as the given (n-1)gram I would lik a way to get something like:
next word = "1" frequency (as successor "a","b") = 1
next word = "2" frequency (as successor "a","b") = 1
next word = "3" frequency (as successor "a","b") = 1

The goal is next word prediction.

Link to the SO question where I enclosed that text and code similar to yours.

koheiw · 2018-08-21T19:36:55Z

You should not run this on a large corpus, but it does the job:

> require(quanteda)
> ngms <- tokens("a b 1 2 3 a b 2 3 4 a b 3 4 5", n = 2:5)
> fmt <- 
+     ngms %>%
+     as.list() %>%
+     unlist() %>% 
+     stringi::stri_replace_last_fixed("_", " ") %>% 
+     tokens() %>% 
+     fcm(ordered = TRUE)
> fmt[33:39, 1:6]
Feature co-occurrence matrix of: 7 by 6 features.
7 x 6 sparse Matrix of class "fcm"
         features
features  a b 1 2 3 4
  3_a_b_2 0 0 0 0 1 0
  a_b_2_3 0 0 0 0 0 1
  b_2_3_4 1 0 0 0 0 0
  2_3_4_a 0 1 0 0 0 0
  3_4_a_b 0 0 0 0 1 0
  4_a_b_3 0 0 0 0 0 1
  a_b_3_4 0 0 0 0 0 0

enrico200165 · 2018-08-22T07:33:04Z

Thank you, very kind and very useful. I posted a simplified version of your code on SO so other people can also benefit from it

koheiw changed the title ~~Beginner/basic question, select ngrams given n-1 predecessors~~ Allow asymmetic window sizes in fcm() Aug 13, 2018

koheiw added the enhancement label Aug 13, 2018

koheiw self-assigned this Aug 13, 2018

koheiw added a commit that referenced this issue Aug 15, 2018

Fix counting when ordered = TRUE to solve #1413

d53c4e0

koheiw mentioned this issue Aug 15, 2018

Issue 1413 #1415

Merged

kbenoit added a commit that referenced this issue Aug 18, 2018

Update NEWS and version for #1413 fix

2fc27f6

kbenoit closed this as completed Aug 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow asymmetic window sizes in fcm() #1413

Allow asymmetic window sizes in fcm() #1413

enrico200165 commented Aug 9, 2018

koheiw commented Aug 13, 2018 •

edited

Loading

kbenoit commented Aug 13, 2018

koheiw commented Aug 15, 2018

kbenoit commented Aug 15, 2018 •

edited

Loading

koheiw commented Aug 15, 2018

kbenoit commented Aug 15, 2018

enrico200165 commented Aug 20, 2018 •

edited

Loading

koheiw commented Aug 21, 2018

enrico200165 commented Aug 21, 2018 •

edited

Loading

koheiw commented Aug 21, 2018

enrico200165 commented Aug 22, 2018

Allow asymmetic window sizes in fcm() #1413

Allow asymmetic window sizes in fcm() #1413

Comments

enrico200165 commented Aug 9, 2018

koheiw commented Aug 13, 2018 • edited Loading

kbenoit commented Aug 13, 2018

koheiw commented Aug 15, 2018

kbenoit commented Aug 15, 2018 • edited Loading

koheiw commented Aug 15, 2018

kbenoit commented Aug 15, 2018

enrico200165 commented Aug 20, 2018 • edited Loading

koheiw commented Aug 21, 2018

enrico200165 commented Aug 21, 2018 • edited Loading

koheiw commented Aug 21, 2018

enrico200165 commented Aug 22, 2018

koheiw commented Aug 13, 2018 •

edited

Loading

kbenoit commented Aug 15, 2018 •

edited

Loading

enrico200165 commented Aug 20, 2018 •

edited

Loading

enrico200165 commented Aug 21, 2018 •

edited

Loading