Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow asymmetic window sizes in fcm() #1413

Closed
enrico200165 opened this issue Aug 9, 2018 · 11 comments
Closed

Allow asymmetic window sizes in fcm() #1413

enrico200165 opened this issue Aug 9, 2018 · 11 comments
Assignees

Comments

@enrico200165
Copy link

posted question here on stackoverflow without answer yet
Is there a way to access a dfm using the n-1 precedessor types as indexes values?

[abstract pasted from link above] "For next word prediction using ngrams I would need to find all the ngrams (and their frequencies) given n-1 predecessor words.
In dfm I could not see any way to do that, so started implementing it manually on texstat_frequency (data.frame). ...
(Implicitly maybe wrongly excluding using regexes, that I normally love, becauses of prejudice that running them on hundred thousands strings might be too slow/heavy)

@koheiw
Copy link
Collaborator

koheiw commented Aug 13, 2018

I thought this can be done by using asymmetric window values in fcm() but it raises error:

> require(quanteda)
> txt <- c("a b c", "b d e")
> fcm(tokens(txt), "window", window = c(0, 1), tri = FALSE,)
Error in try(if (window < 2) stop("The window size is too small.")) : 
  The window size is too small.
In addition: Warning message:
In if (window < 2) stop("The window size is too small.") :
  the condition has length > 1 and only the first element will be used
Error in qatd_cpp_fcm(x, length(types), count, window, weights, ordered,  : 
  Expecting a single value: [extent=2].

By looking at the code, I remembered that fcm() only takes single value for window unlike the window argument of tokens_select(). It makes sense to me to add of such a capability to the function.

@koheiw koheiw changed the title Beginner/basic question, select ngrams given n-1 predecessors Allow asymmetic window sizes in fcm() Aug 13, 2018
@koheiw koheiw self-assigned this Aug 13, 2018
@kbenoit
Copy link
Collaborator

kbenoit commented Aug 13, 2018

Yes that was my first line of attack too. In a test branch I commented out the condition about window needing to be >2, and it produced some results I could not quite explain. It's something we need to dig deeper into.

The other angle of attack would be:

  1. form bigrams with space concatenator
  2. coerce those tokens to character
  3. tokenize them again, coerce to a list
  4. coerce to a data.frame
  5. use some tidy-fu to get a transition matrix of features in rows and the features that followed them in columns
  6. Divide by rowSums() to get a transition matrix

Then 6. could be used as a bigram generator, similar to the one in the NLTK book.

@koheiw
Copy link
Collaborator

koheiw commented Aug 15, 2018

Alternatively, the size of the window could be only an integer, but we could record words before/after in upper/lower triangle when a new argument asymmetric = TRUE.

@kbenoit
Copy link
Collaborator

kbenoit commented Aug 15, 2018

That’s what ordered = TRUE already does: shows in rows a term and then in columns, the counts of the words that occur following the row word, within the (one-sided/asymmetric) window size.

or at least I think that is what it does. When I ran some tests I found some results I could not explain. That’s why I think this needs to be checked again very thoroughly, and matched to results we would count by hand.

@koheiw
Copy link
Collaborator

koheiw commented Aug 15, 2018

You are right, it is there but produces what I do not expect:

> require(quanteda)
> txt <- c("a b c", "b d e")
> toks <- tokens(txt)
> fcm(toks, "window", window = 2, ordered = TRUE, tri = FALSE)
Feature co-occurrence matrix of: 5 by 5 features.
5 x 5 sparse Matrix of class "fcm"
        features
features a b c d e
       a 0 1 1 0 0
       b 0 0 1 1 1
       c 0 0 0 0 0
       d 0 0 0 0 1
       e 0 0 0 0 0

@kbenoit
Copy link
Collaborator

kbenoit commented Aug 15, 2018

Yes, that's what I found. If this did work, and we removed thewindow >=2 condition, then it could easily serve as a generator for a bigram transition matrix.

We need better substantive tests on this, clearly - should start a new issue to fix this.

@enrico200165
Copy link
Author

enrico200165 commented Aug 20, 2018

@koheiw I installed today 20th August from github, I am still unable to see/find the a way.
Have updated question on SO to include my unsuccessful/wrong way and a simple text to be used if you reply with some sample code to illustrate how to do it

@koheiw
Copy link
Collaborator

koheiw commented Aug 21, 2018

Don't you see something like this?

> require(quanteda)
> require(Matrix)
> 
> packageVersion("quanteda")
[1] ‘1.3.5> 
> txt <- c("a b c", "b a e")
> toks <- tokens(txt)
> mt <- fcm(toks, "window", window = 1, ordered = TRUE, tri = FALSE)
> mt[lower.tri(mt)] <- 0
> mt
4 x 4 sparse Matrix of class "dgCMatrix"
        features
features a b c e
       a . 1 . 1
       b . . 1 .
       c . . . .
       e . . . .

@enrico200165
Copy link
Author

enrico200165 commented Aug 21, 2018

@koheiw
I see exactly that, but that is not what I asked (or meant to ask) except when n-1 is 1.
My question is not "given an ngram find subsequent ngrams and their frequencies (as continuations of the given ngram)"
my question is
"given an (n-1)gram get the next words and their frequencies as successors, (possibly avoiding searches that imply regex matching as I guess that they might be slow, happy to be informed/corrected on that).
So, using the text from the updated SO questionI quoted before:
"a b 1 2 3 a b 2 3 4 a b 3 4 5"
using "a","b" as the given (n-1)gram I would lik a way to get something like:
next word = "1" frequency (as successor "a","b") = 1
next word = "2" frequency (as successor "a","b") = 1
next word = "3" frequency (as successor "a","b") = 1

The goal is next word prediction.

Link to the SO question where I enclosed that text and code similar to yours.

@koheiw
Copy link
Collaborator

koheiw commented Aug 21, 2018

You should not run this on a large corpus, but it does the job:

> require(quanteda)
> ngms <- tokens("a b 1 2 3 a b 2 3 4 a b 3 4 5", n = 2:5)
> fmt <- 
+     ngms %>%
+     as.list() %>%
+     unlist() %>% 
+     stringi::stri_replace_last_fixed("_", " ") %>% 
+     tokens() %>% 
+     fcm(ordered = TRUE)
> fmt[33:39, 1:6]
Feature co-occurrence matrix of: 7 by 6 features.
7 x 6 sparse Matrix of class "fcm"
         features
features  a b 1 2 3 4
  3_a_b_2 0 0 0 0 1 0
  a_b_2_3 0 0 0 0 0 1
  b_2_3_4 1 0 0 0 0 0
  2_3_4_a 0 1 0 0 0 0
  3_4_a_b 0 0 0 0 1 0
  4_a_b_3 0 0 0 0 0 1
  a_b_3_4 0 0 0 0 0 0

@enrico200165
Copy link
Author

Thank you, very kind and very useful. I posted a simplified version of your code on SO so other people can also benefit from it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants