Applying thesaurus / dictionary to n-grams (with n > 1) #116

LucFrachon · 2016-03-23T08:36:42Z

E-mail conversation refers.

When building a DFM with n-grams (rather than unigrams), the option to apply a thesaurus or dictionary fails because there is no match between an n-gram and dictionary keys (which are usually unigrams).
We would need a way to apply a dictionary before building the DFM, as per your suggestion:

I get what you are saying on the ngrams and the dictionary. Can’t be done using the existing tools, for the reasons you specify, but could be easily solved by adding an applyDictionary() method for tokenizedTexts. Steps would be:

unigram tokenise
apply dictionary, exclusive = FALSE to unigram tokens
dfm the thesaurus-ized tokens with ngrams = 2.

Thanks.

adamobeng · 2016-06-08T17:39:15Z

This is the same issue as #188. We're working on it, @LucFrachon!

kbenoit · 2016-11-06T14:41:08Z

I think this is the same issue as #188 but I am not 100% sure. It's more unusual to create ngrams and apply a dictionary in a single step using one dfm() call, but when we are done addressing #188, it will probably work where the features created will be ngrams and matched, if present as dictionary keys, through the dfm processing.

(for fixed pattern matches) This already works, as of v.0.9.8.8:

packageVersion("quanteda")
## [1] ‘0.9.8.8’
dict <- dictionary(list(a = c("one two", "one", "two"), b = "three"))
txt <- c(txt1 = "one two three four five", txt2 = "three two one")
dfm(txt, ngrams = 1:2, concatenator = " ", thesaurus = dict)
# Creating a dfm from a character vector ...
#   ... lowercasing
#   ... tokenizing
#   ... indexing documents: 2 documents
#   ... indexing features: 11 feature types
#   ... applying a dictionary consisting of 2 keys
#   ... created a 2 x 9 sparse dfm
#   ... complete. 
# Elapsed time: 0.021 seconds.
# Document-feature matrix of: 2 documents, 9 features (38.9% sparse).
# 2 x 9 sparse Matrix of class "dfmSparse"
#      four five two three three four four five three two two one A B
# txt1    1    1         1          1         1         0       0 3 1
# txt2    0    0         0          0         0         1       1 2 1
# Warning message:
#     In applyDictionary.dfm(dfmresult, dictionary, exclusive = ifelse(!is.null(thesaurus),  :
#       You will probably not get correct behaviour applying a dictionary with multi-word keys to a dfm.

@LucFrachon is that the behaviour you wanted?

koheiw · 2017-05-17T09:59:49Z

I see @LucFrachon's problem. This is nothing do with ngrams, but with dfm_lookup() when exclusive = FALSE, which returns empty dfm when there is no match.

Fix and test for #116

koheiw · 2017-06-19T14:35:25Z

It works.

> dict <- dictionary(list(a = c("one two", "one", "two"), b = "three"))
> txt <- c(txt1 = "one two three four five", txt2 = "three two one")
> dfm(txt, ngrams = 1:2, concatenator = " ", thesaurus = dict)
Document-feature matrix of: 2 documents, 10 features (40% sparse).
2 x 10 sparse Matrix of class "dfmSparse"
      features
docs   four five one two two three three four four five three two two one A B
  txt1    1    1       1         1          1         1         0       0 1 1
  txt2    0    0       0         0          0         0         1       1 2 1

kbenoit modified the milestone: v1.0 Mar 16, 2017

kbenoit assigned koheiw May 17, 2017

kbenoit added the dictionary label May 17, 2017

kbenoit modified the milestones: CRAN v0.9.9.9000, v1.0 May 17, 2017

koheiw added a commit that referenced this issue May 17, 2017

Fix and test for #116

edb1c13

kbenoit added a commit that referenced this issue May 17, 2017

Update NEWS for #116

5883c46

kbenoit added a commit that referenced this issue May 17, 2017

Merge pull request #737 from kbenoit/issue-116

c71aa0f

Fix and test for #116

koheiw closed this as completed Jun 19, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Applying thesaurus / dictionary to n-grams (with n > 1) #116

Applying thesaurus / dictionary to n-grams (with n > 1) #116

LucFrachon commented Mar 23, 2016

adamobeng commented Jun 8, 2016

kbenoit commented Nov 6, 2016 •

edited

Loading

koheiw commented May 17, 2017

koheiw commented Jun 19, 2017

Applying thesaurus / dictionary to n-grams (with n > 1) #116

Applying thesaurus / dictionary to n-grams (with n > 1) #116

Comments

LucFrachon commented Mar 23, 2016

adamobeng commented Jun 8, 2016

kbenoit commented Nov 6, 2016 • edited Loading

koheiw commented May 17, 2017

koheiw commented Jun 19, 2017

kbenoit commented Nov 6, 2016 •

edited

Loading