Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Applying thesaurus / dictionary to n-grams (with n > 1) #116

Closed
LucFrachon opened this issue Mar 23, 2016 · 4 comments
Closed

Applying thesaurus / dictionary to n-grams (with n > 1) #116

LucFrachon opened this issue Mar 23, 2016 · 4 comments
Assignees

Comments

@LucFrachon
Copy link

E-mail conversation refers.

When building a DFM with n-grams (rather than unigrams), the option to apply a thesaurus or dictionary fails because there is no match between an n-gram and dictionary keys (which are usually unigrams).
We would need a way to apply a dictionary before building the DFM, as per your suggestion:

I get what you are saying on the ngrams and the dictionary. Can’t be done using the existing tools, for the reasons you specify, but could be easily solved by adding an applyDictionary() method for tokenizedTexts. Steps would be:

unigram tokenise
apply dictionary, exclusive = FALSE to unigram tokens
dfm the thesaurus-ized tokens with ngrams = 2.

Thanks.

@adamobeng
Copy link
Collaborator

This is the same issue as #188. We're working on it, @LucFrachon!

@kbenoit
Copy link
Collaborator

kbenoit commented Nov 6, 2016

I think this is the same issue as #188 but I am not 100% sure. It's more unusual to create ngrams and apply a dictionary in a single step using one dfm() call, but when we are done addressing #188, it will probably work where the features created will be ngrams and matched, if present as dictionary keys, through the dfm processing.

(for fixed pattern matches) This already works, as of v.0.9.8.8:

packageVersion("quanteda")
## [1] ‘0.9.8.8’
dict <- dictionary(list(a = c("one two", "one", "two"), b = "three"))
txt <- c(txt1 = "one two three four five", txt2 = "three two one")
dfm(txt, ngrams = 1:2, concatenator = " ", thesaurus = dict)
# Creating a dfm from a character vector ...
#   ... lowercasing
#   ... tokenizing
#   ... indexing documents: 2 documents
#   ... indexing features: 11 feature types
#   ... applying a dictionary consisting of 2 keys
#   ... created a 2 x 9 sparse dfm
#   ... complete. 
# Elapsed time: 0.021 seconds.
# Document-feature matrix of: 2 documents, 9 features (38.9% sparse).
# 2 x 9 sparse Matrix of class "dfmSparse"
#      four five two three three four four five three two two one A B
# txt1    1    1         1          1         1         0       0 3 1
# txt2    0    0         0          0         0         1       1 2 1
# Warning message:
#     In applyDictionary.dfm(dfmresult, dictionary, exclusive = ifelse(!is.null(thesaurus),  :
#       You will probably not get correct behaviour applying a dictionary with multi-word keys to a dfm.

@LucFrachon is that the behaviour you wanted?

@kbenoit kbenoit modified the milestone: v1.0 Mar 16, 2017
@kbenoit kbenoit modified the milestones: CRAN v0.9.9.9000, v1.0 May 17, 2017
@koheiw
Copy link
Collaborator

koheiw commented May 17, 2017

I see @LucFrachon's problem. This is nothing do with ngrams, but with dfm_lookup() when exclusive = FALSE, which returns empty dfm when there is no match.

koheiw added a commit that referenced this issue May 17, 2017
kbenoit added a commit that referenced this issue May 17, 2017
kbenoit added a commit that referenced this issue May 17, 2017
@koheiw koheiw closed this as completed Jun 19, 2017
@koheiw
Copy link
Collaborator

koheiw commented Jun 19, 2017

It works.

> dict <- dictionary(list(a = c("one two", "one", "two"), b = "three"))
> txt <- c(txt1 = "one two three four five", txt2 = "three two one")
> dfm(txt, ngrams = 1:2, concatenator = " ", thesaurus = dict)
Document-feature matrix of: 2 documents, 10 features (40% sparse).
2 x 10 sparse Matrix of class "dfmSparse"
      features
docs   four five one two two three three four four five three two two one A B
  txt1    1    1       1         1          1         1         0       0 1 1
  txt2    0    0       0         0          0         0         1       1 2 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants