-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Applying thesaurus / dictionary to n-grams (with n > 1) #116
Comments
This is the same issue as #188. We're working on it, @LucFrachon! |
I think this is the same issue as #188 but I am not 100% sure. It's more unusual to create ngrams and apply a dictionary in a single step using one (for fixed pattern matches) This already works, as of v.0.9.8.8: packageVersion("quanteda")
## [1] ‘0.9.8.8’
dict <- dictionary(list(a = c("one two", "one", "two"), b = "three"))
txt <- c(txt1 = "one two three four five", txt2 = "three two one")
dfm(txt, ngrams = 1:2, concatenator = " ", thesaurus = dict)
# Creating a dfm from a character vector ...
# ... lowercasing
# ... tokenizing
# ... indexing documents: 2 documents
# ... indexing features: 11 feature types
# ... applying a dictionary consisting of 2 keys
# ... created a 2 x 9 sparse dfm
# ... complete.
# Elapsed time: 0.021 seconds.
# Document-feature matrix of: 2 documents, 9 features (38.9% sparse).
# 2 x 9 sparse Matrix of class "dfmSparse"
# four five two three three four four five three two two one A B
# txt1 1 1 1 1 1 0 0 3 1
# txt2 0 0 0 0 0 1 1 2 1
# Warning message:
# In applyDictionary.dfm(dfmresult, dictionary, exclusive = ifelse(!is.null(thesaurus), :
# You will probably not get correct behaviour applying a dictionary with multi-word keys to a dfm. @LucFrachon is that the behaviour you wanted? |
I see @LucFrachon's problem. This is nothing do with ngrams, but with |
It works. > dict <- dictionary(list(a = c("one two", "one", "two"), b = "three"))
> txt <- c(txt1 = "one two three four five", txt2 = "three two one")
> dfm(txt, ngrams = 1:2, concatenator = " ", thesaurus = dict)
Document-feature matrix of: 2 documents, 10 features (40% sparse).
2 x 10 sparse Matrix of class "dfmSparse"
features
docs four five one two two three three four four five three two two one A B
txt1 1 1 1 1 1 1 0 0 1 1
txt2 0 0 0 0 0 0 1 1 2 1 |
E-mail conversation refers.
When building a DFM with n-grams (rather than unigrams), the option to apply a thesaurus or dictionary fails because there is no match between an n-gram and dictionary keys (which are usually unigrams).
We would need a way to apply a dictionary before building the DFM, as per your suggestion:
Thanks.
The text was updated successfully, but these errors were encountered: