Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dfm_weight with weights option throws error #1150

Closed
thomasd2 opened this issue Dec 18, 2017 · 5 comments
Closed

dfm_weight with weights option throws error #1150

thomasd2 opened this issue Dec 18, 2017 · 5 comments

Comments

@thomasd2
Copy link

Using quanteda 0.99.9027, the following throws an error about "too many replacement values":

testText <- c("brown brown yellow green", "yellow green blue")
(testDfm <- dfm(tokens(testText)))
testWeights <- rnorm(4)
names(testWeights) <- featnames(testDfm)
testWeights

dfm_weight(testDfm, weights=testWeights)
@kbenoit
Copy link
Collaborator

kbenoit commented Dec 18, 2017

Thanks @thomasd2, I just pushed a fix and this should make its way into master by this afternoon.

@kbenoit kbenoit self-assigned this Dec 18, 2017
kbenoit added a commit that referenced this issue Dec 18, 2017
Fix #1150, bug for named weights in dfm_weight()
koheiw added a commit that referenced this issue Dec 18, 2017
@koheiw
Copy link
Collaborator

koheiw commented Mar 24, 2018

In response to https://stackoverflow.com/questions/49467432/using-the-french-anew-dictionary-for-sentiment-analysis

require(quanteda)
w <- c('a' = 0.1, 'b' = 0.4,  'c' = 0.1)
txt <- c('a a b b', 'b b c d', 'c c d')
mt <- dfm(txt)

I think it should weight d by zero

mt %>% 
dfm_weight(weights = w)

# Document-feature matrix of: 3 documents, 4 features (41.7% sparse).
# 3 x 4 sparse Matrix of class "dfm"
#        features
# docs      a   b   c d
#   text1 0.2 0.8 0   0
#   text2 0   0.8 0.1 1
#   text3 0   0   0.2 1

This errors

mt %>% 
dfm_select(names(w)) %>%
dfm_weight(weights = w)

# Error in slot(value, what) : 
#   no slot of name "factors" for this object of class "dtCMatrix"

@koheiw koheiw reopened this Mar 24, 2018
koheiw added a commit that referenced this issue Mar 24, 2018
@koheiw koheiw added the design label Mar 24, 2018
@kbenoit
Copy link
Collaborator

kbenoit commented Mar 24, 2018

It's a good question as to what to do for a feature whose weight is missing. I agree it should be zeroed or removed, or we should provide an option for that.

This looks like a bug in dfm_weight(x, weights = w) - the first two are fine but the third is definitely not!

> dfm_weight(mt[, 1:2], weights = w)
Document-feature matrix of: 3 documents, 2 features (50% sparse).
3 x 2 sparse Matrix of class "dfm"
       features
docs      a   b
  text1 0.2 0.8
  text2 0   0.8
  text3 0   0  
Warning message:
dfm_weight(): ignoring 1 unmatched weight feature 
> dfm_weight(mt[, 1:4], weights = w)
Document-feature matrix of: 3 documents, 4 features (41.7% sparse).
3 x 4 sparse Matrix of class "dfm"
       features
docs      a   b   c d
  text1 0.2 0.8 0   0
  text2 0   0.8 0.1 1
  text3 0   0   0.2 1
> dfm_weight(mt[, 1:3], weights = w)
 Error in slot(value, what) : 
  no slot of name "factors" for this object of class "dtCMatrix" 

🤔

@koheiw
Copy link
Collaborator

koheiw commented Mar 24, 2018

I fixed the bug already. Let's remove missing features (unless you can thinking of a case where zeroed features are useful), so that people can apply weighted lexcion in a same way as in dfm_lookup().

@kbenoit
Copy link
Collaborator

kbenoit commented Mar 25, 2018

If the weights are

w <- c(a = 0.1, b = 0.4,  c = 0.1, d = 0)

then we would want the zero-weighted d to appear as zeroes in the dfm, but if is omitted, then it's a candidate for exclusion. That mingles the weight and select operations, so it's worth thinking about the use cases carefully.

@kbenoit kbenoit added this to the CRAN release of v. 1.3 milestone May 22, 2018
@kbenoit kbenoit closed this as completed May 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants