Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

textstat_dist() is faling with dfm with empty rows #1730

Closed
koheiw opened this issue Jul 8, 2019 · 5 comments
Closed

textstat_dist() is faling with dfm with empty rows #1730

koheiw opened this issue Jul 8, 2019 · 5 comments
Assignees
Labels

Comments

@koheiw
Copy link
Collaborator

koheiw commented Jul 8, 2019

require(quanteda)
require(readtext)

data_twitter <- readtext("/home/kohei/packages/quanteda.tutorials/content/data/twitter.json", 
                         source = "twitter")

corp_tweets <- corpus(data_twitter)
dfmat_tweets <- dfm(corp_tweets,
                    remove_punct = TRUE, remove_url = TRUE,
                    remove = c('*.tt', '*.uk', '*.com', 'rt', '#*', '@*')) %>% 
    dfm_remove(stopwords('en'))

dfmat_users <- dfm_group(dfmat_tweets, groups = 'screen_name')

sum(rowSums(dfmat_users_prop) == 0)
# [1] 200

dfmat_users_prop <- dfmat_users %>% 
    dfm_select(min_nchar = 2) %>% 
    dfm_trim(min_termfreq = 10) %>% 
    dfm_weight('prop')

textstat_dist(dfmat_users_prop)
# Error in rep_len(value, lenRepl) : attempt to replicate non-vector

Because of this bug, we cannot build quanteda.tutorials.io

@koheiw koheiw self-assigned this Jul 8, 2019
@koheiw koheiw added the bug label Jul 8, 2019
@kbenoit
Copy link
Collaborator

kbenoit commented Jul 8, 2019

What's the dim here of dfmat_users_prop?

@koheiw
Copy link
Collaborator Author

koheiw commented Jul 8, 2019

> dim(dfmat_users_prop)
[1] 5061  942

@kbenoit
Copy link
Collaborator

kbenoit commented Jul 8, 2019

So the problem is an NA in the dfm? I can't understand the problem without the source data, but then I'm sure you are on top of this.

@koheiw
Copy link
Collaborator Author

koheiw commented Jul 8, 2019

data_twitter <- readtext("content/data/twitter.json", source = "twitter")

@koheiw
Copy link
Collaborator Author

koheiw commented Jul 8, 2019

This block is failing

quanteda/R/textstat_simil.R

Lines 574 to 577 in f72848a

if (any(na1))
result[na1, , drop = FALSE] <- NA
if (any(na2))
result[, na2, drop = FALSE] <- NA

@koheiw koheiw closed this as completed Jul 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants