Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrade_dfm() not working on some pre-docid_ v2 objects #2097

Closed
chainsawriot opened this issue Mar 21, 2021 · 9 comments
Closed

upgrade_dfm() not working on some pre-docid_ v2 objects #2097

chainsawriot opened this issue Mar 21, 2021 · 9 comments
Assignees
Milestone

Comments

@chainsawriot
Copy link
Contributor

chainsawriot commented Mar 21, 2021

Describe the bug

I encounter this problem while working with the useNews (@cbpuschmann & @MarHai, 2020) dataset. As the data is not from me, it could be a data issue (e.g. the dfm is malformed). But displaying the dataset doesn't suggest so.

I can't directly pinpoint what is the actual cause of the error. debug() actually points to dfm_match, exactly this line.

Reproducible code

Please paste minimal code that reproduces the bug. If possible, please upload the data file as .rds.

require(osfr)
require(quanteda)

osf_retrieve_node("uzca3") %>% osf_ls_files(n_max = 1, pattern = "usenews.mediacloud.wm.RData") %>% osf_download(path = ".", progress = TRUE)

load("usenews.mediacloud.wm.RData")

## mediacloud.wordmatrix2019 and mediacloud.wordmatrix2020 are a list of dfms.

mediacloud.wordmatrix2019[[1]] ## looks okay, it was created with v 1.5.2
mediacloud.wordmatrix2020[[1]] ## this one might be corrupted, but looks okay, was created with v 2.1.0 

dfm_lookup(mediacloud.wordmatrix2019[[1]], dictionary(list(Boston = c("boston")))) ## works
dfm_match(mediacloud.wordmatrix2019[[1]], "boston") ## works

dfm_lookup(mediacloud.wordmatrix2020[[1]], dictionary(list(Boston = c("boston")))) ## error
dfm_match(mediacloud.wordmatrix2020[[1]], "boston") ## print problem

Expected behavior

It works for 2019 data but not for 2020 data. 2020 data should work.

System information

R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] quanteda_3.0.0 osfr_0.2.8    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6         magrittr_2.0.1     stopwords_2.2      tidyselect_1.1.0  
 [5] lattice_0.20-41    R6_2.5.0           rlang_0.4.10       fastmatch_1.1-0   
 [9] httr_1.4.2         dplyr_1.0.2        tools_4.0.4        grid_4.0.4        
[13] ellipsis_0.3.1     RcppParallel_5.0.3 digest_0.6.27      httpcode_0.3.0    
[17] tibble_3.0.4       lifecycle_0.2.0    crayon_1.4.1       Matrix_1.3-2      
[21] purrr_0.3.4        vctrs_0.3.5        fs_1.5.0           curl_4.3          
[25] crul_1.0.0         memoise_1.1.0      glue_1.4.2         stringi_1.5.3     
[29] compiler_4.0.4     pillar_1.4.7       generics_0.1.0     jsonlite_1.7.1    
[33] pkgconfig_2.0.3   

Additional info

Both the CRAN version (2.1.2) and the Github version (3.0.0) have the same issue.

@kbenoit
Copy link
Collaborator

kbenoit commented Mar 21, 2021

Thanks for filing this and making it easy to trace. I've figured out the issue.

@koheiw what's happening here is that our is_pre2 returns FALSE so the upgrade_dfm() is never triggered. However even if we do try to run the upgrade_docvars() from that function on this dfm, it fails. I think we must have introduced the new builder functions and the docid_ etc structure just after releasing v2, so didn't catch this case. Would be great to solve this before we release v3. Maybe change is_pre2 conditions to become is_pre3()?

To make this more easily reproducible (since the original files are several GB each) I've extracted just the offending dfm here:

library("quanteda")
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
problem_dfmat <- readRDS(url("https://www.dropbox.com/s/p15zhbluiltp36x/problem_dfmat.rds?dl=1"))

meta(problem_dfmat, type = "system")
## $`package-version`
## [1] '2.1.0'
## 
## $`r-version`
## [1] '4.0.2'
## 
## $system
##  sysname  machine     user 
## "Darwin" "x86_64"     "cp" 
## 
## $directory
## [1] "/Users/cp/GDrive/Projekte/useNews"
## 
## $created
## [1] "2020-10-30"

# problem is no docid_
names(quanteda:::get_docvars.dfm(problem_dfmat, user = TRUE, system = TRUE))
##  [1] "stories_id"           "processed_stories_id" "collect_date"        
##  [4] "guid"                 "title"                "publish_date"        
##  [7] "url"                  "language"             "ap_syndicated"       
## [10] "media_id"             "media_name"           "media_url"
docid(problem_dfmat)
## Error in select_docvars(x@docvars, field, user, system, drop): field(s) docid_ not found

Created on 2021-03-21 by the reprex package (v1.0.0)

@kbenoit kbenoit added this to the v3 release milestone Mar 21, 2021
@kbenoit kbenoit changed the title dfm_lookup error: all(dims >= dims.min) is not TRUE upgrade_dfm() not working on some pre-docid_ v2 objects Mar 21, 2021
@kbenoit
Copy link
Collaborator

kbenoit commented Mar 21, 2021

@koheiw it's possible the same issue will also affect corpus and tokens objects, but since we didn't get that object "in the wild" I have not confirmed.

@koheiw koheiw mentioned this issue Mar 22, 2021
@koheiw
Copy link
Collaborator

koheiw commented Mar 22, 2021

I think that the DFM was created because we forgot to call as.tokens() in dfm.tokens(). I fixed that recently.

@MarHai
Copy link

MarHai commented Mar 22, 2021

Thanks for the report @chainsawriot and the quick help, @kbenoit and @koheiw. Anything we (as the dataset authors) could do as of now? Could we, for example, trigger any update mechanisms manually to update the datasets?

kbenoit added a commit that referenced this issue Mar 22, 2021
@kbenoit
Copy link
Collaborator

kbenoit commented Mar 22, 2021

@MarHai I'd suggest

mediacloud.wordmatrix2020 <- lapply(mediacloud.wordmatrix2020, as.dfm)

to update your objects.

@koheiw in #2098, now merged, I noticed that as.dfm() (via upgrade_dfm()) does not update the system meta. We should change that, no? Below it exits before updating the rest.

upgrade_dfm <- function(x) {
if (!is_pre2(x)) return(x)
attrs <- attributes(x)
if ("meta" %in% names(attrs)) {
x@docvars <- upgrade_docvars(attrs$docvars, rownames(x))
return(x)
}

@koheiw
Copy link
Collaborator

koheiw commented Mar 22, 2021

There is only one version of the system meta, so it should be valid if an object has it. ie ("meta" %in% names(attrs)) == TRUE

@kbenoit
Copy link
Collaborator

kbenoit commented Mar 22, 2021

What I meant was we don't update it for the version that performed the upgrade.

> problem_dfmat <- readRDS(url("https://www.dropbox.com/s/p15zhbluiltp36x/problem_dfmat.rds?dl=1"))
> problem_dfmat2 <- as.dfm(problem_dfmat)
> meta(problem_dfmat2, type = "system")
$`package-version`
[1] ‘2.1.0$`r-version`
[1] ‘4.0.2$system
 sysname  machine     user 
"Darwin" "x86_64"     "cp" 

$directory
[1] "/Users/cp/GDrive/Projekte/useNews"

$created
[1] "2020-10-30"

@koheiw
Copy link
Collaborator

koheiw commented Mar 22, 2021

I see. I think it is OK without updating the version, because the object structure is of v2.1 (the same in v3.0).

@kbenoit kbenoit closed this as completed Mar 22, 2021
@MarHai
Copy link

MarHai commented Mar 30, 2021

For the sake of completeness, @chainsawriot, the respective objects in the OSF repository of useNews are now updated as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants