upgrade_dfm() not working on some pre-`docid_` v2 objects #2097

chainsawriot · 2021-03-21T16:31:26Z

Describe the bug

I encounter this problem while working with the useNews (@cbpuschmann & @MarHai, 2020) dataset. As the data is not from me, it could be a data issue (e.g. the dfm is malformed). But displaying the dataset doesn't suggest so.

I can't directly pinpoint what is the actual cause of the error. debug() actually points to dfm_match, exactly this line.

Reproducible code

Please paste minimal code that reproduces the bug. If possible, please upload the data file as .rds.

require(osfr)
require(quanteda)

osf_retrieve_node("uzca3") %>% osf_ls_files(n_max = 1, pattern = "usenews.mediacloud.wm.RData") %>% osf_download(path = ".", progress = TRUE)

load("usenews.mediacloud.wm.RData")

## mediacloud.wordmatrix2019 and mediacloud.wordmatrix2020 are a list of dfms.

mediacloud.wordmatrix2019[[1]] ## looks okay, it was created with v 1.5.2
mediacloud.wordmatrix2020[[1]] ## this one might be corrupted, but looks okay, was created with v 2.1.0 

dfm_lookup(mediacloud.wordmatrix2019[[1]], dictionary(list(Boston = c("boston")))) ## works
dfm_match(mediacloud.wordmatrix2019[[1]], "boston") ## works

dfm_lookup(mediacloud.wordmatrix2020[[1]], dictionary(list(Boston = c("boston")))) ## error
dfm_match(mediacloud.wordmatrix2020[[1]], "boston") ## print problem

Expected behavior

It works for 2019 data but not for 2020 data. 2020 data should work.

System information

R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] quanteda_3.0.0 osfr_0.2.8    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6         magrittr_2.0.1     stopwords_2.2      tidyselect_1.1.0  
 [5] lattice_0.20-41    R6_2.5.0           rlang_0.4.10       fastmatch_1.1-0   
 [9] httr_1.4.2         dplyr_1.0.2        tools_4.0.4        grid_4.0.4        
[13] ellipsis_0.3.1     RcppParallel_5.0.3 digest_0.6.27      httpcode_0.3.0    
[17] tibble_3.0.4       lifecycle_0.2.0    crayon_1.4.1       Matrix_1.3-2      
[21] purrr_0.3.4        vctrs_0.3.5        fs_1.5.0           curl_4.3          
[25] crul_1.0.0         memoise_1.1.0      glue_1.4.2         stringi_1.5.3     
[29] compiler_4.0.4     pillar_1.4.7       generics_0.1.0     jsonlite_1.7.1    
[33] pkgconfig_2.0.3

Additional info

Both the CRAN version (2.1.2) and the Github version (3.0.0) have the same issue.

The text was updated successfully, but these errors were encountered:

kbenoit · 2021-03-21T17:37:20Z

Thanks for filing this and making it easy to trace. I've figured out the issue.

@koheiw what's happening here is that our is_pre2 returns FALSE so the upgrade_dfm() is never triggered. However even if we do try to run the upgrade_docvars() from that function on this dfm, it fails. I think we must have introduced the new builder functions and the docid_ etc structure just after releasing v2, so didn't catch this case. Would be great to solve this before we release v3. Maybe change is_pre2 conditions to become is_pre3()?

To make this more easily reproducible (since the original files are several GB each) I've extracted just the offending dfm here:

library("quanteda")
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
problem_dfmat <- readRDS(url("https://www.dropbox.com/s/p15zhbluiltp36x/problem_dfmat.rds?dl=1"))

meta(problem_dfmat, type = "system")
## $`package-version`
## [1] '2.1.0'
## 
## $`r-version`
## [1] '4.0.2'
## 
## $system
##  sysname  machine     user 
## "Darwin" "x86_64"     "cp" 
## 
## $directory
## [1] "/Users/cp/GDrive/Projekte/useNews"
## 
## $created
## [1] "2020-10-30"

# problem is no docid_
names(quanteda:::get_docvars.dfm(problem_dfmat, user = TRUE, system = TRUE))
##  [1] "stories_id"           "processed_stories_id" "collect_date"        
##  [4] "guid"                 "title"                "publish_date"        
##  [7] "url"                  "language"             "ap_syndicated"       
## [10] "media_id"             "media_name"           "media_url"
docid(problem_dfmat)
## Error in select_docvars(x@docvars, field, user, system, drop): field(s) docid_ not found

^{Created on 2021-03-21 by the reprex package (v1.0.0)}

kbenoit · 2021-03-21T18:39:08Z

@koheiw it's possible the same issue will also affect corpus and tokens objects, but since we didn't get that object "in the wild" I have not confirmed.

koheiw · 2021-03-22T00:41:00Z

I think that the DFM was created because we forgot to call as.tokens() in dfm.tokens(). I fixed that recently.

MarHai · 2021-03-22T08:00:04Z

Thanks for the report @chainsawriot and the quick help, @kbenoit and @koheiw. Anything we (as the dataset authors) could do as of now? Could we, for example, trigger any update mechanisms manually to update the datasets?

kbenoit · 2021-03-22T08:46:58Z

@MarHai I'd suggest

mediacloud.wordmatrix2020 <- lapply(mediacloud.wordmatrix2020, as.dfm)

to update your objects.

@koheiw in #2098, now merged, I noticed that as.dfm() (via upgrade_dfm()) does not update the system meta. We should change that, no? Below it exits before updating the rest.

quanteda/R/object-builder.R

Lines 52 to 58 in ffb814a

    
           upgrade_dfm <- function(x) { 
        
               if (!is_pre2(x)) return(x) 
        
               attrs <- attributes(x) 
        
               if ("meta" %in% names(attrs)) { 
        
                   x@docvars <- upgrade_docvars(attrs$docvars, rownames(x)) 
        
                   return(x) 
        
               }

koheiw · 2021-03-22T09:09:09Z

There is only one version of the system meta, so it should be valid if an object has it. ie ("meta" %in% names(attrs)) == TRUE

kbenoit · 2021-03-22T09:13:00Z

What I meant was we don't update it for the version that performed the upgrade.

> problem_dfmat <- readRDS(url("https://www.dropbox.com/s/p15zhbluiltp36x/problem_dfmat.rds?dl=1"))
> problem_dfmat2 <- as.dfm(problem_dfmat)
> meta(problem_dfmat2, type = "system")
$`package-version`
[1] ‘2.1.0’

$`r-version`
[1] ‘4.0.2’

$system
 sysname  machine     user 
"Darwin" "x86_64"     "cp" 

$directory
[1] "/Users/cp/GDrive/Projekte/useNews"

$created
[1] "2020-10-30"

koheiw · 2021-03-22T10:27:16Z

I see. I think it is OK without updating the version, because the object structure is of v2.1 (the same in v3.0).

MarHai · 2021-03-30T11:10:23Z

For the sake of completeness, @chainsawriot, the respective objects in the OSF repository of useNews are now updated as well.

kbenoit assigned koheiw Mar 21, 2021

kbenoit added this to the v3 release milestone Mar 21, 2021

kbenoit changed the title ~~dfm_lookup error: all(dims >= dims.min) is not TRUE~~ upgrade_dfm() not working on some pre-docid_ v2 objects Mar 21, 2021

koheiw mentioned this issue Mar 22, 2021

Issue 2097 #2098

Merged

kbenoit added a commit that referenced this issue Mar 22, 2021

Update NEWS for #2097

998f361

kbenoit closed this as completed Mar 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upgrade_dfm() not working on some pre-`docid_` v2 objects #2097

upgrade_dfm() not working on some pre-`docid_` v2 objects #2097

chainsawriot commented Mar 21, 2021 •

edited

Loading

kbenoit commented Mar 21, 2021 •

edited

Loading

kbenoit commented Mar 21, 2021

koheiw commented Mar 22, 2021

MarHai commented Mar 22, 2021

kbenoit commented Mar 22, 2021

koheiw commented Mar 22, 2021 •

edited

Loading

kbenoit commented Mar 22, 2021

koheiw commented Mar 22, 2021

MarHai commented Mar 30, 2021

upgrade_dfm() not working on some pre-docid_ v2 objects #2097

upgrade_dfm() not working on some pre-docid_ v2 objects #2097

Comments

chainsawriot commented Mar 21, 2021 • edited Loading

Describe the bug

Reproducible code

Expected behavior

System information

Additional info

kbenoit commented Mar 21, 2021 • edited Loading

kbenoit commented Mar 21, 2021

koheiw commented Mar 22, 2021

MarHai commented Mar 22, 2021

kbenoit commented Mar 22, 2021

koheiw commented Mar 22, 2021 • edited Loading

kbenoit commented Mar 22, 2021

koheiw commented Mar 22, 2021

MarHai commented Mar 30, 2021

upgrade_dfm() not working on some pre-`docid_` v2 objects #2097

upgrade_dfm() not working on some pre-`docid_` v2 objects #2097

chainsawriot commented Mar 21, 2021 •

edited

Loading

kbenoit commented Mar 21, 2021 •

edited

Loading

koheiw commented Mar 22, 2021 •

edited

Loading