-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upgrade_dfm() not working on some pre-docid_
v2 objects
#2097
Comments
Thanks for filing this and making it easy to trace. I've figured out the issue. @koheiw what's happening here is that our To make this more easily reproducible (since the original files are several GB each) I've extracted just the offending dfm here: library("quanteda")
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
problem_dfmat <- readRDS(url("https://www.dropbox.com/s/p15zhbluiltp36x/problem_dfmat.rds?dl=1"))
meta(problem_dfmat, type = "system")
## $`package-version`
## [1] '2.1.0'
##
## $`r-version`
## [1] '4.0.2'
##
## $system
## sysname machine user
## "Darwin" "x86_64" "cp"
##
## $directory
## [1] "/Users/cp/GDrive/Projekte/useNews"
##
## $created
## [1] "2020-10-30"
# problem is no docid_
names(quanteda:::get_docvars.dfm(problem_dfmat, user = TRUE, system = TRUE))
## [1] "stories_id" "processed_stories_id" "collect_date"
## [4] "guid" "title" "publish_date"
## [7] "url" "language" "ap_syndicated"
## [10] "media_id" "media_name" "media_url"
docid(problem_dfmat)
## Error in select_docvars(x@docvars, field, user, system, drop): field(s) docid_ not found Created on 2021-03-21 by the reprex package (v1.0.0) |
docid_
v2 objects
@koheiw it's possible the same issue will also affect corpus and tokens objects, but since we didn't get that object "in the wild" I have not confirmed. |
I think that the DFM was created because we forgot to call |
Thanks for the report @chainsawriot and the quick help, @kbenoit and @koheiw. Anything we (as the dataset authors) could do as of now? Could we, for example, trigger any update mechanisms manually to update the datasets? |
@MarHai I'd suggest mediacloud.wordmatrix2020 <- lapply(mediacloud.wordmatrix2020, as.dfm) to update your objects. @koheiw in #2098, now merged, I noticed that Lines 52 to 58 in ffb814a
|
There is only one version of the system meta, so it should be valid if an object has it. ie |
What I meant was we don't update it for the version that performed the upgrade. > problem_dfmat <- readRDS(url("https://www.dropbox.com/s/p15zhbluiltp36x/problem_dfmat.rds?dl=1"))
> problem_dfmat2 <- as.dfm(problem_dfmat)
> meta(problem_dfmat2, type = "system")
$`package-version`
[1] ‘2.1.0’
$`r-version`
[1] ‘4.0.2’
$system
sysname machine user
"Darwin" "x86_64" "cp"
$directory
[1] "/Users/cp/GDrive/Projekte/useNews"
$created
[1] "2020-10-30" |
I see. I think it is OK without updating the version, because the object structure is of v2.1 (the same in v3.0). |
For the sake of completeness, @chainsawriot, the respective objects in the OSF repository of useNews are now updated as well. |
Describe the bug
I encounter this problem while working with the useNews (@cbpuschmann & @MarHai, 2020) dataset. As the data is not from me, it could be a data issue (e.g. the dfm is malformed). But displaying the dataset doesn't suggest so.
I can't directly pinpoint what is the actual cause of the error.
debug()
actually points todfm_match
, exactly this line.Reproducible code
Please paste minimal code that reproduces the bug. If possible, please upload the data file as
.rds
.Expected behavior
It works for 2019 data but not for 2020 data. 2020 data should work.
System information
Additional info
Both the CRAN version (2.1.2) and the Github version (3.0.0) have the same issue.
The text was updated successfully, but these errors were encountered: