Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert to data.frame issue if 'document' is non-unique column name. #1918

Closed
danlewis85 opened this issue Mar 26, 2020 · 0 comments · Fixed by #1919
Closed

Convert to data.frame issue if 'document' is non-unique column name. #1918

danlewis85 opened this issue Mar 26, 2020 · 0 comments · Fixed by #1919

Comments

@danlewis85
Copy link

Converting a dfm to a data.frame using 'convert' creates an issue if one of the features in your dfm is also called 'document'.

Perhaps rename the document column to something more likely to be unique, like "doc_id" in line with ropensci text interchange formats.

Reproducible code

Please paste minimal code that reproduces the bug. If possible, please upload the data file as .rds.

library(magrittr)
library(quanteda)

# convert dfm to data.frame
dfm_df <- dfm(c("this is a fine document")) %>% convert(to = 'data.frame')

# fix
names(dfm_df)[1] <- "doc_id"

Expected behavior

If you create a data.frame with two 'document' columns, R throws an Rlang error if you try to make use of that column: for example:

Call `rlang::last_error()` to see a backtrace.```


## System information

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] quanteda_2.0.1 magrittr_1.5

loaded via a namespace (and not attached):
[1] Rcpp_1.0.3 rstudioapi_0.10 stopwords_1.0 tidyselect_0.2.5
[5] munsell_0.5.0 colorspace_1.4-1 lattice_0.20-38 R6_2.4.1
[9] rlang_0.4.1 fastmatch_1.1-0 dplyr_0.8.3 tools_3.6.1
[13] grid_3.6.1 data.table_1.12.8 gtable_0.3.0 lazyeval_0.2.2
[17] RcppParallel_5.0.0 assertthat_0.2.1 tibble_2.1.3 lifecycle_0.1.0
[21] crayon_1.3.4 Matrix_1.2-17 purrr_0.3.3 ggplot2_3.2.1
[25] glue_1.3.1 stringi_1.4.3 compiler_3.6.1 pillar_1.4.2
[29] scales_1.1.0 pkgconfig_2.0.3

kbenoit added a commit that referenced this issue Mar 27, 2020
- Adds a `docid_field = "doc_id"` as the default to `convert(x, to = "data.frame")`
- Checks for collisions with the `docid_field` and a named feature
- Re-implements the deprecated `as.data.frame.dfm()` to use the same internal function as `convert(x, to = "data.frame")`
- Updates tests
@kbenoit kbenoit mentioned this issue Mar 27, 2020
kbenoit added a commit that referenced this issue Apr 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant