Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

corpus_reshape and trim operations drop documents #1978

Closed
kbenoit opened this issue Jul 2, 2020 · 1 comment
Closed

corpus_reshape and trim operations drop documents #1978

kbenoit opened this issue Jul 2, 2020 · 1 comment
Assignees
Labels
Milestone

Comments

@kbenoit
Copy link
Collaborator

kbenoit commented Jul 2, 2020

When a document is reshaped to sentences, and then some operation removes all of the original document's sentences - such as in corpus_trim() - then reshaping to the document level removes that document. This seems like an unanticipated side effect, so if we want this to be policy, we should decide on that and document it rather than having it be something of a surprise.

I noticed this when answering this SO question affecting textstat_readability(). Since that function trims short sentences rather than score them, it was removing an entire document that consisted only of one short sentence. (This can happen a lot in Tweets.) We fixed that in #1977 but made a note to revisit the underlying issue for a more careful and more fundamental fix.

library("quanteda")
## Package version: 2.1.0
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View

txt <- c(
  d1 = "The cat in the hat ate green ham and eggs.",
  d2 = "",
  d3 = "Once upon a time.",
  d4 = NA
)

corp <- corpus(txt)
## Warning: NA is replaced by empty string

txt
##                                           d1 
## "The cat in the hat ate green ham and eggs." 
##                                           d2 
##                                           "" 
##                                           d3 
##                          "Once upon a time." 
##                                           d4 
##                                           NA
corp
## Corpus consisting of 4 documents.
## d1 :
## "The cat in the hat ate green ham and eggs."
## 
## d2 :
## ""
## 
## d3 :
## "Once upon a time."
## 
## d4 :
## ""

corp %>%
  corpus_reshape(to = "sentences") %>%
  corpus_reshape(to = "documents")
## Corpus consisting of 2 documents.
## d1 :
## "The cat in the hat ate green ham and eggs."
## 
## d3 :
## "Once upon a time."

char_trim(txt, what = "documents", min_ntoken = 0)
## Warning: NA is replaced by empty string
##                                           d1 
## "The cat in the hat ate green ham and eggs." 
##                                           d2 
##                                           "" 
##                                           d3 
##                          "Once upon a time." 
##                                           d4 
##                                           ""
char_trim(txt, what = "documents", min_ntoken = 1)
## Warning: NA is replaced by empty string
##                                           d1 
## "The cat in the hat ate green ham and eggs." 
##                                           d3 
##                          "Once upon a time."

char_trim(txt, what = "sentences", min_ntoken = 0)
## Warning: NA is replaced by empty string
##                                           d1 
## "The cat in the hat ate green ham and eggs." 
##                                           d3 
##                          "Once upon a time."
char_trim(txt, what = "sentences", min_ntoken = 1)
## Warning: NA is replaced by empty string
##                                           d1 
## "The cat in the hat ate green ham and eggs." 
##                                           d3 
##                          "Once upon a time."
char_trim(txt, what = "sentences", min_ntoken = 5)
## Warning: NA is replaced by empty string
##                                           d1 
## "The cat in the hat ate green ham and eggs."

corpus_trim(corp, what = "sentences", min_ntoken = 0)
## Corpus consisting of 2 documents.
## d1 :
## "The cat in the hat ate green ham and eggs."
## 
## d3 :
## "Once upon a time."
corpus_trim(corp, what = "sentences", min_ntoken = 1)
## Corpus consisting of 2 documents.
## d1 :
## "The cat in the hat ate green ham and eggs."
## 
## d3 :
## "Once upon a time."
corpus_trim(corp, what = "sentences", min_ntoken = 5)
## Corpus consisting of 1 document.
## d1 :
## "The cat in the hat ate green ham and eggs."
@koheiw
Copy link
Collaborator

koheiw commented Mar 15, 2021

corpus_reshape() does not forget docid now so you can do:

require(quanteda)
#> Loading required package: quanteda
#> Package version: 2.9.9000
#> Unicode version: 13.0
#> ICU version: 66.1
#> Parallel computing: 6 of 6 threads used.
#> See https://quanteda.io for tutorials and examples.
txt <- c(
    d1 = "The cat in the hat ate green ham and eggs.",
    d2 = "",
    d3 = "Once upon a time.",
    d4 = NA
)

corp <- corpus(txt)
corp_sent <- corpus_reshape(corp)
docid(corpus_subset(corp_sent, nchar(corp_sent) > 0, drop_docid = FALSE))
#> [1] d1 d3
#> Levels: d1 d2 d3 d4

@kbenoit kbenoit closed this as completed Mar 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants