corpus_reshape and trim operations drop documents #1978

kbenoit · 2020-07-02T08:18:07Z

When a document is reshaped to sentences, and then some operation removes all of the original document's sentences - such as in corpus_trim() - then reshaping to the document level removes that document. This seems like an unanticipated side effect, so if we want this to be policy, we should decide on that and document it rather than having it be something of a surprise.

I noticed this when answering this SO question affecting textstat_readability(). Since that function trims short sentences rather than score them, it was removing an entire document that consisted only of one short sentence. (This can happen a lot in Tweets.) We fixed that in #1977 but made a note to revisit the underlying issue for a more careful and more fundamental fix.

library("quanteda")
## Package version: 2.1.0
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View

txt <- c(
  d1 = "The cat in the hat ate green ham and eggs.",
  d2 = "",
  d3 = "Once upon a time.",
  d4 = NA
)

corp <- corpus(txt)
## Warning: NA is replaced by empty string

txt
##                                           d1 
## "The cat in the hat ate green ham and eggs." 
##                                           d2 
##                                           "" 
##                                           d3 
##                          "Once upon a time." 
##                                           d4 
##                                           NA
corp
## Corpus consisting of 4 documents.
## d1 :
## "The cat in the hat ate green ham and eggs."
## 
## d2 :
## ""
## 
## d3 :
## "Once upon a time."
## 
## d4 :
## ""

corp %>%
  corpus_reshape(to = "sentences") %>%
  corpus_reshape(to = "documents")
## Corpus consisting of 2 documents.
## d1 :
## "The cat in the hat ate green ham and eggs."
## 
## d3 :
## "Once upon a time."

char_trim(txt, what = "documents", min_ntoken = 0)
## Warning: NA is replaced by empty string
##                                           d1 
## "The cat in the hat ate green ham and eggs." 
##                                           d2 
##                                           "" 
##                                           d3 
##                          "Once upon a time." 
##                                           d4 
##                                           ""
char_trim(txt, what = "documents", min_ntoken = 1)
## Warning: NA is replaced by empty string
##                                           d1 
## "The cat in the hat ate green ham and eggs." 
##                                           d3 
##                          "Once upon a time."

char_trim(txt, what = "sentences", min_ntoken = 0)
## Warning: NA is replaced by empty string
##                                           d1 
## "The cat in the hat ate green ham and eggs." 
##                                           d3 
##                          "Once upon a time."
char_trim(txt, what = "sentences", min_ntoken = 1)
## Warning: NA is replaced by empty string
##                                           d1 
## "The cat in the hat ate green ham and eggs." 
##                                           d3 
##                          "Once upon a time."
char_trim(txt, what = "sentences", min_ntoken = 5)
## Warning: NA is replaced by empty string
##                                           d1 
## "The cat in the hat ate green ham and eggs."

corpus_trim(corp, what = "sentences", min_ntoken = 0)
## Corpus consisting of 2 documents.
## d1 :
## "The cat in the hat ate green ham and eggs."
## 
## d3 :
## "Once upon a time."
corpus_trim(corp, what = "sentences", min_ntoken = 1)
## Corpus consisting of 2 documents.
## d1 :
## "The cat in the hat ate green ham and eggs."
## 
## d3 :
## "Once upon a time."
corpus_trim(corp, what = "sentences", min_ntoken = 5)
## Corpus consisting of 1 document.
## d1 :
## "The cat in the hat ate green ham and eggs."

The text was updated successfully, but these errors were encountered:

koheiw · 2021-03-15T10:20:12Z

corpus_reshape() does not forget docid now so you can do:

require(quanteda)
#> Loading required package: quanteda
#> Package version: 2.9.9000
#> Unicode version: 13.0
#> ICU version: 66.1
#> Parallel computing: 6 of 6 threads used.
#> See https://quanteda.io for tutorials and examples.
txt <- c(
    d1 = "The cat in the hat ate green ham and eggs.",
    d2 = "",
    d3 = "Once upon a time.",
    d4 = NA
)

corp <- corpus(txt)
corp_sent <- corpus_reshape(corp)
docid(corpus_subset(corp_sent, nchar(corp_sent) > 0, drop_docid = FALSE))
#> [1] d1 d3
#> Levels: d1 d2 d3 d4

kbenoit added the corpus label Jul 2, 2020

kbenoit assigned kbenoit and koheiw Jul 2, 2020

koheiw mentioned this issue Jul 3, 2020

Fix corpus_reshape() #1980

Merged

koheiw mentioned this issue Aug 2, 2020

Add drop_docid = FALSE to *_subset() #1988

Closed

koheiw mentioned this issue Feb 1, 2021

Issue 1988 #2044

Merged

koheiw added a commit that referenced this issue Mar 15, 2021

Stop dropping levels to address #1978

45e30d6

koheiw mentioned this issue Mar 15, 2021

Issue 1978 #2084

Merged

kbenoit added this to the v3 release milestone Mar 15, 2021

kbenoit closed this as completed Mar 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

corpus_reshape and trim operations drop documents #1978

corpus_reshape and trim operations drop documents #1978

kbenoit commented Jul 2, 2020

koheiw commented Mar 15, 2021

corpus_reshape and trim operations drop documents #1978

corpus_reshape and trim operations drop documents #1978

Comments

kbenoit commented Jul 2, 2020

koheiw commented Mar 15, 2021