You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a document is reshaped to sentences, and then some operation removes all of the original document's sentences - such as in corpus_trim() - then reshaping to the document level removes that document. This seems like an unanticipated side effect, so if we want this to be policy, we should decide on that and document it rather than having it be something of a surprise.
I noticed this when answering this SO question affecting textstat_readability(). Since that function trims short sentences rather than score them, it was removing an entire document that consisted only of one short sentence. (This can happen a lot in Tweets.) We fixed that in #1977 but made a note to revisit the underlying issue for a more careful and more fundamental fix.
library("quanteda")
## Package version: 2.1.0## Parallel computing: 2 of 8 threads used.## See https://quanteda.io for tutorials and examples.## ## Attaching package: 'quanteda'## The following object is masked from 'package:utils':## ## Viewtxt<- c(
d1="The cat in the hat ate green ham and eggs.",
d2="",
d3="Once upon a time.",
d4=NA
)
corp<- corpus(txt)
## Warning: NA is replaced by empty stringtxt## d1 ## "The cat in the hat ate green ham and eggs." ## d2 ## "" ## d3 ## "Once upon a time." ## d4 ## NAcorp## Corpus consisting of 4 documents.## d1 :## "The cat in the hat ate green ham and eggs."## ## d2 :## ""## ## d3 :## "Once upon a time."## ## d4 :## ""corp %>%
corpus_reshape(to="sentences") %>%
corpus_reshape(to="documents")
## Corpus consisting of 2 documents.## d1 :## "The cat in the hat ate green ham and eggs."## ## d3 :## "Once upon a time."
char_trim(txt, what="documents", min_ntoken=0)
## Warning: NA is replaced by empty string## d1 ## "The cat in the hat ate green ham and eggs." ## d2 ## "" ## d3 ## "Once upon a time." ## d4 ## ""
char_trim(txt, what="documents", min_ntoken=1)
## Warning: NA is replaced by empty string## d1 ## "The cat in the hat ate green ham and eggs." ## d3 ## "Once upon a time."
char_trim(txt, what="sentences", min_ntoken=0)
## Warning: NA is replaced by empty string## d1 ## "The cat in the hat ate green ham and eggs." ## d3 ## "Once upon a time."
char_trim(txt, what="sentences", min_ntoken=1)
## Warning: NA is replaced by empty string## d1 ## "The cat in the hat ate green ham and eggs." ## d3 ## "Once upon a time."
char_trim(txt, what="sentences", min_ntoken=5)
## Warning: NA is replaced by empty string## d1 ## "The cat in the hat ate green ham and eggs."
corpus_trim(corp, what="sentences", min_ntoken=0)
## Corpus consisting of 2 documents.## d1 :## "The cat in the hat ate green ham and eggs."## ## d3 :## "Once upon a time."
corpus_trim(corp, what="sentences", min_ntoken=1)
## Corpus consisting of 2 documents.## d1 :## "The cat in the hat ate green ham and eggs."## ## d3 :## "Once upon a time."
corpus_trim(corp, what="sentences", min_ntoken=5)
## Corpus consisting of 1 document.## d1 :## "The cat in the hat ate green ham and eggs."
The text was updated successfully, but these errors were encountered:
corpus_reshape() does not forget docid now so you can do:
require(quanteda)
#> Loading required package: quanteda#> Package version: 2.9.9000#> Unicode version: 13.0#> ICU version: 66.1#> Parallel computing: 6 of 6 threads used.#> See https://quanteda.io for tutorials and examples.txt<- c(
d1="The cat in the hat ate green ham and eggs.",
d2="",
d3="Once upon a time.",
d4=NA
)
corp<- corpus(txt)
corp_sent<- corpus_reshape(corp)
docid(corpus_subset(corp_sent, nchar(corp_sent) >0, drop_docid=FALSE))
#> [1] d1 d3#> Levels: d1 d2 d3 d4
When a document is reshaped to sentences, and then some operation removes all of the original document's sentences - such as in
corpus_trim()
- then reshaping to the document level removes that document. This seems like an unanticipated side effect, so if we want this to be policy, we should decide on that and document it rather than having it be something of a surprise.I noticed this when answering this SO question affecting
textstat_readability()
. Since that function trims short sentences rather than score them, it was removing an entire document that consisted only of one short sentence. (This can happen a lot in Tweets.) We fixed that in #1977 but made a note to revisit the underlying issue for a more careful and more fundamental fix.The text was updated successfully, but these errors were encountered: