Permalink
Fetching contributors…
Cannot retrieve contributors at this time
69 lines (47 sloc) 970 Bytes
title weight draft
Change units of texts
30
false
require(quanteda)

corpus_reshape() allows to change the unit of texts between documents, paragraphs and sentences. Since it records document identifiers, texts can be restored to the original unit even if the corpus is modified by other functions.

corp <- corpus(data_char_ukimmig2010)
ndoc(corp)
## [1] 9

Change the unit of texts to sentences.

sent_corp <- corpus_reshape(corp, 'sentences')
ndoc(sent_corp)
## [1] 207

Restore the original documents.

corp2 <- corpus_reshape(sent_corp, 'documents')
ndoc(corp2)
## [1] 9

If you apply corpus_subset() to sent_corp, you can only keep long senteces (more than 10 words).

longsent_corp <- corpus_subset(sent_corp, ntoken(sent_corp) >= 10)
ndoc(longsent_corp)
## [1] 183
corp3 <- corpus_reshape(longsent_corp, 'documents')
ndoc(corp3)
## [1] 9