> testCorpus <- corpus(c("First line\n##INTRO This is the introduction.
+ ##DOC1 This is the first document. Second sentence in Doc 1.
+ ##DOC3 Third document starts here. End of third document.",
+ "##INTRO Document ##NUMBER Two starts before ##NUMBER Three. ##END"))
> testCorpusSeg <- corpus_segment(testCorpus, "tags")
> summary(testCorpusSeg)
Corpus consisting of 8 documents.
Text Types Tokens Sentences tag
text1.1 2 2 1 ##INTRO
text1.2 5 5 1 ##DOC1
text1.3 11 12 2 ##DOC3
text1.4 8 10 2 ##INTRO
text2.1 1 1 1 ##NUMBER
text2.2 3 3 1 ##NUMBER
text2.3 2 2 1 ##END
text2.4 0 0 0 <NA>
Source: /Users/kbenoit/Dropbox (Personal)/GitHub/quanteda/* on x86_64 by kbenoit
Created: Wed Apr 5 11:38:04 2017
Notes: corpus_segment(corpus_segment.corpus)corpus_segment(testCorpus)corpus_segment(tags)
Also: The way that the segmentation is done is fast, but retrieving the tags is very slow. A re-implementation would solve both issues.
Also: The way that the segmentation is done is fast, but retrieving the tags is very slow. A re-implementation would solve both issues.