You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
>testCorpus<- corpus(c("First line\n##INTRO This is the introduction.+ ##DOC1 This is the first document. Second sentence in Doc 1.+ ##DOC3 Third document starts here. End of third document.",
+"##INTRO Document ##NUMBER Two starts before ##NUMBER Three. ##END"))
>testCorpusSeg<- corpus_segment(testCorpus, "tags")
> summary(testCorpusSeg)
Corpusconsistingof8documents.TextTypesTokensSentencestagtext1.1221##INTROtext1.2551##DOC1text1.311122##DOC3text1.48102##INTROtext2.1111##NUMBERtext2.2331##NUMBERtext2.3221##ENDtext2.4000<NA>Source:/Users/kbenoit/Dropbox (Personal)/GitHub/quanteda/*onx86_64bykbenoitCreated:WedApr511:38:042017Notes: corpus_segment(corpus_segment.corpus)corpus_segment(testCorpus)corpus_segment(tags)
Also: The way that the segmentation is done is fast, but retrieving the tags is very slow. A re-implementation would solve both issues.
The text was updated successfully, but these errors were encountered:
- Reimplements corpus_segment using stringi::stri_extract to get the tags, and catches all exceptions
- Improves char_segment
- Significant speed gains as well
- Fixes#634
- Adds tests
Also: The way that the segmentation is done is fast, but retrieving the tags is very slow. A re-implementation would solve both issues.
The text was updated successfully, but these errors were encountered: