corpus_segment fails when second text ends with tag #634

kbenoit · 2017-04-05T10:38:39Z

> testCorpus <- corpus(c("First line\n##INTRO This is the introduction.
+                            ##DOC1 This is the first document.  Second sentence in Doc 1.
+                        ##DOC3 Third document starts here.  End of third document.",
+                        "##INTRO Document ##NUMBER Two starts before ##NUMBER Three. ##END"))
> testCorpusSeg <- corpus_segment(testCorpus, "tags")
> summary(testCorpusSeg)
Corpus consisting of 8 documents.

    Text Types Tokens Sentences      tag
 text1.1     2      2         1  ##INTRO
 text1.2     5      5         1   ##DOC1
 text1.3    11     12         2   ##DOC3
 text1.4     8     10         2  ##INTRO
 text2.1     1      1         1 ##NUMBER
 text2.2     3      3         1 ##NUMBER
 text2.3     2      2         1    ##END
 text2.4     0      0         0     <NA>

Source:  /Users/kbenoit/Dropbox (Personal)/GitHub/quanteda/* on x86_64 by kbenoit
Created: Wed Apr  5 11:38:04 2017
Notes:   corpus_segment(corpus_segment.corpus)corpus_segment(testCorpus)corpus_segment(tags)

Also: The way that the segmentation is done is fast, but retrieving the tags is very slow. A re-implementation would solve both issues.

- Reimplements corpus_segment using stringi::stri_extract to get the tags, and catches all exceptions - Improves char_segment - Significant speed gains as well - Fixes #634 - Adds tests

kbenoit self-assigned this Apr 5, 2017

kbenoit added the bug label Apr 5, 2017

kbenoit added a commit that referenced this issue Apr 5, 2017

Reimplement corpus_segment

9c746b7

- Reimplements corpus_segment using stringi::stri_extract to get the tags, and catches all exceptions - Improves char_segment - Significant speed gains as well - Fixes #634 - Adds tests

kbenoit mentioned this issue Apr 5, 2017

Reimplement corpus_segment #636

Merged

kbenoit closed this as completed in #636 Apr 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

corpus_segment fails when second text ends with tag #634

corpus_segment fails when second text ends with tag #634

kbenoit commented Apr 5, 2017 •

edited

Loading

corpus_segment fails when second text ends with tag #634

corpus_segment fails when second text ends with tag #634

Comments

kbenoit commented Apr 5, 2017 • edited Loading

kbenoit commented Apr 5, 2017 •

edited

Loading