Skip to content

corpus_segment fails when second text ends with tag #634

@kbenoit

Description

@kbenoit
> testCorpus <- corpus(c("First line\n##INTRO This is the introduction.
+                            ##DOC1 This is the first document.  Second sentence in Doc 1.
+                        ##DOC3 Third document starts here.  End of third document.",
+                        "##INTRO Document ##NUMBER Two starts before ##NUMBER Three. ##END"))
> testCorpusSeg <- corpus_segment(testCorpus, "tags")
> summary(testCorpusSeg)
Corpus consisting of 8 documents.

    Text Types Tokens Sentences      tag
 text1.1     2      2         1  ##INTRO
 text1.2     5      5         1   ##DOC1
 text1.3    11     12         2   ##DOC3
 text1.4     8     10         2  ##INTRO
 text2.1     1      1         1 ##NUMBER
 text2.2     3      3         1 ##NUMBER
 text2.3     2      2         1    ##END
 text2.4     0      0         0     <NA>

Source:  /Users/kbenoit/Dropbox (Personal)/GitHub/quanteda/* on x86_64 by kbenoit
Created: Wed Apr  5 11:38:04 2017
Notes:   corpus_segment(corpus_segment.corpus)corpus_segment(testCorpus)corpus_segment(tags)

Also: The way that the segmentation is done is fast, but retrieving the tags is very slow. A re-implementation would solve both issues.

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions