Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

corpus_segment fails when second text ends with tag #634

Closed
kbenoit opened this issue Apr 5, 2017 · 0 comments
Closed

corpus_segment fails when second text ends with tag #634

kbenoit opened this issue Apr 5, 2017 · 0 comments
Assignees
Labels

Comments

@kbenoit
Copy link
Collaborator

kbenoit commented Apr 5, 2017

> testCorpus <- corpus(c("First line\n##INTRO This is the introduction.
+                            ##DOC1 This is the first document.  Second sentence in Doc 1.
+                        ##DOC3 Third document starts here.  End of third document.",
+                        "##INTRO Document ##NUMBER Two starts before ##NUMBER Three. ##END"))
> testCorpusSeg <- corpus_segment(testCorpus, "tags")
> summary(testCorpusSeg)
Corpus consisting of 8 documents.

    Text Types Tokens Sentences      tag
 text1.1     2      2         1  ##INTRO
 text1.2     5      5         1   ##DOC1
 text1.3    11     12         2   ##DOC3
 text1.4     8     10         2  ##INTRO
 text2.1     1      1         1 ##NUMBER
 text2.2     3      3         1 ##NUMBER
 text2.3     2      2         1    ##END
 text2.4     0      0         0     <NA>

Source:  /Users/kbenoit/Dropbox (Personal)/GitHub/quanteda/* on x86_64 by kbenoit
Created: Wed Apr  5 11:38:04 2017
Notes:   corpus_segment(corpus_segment.corpus)corpus_segment(testCorpus)corpus_segment(tags)

Also: The way that the segmentation is done is fast, but retrieving the tags is very slow. A re-implementation would solve both issues.

@kbenoit kbenoit self-assigned this Apr 5, 2017
@kbenoit kbenoit added the bug label Apr 5, 2017
kbenoit added a commit that referenced this issue Apr 5, 2017
- Reimplements corpus_segment using stringi::stri_extract to get the tags, and catches all exceptions
- Improves char_segment
- Significant speed gains as well
- Fixes #634
- Adds tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant