New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split on sentence and other boundaries #58

Closed
wants to merge 46 commits into
base: master
from

Conversation

Projects
None yet
9 participants
@lmullen
Contributor

lmullen commented Feb 25, 2015

This pull request fixes a problem with splitting on boundaries other than words. Currently, splitting on sentence boundaries returns a list with an empty character vector:

str_split("This is a sentence. So is this.", boundary("sentence"))
#> [[1]]
#> character(0)

The problem is that boundary() sets skip_word_none = TRUE by default. But if stringi:stri_split_boundaries() is called for any boundary other than word boundaries, and skip_word_none is set to TRUE, then it returns an empty character vector. For non-word boundaries, this fix sets skip_word_none to FALSE unless the user has deliberately chosen otherwise.

str_split("This is a sentence. So is this.", boundary("sentence"))
#> [[1]]
#> [1] "This is a sentence. " "So is this."

The PR adds tests for sentence splitting.

@gagolews

This comment has been minimized.

Contributor

gagolews commented Feb 25, 2015

@lmullen that's right.

@@ -109,6 +109,9 @@ regex <- function(pattern, ignore_case = FALSE, multiline = FALSE,
boundary <- function(type = c("character", "line_break", "sentence", "word"),
skip_word_none = TRUE, ...) {
type <- match.arg(type)
if (type != "word" & missingArg(skip_word_none)) skip_word_none <- FALSE

This comment has been minimized.

@hadley

hadley Feb 25, 2015

Member

Maybe it would be better to make the default value of skip_word_none NA, and then do:

if (identical(skip_word_none, NA)) {
  skip_word_none <- type == "word"
}

This would also need some doc updates

@dselivanov

This comment has been minimized.

dselivanov commented Oct 21, 2015

when this will be merged?

@hadley

This comment has been minimized.

Member

hadley commented Oct 30, 2015

@lmullen do you want to finish this off? It also needs a bullet point in NEWS

@lmullen

This comment has been minimized.

Contributor

lmullen commented Oct 30, 2015

@hadley Sorry, I screwed up squashing the pull request. Mind if I resubmit this as a new, clean PR?

@hadley

This comment has been minimized.

Member

hadley commented Oct 30, 2015

Yeah, sure

@hadley hadley closed this Oct 30, 2015

@lmullen lmullen referenced this pull request Oct 30, 2015

Merged

Fix sentence splitting #101

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment