Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split on sentence and other boundaries #58

Closed
wants to merge 46 commits into from
Closed

Split on sentence and other boundaries #58

wants to merge 46 commits into from

Conversation

@lmullen
Copy link
Contributor

@lmullen lmullen commented Feb 25, 2015

This pull request fixes a problem with splitting on boundaries other than words. Currently, splitting on sentence boundaries returns a list with an empty character vector:

str_split("This is a sentence. So is this.", boundary("sentence"))
#> [[1]]
#> character(0)

The problem is that boundary() sets skip_word_none = TRUE by default. But if stringi:stri_split_boundaries() is called for any boundary other than word boundaries, and skip_word_none is set to TRUE, then it returns an empty character vector. For non-word boundaries, this fix sets skip_word_none to FALSE unless the user has deliberately chosen otherwise.

str_split("This is a sentence. So is this.", boundary("sentence"))
#> [[1]]
#> [1] "This is a sentence. " "So is this."

The PR adds tests for sentence splitting.

@gagolews
Copy link
Contributor

@gagolews gagolews commented Feb 25, 2015

@lmullen that's right.

@@ -109,6 +109,9 @@ regex <- function(pattern, ignore_case = FALSE, multiline = FALSE,
boundary <- function(type = c("character", "line_break", "sentence", "word"),
skip_word_none = TRUE, ...) {
type <- match.arg(type)

if (type != "word" & missingArg(skip_word_none)) skip_word_none <- FALSE

This comment has been minimized.

@hadley

hadley Feb 25, 2015
Member

Maybe it would be better to make the default value of skip_word_none NA, and then do:

if (identical(skip_word_none, NA)) {
  skip_word_none <- type == "word"
}

This would also need some doc updates

@dselivanov
Copy link

@dselivanov dselivanov commented Oct 21, 2015

when this will be merged?

@hadley
Copy link
Member

@hadley hadley commented Oct 30, 2015

@lmullen do you want to finish this off? It also needs a bullet point in NEWS

@lmullen
Copy link
Contributor Author

@lmullen lmullen commented Oct 30, 2015

@hadley Sorry, I screwed up squashing the pull request. Mind if I resubmit this as a new, clean PR?

@hadley
Copy link
Member

@hadley hadley commented Oct 30, 2015

Yeah, sure

@hadley hadley closed this Oct 30, 2015
@lmullen lmullen mentioned this pull request Oct 30, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

9 participants