Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split on sentence and other boundaries #58

Closed
wants to merge 46 commits into from
Closed

Split on sentence and other boundaries #58

wants to merge 46 commits into from

Conversation

lmullen
Copy link
Contributor

@lmullen lmullen commented Feb 25, 2015

This pull request fixes a problem with splitting on boundaries other than words. Currently, splitting on sentence boundaries returns a list with an empty character vector:

str_split("This is a sentence. So is this.", boundary("sentence"))
#> [[1]]
#> character(0)

The problem is that boundary() sets skip_word_none = TRUE by default. But if stringi:stri_split_boundaries() is called for any boundary other than word boundaries, and skip_word_none is set to TRUE, then it returns an empty character vector. For non-word boundaries, this fix sets skip_word_none to FALSE unless the user has deliberately chosen otherwise.

str_split("This is a sentence. So is this.", boundary("sentence"))
#> [[1]]
#> [1] "This is a sentence. " "So is this."

The PR adds tests for sentence splitting.

@gagolews
Copy link
Contributor

@lmullen that's right.

@@ -109,6 +109,9 @@ regex <- function(pattern, ignore_case = FALSE, multiline = FALSE,
boundary <- function(type = c("character", "line_break", "sentence", "word"),
skip_word_none = TRUE, ...) {
type <- match.arg(type)

if (type != "word" & missingArg(skip_word_none)) skip_word_none <- FALSE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be better to make the default value of skip_word_none NA, and then do:

if (identical(skip_word_none, NA)) {
  skip_word_none <- type == "word"
}

This would also need some doc updates

@dselivanov
Copy link

when this will be merged?

@hadley
Copy link
Member

hadley commented Oct 30, 2015

@lmullen do you want to finish this off? It also needs a bullet point in NEWS

@lmullen
Copy link
Contributor Author

lmullen commented Oct 30, 2015

@hadley Sorry, I screwed up squashing the pull request. Mind if I resubmit this as a new, clean PR?

@hadley
Copy link
Member

hadley commented Oct 30, 2015

Yeah, sure

@hadley hadley closed this Oct 30, 2015
@lmullen lmullen mentioned this pull request Oct 30, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants