Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add stringIn1 parser #110

Merged
merged 14 commits into from
Dec 22, 2020
Merged

Add stringIn1 parser #110

merged 14 commits into from
Dec 22, 2020

Conversation

satabin
Copy link
Contributor

@satabin satabin commented Dec 11, 2020

This parser builds a radix tree from non empty string alternatives.
Radix trees are interesting as they allow for efficient parsing of
string alternatives (e.g. enum names) without backtracking and require
only 1 character lookahead.

String alternatives are first sorted and then grouped by common non
empty prefix. Grouped string are then recursively grouped with the same
process. In the resulting tree, each group of substrings start with a
different character, which can be used to decide which alternative
branch to take.

This operator is inspired by fastparse stringIn one, and requires all
alternatives to be non empty.

Thank you for contributing to cats-parse!

This is a kind reminder to run sbt prePR and commit the changed files, if any, before submitting.

@satabin
Copy link
Contributor Author

satabin commented Dec 11, 2020

Is it ok to depend on scala-collection-compat to solve the cross-compilation problem with 2.12?

@codecov-io
Copy link

codecov-io commented Dec 11, 2020

Codecov Report

Merging #110 (cc084d0) into main (454127f) will increase coverage by 0.42%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #110      +/-   ##
==========================================
+ Coverage   94.45%   94.88%   +0.42%     
==========================================
  Files           6        7       +1     
  Lines         758      821      +63     
  Branches       77       88      +11     
==========================================
+ Hits          716      779      +63     
  Misses         42       42              
Impacted Files Coverage Δ
core/shared/src/main/scala/cats/parse/Parser.scala 95.14% <100.00%> (+0.24%) ⬆️
...e/shared/src/main/scala/cats/parse/RadixNode.scala 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 454127f...cc084d0. Read the comment docs.

Copy link
Collaborator

@johnynek johnynek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! This is something I have wanted to get to.

I'd like to make two suggestions:

  1. Can we remove the stringIn1 API? It seems to me this can be an internal detail of how to combine Str parsers inside of oneOf much as we internally combine CharIn parsers.
  2. If you feel we should keep the stringIn API, I'm concerned that we don't seem to be deduplicating it anywhere. I'd rather accept Iterable, then make a distinct set and the if it has only one item just return Str. Also if we keep the API it should have a comment in particular describing the backtracking behavior of the method.
  3. Since this is a performance improvement, can you add a benchmark? Users now have two ways to do this: oneOf + string1 or stringIn1, or if we don't want to expose the API internally we still need to choose. We need to show it is faster to use the tree. I suspect it will be in some cases, but maybe not all cases especially when there are few and distinct key words. But a benchmark will help us see. See the current benchmark directory for examples.
  4. I have some suggestions on improving the performance of the matching loop but I'll wait till we have benchmarks so we can see if it is working.

Thanks for sending the PR.

@satabin
Copy link
Contributor Author

satabin commented Dec 11, 2020

Thanks for the PR! This is something I have wanted to get to.

I'd like to make two suggestions:

1. Can we remove the stringIn1 API? It seems to me this can be an internal detail of how to combine Str parsers inside of oneOf much as we internally combine CharIn parsers.

The current implementation heavily relies of the fact that it takes strings and not parsers. I can see if I can internalize it in oneOf1 but it seems a bit different in semantics to me currently: there is no need for .backtrack when several alternatives share a common prefix. I fear that working below oneOf will hide this from us, but I can give it a try.

2. If you feel we should keep the stringIn API, I'm concerned that we don't seem to be deduplicating it anywhere. I'd rather accept Iterable, then make a distinct set and the if it has only one item just return Str. Also if we keep the API it should have a comment in particular describing the backtracking behavior of the method.

Deduplication happens implicitly when building the tree, but I can make it explicit first, before I sort. I modeled the pre-conditio after the one from oneOf but I can make it smarter to return a Str in case there is only one in the end.

I will document the (lack of) backtracking in the API

3. Since this is a performance improvement, can you add a benchmark? Users now have two ways to do this: oneOf + string1 or stringIn1, or if we don't want to expose the API internally we still need to choose. We need to show it is faster to use the tree. I suspect it will be in some cases, but maybe not all cases especially when there are few and distinct key words. But a benchmark will help us see. See the current benchmark directory for examples.

I will add those.

4. I have some suggestions on improving the performance of the matching loop but I'll wait till we have benchmarks so we can see if it is working.

Sure, let's see how it can be further improved.

Thanks for sending the PR.

@satabin
Copy link
Contributor Author

satabin commented Dec 16, 2020

Hello @johnynek,

I reworked this PR and made the following changes:

  • added a benchmark for the stringIn1 implementation
  • made the operator smarter in case there is only 0 or 1 alternatives
  • rewrote the radix tree implementation to use an array map, as the benchmark showed the Map was a big bottleneck (which I was supposing, but the benchmark showed it clearly)

I still didn't merge it with oneOf1 since the semantics of this parser is a bit different. The order of the string doesn't matter and it parses the longest match, and not the first match. I prefer not to mix them.

I still have a bug to fix though, and I will add tests that cover it.

@satabin
Copy link
Contributor Author

satabin commented Dec 16, 2020

The PR is now ready , I think. I ran the benchmarks and here are the results:

[info] Benchmark                         Mode  Cnt   Score   Error  Units
[info] StringInBenchmarks.oneOfParse     avgt   25  38,604 ± 3,312  ns/op
[info] StringInBenchmarks.stringInParse  avgt   25  22,861 ± 0,162  ns/op

Copy link
Collaborator

@johnynek johnynek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks nice.

Can you post the results of the benchmark as you run them and update them? This helps us record also for the future if they are logged in the message of the PR.

bench/src/main/scala/cats/parse/bench/StringInBench.scala Outdated Show resolved Hide resolved
bench/src/main/scala/cats/parse/bench/StringInBench.scala Outdated Show resolved Hide resolved
core/shared/src/main/scala/cats/parse/Parser.scala Outdated Show resolved Hide resolved
core/shared/src/test/scala/cats/parse/ParserTest.scala Outdated Show resolved Hide resolved

import scala.annotation.tailrec

class RadixNode(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you are using an Array here to avoid the boxing of using a map. Another approach would be to use scala.collection.immutable.IntMap and lift the Char into an Int, and then IntMap won't box.

That will avoid the binary search, which might be a significant win.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, let's see if it improves. The Map method also has the overhead of creating an Option, I will compare to the current approach.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you have a chance to try the IntMap approach? I would think for larger sets it would be a big win, but I could be wrong.

core/shared/src/main/scala/cats/parse/Parser.scala Outdated Show resolved Hide resolved
}
}
if (lastMatch < 0) {
state.error = Chain.one(Expectation.OneStr(startOffset, all))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to generate the subtree that failed and the offset that we failed at. So, if we have foobar, foobaz, foofoo, and we try to parse fooquux, see would point to quux and say we expected bar, baz or foo. That would be a better error message I think.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like to see the strings as atoms which either match completely or not at all, so I like to see the error at the beginning of the atom. But I am fine either way in the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it again, and a typical use case for this operator is to match enum-like names. As a user, I think I am more interested in knowing where it fails at the beginning of the matching of the names than somewhere in the middle of one of the possibilities.

For instance if I want to parse the name of a weekday, I'd have:

stringIn1(List("monday", "tuesday", "wednesday", "thursday", "firday", "saturday", "sunday"))

Let's say I made a type and wrote thrusday, it would feel weird to me to know that my possiblities where hursday and uesday.
Of course this means that the failure offset is closer to the matching error, but semantically speaking, the list in stringIn1 is what I, as a user, defined as my possibilities, and would expect to see in error messages.

What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense. I guess one thing I'm concerned about is this list getting very big and seeing a giant list on the typo. Like, in your example, I might say, I would expect Thursday or Tuesday, but I don't want to see Sunday in the list, personally. What do you think of that?

This parser builds a radix tree from non empty string alternatives.
Radix trees are interesting as they allow for efficient parsing of
string alternatives (e.g. enum names) without backtracking and require
only 1 character lookahead.

String alternatives are first sorted and then grouped by common non
empty prefix. Grouped string are then recursively grouped with the same
process. In the resulting tree, each group of substrings start with a
different character, which can be used to decide which alternative
branch to take.

This operator is inspired by fastparse `stringIn` one, and requires all
alternatives to be non empty.
@satabin
Copy link
Contributor Author

satabin commented Dec 17, 2020

I took your comments and feedback into account and pushed the new version. I kept the array based radix tree implementation due to the results of the benchmarks:

Arrays
Benchmark                         Mode  Cnt    Score   Error  Units
StringInBenchmarks.stringInParse  avgt   25  163,929 ± 4,582  ns/op

IntMap.get
Benchmark                         Mode  Cnt    Score   Error  Units
StringInBenchmarks.stringInParse  avgt   25  198,055 ± 3,762  ns/op

IntMap.contains + IntMap.apply
Benchmark                         Mode  Cnt    Score    Error  Units
StringInBenchmarks.stringInParse  avgt   25  231,887 ± 10,620  ns/op

oneOf
Benchmark                      Mode  Cnt    Score    Error  Units
StringInBenchmarks.oneOfParse  avgt   25  296,758 ± 22,775  ns/op

The IntMap implementation creates more objects (especially the Option for the result), or requires two lookups (contains+apply), which results in some overhead over the tree implementation.

Copy link
Collaborator

@johnynek johnynek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is coming along very nicely


val oneOf =
Parser.oneOf1(
List(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we use the style of using :: to construct the args:

oneOf1(
  Parser.string1("foobar") ::
  Parser.string1("foobaz") ::
  ...

This makes using the List more ergonomic since you are just replacing , with :: (and slightly more efficient since you don't need to use the varargs constructors.

@@ -501,6 +502,7 @@ object Parser extends ParserInstances {

object Expectation {
case class Str(offset: Int, str: String) extends Expectation
case class OneStr(offset: Int, strs: List[String]) extends Expectation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should be a Set[String] or SortedSet[String] actually, since the order does not matter.

Also, I wonder if we really need Str and OneStr, maybe we just combine them into one error case. Having too many error results makes it harder to present simple messages to the users.

*
* If no string matches, this parser results in an epsilon failure.
*/
def stringIn1(strings: List[String]): Parser1[Unit] =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since order does not matter here maybe we should take an Iterable[String] to make that clearer. List (or any Seq) tends to imply order does matter.

}
}
if (lastMatch < 0) {
state.error = Chain.one(Expectation.OneStr(startOffset, all))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense. I guess one thing I'm concerned about is this list getting very big and seeing a giant list on the typo. Like, in your example, I might say, I would expect Thursday or Tuesday, but I don't want to see Sunday in the list, personally. What do you think of that?

@@ -1796,6 +1853,15 @@ object Parser extends ParserInstances {
override def parseMut(state: State): A = oneOf(ary, state)
}

case class StringIn1(all: List[String]) extends Parser1[Unit] {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should accept a SortedSet[String] here, which makes it structurally deterministic, and then we can skip having sorted, and just all.toList will be a sorted list we can pass to the RadixNode. What do you think?


import scala.annotation.tailrec

class RadixNode(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you have a chance to try the IntMap approach? I would think for larger sets it would be a big win, but I could be wrong.

@@ -581,6 +592,17 @@ class ParserTest extends munit.ScalaCheckSuite {
parseTest(Parser.product(fooP, barP), "foobar", ((), ()))
}

test("longest match stringIn1") {
parseTest(Parser.stringIn1(List("foo", "foobar", "foofoo", "foobat")).string, "foo", "foo")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use a val to not repeat the parser here? It looks like we repeat it 4 times.

if (s.nonEmpty && Parser.string1(s).parse(toParse).isRight)
assert(Parser.stringIn1(ss1).parse(toParse).isRight)
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it also true that:

P.string1(s0).orElse(P.string1(s1)).parse(arg).toOption ==
  P.stringIn1(s0 :: s1 :: Nil).parse(arg).toOption

That is: if we union to string parsers we get the same as a stringIn?

Also, doesn't that also hold for two stringIn parsers?

P.stringIn1(s0).orElse(P.stringIn1(s1)).parse(arg).toOption ==
  P.stringIn1(s0 ::: s1).parse(arg).toOption

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I don't think these laws are true in general. I think they are only true if the first item in the orElse has no prefixes in the second. I think we can check for that in the law:

val noPrefix = left.forall { s => !right.exists(_.startsWith(s)) }

something like that.

finally, here is one more law:

forAll { (ss0: List[String], toParse: String) =>
  val ss = ss0.filterNot(_.isEmpty)
  val left = P.stringIn1(ss).parse(toParse).toOption
  val right = ss.filter(toParse.startsWith(_)).sortBy { s => -s.length }
  assertEquals(left, right.headOption)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If s0 is a prefix of s1 then result is different when arg matches s1. The thing is that the radix tree approach really doesn't care about the order of the alternatives, whereas the orElse approach does and takes the first one that matches, eagerly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's why I suggested the modification above. Maybe our comments crossed in flight.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I didn't se that when replying, my bad. I will add these laws.

@johnynek
Copy link
Collaborator

after you add the laws and merge main, I'm ready to merge this.

Thanks.

Note #116 is a pretty big change you need to be a bit careful with (Parser1 became Parser and Parser became Parser0)

@satabin
Copy link
Contributor Author

satabin commented Dec 21, 2020

after you add the laws and merge main, I'm ready to merge this.

Thanks.

Note #116 is a pretty big change you need to be a bit careful with (Parser1 became Parser and Parser became Parser0)

I didn't have much time this weekend, I'll try to finish this tomorrow, and rebase everything properly.

@satabin
Copy link
Contributor Author

satabin commented Dec 22, 2020

I think I took all your comments into account. For the IntMap approach, see the benchmark results.

@satabin
Copy link
Contributor Author

satabin commented Dec 22, 2020

These are the latest results of the benchmark with the current code

Benchmark                         Mode  Cnt    Score   Error  Units
StringInBenchmarks.oneOfParse     avgt   25  325,339 ± 7,546  ns/op
StringInBenchmarks.stringInParse  avgt   25  154,768 ± 3,157  ns/op

Copy link
Collaborator

@johnynek johnynek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really great!

Thanks for addressing all my comments. I really appreciate it.

This will really improve the library.

I plan to release in a day or two version 0.3.0 with this change.

@johnynek johnynek merged commit 378d838 into typelevel:main Dec 22, 2020
@satabin satabin deleted the string-in branch December 22, 2020 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants