Add `stringIn1` parser #110

satabin · 2020-12-11T10:46:22Z

This parser builds a radix tree from non empty string alternatives.
Radix trees are interesting as they allow for efficient parsing of
string alternatives (e.g. enum names) without backtracking and require
only 1 character lookahead.

String alternatives are first sorted and then grouped by common non
empty prefix. Grouped string are then recursively grouped with the same
process. In the resulting tree, each group of substrings start with a
different character, which can be used to decide which alternative
branch to take.

This operator is inspired by fastparse stringIn one, and requires all
alternatives to be non empty.

Thank you for contributing to cats-parse!

This is a kind reminder to run sbt prePR and commit the changed files, if any, before submitting.

satabin · 2020-12-11T10:56:36Z

Is it ok to depend on scala-collection-compat to solve the cross-compilation problem with 2.12?

codecov-io · 2020-12-11T11:04:40Z

Codecov Report

Merging #110 (cc084d0) into main (454127f) will increase coverage by 0.42%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #110      +/-   ##
==========================================
+ Coverage   94.45%   94.88%   +0.42%     
==========================================
  Files           6        7       +1     
  Lines         758      821      +63     
  Branches       77       88      +11     
==========================================
+ Hits          716      779      +63     
  Misses         42       42

Impacted Files	Coverage Δ
core/shared/src/main/scala/cats/parse/Parser.scala	`95.14% <100.00%> (+0.24%)`	⬆️
...e/shared/src/main/scala/cats/parse/RadixNode.scala	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 454127f...cc084d0. Read the comment docs.

johnynek

Thanks for the PR! This is something I have wanted to get to.

I'd like to make two suggestions:

Can we remove the stringIn1 API? It seems to me this can be an internal detail of how to combine Str parsers inside of oneOf much as we internally combine CharIn parsers.
If you feel we should keep the stringIn API, I'm concerned that we don't seem to be deduplicating it anywhere. I'd rather accept Iterable, then make a distinct set and the if it has only one item just return Str. Also if we keep the API it should have a comment in particular describing the backtracking behavior of the method.
Since this is a performance improvement, can you add a benchmark? Users now have two ways to do this: oneOf + string1 or stringIn1, or if we don't want to expose the API internally we still need to choose. We need to show it is faster to use the tree. I suspect it will be in some cases, but maybe not all cases especially when there are few and distinct key words. But a benchmark will help us see. See the current benchmark directory for examples.
I have some suggestions on improving the performance of the matching loop but I'll wait till we have benchmarks so we can see if it is working.

Thanks for sending the PR.

core/shared/src/main/scala/cats/parse/RadixNode.scala

satabin · 2020-12-11T16:56:35Z

Thanks for the PR! This is something I have wanted to get to.

I'd like to make two suggestions:
1. Can we remove the stringIn1 API? It seems to me this can be an internal detail of how to combine Str parsers inside of oneOf much as we internally combine CharIn parsers.

The current implementation heavily relies of the fact that it takes strings and not parsers. I can see if I can internalize it in oneOf1 but it seems a bit different in semantics to me currently: there is no need for .backtrack when several alternatives share a common prefix. I fear that working below oneOf will hide this from us, but I can give it a try.

2. If you feel we should keep the stringIn API, I'm concerned that we don't seem to be deduplicating it anywhere. I'd rather accept Iterable, then make a distinct set and the if it has only one item just return Str. Also if we keep the API it should have a comment in particular describing the backtracking behavior of the method.

Deduplication happens implicitly when building the tree, but I can make it explicit first, before I sort. I modeled the pre-conditio after the one from oneOf but I can make it smarter to return a Str in case there is only one in the end.

I will document the (lack of) backtracking in the API

3. Since this is a performance improvement, can you add a benchmark? Users now have two ways to do this: oneOf + string1 or stringIn1, or if we don't want to expose the API internally we still need to choose. We need to show it is faster to use the tree. I suspect it will be in some cases, but maybe not all cases especially when there are few and distinct key words. But a benchmark will help us see. See the current benchmark directory for examples.

I will add those.

4. I have some suggestions on improving the performance of the matching loop but I'll wait till we have benchmarks so we can see if it is working.

Sure, let's see how it can be further improved.

Thanks for sending the PR.

satabin · 2020-12-16T18:21:57Z

Hello @johnynek,

I reworked this PR and made the following changes:

added a benchmark for the stringIn1 implementation
made the operator smarter in case there is only 0 or 1 alternatives
rewrote the radix tree implementation to use an array map, as the benchmark showed the Map was a big bottleneck (which I was supposing, but the benchmark showed it clearly)

I still didn't merge it with oneOf1 since the semantics of this parser is a bit different. The order of the string doesn't matter and it parses the longest match, and not the first match. I prefer not to mix them.

I still have a bug to fix though, and I will add tests that cover it.

satabin · 2020-12-16T19:11:11Z

The PR is now ready , I think. I ran the benchmarks and here are the results:

[info] Benchmark                         Mode  Cnt   Score   Error  Units
[info] StringInBenchmarks.oneOfParse     avgt   25  38,604 ± 3,312  ns/op
[info] StringInBenchmarks.stringInParse  avgt   25  22,861 ± 0,162  ns/op

johnynek

This looks nice.

Can you post the results of the benchmark as you run them and update them? This helps us record also for the future if they are logged in the message of the PR.

bench/src/main/scala/cats/parse/bench/StringInBench.scala

core/shared/src/main/scala/cats/parse/Parser.scala

core/shared/src/test/scala/cats/parse/ParserTest.scala

johnynek · 2020-12-16T19:27:45Z

core/shared/src/main/scala/cats/parse/RadixNode.scala

+
+import scala.annotation.tailrec
+
+class RadixNode(


I see you are using an Array here to avoid the boxing of using a map. Another approach would be to use scala.collection.immutable.IntMap and lift the Char into an Int, and then IntMap won't box.

That will avoid the binary search, which might be a significant win.

Indeed, let's see if it improves. The Map method also has the overhead of creating an Option, I will compare to the current approach.

did you have a chance to try the IntMap approach? I would think for larger sets it would be a big win, but I could be wrong.

core/shared/src/main/scala/cats/parse/Parser.scala

johnynek · 2020-12-16T19:33:37Z

core/shared/src/main/scala/cats/parse/Parser.scala

+        }
+      }
+      if (lastMatch < 0) {
+        state.error = Chain.one(Expectation.OneStr(startOffset, all))


I think it would be better to generate the subtree that failed and the offset that we failed at. So, if we have foobar, foobaz, foofoo, and we try to parse fooquux, see would point to quux and say we expected bar, baz or foo. That would be a better error message I think.

What do you think?

I like to see the strings as atoms which either match completely or not at all, so I like to see the error at the beginning of the atom. But I am fine either way in the end.

I thought about it again, and a typical use case for this operator is to match enum-like names. As a user, I think I am more interested in knowing where it fails at the beginning of the matching of the names than somewhere in the middle of one of the possibilities.

For instance if I want to parse the name of a weekday, I'd have:

stringIn1(List("monday", "tuesday", "wednesday", "thursday", "firday", "saturday", "sunday"))

Let's say I made a type and wrote thrusday, it would feel weird to me to know that my possiblities where hursday and uesday.
Of course this means that the failure offset is closer to the matching error, but semantically speaking, the list in stringIn1 is what I, as a user, defined as my possibilities, and would expect to see in error messages.

What do you think?

I think that makes sense. I guess one thing I'm concerned about is this list getting very big and seeing a giant list on the typo. Like, in your example, I might say, I would expect Thursday or Tuesday, but I don't want to see Sunday in the list, personally. What do you think of that?

This parser builds a radix tree from non empty string alternatives. Radix trees are interesting as they allow for efficient parsing of string alternatives (e.g. enum names) without backtracking and require only 1 character lookahead. String alternatives are first sorted and then grouped by common non empty prefix. Grouped string are then recursively grouped with the same process. In the resulting tree, each group of substrings start with a different character, which can be used to decide which alternative branch to take. This operator is inspired by fastparse `stringIn` one, and requires all alternatives to be non empty.

satabin · 2020-12-17T18:27:55Z

I took your comments and feedback into account and pushed the new version. I kept the array based radix tree implementation due to the results of the benchmarks:

Arrays
Benchmark                         Mode  Cnt    Score   Error  Units
StringInBenchmarks.stringInParse  avgt   25  163,929 ± 4,582  ns/op

IntMap.get
Benchmark                         Mode  Cnt    Score   Error  Units
StringInBenchmarks.stringInParse  avgt   25  198,055 ± 3,762  ns/op

IntMap.contains + IntMap.apply
Benchmark                         Mode  Cnt    Score    Error  Units
StringInBenchmarks.stringInParse  avgt   25  231,887 ± 10,620  ns/op

oneOf
Benchmark                      Mode  Cnt    Score    Error  Units
StringInBenchmarks.oneOfParse  avgt   25  296,758 ± 22,775  ns/op

The IntMap implementation creates more objects (especially the Option for the result), or requires two lookups (contains+apply), which results in some overhead over the tree implementation.

johnynek

This is coming along very nicely

johnynek · 2020-12-17T18:26:22Z

bench/src/main/scala/cats/parse/bench/StringInBench.scala

+
+  val oneOf =
+    Parser.oneOf1(
+      List(


nit: can we use the style of using :: to construct the args:

oneOf1( Parser.string1("foobar") :: Parser.string1("foobaz") :: ...

This makes using the List more ergonomic since you are just replacing , with :: (and slightly more efficient since you don't need to use the varargs constructors.

johnynek · 2020-12-17T18:28:26Z

core/shared/src/main/scala/cats/parse/Parser.scala

@@ -501,6 +502,7 @@ object Parser extends ParserInstances {

  object Expectation {
    case class Str(offset: Int, str: String) extends Expectation
+    case class OneStr(offset: Int, strs: List[String]) extends Expectation


I wonder if this should be a Set[String] or SortedSet[String] actually, since the order does not matter.

Also, I wonder if we really need Str and OneStr, maybe we just combine them into one error case. Having too many error results makes it harder to present simple messages to the users.

johnynek · 2020-12-17T18:29:11Z

core/shared/src/main/scala/cats/parse/Parser.scala

+    *
+    * If no string matches, this parser results in an epsilon failure.
+    */
+  def stringIn1(strings: List[String]): Parser1[Unit] =


since order does not matter here maybe we should take an Iterable[String] to make that clearer. List (or any Seq) tends to imply order does matter.

johnynek · 2020-12-17T18:31:23Z

core/shared/src/main/scala/cats/parse/Parser.scala

+        }
+      }
+      if (lastMatch < 0) {
+        state.error = Chain.one(Expectation.OneStr(startOffset, all))


I think that makes sense. I guess one thing I'm concerned about is this list getting very big and seeing a giant list on the typo. Like, in your example, I might say, I would expect Thursday or Tuesday, but I don't want to see Sunday in the list, personally. What do you think of that?

core/shared/src/main/scala/cats/parse/Parser.scala

johnynek · 2020-12-17T18:33:30Z

core/shared/src/main/scala/cats/parse/Parser.scala

@@ -1796,6 +1853,15 @@ object Parser extends ParserInstances {
      override def parseMut(state: State): A = oneOf(ary, state)
    }

+    case class StringIn1(all: List[String]) extends Parser1[Unit] {


maybe we should accept a SortedSet[String] here, which makes it structurally deterministic, and then we can skip having sorted, and just all.toList will be a sorted list we can pass to the RadixNode. What do you think?

johnynek · 2020-12-17T18:35:53Z

core/shared/src/main/scala/cats/parse/RadixNode.scala

+
+import scala.annotation.tailrec
+
+class RadixNode(


did you have a chance to try the IntMap approach? I would think for larger sets it would be a big win, but I could be wrong.

johnynek · 2020-12-17T18:36:44Z

core/shared/src/test/scala/cats/parse/ParserTest.scala

@@ -581,6 +592,17 @@ class ParserTest extends munit.ScalaCheckSuite {
    parseTest(Parser.product(fooP, barP), "foobar", ((), ()))
  }

+  test("longest match stringIn1") {
+    parseTest(Parser.stringIn1(List("foo", "foobar", "foofoo", "foobat")).string, "foo", "foo")


can we use a val to not repeat the parser here? It looks like we repeat it 4 times.

johnynek · 2020-12-17T18:41:01Z

core/shared/src/test/scala/cats/parse/ParserTest.scala

+      if (s.nonEmpty && Parser.string1(s).parse(toParse).isRight)
+        assert(Parser.stringIn1(ss1).parse(toParse).isRight)
+    }
+  }


isn't it also true that:

P.string1(s0).orElse(P.string1(s1)).parse(arg).toOption == P.stringIn1(s0 :: s1 :: Nil).parse(arg).toOption

That is: if we union to string parsers we get the same as a stringIn?

Also, doesn't that also hold for two stringIn parsers?

P.stringIn1(s0).orElse(P.stringIn1(s1)).parse(arg).toOption == P.stringIn1(s0 ::: s1).parse(arg).toOption

Okay, I don't think these laws are true in general. I think they are only true if the first item in the orElse has no prefixes in the second. I think we can check for that in the law:

val noPrefix = left.forall { s => !right.exists(_.startsWith(s)) }

something like that.

finally, here is one more law:

forAll { (ss0: List[String], toParse: String) => val ss = ss0.filterNot(_.isEmpty) val left = P.stringIn1(ss).parse(toParse).toOption val right = ss.filter(toParse.startsWith(_)).sortBy { s => -s.length } assertEquals(left, right.headOption) }

If s0 is a prefix of s1 then result is different when arg matches s1. The thing is that the radix tree approach really doesn't care about the order of the alternatives, whereas the orElse approach does and takes the first one that matches, eagerly.

Yes, that's why I suggested the modification above. Maybe our comments crossed in flight.

Oh I didn't se that when replying, my bad. I will add these laws.

johnynek · 2020-12-21T18:50:12Z

after you add the laws and merge main, I'm ready to merge this.

Thanks.

Note #116 is a pretty big change you need to be a bit careful with (Parser1 became Parser and Parser became Parser0)

satabin · 2020-12-21T23:00:27Z

after you add the laws and merge main, I'm ready to merge this.

Thanks.

Note #116 is a pretty big change you need to be a bit careful with (Parser1 became Parser and Parser became Parser0)

I didn't have much time this weekend, I'll try to finish this tomorrow, and rebase everything properly.

satabin · 2020-12-22T13:48:28Z

I think I took all your comments into account. For the IntMap approach, see the benchmark results.

satabin · 2020-12-22T14:06:19Z

These are the latest results of the benchmark with the current code

Benchmark                         Mode  Cnt    Score   Error  Units
StringInBenchmarks.oneOfParse     avgt   25  325,339 ± 7,546  ns/op
StringInBenchmarks.stringInParse  avgt   25  154,768 ± 3,157  ns/op

johnynek

This looks really great!

Thanks for addressing all my comments. I really appreciate it.

This will really improve the library.

I plan to release in a day or two version 0.3.0 with this change.

satabin force-pushed the string-in branch from 04dd432 to c2d6979 Compare December 11, 2020 10:57

johnynek reviewed Dec 11, 2020

View reviewed changes

core/shared/src/main/scala/cats/parse/RadixNode.scala Outdated Show resolved Hide resolved

satabin force-pushed the string-in branch from 929775f to 631e9e6 Compare December 16, 2020 18:41

satabin requested a review from johnynek December 16, 2020 19:11

johnynek requested changes Dec 16, 2020

View reviewed changes

satabin added 9 commits December 17, 2020 19:14

Implement radix tree with array map

a4db928

Add basic stringIn1 benchmark

161a422

Fix longest match semantics

230ef69

Avoid building tuples

499810e

Add longest match test

82f80f1

Fix formatting

3e77536

Benchmark more cases

6353391

Take review comments into account

f10a5a9

satabin force-pushed the string-in branch from f84b401 to 039c285 Compare December 17, 2020 18:15

Add some StringIn1 laws

cc084d0

satabin force-pushed the string-in branch from 039c285 to cc084d0 Compare December 17, 2020 18:27

satabin requested a review from johnynek December 17, 2020 18:28

johnynek requested changes Dec 17, 2020

View reviewed changes

satabin added 2 commits December 22, 2020 14:10

Merge branch 'main' into string-in

923763e

Unify Str and OneOfStr errors

54adec5

satabin added 2 commits December 22, 2020 14:39

Make use of sorted set internally

409f367

Add more laws

7d90c18

satabin requested a review from johnynek December 22, 2020 13:48

johnynek approved these changes Dec 22, 2020

View reviewed changes

johnynek merged commit 378d838 into typelevel:main Dec 22, 2020

satabin deleted the string-in branch December 22, 2020 16:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `stringIn1` parser #110

Add `stringIn1` parser #110

satabin commented Dec 11, 2020 •

edited

Loading

satabin commented Dec 11, 2020

codecov-io commented Dec 11, 2020 •

edited

Loading

johnynek left a comment

satabin commented Dec 11, 2020

satabin commented Dec 16, 2020

satabin commented Dec 16, 2020

johnynek left a comment

johnynek Dec 16, 2020

satabin Dec 16, 2020

johnynek Dec 17, 2020

johnynek Dec 16, 2020

satabin Dec 16, 2020

satabin Dec 17, 2020

johnynek Dec 17, 2020

satabin commented Dec 17, 2020

johnynek left a comment

johnynek Dec 17, 2020

johnynek Dec 17, 2020

johnynek Dec 17, 2020

johnynek Dec 17, 2020

johnynek Dec 17, 2020

johnynek Dec 17, 2020

johnynek Dec 17, 2020

johnynek Dec 17, 2020

johnynek Dec 17, 2020

satabin Dec 17, 2020

johnynek Dec 17, 2020

satabin Dec 17, 2020

johnynek commented Dec 21, 2020

satabin commented Dec 21, 2020

satabin commented Dec 22, 2020

satabin commented Dec 22, 2020

johnynek left a comment

Add stringIn1 parser #110

Add stringIn1 parser #110

Conversation

satabin commented Dec 11, 2020 • edited Loading

satabin commented Dec 11, 2020

codecov-io commented Dec 11, 2020 • edited Loading

Codecov Report

johnynek left a comment

Choose a reason for hiding this comment

satabin commented Dec 11, 2020

satabin commented Dec 16, 2020

satabin commented Dec 16, 2020

johnynek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

satabin commented Dec 17, 2020

johnynek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnynek commented Dec 21, 2020

satabin commented Dec 21, 2020

satabin commented Dec 22, 2020

satabin commented Dec 22, 2020

johnynek left a comment

Choose a reason for hiding this comment

Add `stringIn1` parser #110

Add `stringIn1` parser #110

satabin commented Dec 11, 2020 •

edited

Loading

codecov-io commented Dec 11, 2020 •

edited

Loading