-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add stringIn1
parser
#110
Add stringIn1
parser
#110
Conversation
Is it ok to depend on |
Codecov Report
@@ Coverage Diff @@
## main #110 +/- ##
==========================================
+ Coverage 94.45% 94.88% +0.42%
==========================================
Files 6 7 +1
Lines 758 821 +63
Branches 77 88 +11
==========================================
+ Hits 716 779 +63
Misses 42 42
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! This is something I have wanted to get to.
I'd like to make two suggestions:
- Can we remove the stringIn1 API? It seems to me this can be an internal detail of how to combine Str parsers inside of oneOf much as we internally combine CharIn parsers.
- If you feel we should keep the stringIn API, I'm concerned that we don't seem to be deduplicating it anywhere. I'd rather accept Iterable, then make a distinct set and the if it has only one item just return Str. Also if we keep the API it should have a comment in particular describing the backtracking behavior of the method.
- Since this is a performance improvement, can you add a benchmark? Users now have two ways to do this: oneOf + string1 or stringIn1, or if we don't want to expose the API internally we still need to choose. We need to show it is faster to use the tree. I suspect it will be in some cases, but maybe not all cases especially when there are few and distinct key words. But a benchmark will help us see. See the current benchmark directory for examples.
- I have some suggestions on improving the performance of the matching loop but I'll wait till we have benchmarks so we can see if it is working.
Thanks for sending the PR.
The current implementation heavily relies of the fact that it takes strings and not parsers. I can see if I can internalize it in
Deduplication happens implicitly when building the tree, but I can make it explicit first, before I sort. I modeled the pre-conditio after the one from I will document the (lack of) backtracking in the API
I will add those.
Sure, let's see how it can be further improved.
|
Hello @johnynek, I reworked this PR and made the following changes:
I still didn't merge it with I still have a bug to fix though, and I will add tests that cover it. |
The PR is now ready , I think. I ran the benchmarks and here are the results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks nice.
Can you post the results of the benchmark as you run them and update them? This helps us record also for the future if they are logged in the message of the PR.
|
||
import scala.annotation.tailrec | ||
|
||
class RadixNode( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you are using an Array here to avoid the boxing of using a map. Another approach would be to use scala.collection.immutable.IntMap and lift the Char into an Int, and then IntMap won't box.
That will avoid the binary search, which might be a significant win.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, let's see if it improves. The Map
method also has the overhead of creating an Option
, I will compare to the current approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you have a chance to try the IntMap approach? I would think for larger sets it would be a big win, but I could be wrong.
} | ||
} | ||
if (lastMatch < 0) { | ||
state.error = Chain.one(Expectation.OneStr(startOffset, all)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to generate the subtree that failed and the offset that we failed at. So, if we have foobar, foobaz, foofoo, and we try to parse fooquux, see would point to quux and say we expected bar, baz or foo. That would be a better error message I think.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like to see the strings as atoms which either match completely or not at all, so I like to see the error at the beginning of the atom. But I am fine either way in the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about it again, and a typical use case for this operator is to match enum-like names. As a user, I think I am more interested in knowing where it fails at the beginning of the matching of the names than somewhere in the middle of one of the possibilities.
For instance if I want to parse the name of a weekday, I'd have:
stringIn1(List("monday", "tuesday", "wednesday", "thursday", "firday", "saturday", "sunday"))
Let's say I made a type and wrote thrusday
, it would feel weird to me to know that my possiblities where hursday
and uesday
.
Of course this means that the failure offset is closer to the matching error, but semantically speaking, the list in stringIn1
is what I, as a user, defined as my possibilities, and would expect to see in error messages.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense. I guess one thing I'm concerned about is this list getting very big and seeing a giant list on the typo. Like, in your example, I might say, I would expect Thursday or Tuesday, but I don't want to see Sunday in the list, personally. What do you think of that?
This parser builds a radix tree from non empty string alternatives. Radix trees are interesting as they allow for efficient parsing of string alternatives (e.g. enum names) without backtracking and require only 1 character lookahead. String alternatives are first sorted and then grouped by common non empty prefix. Grouped string are then recursively grouped with the same process. In the resulting tree, each group of substrings start with a different character, which can be used to decide which alternative branch to take. This operator is inspired by fastparse `stringIn` one, and requires all alternatives to be non empty.
I took your comments and feedback into account and pushed the new version. I kept the array based radix tree implementation due to the results of the benchmarks:
The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is coming along very nicely
|
||
val oneOf = | ||
Parser.oneOf1( | ||
List( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we use the style of using ::
to construct the args:
oneOf1(
Parser.string1("foobar") ::
Parser.string1("foobaz") ::
...
This makes using the List more ergonomic since you are just replacing , with :: (and slightly more efficient since you don't need to use the varargs constructors.
@@ -501,6 +502,7 @@ object Parser extends ParserInstances { | |||
|
|||
object Expectation { | |||
case class Str(offset: Int, str: String) extends Expectation | |||
case class OneStr(offset: Int, strs: List[String]) extends Expectation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this should be a Set[String] or SortedSet[String] actually, since the order does not matter.
Also, I wonder if we really need Str and OneStr, maybe we just combine them into one error case. Having too many error results makes it harder to present simple messages to the users.
* | ||
* If no string matches, this parser results in an epsilon failure. | ||
*/ | ||
def stringIn1(strings: List[String]): Parser1[Unit] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since order does not matter here maybe we should take an Iterable[String] to make that clearer. List (or any Seq) tends to imply order does matter.
} | ||
} | ||
if (lastMatch < 0) { | ||
state.error = Chain.one(Expectation.OneStr(startOffset, all)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense. I guess one thing I'm concerned about is this list getting very big and seeing a giant list on the typo. Like, in your example, I might say, I would expect Thursday or Tuesday, but I don't want to see Sunday in the list, personally. What do you think of that?
@@ -1796,6 +1853,15 @@ object Parser extends ParserInstances { | |||
override def parseMut(state: State): A = oneOf(ary, state) | |||
} | |||
|
|||
case class StringIn1(all: List[String]) extends Parser1[Unit] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should accept a SortedSet[String] here, which makes it structurally deterministic, and then we can skip having sorted, and just all.toList will be a sorted list we can pass to the RadixNode. What do you think?
|
||
import scala.annotation.tailrec | ||
|
||
class RadixNode( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you have a chance to try the IntMap approach? I would think for larger sets it would be a big win, but I could be wrong.
@@ -581,6 +592,17 @@ class ParserTest extends munit.ScalaCheckSuite { | |||
parseTest(Parser.product(fooP, barP), "foobar", ((), ())) | |||
} | |||
|
|||
test("longest match stringIn1") { | |||
parseTest(Parser.stringIn1(List("foo", "foobar", "foofoo", "foobat")).string, "foo", "foo") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use a val to not repeat the parser here? It looks like we repeat it 4 times.
if (s.nonEmpty && Parser.string1(s).parse(toParse).isRight) | ||
assert(Parser.stringIn1(ss1).parse(toParse).isRight) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't it also true that:
P.string1(s0).orElse(P.string1(s1)).parse(arg).toOption ==
P.stringIn1(s0 :: s1 :: Nil).parse(arg).toOption
That is: if we union to string parsers we get the same as a stringIn?
Also, doesn't that also hold for two stringIn parsers?
P.stringIn1(s0).orElse(P.stringIn1(s1)).parse(arg).toOption ==
P.stringIn1(s0 ::: s1).parse(arg).toOption
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I don't think these laws are true in general. I think they are only true if the first item in the orElse has no prefixes in the second. I think we can check for that in the law:
val noPrefix = left.forall { s => !right.exists(_.startsWith(s)) }
something like that.
finally, here is one more law:
forAll { (ss0: List[String], toParse: String) =>
val ss = ss0.filterNot(_.isEmpty)
val left = P.stringIn1(ss).parse(toParse).toOption
val right = ss.filter(toParse.startsWith(_)).sortBy { s => -s.length }
assertEquals(left, right.headOption)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If s0
is a prefix of s1
then result is different when arg
matches s1
. The thing is that the radix tree approach really doesn't care about the order of the alternatives, whereas the orElse
approach does and takes the first one that matches, eagerly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's why I suggested the modification above. Maybe our comments crossed in flight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I didn't se that when replying, my bad. I will add these laws.
after you add the laws and merge main, I'm ready to merge this. Thanks. Note #116 is a pretty big change you need to be a bit careful with (Parser1 became Parser and Parser became Parser0) |
I didn't have much time this weekend, I'll try to finish this tomorrow, and rebase everything properly. |
I think I took all your comments into account. For the |
These are the latest results of the benchmark with the current code
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really great!
Thanks for addressing all my comments. I really appreciate it.
This will really improve the library.
I plan to release in a day or two version 0.3.0 with this change.
This parser builds a radix tree from non empty string alternatives.
Radix trees are interesting as they allow for efficient parsing of
string alternatives (e.g. enum names) without backtracking and require
only 1 character lookahead.
String alternatives are first sorted and then grouped by common non
empty prefix. Grouped string are then recursively grouped with the same
process. In the resulting tree, each group of substrings start with a
different character, which can be used to decide which alternative
branch to take.
This operator is inspired by fastparse
stringIn
one, and requires allalternatives to be non empty.
Thank you for contributing to
cats-parse
!This is a kind reminder to run
sbt prePR
and commit the changed files, if any, before submitting.