-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String matcher #30
String matcher #30
Conversation
Most of the time when I'm using string scanner I am checking multiple patterns. In this case I usually have to join them in a single regex or I just check them with multiple elsif clauses. Instead, it would be nice to initialize a string scanner that could check multiple patterns and then return to me the value that I want. In this commit I introduce a `StringMatcher` object that does just that. For example: ``` require "strmatch" NumberToken = Struct.new(:value, keyword_init: true) StringToken = Struct.new(:value, keyword_init: true) matcher = StringMatcher.new("12ab34") matcher.match(/\d+/) { |value| NumberToken.new(value: value.to_i) } matcher.match(/[a-z]+/) { |value| StringToken.new(value:) } until matcher.eos? case matcher.select in NumberToken[value:] if value >= 20 puts "NUMBER: #{value.inspect} (big)" in NumberToken[value:] if value < 20 puts "NUMBER: #{value.inspect} (small)" in StringToken[value:] puts "STRING: #{value.inspect}" end end ``` Running the above example will produce: ``` NUMBER: 12 (small) STRING: "ab" NUMBER: 34 (big) ``` This is more ergonomic since you can now use pattern matching instead of a chain of if/elsif clauses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that your use case is one of use cases but it's not one of major use cases. How about creating a new gem instead of putting this to strscan?
One of major use cases is parsing a structured format such as CSV and LDIF. In this use case, there are some states in parsing. Patterns to be used for the next scan are depend on the current states:
def parse_column(scanner)
if scanner.scan(/"/)
# state: in quote
column = scanner.scan(/[^"]+/)
unless scanner.scan(/"/)
raise "end quote is missing"
end
else
# state: no quote
column = scanner.scan(/[^,\n]+/)
end
column
end
See also:
- https://github.com/ruby/csv/blob/master/lib/csv/parser.rb#L389
- https://github.com/activeldap/activeldap/blob/master/lib/active_ldap/ldif.rb#L107
BTW, I think the following API is better than the current API:
#match
->on
#select
->#match
#select_all
->#match_all
Because class name is StringMatcher
. I think that #match
should be used for the main method.
# is given, then an enumerator is returned. | ||
def select_all | ||
return to_enum(__method__) unless block_given? | ||
yield select until eos? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may cause infinite loop:
matcher = StringMatcher.new("1a2b3")
matcher.match(/\d/) { |value| value.to_i }
p matcher.select_all.to_a
patterns.find do |(pattern, block)| | ||
value = scan(pattern) | ||
break block.call(value) if value | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
each
and return
are better than find
and break
because this doesn't use return value of find
:
patterns.find do |(pattern, block)| | |
value = scan(pattern) | |
break block.call(value) if value | |
end | |
patterns.each do |(pattern, block)| | |
value = scan(pattern) | |
return block.call(value) if value | |
end | |
nil |
TBH, I am not convinced that this should be in the stdlib library. Though, what I can agree with, is that the frequent pattern of working with StringScanner is some branching depending on what's next in the stream. If it could've been done like this (impossible currently): case scanner
in match(/\d+/) => value
NumberToken.new(value: value.to_i)
in match(/[a-z]+/) => value
StringToken.new(value:)
#... This raises the question of pattern-matching limitation! Something like this might be imitated with the API like: scanner.branch(
/\d+/ => proc { |value| NumberToken.new(value: value.to_i) }
/[a-z]+/ => proc { |value| StringToken.new(value:) }
) ...which is an interesting possibility, but somewhat clumsy and unlike most other stdlib APIs. |
No worries, thank for the feedback. |
Most of the time when I'm using string scanner I am checking multiple patterns. In this case I usually have to join them in a single regex or I just check them with multiple elsif clauses. Instead, it would be nice to initialize a string scanner that could check multiple patterns and then return to me the value that I want. In this commit I introduce a
StringMatcher
object that does just that. For example:Running the above example will produce:
This is more ergonomic since you can now use pattern matching instead of a chain of if/elsif clauses.