Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String matcher #30

Closed
wants to merge 1 commit into from
Closed

String matcher #30

wants to merge 1 commit into from

Conversation

kddnewton
Copy link

Most of the time when I'm using string scanner I am checking multiple patterns. In this case I usually have to join them in a single regex or I just check them with multiple elsif clauses. Instead, it would be nice to initialize a string scanner that could check multiple patterns and then return to me the value that I want. In this commit I introduce a StringMatcher object that does just that. For example:

require "strmatch"

NumberToken = Struct.new(:value, keyword_init: true)
StringToken = Struct.new(:value, keyword_init: true)

matcher = StringMatcher.new("12ab34")
matcher.match(/\d+/) { |value| NumberToken.new(value: value.to_i) }
matcher.match(/[a-z]+/) { |value| StringToken.new(value:) }

until matcher.eos?
  case matcher.select
  in NumberToken[value:] if value >= 20
    puts "NUMBER: #{value.inspect} (big)"
  in NumberToken[value:] if value < 20
    puts "NUMBER: #{value.inspect} (small)"
  in StringToken[value:]
    puts "STRING: #{value.inspect}"
  end
end

Running the above example will produce:

NUMBER: 12 (small)
STRING: "ab"
NUMBER: 34 (big)

This is more ergonomic since you can now use pattern matching instead of a chain of if/elsif clauses.

Most of the time when I'm using string scanner I am checking multiple patterns. In this case I usually have to join them in a single regex or I just check them with multiple elsif clauses. Instead, it would be nice to initialize a string scanner that could check multiple patterns and then return to me the value that I want. In this commit I introduce a `StringMatcher` object that does just that. For example:

```
require "strmatch"

NumberToken = Struct.new(:value, keyword_init: true)
StringToken = Struct.new(:value, keyword_init: true)

matcher = StringMatcher.new("12ab34")
matcher.match(/\d+/) { |value| NumberToken.new(value: value.to_i) }
matcher.match(/[a-z]+/) { |value| StringToken.new(value:) }

until matcher.eos?
  case matcher.select
  in NumberToken[value:] if value >= 20
    puts "NUMBER: #{value.inspect} (big)"
  in NumberToken[value:] if value < 20
    puts "NUMBER: #{value.inspect} (small)"
  in StringToken[value:]
    puts "STRING: #{value.inspect}"
  end
end
```

Running the above example will produce:

```
NUMBER: 12 (small)
STRING: "ab"
NUMBER: 34 (big)
```

This is more ergonomic since you can now use pattern matching instead of a chain of if/elsif clauses.
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that your use case is one of use cases but it's not one of major use cases. How about creating a new gem instead of putting this to strscan?

One of major use cases is parsing a structured format such as CSV and LDIF. In this use case, there are some states in parsing. Patterns to be used for the next scan are depend on the current states:

def parse_column(scanner)
  if scanner.scan(/"/)
    # state: in quote
    column = scanner.scan(/[^"]+/)
    unless scanner.scan(/"/)
      raise "end quote is missing"
    end
  else
    # state: no quote
    column = scanner.scan(/[^,\n]+/)
  end
  column
end

See also:

BTW, I think the following API is better than the current API:

  • #match -> on
  • #select -> #match
  • #select_all -> #match_all

Because class name is StringMatcher. I think that #match should be used for the main method.

# is given, then an enumerator is returned.
def select_all
return to_enum(__method__) unless block_given?
yield select until eos?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may cause infinite loop:

matcher = StringMatcher.new("1a2b3")
matcher.match(/\d/) { |value| value.to_i }
p matcher.select_all.to_a

Comment on lines +57 to +60
patterns.find do |(pattern, block)|
value = scan(pattern)
break block.call(value) if value
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each and return are better than find and break because this doesn't use return value of find:

Suggested change
patterns.find do |(pattern, block)|
value = scan(pattern)
break block.call(value) if value
end
patterns.each do |(pattern, block)|
value = scan(pattern)
return block.call(value) if value
end
nil

@zverok
Copy link

zverok commented Feb 12, 2022

TBH, I am not convinced that this should be in the stdlib library.
It seems to handle one arbitrary approach to parsing, and it is not clear why the users shouldn't implement it themselves, as the implementation is relatively simple and not really coupled with StringScanner.

Though, what I can agree with, is that the frequent pattern of working with StringScanner is some branching depending on what's next in the stream. If it could've been done like this (impossible currently):

case scanner
in match(/\d+/) => value 
  NumberToken.new(value: value.to_i)
in match(/[a-z]+/) => value 
  StringToken.new(value:)
#...

This raises the question of pattern-matching limitation!

Something like this might be imitated with the API like:

scanner.branch(
  /\d+/ => proc { |value| NumberToken.new(value: value.to_i) }
  /[a-z]+/ => proc { |value| StringToken.new(value:) }
)

...which is an interesting possibility, but somewhat clumsy and unlike most other stdlib APIs.

@kddnewton
Copy link
Author

No worries, thank for the feedback.

@kddnewton kddnewton closed this Feb 16, 2022
@kddnewton kddnewton deleted the string-matcher branch February 16, 2022 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants