Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support regex in topic protobuf mappings #584

Closed
wants to merge 5 commits into from

Conversation

upils
Copy link

@upils upils commented Dec 21, 2022

Fixing #405

This is still a work in progress (need more manual and unit tests) but If you find time you can validate you are OK with the direction I am heading.

I have concerns about performance if every message end up matching with the last regex in a long list. I may add a benchmark to get an idea of the difference between regex matching vs directly matching in the map (I expect a huge difference).

@CLAassistant
Copy link

CLAassistant commented Dec 21, 2022

CLA assistant check
All committers have signed the CLA.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@upils
Copy link
Author

upils commented Jan 9, 2023

I fixed merge conflicts. @weeco / @rikimaru0345 can you take a look on this?

@weeco
Copy link
Contributor

weeco commented Jan 11, 2023

Hey @upils ,
I share your concerns in regards to the performance. This is code executed in the hot-path if you search a topic's messages with search filters enabled. It wouldn't be uncommon to process ~100k records / seconds. I think there are solutions to this problem - such as caching the results for requested types etc.

@upils
Copy link
Author

upils commented Jan 11, 2023

@weeco thanks for your answer. I have now added matching topics in the strictMappingsByTopic (see https://github.com/redpanda-data/console/pull/584/files#diff-cb0859875b7a3634d6b559a35d024fc14ba558a7960fabbd6287d716423246a7R296) so a given topic never have to be searched twice in the regexMappingsByTopic.

I wrote a benchmark with a variable proportion of strict/regex mappings.

Before the fix, performances were degrading following the ratio of regex mappings:

goos: linux
goarch: amd64
pkg: github.com/redpanda-data/console/backend/pkg/proto
cpu: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz
BenchmarkService_getMatchingMapping
BenchmarkService_getMatchingMapping/Only_strict_mappings
BenchmarkService_getMatchingMapping/Only_strict_mappings-8                  5998            186877 ns/op               8 B/op          0 allocs/op
BenchmarkService_getMatchingMapping/10%_regex_mappings
BenchmarkService_getMatchingMapping/10%_regex_mappings-8                     249           4464246 ns/op             451 B/op          2 allocs/op
BenchmarkService_getMatchingMapping/50%_regex_mappings
BenchmarkService_getMatchingMapping/50%_regex_mappings-8                      91          12583620 ns/op            2226 B/op         15 allocs/op
BenchmarkService_getMatchingMapping/90%_regex_mappings
BenchmarkService_getMatchingMapping/90%_regex_mappings-8                      58          17922738 ns/op            4674 B/op         41 allocs/op
PASS
ok      github.com/redpanda-data/console/backend/pkg/proto      5.692s

After the fix, it shows no significant difference:

goos: linux
goarch: amd64
pkg: github.com/redpanda-data/console/backend/pkg/proto
cpu: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz
BenchmarkService_getMatchingMapping
BenchmarkService_getMatchingMapping/Only_strict_mappings
BenchmarkService_getMatchingMapping/Only_strict_mappings-8                  5724            180517 ns/op               8 B/op          0 allocs/op
BenchmarkService_getMatchingMapping/10%_regex_mappings
BenchmarkService_getMatchingMapping/10%_regex_mappings-8                    7711            150304 ns/op              12 B/op          0 allocs/op
BenchmarkService_getMatchingMapping/50%_regex_mappings
BenchmarkService_getMatchingMapping/50%_regex_mappings-8                    7566            147093 ns/op              30 B/op          0 allocs/op
BenchmarkService_getMatchingMapping/90%_regex_mappings
BenchmarkService_getMatchingMapping/90%_regex_mappings-8                    6926            157498 ns/op              42 B/op          0 allocs/op
PASS
ok      github.com/redpanda-data/console/backend/pkg/proto      6.286s

I may add more use cases to the benchmark if I think of some corner cases.

Copy link
Contributor

@smatvienko-tb smatvienko-tb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! Your feature is awesome and I really need it for my deployment!

Here you can find the solution how to avoid performance degradation to get a green light for this feature!

It also will be nice to have a small example in the configuration mapping in the same project. Because it is the only source of information right behind the mapping data.

Have a good day!


var match bool
for _, rMapping := range s.regexMappingsByTopic {
match = rMapping.r.MatchString(topicName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a performance degradation for a non-matched topic.
Each time for non-matched topics, the algorithm iterate and match all known regex.

The maximum number of searches has to be estimated as total unique topic names O(N).

I suggest saving the nil result to the strict mapping s.strictMappingsByTopic[topicName] and checking on the nil output at the beginning of this function to avoid another loop searching by regex.

@smatvienko-tb
Copy link
Contributor

@upils , @weeco can we apprise this topic one final time?
It's almost done, guys!
I left my comment in the review on how to fix the complexity and keep the steam inside.
Can you help me?

@upils upils force-pushed the master branch 2 times, most recently from 233cfd9 to 4e3d357 Compare August 14, 2023 19:24
@brycefisher
Copy link
Contributor

@upils were you able to sign the Contributor License Agreement? I honestly can't tell from the bot 😅 Looks like once that is sorted and you get an approval, this PR is ready for prime time!

Copy link
Contributor

@smatvienko-tb smatvienko-tb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@upils , @brycefisher , @weeco
Please, take a look at my suggestions in the comments.
With real-world test case suggestions.
I hope it will help Redpanda Console to perform at maximum speed with less configuration required!

fsSvc *filesystem.Service
schemaSvc *schema.Service
strictMappingsByTopic map[string]config.ProtoTopicMapping
regexMappingsByTopic map[string]regexProtoTopicMapping
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are only iterated by regexMapping but never search in the map by regex definition. So the map data structure is not suited for this.

Another point is key-value ordering by range: the key order does not guarantee for map.
A single topic may match by more than one regex.
So I suggest iterating in the same order as mappings appear in config.

In Go, the order in which the range keyword iterates over the keys of a map is not specified and is not guaranteed to be the same from one iteration to the next.

That is why I suggest using the array or Slice.

	mappingsByTopic   map[string]config.ProtoTopicMapping
	mappingsRegex     []regexProtoTopicMapping

Tests

First test case for validation:

mappings:

      - topicName: tb_core.notifications.*
        isRegex: true
        valueProtoType: transport.ToCoreNotificationMsg
      - topicName: tb_core.*
        isRegex: true
        valueProtoType: transport.ToCoreMsg

inputs (first matches once, second matches for both):
tb_core.notifications.tb-core-0
tb_core.0

outputs:
transport.ToCoreNotificationMsg
transport.ToCoreMsg

Second test case (first regex always win):

mappings:

      - topicName: tb_core.*
        valueProtoType: transport.ToCoreMsg
      - topicName: tb_core.notifications.*
        valueProtoType: transport.ToCoreNotificationMsg

inputs (first matches once, second matches for both):
tb_core.notifications.tb-core-0
tb_core.0

outputs:
transport.ToCoreMsg
transport.ToCoreMsg

return mapping, nil
}

var match bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, validate the complexity with a simple test.
It is convenient to extract the method that iterates by range s.regexMappingsByTopic.
Let's assume that the method will have the name findMappingByRegex(topicName string) (mapping config.ProtoTopicMapping, err error)

Test case

Call getMatchingMapping for each topic 3 times. Verify exactly one method call findMappingByRegex

Mappings:


      - topicName: "tb_core.0"
        isRegex: false
        valueProtoType: transport.ToCoreMsg
      - topicName: "tb_core.notifications.*"
        isRegex: true
        valueProtoType: transport.ToCoreNotificationMsg

Inputs (strict, regex, none-mapping):
tb_core.0
tb_core.notifications.tb-core-0
tb_rule_engine.main.9

Verify call getMatchingMapping (times), call findMappingByRegex (times):
3, 0 strict
3, 1 regex
3, 1 no match

TopicName string `yaml:"topicName"`

// IsRegex indicates if the topicName must be handled as a regex and match multiple topic names
IsRegex bool `yaml:"isRegex"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After all suggestions...
Why can we not approach any topic name as a regex?
Why do we need to increase the complexity of the mapping?

Let's try to compile every topicName as regex.
For the Kafka name policy, it is completely fine!

If the topic name can not be compiled, let's put a warning to the logs and recognize such a topicName as a strict match to not break any exotic deployments.

@upils
Copy link
Author

upils commented Aug 16, 2023

@upils were you able to sign the Contributor License Agreement? I honestly can't tell from the bot 😅 Looks like once that is sorted and you get an approval, this PR is ready for prime time!

Yes, but I guess I messed up with the mail address I used in my commits. I will rewrite them and it should be good.

@smatvienko-tb
Copy link
Contributor

raising it up

@twmb
Copy link
Contributor

twmb commented Sep 19, 2023

@upils ping on rewriting the commit emails.

redpanda-data#1122)

* backend: check for connect clusters in new connect service startup to avoid confusing logs

* backend: improve check for connect cluster len

* backend: code whitespace cleanup
@upils
Copy link
Author

upils commented Feb 17, 2024

Hey @smatvienko-tb @twmb @brycefisher . Sorry for this long hiatus. I fixed the commit author and started to apply the changes suggested by @smatvienko-tb . I need to add test cases but let me know if this is what you had in mind.

Copy link
Contributor

@smatvienko-tb smatvienko-tb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@weeco
Copy link
Contributor

weeco commented Feb 22, 2024

I left two comments. Overall I'd like to suggest to change this PR so that regex strings are recognized if the input is wrapped into two slashes (see my other PR comment).

We can implement a custom type RegexpOrLiteral which implements the Stringer method and has a custom Unmarshaler, so that you can provide mappings like this:

mappings:
  - topicName: normal-exact-match
    keyProtoType: "foo"
  - topicName: /some-topic-prefix-matc.*/
    keyProtoType: "bar"

You can do so by creating a type such as:

type RegexpOrLiteral struct {
	literal string
	*regexp.Regexp
}

func (r *RegexpOrLiteral) String() string {
	if r.literal != "" {
		return r.literal
	}
	if r.Regexp != nil {
		return r.Regexp.String()
	}
	return ""
}

// UnmarshalText unmarshals json into a regexp.Regexp
func (r *RegexpOrLiteral) UnmarshalText(b []byte) error {
	regex, err := compileRegex(string(b))
	if err == nil {
		r.Regexp = regex
	}
	r.literal = string(b)
	return nil
}

// MarshalText marshals regexp.Regexp as string
func (r *RegexpOrLiteral) MarshalText() ([]byte, error) {
	if r.Regexp != nil {
		return []byte(r.Regexp.String()), nil
	}

	return []byte(r.literal), nil
}

Additionally koanf (our config management lib) needs to be configured so that it will try the TextUnmarshalling, by adding the mapstructure.TextUnmarshallerHookFunc(), option here:

mapstructure.StringToTimeDurationHookFunc()),

The compileRegex method already exists here and checks for the wrapped slashes:

func CompileRegex(expr string) (*regexp.Regexp, error) {
if strings.HasPrefix(expr, "/") && strings.HasSuffix(expr, "/") {
substr := expr[1 : len(expr)-1]
regex, err := regexp.Compile(substr)
if err != nil {
return nil, err
}
return regex, nil
}
// If this is no regex input (which is marked by the slashes around it) then we escape it so that it's a literal
regex, err := regexp.Compile("^" + expr + "$")
if err != nil {
return nil, err
}
return regex, nil
}
. By doing so you have the best of both worlds:

  • Ability to provide regexes while being backwards compatible (a single string can be provided for the topic name)
  • Regexes are compiled at unmarshal time so that we error early and users should clearly know whether regex is used or not

What do you think @upils @smatvienko-tb ?

Comment on lines -44 to 58
logger.Info("creating Kafka connect HTTP clients and testing connectivity to all clusters")
clientsByCluster := make(map[string]*ClientWithConfig)

if len(cfg.Clusters) == 0 {
return &Service{
Cfg: cfg,
Logger: logger,
ClientsByCluster: clientsByCluster,
Interceptor: interceptor.NewInterceptor(),
}, nil
}

// 1. Create a client for each configured Connect cluster
clientsByCluster := make(map[string]*ClientWithConfig)
logger.Info("creating Kafka connect HTTP clients and testing connectivity to all clusters")

for _, clusterCfg := range cfg.Clusters {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR, I assume something went wrong when syncing your fork? Please fix this

r, err := regexp.Compile(mapping.TopicName)
if err != nil {
strictMappingsByTopic[mapping.TopicName] = mapping
logger.Warn("topic name handled as a strict match", zap.String("topic_name", mapping.TopicName))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So as of today where everyone use "strict matches" this would print one warning message for every single configured topic name. I think we need to change this. I agree that we want to be explicit about what's being used since the regexp compilation can fail.

I suggested something in my comment to overcome this issue

@brycefisher
Copy link
Contributor

Hi @upils and @weeco! Thanks so much for all the hardwork on implementing and reviewing this feature so far. Is there anything I can do to help push this feature over the line?

@twmb
Copy link
Contributor

twmb commented Mar 25, 2024

@brycefisher if you'd like to take this on, would you be able to fork and address suggestions from @weeco just above?

@brycefisher
Copy link
Contributor

Yes! Happy to put the finishing touches on this.

@weeco
Copy link
Contributor

weeco commented Apr 8, 2024

@upils @smatvienko-tb @brycefisher ping

@twmb
Copy link
Contributor

twmb commented May 6, 2024

We'll move this to unscheduled on our side, but please ping once y'all are able to uptake this and we can re-review.

@bojand
Copy link
Member

bojand commented Jun 19, 2024

This should be addressed now via #1291.

@bojand bojand closed this Jun 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants