Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use case: synching to Elasticsearch/Solr #66

Open
kwakwaversal opened this issue Aug 26, 2020 · 7 comments
Open

Use case: synching to Elasticsearch/Solr #66

kwakwaversal opened this issue Aug 26, 2020 · 7 comments
Labels
question Further information is requested

Comments

@kwakwaversal
Copy link

#40 (comment)

Can you elaborate a bit on your particular use-case? It will help with our planning and improvements. Feel free to start a new issue if I haven't correctly addressed your question here

Hi @kiwicopple. I'm currently in an exploratory stage following a requirement to sync data from PostgreSQL to Solr or Elasticsearch and seeing what's out there. I knew logical replication was the only way to go but needed to explore existing solutions. I heard about supabase realtime a year ago so been kicking the tyres to see what it could do.

My thinking was to subscribe to tables of interest and update or insert records to ES/Solr. Guaranteed delivery is obviously preferred for this. I don't like the idea of having to occasionally sync hundreds of thousands of emails that might be out of sync.

While exploring supabase/realtime however, I see how it might be beneficial for other aspects of a product I'm working on.

• Subscribing to tables instead of creating lots of NOTIFY triggers (as we currently do in production).

One thing that jumped out about this is that the payload size including the types object, while useful, really adds up over thousands of records (even more if you're receiving the OLD record). I think I'd prefer it if when subscribing I could request for specific data.

This seems to be partly discussed here: #47 but that seems to just be conditionally filtering data (which is good). What I would find useful would be requesting a subset of the data so optionally only the bare minimum needs to be transmitted between the realtime server and client. A JSON query and transformation language like https://github.com/schibsted/jslt or https://github.com/jsonata-js/jsonata would be ideal as this could be passed as an argument on subscription and transform the data on the server side. I don't know equivalent libraries are available in Elixir.

Question:
I'm assuming the preferred approach to reduce the load on the server and the streaming output from PostrgreSQL is to only create a publication for a subset of tables rather than CREATE PUBLICATION supabase_realtime FOR ALL TABLES;?

One more thing: we are planning the benchmark test as a requirement to move from alpha to beta.

BTW the checkboxes in the README for this repo suggests you're already in Beta (https://github.com/supabase/realtime#status) but the paragraph below suggests otherwise. I didn't read the paragraph below originally and just the checkboxes so thought you were in Beta. :)

@kwakwaversal kwakwaversal added the question Further information is requested label Aug 26, 2020
@kiwicopple
Copy link
Member

Hey @kwakwaversal thanks for the amazing write up.

the checkboxes in the README for this repo suggests you're already in Beta

Oops. Thanks for the heads up! We are standardising release statuses across the whole org. I shifted it back to alpha - I think that it's a fairer representation of the state of the server, but also Supabase as a whole. We will move into Beta once we have some benchmarks and, more importantly, we will hire someone who can support this repo full time (we're a very small team right now)

update or insert records to ES/Solr. Guaranteed delivery is obviously preferred for this.

Very cool. This one will be doable with the webhooks (#33). Once I iron out the issue with reusing replication slots (#22) then it will have guaranteed read from Postgres and send. I'm reluctant to call it guaranteed delivery, because once it sends it to ES it won't wait for an acknowledgement back. If the payload fails to reach ES (for whatever reason), the change event is lost too. This might be something we work on in the future, but it will be a major development (since it requires persistence).

the payload size including the types object, while useful, really adds up over thousands of records (even more if you're receiving the OLD record)

Interesting - also one that I hadn't thought of. I'll dig into the idea of a JSON transformer. The payload values are all strings so it might be more universally useful to add a boolean flag like PARSE_PAYLOAD, and if true, this server could parse the payload into the correct types, dropping the columns key altogether. This will have a an impact on performance, so it's one that we will have to weigh up. You would also lose a lot of detail (is it a smallint or bigint?), but in many cases that won't matter.

My initial thoughts on this:

  • storage is cheaper than compute
  • performance is important

Therefore, it may not be the best idea if it's only to save space on a database. Any reason you couldn't cleanse the column from the payload after it has landed in ES?

I'm assuming the preferred approach to reduce the load on the server and the streaming output from PostrgreSQL is to only create a publication for a subset of tables

Yes, that's correct. If there are tables you don't care about streaming, just don't enable the replication. Let me know if that's a problem for your use-case!

Subscribing to tables instead of creating lots of NOTIFY triggers (as we currently do in production).

In case you didn't see this, there is an 8000 byte limit for NOTIFY payloads, so be careful that your system isn't silently dropping the events. Postgres actually raises an error but it's hard to catch. This is the exact reason why we created Realtime last year - I was using trigger/NOTIFY and discovered this the hard way ..

Next steps

I will explore a JSON transformer with one of our team and let you know. It could take some time but I think it's worth considering as part of our work on #47

@kwakwaversal
Copy link
Author

I'm reluctant to call it guaranteed delivery, because once it sends it to ES it won't wait for an acknowledgement back.

I think it would make sense for me to handle to the writes to Solr rather use create a webhook in the server. I would need to update Solr FIFO but I read in #33 (comment) that you suggested webhooks that don't return 2XX that a log is written somewhere. That would break the order of the records that update the relevant documents in Solr which isn't ideal.

As a side note, if records that are not successfully forwarded to a webhook aren't easily queried (and optionally extracted) I think this might be an issue. A log is great obviously, but I would like to see some stats if some webhooks fail for whatever reason. Programmatically polling the server for failed webhooks would be helpful, but then you're changing the simplicity of the server. It does sound more and more like you're going to end up having to write some job queue for the webhooks which is non-trivial. Or at the very least guarantee writes to somewhere so that events are not lost.

The payload values are all strings so it might be more universally useful to add a boolean flag like PARSE_PAYLOAD, and if true, this server could parse the payload into the correct types, dropping the columns key altogether.

Parsing the payload into the correct types might be useful, but I wouldn't say it's necessary. A JSON transformer would allow someone to cherry pick properties so that in a table of 30 columns, it would just return 3 and also be able to rename the columns in the process.

storage is cheaper than compute

Therefore, it may not be the best idea if it's only to save space on a database. Any reason you couldn't cleanse the column from the payload after it has landed in ES?

When I mentioned the payload size, I wasn't talking about storing it, I was literally talking about the payload size that's being sent across interfaces. The network traffic between hosts.

If I have a DB server, and put supabase/realtime on there, then having a consuming websocket server on a different host I'm going to get a lot of data being sent between the hosts. It's quite possible that I am throwing away a lot of that data so it would be more performant for me to revert back to sending NOTIFY from triggers. But then I lose some of the flexibility that supabase/realtime was giving me originally by not having to keep changing me database migrations to add/remove columns from the NOTIFY payload. As discussed, the JSON transformers would really shine here.

In case you didn't see this, there is an 8000 byte limit for NOTIFY payloads

Hah, yeah. I ran into this when trying to send emails stored in the database over the notifies. I mean, it makes sense to maintain performance but at the time it was a head scratcher.

I will explore a JSON transformer with one of our team and let you know. It could take some time but I think it's worth considering as part of our work on #47

Great stuff.

Addendum

For anyone else that might stumble across this issue looking for a way of syncing PostgreSQL to ES/Solr, https://debezium.io/ looks like a good alternative. I was hoping not to add too much extra to the stack because this becomes: ES/Solr, Debezium, Zookeeper, Kafka but I guess that's the price you have to pay for guaranteed delivery in this instance.

@kiwicopple
Copy link
Member

I was literally talking about the payload size that's being sent across interfaces. The network traffic between hosts.

Got it, that makes sense.

Debezium

Yes, great system! Definitely heavy if you want full functionality I think it is the only thing on the market right now.

Thanks again for the context. I will need to chat to Francesco about this use-case. He's not active right now, so unfortunately it could take a while to see progress on this one. Nonetheless, here are the actions I have:

  • Investigate the use of a queuing system, or some other guaranteed delivery
  • explore a JSON transformer

@kiwicopple kiwicopple mentioned this issue Oct 23, 2020
@rrjanbiah
Copy link

@kwakwaversal

Addendum

For anyone else that might stumble across this issue looking for a way of syncing PostgreSQL to ES/Solr, https://debezium.io/ looks like a good alternative. I was hoping not to add too much extra to the stack because this becomes: ES/Solr, Debezium, Zookeeper, Kafka but I guess that's the price you have to pay for guaranteed delivery in this instance.

I was looking for a debezium alternative (to avoid JVM stack) and stumbled upon this issue. Are you able to succeed with the attempt to use supabase for Elastic search sync? In my case, I have found Elastic search alternative MeiliSearch (crosslinking meilisearch/integration-guides#20)

On other note, for JSON transforming, I think these libraries are better and popular: https://github.com/jmespath/jmespath.js, https://github.com/jsonata-js/jsonata or https://github.com/wankdanker/node-object-mapper (FWIW, https://www.npmtrends.com/json-query-vs-jsonata-vs-JSONPath-vs-jsonpath-vs-jsonpath-plus-vs-jmespath-vs-object-mapper )

@kiwicopple
Copy link
Member

Thanks for the parsers @rrjanbiah - we might need to find something elixir-native, but these are good prior art to work from.

Our current discussion is to update the event parser to be more inline with AWS's Simple Work Flows. eg: https://states-language.net/#choice-state. This is a full state machine, which should make it this a very versatile server. It's a large task though, so we'll have to spend time figuring out if it's possible and if we can ship it in parts

@rrjanbiah
Copy link

@kiwicopple

Thanks for the parsers @rrjanbiah - we might need to find something elixir-native, but these are good prior art to work from.

Looks not maintained, but I found this one https://github.com/stephan83/ex-jmes

Our current discussion is to update the event parser to be more inline with AWS's Simple Work Flows. eg: https://states-language.net/#choice-state. This is a full state machine, which should make it this a very versatile server. It's a large task though, so we'll have to spend time figuring out if it's possible and if we can ship it in parts

Not quite understand here... but since, you're referring to the state machines, you may want to check https://github.com/davidkpiano/xstate as that's quite popular

@chasers
Copy link
Contributor

chasers commented Sep 26, 2022

This is a great idea and with Realtime v2 we have an extensions concept where we could additionally insert your realtime feed into another database.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants