Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improvement: run the batch ingestion process in parallel #14

Closed
ghost opened this issue Jul 24, 2017 · 7 comments
Closed

improvement: run the batch ingestion process in parallel #14

ghost opened this issue Jul 24, 2017 · 7 comments
Assignees
Milestone

Comments

@ghost
Copy link

ghost commented Jul 24, 2017

As an improvement:
run the batch-process (executing fn_process_batch) in parallel (thread/forkl) to the parsing of binary logs when using start_replica

That would help keep up with the catchup of the database.

Note: I am not asking for parallel table processing when running init_replica, but that would also help ;)

@ghost
Copy link
Author

ghost commented Jul 24, 2017

One idea: why not run this fn_process_batch only in the database using a trigger on the t_replica_batch table.

@the4thdoctor
Copy link
Owner

the4thdoctor commented Jul 24, 2017

Hi, the version 2 I'm currently working on it will split the read and reply in two separate subprocess. I'll try to release it by the end of the year. It will also add the init replica in parallel with a less invasive flush process.

@the4thdoctor
Copy link
Owner

Btw, I'm getting curious about your migration. Any chance you'll be able to write a case study? :-)

@ghost
Copy link
Author

ghost commented Jul 24, 2017

well - in principle we got a single mysql instance/process running 3 schema.
We want to migrate those 3 to postgres (2 to the same database as 2 diffferent schema, the last maybe to a different instance) and we want the switchover to run with minimal downtime.
So here pg_chameleon is the tool to allow this.

As for speed we found that:
a) initial migration of the biggest database takes about 6 hours (~50GB) - we have not tested the others yet, but they are smaller
b) that when running start_replica (recovering) we see that there is a lot of time when we only process bin logs while postgres is sitting idle.

So it is not something "special".

The only thing is that I find a lot of things that - to me - looks strange and as we want to have all the data in postgressql we have to make sure that there is no dataloss. Hence all those questions with regards to logging and missmatches...

Besides that I am totally happy with the tool itself - thanks for providing it!

@the4thdoctor the4thdoctor added this to the Version 2.0 milestone Jul 25, 2017
@the4thdoctor
Copy link
Owner

thanks for sharing :)
I'm happy the tool is proving useful.
the replica process is the biggest issue of this initial version as do not provides read and replay in parallel. And I found quite difficult to change this approach because I made wrong decisions when I wrote the initial implementation.

I'll try to speed up the development of version 2 to provide a better experience :)

@the4thdoctor
Copy link
Owner

the4thdoctor commented Jul 31, 2017

@martinsperl-kognitiv I've just pushed an improvement for the replay function. My tests shown a faster execution with the reduction of cpu load and io wait. If you wanna give it a try. :)

The upgrade procedure will add an extra table used by the procedure. Be sure to stop all the replica processes before upgrading the schema and take a backup of sch_chameleon.

@the4thdoctor the4thdoctor modified the milestones: ver1.7, ver2.0 Aug 4, 2017
@the4thdoctor
Copy link
Owner

the version 1.7 will have the threaded option for running the read and replay in parallel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant