-
Notifications
You must be signed in to change notification settings - Fork 1.3k
fix data-integrity problem when add unique key #912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Adding a Can you please clarify what is the exact use case (possibly with a test) that this PR solves? Thank you! |
Thanks for your quick reply. @shlomi-noach In this PR, if duplicates are on the newly added unique key , it will fail instead of ignore. BackgroundGh-ost contains two threads to process the data of the published table, one thread copies the data of the original table, using Insert ignore into, and the other thread is used to process incremental data, by parsing and applying the binlog, using replace into; Same data is likely to be processed twice because of two threads are processed concurrently. In order to avoid conflicts leading to failure, insert ignore into and replace into are designed. However, when adding a unique key will cause the actual duplicate data on the newly added unique key would be ignored, because duplicate data will be ignored or replaced. This will cause data loss. In extreme cases, if you add a new column and add a unique key to this column at the same time, then there will be only one record left after the table is published. It may cause serious problems. ProposalBased on the above background, we can do special processing for the scene. For data copy, only copy the data that primary key value is not in _xx_gho, which can be rewritten as insert into _xx_gho select * from xx a left join _xx_gho b on a.pk = b.pk where b.pk is null; for binlog apply, You can rewrite replace into as delete + insert. RationaleThe data copy, only the primary key value is not in the gho table is copied, but we don’t need use Thank you! |
Add 2 test cases original table definition: Test case 1:prepare: execute: We can insert a duplicate row before cutover table by popstpone flag, insert sql: result:
Test case 2:prepare: execute: result:
|
@cenkore thank you for the PR and for the explanation. I've been thinking, possibly we don't need to change the |
Oh, strike that. The trivial case when the table is pre-populated with conflicting values + zero traffic in the binary logs is enough to prove my thinking was wrong. |
@cenkore would you mind submitting an identical PR on https://github.com/openark/gh-ost/, where I can take a closer look please? |
@shlomi-noach ok, I will submit later. Thank you. |
@shlomi-noach I submitted to |
@cenkore thank you so much for the effort! I will double check. |
When will this pr be merged |
When this will be fixed? |
This situation is addressed by #1500 |
Hi,
Add a PR for #485 #477 #167 .
Insert syntax in gh-ost will be rewritten with
insert ignore into
(row-copy) andreplace into
(binlog apply goroutine) . if the data in the original table or data after apply would cause conflicts, means does not meet the conditions for adding a unique key, data will be lost in current version. This PR can detect data conflicts and exit, including conflicts in the original table data, and the data after binlog apply. PTAL.The error looks like the following and is very easy to reproduce:
thanks.