Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data corruption in concurrent delete #23

Closed
dvarrazzo opened this issue May 18, 2014 · 1 comment
Closed

Data corruption in concurrent delete #23

dvarrazzo opened this issue May 18, 2014 · 1 comment

Comments

@dvarrazzo
Copy link
Member

Analysis and bugfix by Josh. Opening the issue for reference: Josh email is not available in ML archive. The bug has already been fixed and ready to be released in pg_repack 1.2.0.

OK, I found some time to revisit this bug. Here's a test script with
which I managed to reproduce the problem fairly consistently (see
comments for initial table setup, etc.) I was testing with pg_repack
master, with just a few sleep calls like so sprinkled in to encourage
races (extra_sleeps.diff debugging patch attached). To reproduce, just
start up that repack_bug.py script, let it run for a few seconds to
start populating rows, then kick off an extra_sleeps-patched pg_repack
of table "parent", and afterwards you should get some broken FKs.

AFAICT, the problem only occurs when the test script is using more
than one child process, i.e. concurrent transactions are queueing rows
into the repack log table. So where's the bug? It seems the sql_pop
query used in repack_apply() is to blame. It tries to bulk-delete the
rows it has already processed from the log table via a query like:

DELETE FROM repack.log_22520137 WHERE id <= $1

assuming that any "id" value less than the first "id" it started
processing in its current batch is safe to delete. But this assumption
may not hold if we have say:

T1: BEGIN
T1: DELETE FROM table_being_repacked WHERE ...
T1: ... hangs out for a while ...

T2: BEGIN
T2: INSERT INTO table_being_repacked ...
T2: COMMIT

... later ...

T1: COMMIT
/* repack_apply's sql_pop may run here */

In this example repack_apply() may delete T1's entry in the repack log
table without having actually carried out T1's DELETE on the repack
temp table. Attached is a patch to turn that bulk-delete into a delete
for every row we process out of the log table
(delete_individual_rows.diff). That'll be slower of course, but it
could be optimized into a bulk delete by keeping track of all the IDs
we have processed, and performing a single DELETE FROM repack.log_xyz
WHERE id IN (...).

At least, I hope this is the bug which I got bitten by back in
February. Other thoughts are welcome of course.

@dvarrazzo
Copy link
Member Author

Fixed in 5061046

dvarrazzo added a commit that referenced this issue May 19, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant