Data corruption in concurrent delete #23

dvarrazzo · 2014-05-18T21:31:23Z

Analysis and bugfix by Josh. Opening the issue for reference: Josh email is not available in ML archive. The bug has already been fixed and ready to be released in pg_repack 1.2.0.

OK, I found some time to revisit this bug. Here's a test script with
which I managed to reproduce the problem fairly consistently (see
comments for initial table setup, etc.) I was testing with pg_repack
master, with just a few sleep calls like so sprinkled in to encourage
races (extra_sleeps.diff debugging patch attached). To reproduce, just
start up that repack_bug.py script, let it run for a few seconds to
start populating rows, then kick off an extra_sleeps-patched pg_repack
of table "parent", and afterwards you should get some broken FKs.

AFAICT, the problem only occurs when the test script is using more
than one child process, i.e. concurrent transactions are queueing rows
into the repack log table. So where's the bug? It seems the sql_pop
query used in repack_apply() is to blame. It tries to bulk-delete the
rows it has already processed from the log table via a query like:

DELETE FROM repack.log_22520137 WHERE id <= $1

assuming that any "id" value less than the first "id" it started
processing in its current batch is safe to delete. But this assumption
may not hold if we have say:

T1: BEGIN
T1: DELETE FROM table_being_repacked WHERE ...
T1: ... hangs out for a while ...

T2: BEGIN
T2: INSERT INTO table_being_repacked ...
T2: COMMIT

... later ...

T1: COMMIT
/* repack_apply's sql_pop may run here */

In this example repack_apply() may delete T1's entry in the repack log
table without having actually carried out T1's DELETE on the repack
temp table. Attached is a patch to turn that bulk-delete into a delete
for every row we process out of the log table
(delete_individual_rows.diff). That'll be slower of course, but it
could be optimized into a bulk delete by keeping track of all the IDs
we have processed, and performing a single DELETE FROM repack.log_xyz
WHERE id IN (...).

At least, I hope this is the bug which I got bitten by back in
February. Other thoughts are welcome of course.

dvarrazzo · 2014-05-18T21:31:58Z

Fixed in 5061046

dvarrazzo closed this as completed May 18, 2014

dvarrazzo added a commit that referenced this issue May 19, 2014

Document issue #23 fixed

5e9d596

schmiddy added a commit that referenced this issue May 23, 2014

Add note urging users to upgrade due to Issue #23.

08d28df

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data corruption in concurrent delete #23

Data corruption in concurrent delete #23

dvarrazzo commented May 18, 2014

dvarrazzo commented May 18, 2014

Data corruption in concurrent delete #23

Data corruption in concurrent delete #23

Comments

dvarrazzo commented May 18, 2014

dvarrazzo commented May 18, 2014