Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy phase is acquiring a lock on my original table I don't know why #94

Closed
yash-toddleapp opened this issue Jun 24, 2023 · 18 comments
Closed

Comments

@yash-toddleapp
Copy link

yash-toddleapp commented Jun 24, 2023

image

It's mostly happening for update queries. I don't know why. I thought it will release the lock. But the curve kept going up. I had to manually intervene and stop the process. It instantly came down to normal as soon as I stopped this.

@jfrost
Copy link
Collaborator

jfrost commented Jun 24, 2023

Could you give it another test and run this query as your RDS admin user when you see those waiting locks:

WITH RECURSIVE
     c(requested, current) AS
       ( VALUES
         ('AccessShareLock'::text, 'AccessExclusiveLock'::text),
         ('RowShareLock'::text, 'ExclusiveLock'::text),
         ('RowShareLock'::text, 'AccessExclusiveLock'::text),
         ('RowExclusiveLock'::text, 'ShareLock'::text),
         ('RowExclusiveLock'::text, 'ShareRowExclusiveLock'::text),
         ('RowExclusiveLock'::text, 'ExclusiveLock'::text),
         ('RowExclusiveLock'::text, 'AccessExclusiveLock'::text),
         ('ShareUpdateExclusiveLock'::text, 'ShareUpdateExclusiveLock'::text),
         ('ShareUpdateExclusiveLock'::text, 'ShareLock'::text),
         ('ShareUpdateExclusiveLock'::text, 'ShareRowExclusiveLock'::text),
         ('ShareUpdateExclusiveLock'::text, 'ExclusiveLock'::text),
         ('ShareUpdateExclusiveLock'::text, 'AccessExclusiveLock'::text),
         ('ShareLock'::text, 'RowExclusiveLock'::text),
         ('ShareLock'::text, 'ShareUpdateExclusiveLock'::text),
         ('ShareLock'::text, 'ShareRowExclusiveLock'::text),
         ('ShareLock'::text, 'ExclusiveLock'::text),
         ('ShareLock'::text, 'AccessExclusiveLock'::text),
         ('ShareRowExclusiveLock'::text, 'RowExclusiveLock'::text),
         ('ShareRowExclusiveLock'::text, 'ShareUpdateExclusiveLock'::text),
         ('ShareRowExclusiveLock'::text, 'ShareLock'::text),
         ('ShareRowExclusiveLock'::text, 'ShareRowExclusiveLock'::text),
         ('ShareRowExclusiveLock'::text, 'ExclusiveLock'::text),
         ('ShareRowExclusiveLock'::text, 'AccessExclusiveLock'::text),
         ('ExclusiveLock'::text, 'RowShareLock'::text),
         ('ExclusiveLock'::text, 'RowExclusiveLock'::text),
         ('ExclusiveLock'::text, 'ShareUpdateExclusiveLock'::text),
         ('ExclusiveLock'::text, 'ShareLock'::text),
         ('ExclusiveLock'::text, 'ShareRowExclusiveLock'::text),
         ('ExclusiveLock'::text, 'ExclusiveLock'::text),
         ('ExclusiveLock'::text, 'AccessExclusiveLock'::text),
         ('AccessExclusiveLock'::text, 'AccessShareLock'::text),
         ('AccessExclusiveLock'::text, 'RowShareLock'::text),
         ('AccessExclusiveLock'::text, 'RowExclusiveLock'::text),
         ('AccessExclusiveLock'::text, 'ShareUpdateExclusiveLock'::text),
         ('AccessExclusiveLock'::text, 'ShareLock'::text),
         ('AccessExclusiveLock'::text, 'ShareRowExclusiveLock'::text),
         ('AccessExclusiveLock'::text, 'ExclusiveLock'::text),
         ('AccessExclusiveLock'::text, 'AccessExclusiveLock'::text)
       ),
     l AS
       (
         SELECT
             (pl.locktype,pl.DATABASE,pl.relation::regclass::text,pl.page,pl.tuple,pl.virtualxid,pl.transactionid,pl.classid,pl.objid,pl.objsubid) AS target,
             pl.virtualtransaction,
             pl.pid,
             pl.mode,
             pl.granted,
             psa.query
           FROM pg_catalog.pg_locks pl
           JOIN pg_catalog.pg_stat_activity psa
             ON pl.pid = psa.pid
       ),
     t AS
       (
         SELECT
             blocker.target  AS blocker_target,
             blocker.pid     AS blocker_pid,
             blocker.mode    AS blocker_mode,
             blocker.query   AS blocker_query,
             blocked.target  AS target,
             blocked.pid     AS pid,
             blocked.mode    AS mode
           FROM l blocker
           JOIN l blocked
             ON ( NOT blocked.granted
              AND blocker.granted
              AND blocked.pid != blocker.pid
              AND blocked.target IS NOT DISTINCT FROM blocker.target)
           JOIN c ON (c.requested = blocked.mode AND c.current = blocker.mode)
       ),
     r AS
       (
         SELECT
             blocker_target,
             blocker_pid,
             blocker_mode,
             '1'::int        AS depth,
             target,
             pid,
             mode,
             blocker_pid::text || ',' || pid::text AS seq,
             blocker_query
           FROM t
         UNION ALL
         SELECT
             blocker.blocker_target,
             blocker.blocker_pid,
             blocker.blocker_mode,
             blocker.depth + 1,
             blocked.target,
             blocked.pid,
             blocked.mode,
             blocker.seq || ',' || blocked.pid::text,
             blocker.blocker_query
           FROM r blocker
           JOIN t blocked
             ON (blocked.blocker_pid = blocker.pid)
           WHERE blocker.depth < 1000
       )
SELECT * FROM r
  ORDER BY seq;

Would also be good to just grab a dump of pg_stat_activity like so:

SELECT * FROM pg_stat_activity;

Make sure to REDACT any sensitive info before pasting it into your reply.

@jfrost
Copy link
Collaborator

jfrost commented Jun 24, 2023

BTW, it might also be helpful to see the queries section of Performance insights from the same time period you posted the screenshot. Again, remember to blur any sensitive info if you post that.

@shayonj
Copy link
Owner

shayonj commented Jun 24, 2023

Yeah, the output of the above query will def be useful.

Thinking out loud - I wonder if its a by product of using TRANSACTION ISOLATION LEVEL SERIALIZABLE until we add the triggers (during initial copy), this can be prohibitive on large DBs 🤔

Query.run(client.connection, "SET TRANSACTION ISOLATION LEVEL SERIALIZABLE", true)

@dhrumil-toddleapp
Copy link

@shayonj , I also think that it's the byproduct of TRANSACTION ISOLATION LEVEL SERIALIZABLE. Because the transaction carries from setup_shadow_table to copy_data functions. What if we don't do copy_data in the same transaction?

@jfrost Also here are the screenshots of queries section of performance insights. Largely the updates query on the original tables are stuck. Here INSERT INTO is happening to the shadow table, and update is happening on the original table.

Screenshot 2023-06-25 at 10 06 45 AM

@yash-toddleapp
Copy link
Author

@shayonj I am not finding locks per say with the query @jfrost shared but this is the behavior while trying to replicate the same on local. You can see how queries goes from taking < 10ms to like 400ms and more.

clip of what's happening while trying to replicate it on local: https://imgur.com/M5h5ZPm

@jfrost
Copy link
Collaborator

jfrost commented Jun 25, 2023

If the query I sent over returned zero results while the issue is happening and you can see the lock waits in Performance Insights, then it's possible you are running the query as a user which doesn't have the appropriate permissions. You can do a simple SELECT * FROM pg_stat_activity; to see if you user can see queries or not.

@shayonj
Copy link
Owner

shayonj commented Jun 25, 2023

Yeah, thats interesting. If you have any reproducible script or similar, that could be super useful. I also see that in the video, the latency spikes for a handful of queries but most are <100ms (?).

I will take a deeper look and get back within the week. I have been meaning to refactor parts of this code and introduce concept of two connections to setup copy and trigger, like pg_repack, which I think would allow us to also drop the SET TRANSACTION ISOLATION LEVEL SERIALIZABLE. In the meantime, any more info like @jfrost mentioned or reproducible script would be super useful. Thanks!!

@yash-toddleapp
Copy link
Author

@shayonj as of now we have only been able to replicate the locking behaviour on our Amazon's RDS instances (it doesn't happen on local environment). And also copying data in batches instead of a single query solves. So I don't think serialised transaction is the issue.

@jfrost
Copy link
Collaborator

jfrost commented Jun 25, 2023

@yash-toddleapp Any chance you are running those UPDATE queries on your RDS instance with serializable isolation or something other than read committed?

@yash-toddleapp
Copy link
Author

@shayonj @jfrost Nah I don't think so. I found something more weird. The table only locks while INSERT INTO #{shadow_table}(#{insert_into_columns.join(", ")}) SELECT #{select_columns.join(", ")} FROM ONLY #{client.table_name}
when ran from the pg-osc.

What I did was make the list of all the queries that were being executed with the help of --verbose flag. And ran them using psql. And it does not lock the table. So there's something in the tool itself I don't know what that is causing the lock.

@shayonj
Copy link
Owner

shayonj commented Jun 30, 2023

Interesting! I will look a bit more deeply in the next few days and get back

@shayonj
Copy link
Owner

shayonj commented Jul 1, 2023

So, i am unable to replicate this in our smoke spec env and not seeing any consistent locking the on the parent table. I am curious

  • How did you pin it down to the issue being at INSERT INTO #{shadow_table}(#{insert_into_columns.join(", ")}) SELECT #{select_columns.join(", ")}
  • Also, what version of PG you are on?
  • Lastly, do you have any custom setting set on your RDS instances?

Thanks

@shayonj
Copy link
Owner

shayonj commented Jul 1, 2023

*also how big the table is roughly how many writes/updates it gets. Thanks!

@yash-toddleapp
Copy link
Author

@shayonj Apparently the issue is the audit table is getting locked because of the ALTER TABLE #{audit_table} SET (autovacuum_enabled = false, toast.autovacuum_enabled = false); query run in transaction which commits after data is copied.

So it gets stuck on the trigger that wants to acquire row level lock but can't acquire on audit table to insert the log of insert/update. And yes this behavior is only on RDS, we still haven't figured out why.

Running the disable vacuum query before the serializable transaction starts fixes the issue. I have this fork https://github.dev/yash-toddleapp/pg-osc which has this fix with some other fixes as well. I'll make a PR once I get time.

We have tables as big as 50-100 million rows

@shayonj
Copy link
Owner

shayonj commented Jul 1, 2023

That’s a good find! And makes sense. Since serializable transaction shouldn’t be causing issues. I will get in my refactor after your patch. Let me know if I can help with the patch or anything. Thanks!

curious why it’s only happening on RDS. Will try to look into it later, but I can see how the alter can cause an access exclusive lock and a lock queue.

@shayonj
Copy link
Owner

shayonj commented Jul 2, 2023

I proposed a simplified change here: #97

Feel free to add to it or open new PRs with any other fixes you found, also if you are able to test/verify, then that'd be great too. Thanks!

@shayonj
Copy link
Owner

shayonj commented Jul 4, 2023

v.0.9.2 is now out with the fix: https://github.com/shayonj/pg-osc/releases/tag/v0.9.2

@shayonj
Copy link
Owner

shayonj commented Jul 10, 2023

Closing this, thanks for the report and brainstorming! please feel free to open reports or suggest PRs too. Thanks again!

@shayonj shayonj closed this as completed Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants