Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wip: automatic retries for transient failures #267

Closed
wants to merge 1 commit into from
Closed

Conversation

darora
Copy link
Contributor

@darora darora commented Apr 17, 2022

No description provided.

return new Promise((resolve) => setTimeout(() => resolve(), ms))
}

const RetryableErrorCodes = new Set([503, 504, 522, 523, 524])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments on what the code means? I think some of the semantics are CF-specific

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 504 Gateway Timeout, 522/523/524(Cloudflare specific) are fine to retry on.

503 means that PostgreSQL is unreachable for postgrest, retrying in that case could be handled in another PR since PostgREST provides a Retry-After header. Later on 503 will also signal when postgrest's pool is overloaded.

@@ -88,6 +94,35 @@ export abstract class PostgrestBuilder<T> implements PromiseLike<PostgrestRespon
return this
}

retryingFetch(url: string, opts: any, retries: number, retryDelayMs: number): Promise<Response> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it private?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it mirror fetch:

Suggested change
retryingFetch(url: string, opts: any, retries: number, retryDelayMs: number): Promise<Response> {
retryingFetch(input: RequestInfo, init?: RequestInit, retries: number, retryDelayMs: number): Promise<Response> {

return res.then(onfulfilled, onrejected)
}

processResponse = async <TResult1 = PostgrestResponse<T>>(res: Response): Promise<PostgrestResponse<T>> => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it private


app.get('/temporary503failure', async function(request, reply) {
if (count_503 < 2) {
count_503++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

npm run format

Comment on lines +99 to +101
if (!['GET', 'HEAD'].includes(this.method)) {
return Promise.resolve(this.fetch(url, opts))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good call. However, how should we handle PATCH requests that time out as on supabase/supabase#6813?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@darora Perhaps for this type of cases, we can increase the Cloudflare timeout?

https://support.cloudflare.com/hc/en-us/articles/115003011431-Troubleshooting-Cloudflare-5XX-errors#524error

Enterprise customers can increase the 524 timeout up to 6000 seconds using the proxy_read_timeout API endpoint.

Copy link
Member

@soedirgo soedirgo May 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If PostgREST treats PATCH (and DELETE) as semantically idempotent it should be fine, but maybe safer to do it only for reads for now just to limit the impact, and note it as a TODO.

The linked issue seems odd though - I would've expected res.ok to handle it so that error is not null/undefined. I may need to repro it first

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If PostgREST treats PATCH (and DELETE) as semantically idempotent it should be fin

@soedirgo They are idempotent if we stick to the definition in rfc7231:

A request method is considered "idempotent" if the intended EFFECT ON THE SERVER of multiple identical requests with that method is the same as the effect for a single such request.
It knows that repeating the request will have the same intended effect, even if the original request succeeded, though the response might differ.

The last paragraph applies(response might be different) when using Prefer: return=representation, so a DELETE would return contents the first time but an empty array the next time is tried on the same resource. Not sure if would be safe to retry in those cases, as it could cause application errors?

The way around that, would be not forcing the Prefer: return=representation(which IIRC, is also needed for RLS policies to work as intended without return=minimal) on postgrest-js, and then retry only for cases which do not specify that header.

PATCH should be safe to retry on, even with Prefer: return=representation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retry only for cases which do not specify that header

Yup sounds good

PATCH should be safe to retry on, even with Prefer: return=representation

Wouldn't it return different rows? I'm assuming return=representation returns the old rows

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it still may not be idempotent even with return=minimal, because you might have a trigger that itself isn't idempotent (extreme example: on DELETE charge a Stripe payment w/ pg_net)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it return different rows? I'm assuming return=representation returns the old rows

Hm, return=representation returns the new updated row. Or do you mean the row would have been modified by another PATCH request?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return=representation returns the new updated row

Ah I see, you can ignore that part

@steve-chavez
Copy link
Member

An update here, PostgREST main branch has a db-pool-acquisition-timeout config which can be configured to convert all Cloudflare or other proxies timeout errors to a PostgREST error. Right now it responds with a generic 504 Gateway Timeout but I'm thinking of chaging it to a 529 Site is overloaded.

References:

@darora darora closed this Sep 14, 2022
@soedirgo soedirgo deleted the da/retries branch September 14, 2022 05:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants