Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete content-encoding header after decompressing #190

Conversation

tanguilp
Copy link
Contributor

When a response is decompressed, we should remove the content-encoding header because other downstream steps may find themselves confused by an erroneous %Req.Response{} struct where:

  • the body is uncompressed
  • but the content-encoding header say otherwise

@wojtekmach
Copy link
Owner

Thank you for the PR. This is the first time a built-in step changes headers so I want to be super careful before setting that precedent. I think it's fine conceptually, we have steps that change response.body after all.

Do you know if any other http clients do this?

Does decode_body remove content-encoding?

@tanguilp
Copy link
Contributor Author

Tesla doesn't seem to do this. I'm not a Rubyist but I've found a library that does so.

And you might have noticed that we missed to recalculate the content-length! So yes, changing headers seems to be normal.

I'll submit a PR for updating content-length.

I'm trying to implement something similar to https://github.com/tanguilp/tesla_http_cache for Req, and this issue with the content-encoding not matching the body encoding somehow confuses http_cache, but I'm not sure why. Going to investigate.

Does decode_body remove content-encoding?
I didn't notice it, but format doesn't use content-encoding but content-type, so that's a different issue.

I'd be cautious with decompressing compressed content by default: https://fly.io/phoenix-files/can-phoenix-safely-use-the-zip-module/

@tanguilp
Copy link
Contributor Author

I can confirm the issue:

  • one first request, the response is returned, uncompressed and then cached by custom ReqHTTPCache I'm currently working one. Body is uncompressed, and content-encoding is set to gzip
  • one second request, the cached response is returned by ReqHTTPCache and the response steps start being executed
  • &Req.Steps.decompress_body/1 is called, and tries decompressing an uncompressed body because content-encoding is still set to gzip

What I could do is to halt processing with Req.Request.halt/1. However, it might not return the expected result (for instance, when response is returned from cache, the json payload is not decoded).

For information here is the struct with steps:

%Req.Request{
  method: :get,
  url: URI.parse(""),
  headers: [],
  body: nil,
  options: %{http_cache: %{store: :http_cache_store_process}},
  registered_options: MapSet.new([:finch, :location_trusted, :path_params,
   :pool_timeout, :raw, :user_agent, :cache, :form, :redirect_log_level,
   :max_redirects, :http_errors, :range, :decode_json, :follow_redirects,
   :output, :base_url, :compress_body, :compressed, :retry, :plug, :retry_delay,
   :retry_log_level, :finch_request, :decode_body, :json, :http_cache,
   :max_retries, :receive_timeout, :cache_dir, :params, :connect_options,
   :extract, :unix_socket, :auth]),
  halted: false,
  adapter: &Req.Steps.run_finch/1,
  request_steps: [
    put_user_agent: &Req.Steps.put_user_agent/1,
    compressed: &Req.Steps.compressed/1,
    encode_body: &Req.Steps.encode_body/1,
    put_base_url: &Req.Steps.put_base_url/1,
    auth: &Req.Steps.auth/1,
    put_params: &Req.Steps.put_params/1,
    put_path_params: &Req.Steps.put_path_params/1,
    put_range: &Req.Steps.put_range/1,
    cache: &Req.Steps.cache/1,
    put_plug: &Req.Steps.put_plug/1,
    compress_body: &Req.Steps.compress_body/1,
    read_from_http_cache: #Function<0.104066892/1 in ReqHTTPCache."-fun.read_from_http_cache/1-">
  ],
  response_steps: [
    retry: &Req.Steps.retry/1,
    follow_redirects: &Req.Steps.follow_redirects/1, 
    decompress_body: &Req.Steps.decompress_body/1,
    decode_body: &Req.Steps.decode_body/1,
    handle_http_errors: &Req.Steps.handle_http_errors/1,
    output: &Req.Steps.output/1,
    cache_response: #Function<1.104066892/1 in ReqHTTPCache."-fun.cache_response/1-">
  ],
  error_steps: [retry: &Req.Steps.retry/1],
  private: %{}
}

I'll update this PR later.

@tanguilp tanguilp marked this pull request as draft June 20, 2023 10:12
Copy link
Owner

@wojtekmach wojtekmach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Please un-draft it and mention the behaviour in docs. :)

@tanguilp tanguilp force-pushed the fix_decompress_and_delete_content-encoding branch from 639c035 to 2fe1380 Compare June 23, 2023 13:45
@tanguilp
Copy link
Contributor Author

I've rebased this branch from the one to be merged by #192, which is why I'll wait for it to be merged before opening this PR.

@wojtekmach
Copy link
Owner

@sneako @ericmj I have a favour to ask, do you think this (and #192) make sense, i.e. some steps changing some headers? I think in these particular cases, adjusting content-length and removing content-encoding, makes sense. But I wouldn't change content-type or any other headers at the moment. Just double-checking. And no rush. :)

@tanguilp tanguilp marked this pull request as ready for review June 26, 2023 08:38
@tanguilp tanguilp marked this pull request as draft June 26, 2023 08:38
@sneako
Copy link
Contributor

sneako commented Jun 26, 2023

Hey! I can see the reasoning behind this idea, but it looks like curl does not do this. and IIRC we have been following their conventions when in doubt.

curl https://api.github.com/repos/wojtekmach/req -v
Returns the uncompressed body with content-length: 5340 and no content-encoding

curl -H 'accept-encoding: gzip' https://api.github.com/repos/wojtekmach/req -v
Returns the compressed body with content-length: 1366 and content-encoding: gzip

curl https://api.github.com/repos/wojtekmach/req -v --compressed
Returns the decompressed body with content-length: 1366 and content-encoding: gzip

@yordis
Copy link

yordis commented Aug 10, 2023

Hey @teamon, just wanted to give you an update on the situation at Tesla PR 606

I've already created a PR, but I need your help with updating the headers since the middleware modifies the response body. It's really important that the data integration aligns with the permutation.

I know @sneako mentioned using curl, but we have to keep in mind that it's not on the same level of composition as Req and Tesla. We shouldn't compromise on data integrity.

@wojtekmach
Copy link
Owner

I couldn't find definite reference for this behaviour (if someone does, please chime in!) but empirically this behaviour is also found in Fetch API:

$ deno
> let res = await fetch("http://httpbin.org/gzip"); [res.headers, await res.json()]
[
  Headers {
  "access-control-allow-credentials": "true",
  "access-control-allow-origin": "*",
  connection: "keep-alive",
  "content-type": "application/json",
  date: "Fri, 18 Aug 2023 08:31:38 GMT",
  server: "gunicorn/19.9.0"
},
  {
    gzipped: true,
    headers: {
      Accept: "*/*",
      "Accept-Encoding": "gzip, br",
      "Accept-Language": "*",
      Host: "httpbin.org",
      "User-Agent": "Deno/1.36.1",
      "X-Amzn-Trace-Id": "Root=1-64df2c69-7b747ad735dcd8e0696ade8f"
    },
    method: "GET",
    origin: "89.73.7.111"
  }
]

And in reqwest:

fn main() -> Result<(), reqwest::Error> {
    let res = reqwest::blocking::Client::builder()
        .gzip(true)
        .build()?
        .get("http://httpbin.org/gzip")
        .send()?;

    eprintln!("Response: {:?} {}", res.version(), res.status());
    eprintln!("Headers: {:#?}\n", res.headers());
    println!("{}", res.text()?);
    Ok(())
}
$ cargo run --features gzip,blocking --example simple http://httpbin.org/gzip
Response: HTTP/1.1 200 OK
Headers: {
    "date": "Fri, 18 Aug 2023 08:31:09 GMT",
    "content-type": "application/json",
    "access-control-allow-origin": "*",
    "connection": "keep-alive",
    "server": "gunicorn/19.9.0",
    "access-control-allow-credentials": "true",
}

{
  "gzipped": true,
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip",
    "Host": "httpbin.org",
    "X-Amzn-Trace-Id": "Root=1-64df2c4b-69c3bc432e4a14781b0d8a33"
  },
  "method": "GET",
  "origin": "89.73.7.111"
}

it's even documented as such in reqwest gzip() method:

Enable auto gzip decompression by checking the Content-Encoding response header.

If auto gzip decompression is turned on:

When sending a request and if the request’s headers do not already contain an Accept-Encoding and Range values, the Accept-Encoding header is set to gzip. The request body is not automatically compressed.

When receiving a response, if its headers contain a Content-Encoding value of gzip, both Content-Encoding and Content-Length are removed from the headers’ set. The response body is automatically decompressed.

The last bit in the docs is very interesting. Both Fetch and reqwest remove content-length header. I was able to find some reference for this, from RFC 9112 § 6.3:

An intermediary that chooses to forward the message MUST first remove the received Content-Length field and process the Transfer-Encoding (as described below) prior to forwarding the message downstream.

It's about Transfer-Encoding, not Content-encoding, but I'd expect the semantics to be similar but it of course needs double-checking. Obviously this is relevant for #192.

@tanguilp I'm keen on moving forward with this. Apologies for delay but it might have been worth it, some references mentioned here might be useful to others. Please un-draft this!

@tanguilp
Copy link
Contributor Author

@tanguilp I'm keen on moving forward with this. Apologies for delay but it might have been worth it, some references mentioned here might be useful to others. Please un-draft this!

No worries, I’m myself on holidays. Will take a look next week. Interesting references, thanks 🙏

@wojtekmach
Copy link
Owner

Yup no rush, enjoy your holiday!

@wojtekmach
Copy link
Owner

@tanguilp I'm clearing the backlog and have cherry-picked this branch with minor modifications as ac45a34. I've also added 1947528 on top. Thank you!

@wojtekmach wojtekmach closed this Aug 25, 2023
@tanguilp
Copy link
Contributor Author

Thanks! Funny thing is we were working on it at the exact same moment this morning, up to the same function names. Imagine my surprise when I tried to rebase :D

Thanks again and keep up the great work!

@wojtekmach
Copy link
Owner

oh no haha, sorry about that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants