Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BOM in Response::text_with_charset #1897

Closed
ollyswanson opened this issue Jul 5, 2023 · 5 comments · Fixed by #1898
Closed

BOM in Response::text_with_charset #1897

ollyswanson opened this issue Jul 5, 2023 · 5 comments · Fixed by #1898
Labels
E-pr-welcome The feature is welcome to be added, instruction should be found in the issue.

Comments

@ollyswanson
Copy link
Contributor

ollyswanson commented Jul 5, 2023

Hi, just seeking a bit of clarification on whether including the byte order mark in decoded UTF-8 is intentional or not when calling Reponse::text & Response::text_with_charset.

        let (text, _, _) = encoding.decode(&full);
        if let Cow::Owned(s) = text {
            return Ok(s);
        }
        unsafe {
            // decoding returned Cow::Borrowed, meaning these bytes
            // are already valid utf8
            Ok(String::from_utf8_unchecked(full.to_vec()))
        }

By ignoring the Cow::Borrowed branch we're returning full which can include a BOM, whereas the str slice in the Cow::Borrowed returned by the call to decode strips the BOM.

This tripped us up when trying to deserialize an XML response after calling .text() and I'm unclear on where the responsibility for removing the BOM should live.

@liskin
Copy link

liskin commented Jul 5, 2023

This seems like a bug to me. Consider the following:

fn main() {
    for s in [&b"\xEF\xBB\xBFhelloo", &b"\xEF\xBB\xBFhello\xFF"] {
        let mut server = mockito::Server::new();
        let _mock = server.mock("GET", "/").with_body(s).create();

        println!(
            "{:?}",
            reqwest::blocking::get(server.url())
                .unwrap()
                .error_for_status()
                .unwrap()
                .text()
        );
    }
}

(cargo init, cargo add --features blocking reqwest , cargo add mockito, cargo run)

This outputs:

Ok("\u{feff}helloo")
Ok("hello�")

Surely whether the BOM gets or doesn't get stripped shouldn't depend on whether there were any replacements for invalid codepoints or not.

Here's a proposed fix:

diff --git a/src/async_impl/response.rs b/src/async_impl/response.rs
index 340e541..2251452 100644
--- a/src/async_impl/response.rs
+++ b/src/async_impl/response.rs
@@ -185,13 +185,9 @@ pub async fn text_with_charset(self, default_encoding: &str) -> crate::Result<St
         let full = self.bytes().await?;
 
         let (text, _, _) = encoding.decode(&full);
-        if let Cow::Owned(s) = text {
-            return Ok(s);
-        }
-        unsafe {
-            // decoding returned Cow::Borrowed, meaning these bytes
-            // are already valid utf8
-            Ok(String::from_utf8_unchecked(full.to_vec()))
+        match text {
+            Cow::Owned(s) => Ok(s),
+            Cow::Borrowed(s) => Ok(s.to_owned()),
         }
     }
 

I wonder why did this code ever bother doing the from_utf8_unchecked itself. encoding.decode has done it for at least 7 years. It's not like it was avoiding a copy either, .to_vec() does the copy just as .to_owned() does. I'm probably missing some Rust knowledge here though… 🙂

@liskin
Copy link

liskin commented Jul 5, 2023

Actually, why not just:

diff --git a/src/async_impl/response.rs b/src/async_impl/response.rs
index 340e541..eaa5b20 100644
--- a/src/async_impl/response.rs
+++ b/src/async_impl/response.rs
@@ -1,4 +1,3 @@
-use std::borrow::Cow;
 use std::fmt;
 use std::net::SocketAddr;
 use std::pin::Pin;
@@ -185,14 +184,7 @@ pub async fn text_with_charset(self, default_encoding: &str) -> crate::Result<St
         let full = self.bytes().await?;
 
         let (text, _, _) = encoding.decode(&full);
-        if let Cow::Owned(s) = text {
-            return Ok(s);
-        }
-        unsafe {
-            // decoding returned Cow::Borrowed, meaning these bytes
-            // are already valid utf8
-            Ok(String::from_utf8_unchecked(full.to_vec()))
-        }
+        Ok(text.into_owned())
     }
 
     /// Try to deserialize the response body as JSON.

That's the same thing as the match above.

@seanmonstar
Copy link
Owner

This was the original pull request: #256

I cannot tell why it was written that way. If there's no significant reason to leave it alone, then seems worth fixing.

@liskin
Copy link

liskin commented Jul 5, 2023

This was the original pull request: #256

I cannot tell why it was written that way. If there's no significant reason to leave it alone, then seems worth fixing.

It looks like you suggested that piece of code back then and the reason for doing so was to avoid a copy, and it indeed did avoid a copy back then because content was a Vec and String::from_utf8_unchecked took ownership of it. But it introduced the BOM bug unfortunately, and certainly it doesn't seem the BOM behaviour was intentional. Just an accident of copy avoidance done without the knowledge that the Cow::Borrowed may not be the full slice, but just a part of it.

A bunch of refactorings later, the current code no longer avoids a copy because it does full.to_vec() and to_vec() copies, so fixing the BOM bug won't introduce any copying. Whether there is still a way to avoid it, I don't know (yet), but we can sort that out later.

@seanmonstar
Copy link
Owner

Oh, indeed I did. I should have read a bit farther down that PR. I agree with your assessment.

@seanmonstar seanmonstar added the E-pr-welcome The feature is welcome to be added, instruction should be found in the issue. label Jul 5, 2023
ollyswanson added a commit to ollyswanson/reqwest that referenced this issue Jul 5, 2023
The byte order mark (BOM) is now stripped from utf-8 encoded response
bodies when calling `Response::text` and `Response::text_with_charset`.
This should prevent surprising behaviour when trying to use the returned
String.

Closes seanmonstar#1897
ollyswanson added a commit to ollyswanson/reqwest that referenced this issue Jul 5, 2023
The byte order mark (BOM) is now stripped from utf-8 encoded response
bodies when calling `Response::text` and `Response::text_with_charset`.
This should prevent surprising behaviour when trying to use the returned
String.

Closes seanmonstar#1897
ollyswanson added a commit to ollyswanson/reqwest that referenced this issue Jul 5, 2023
The byte order mark (BOM) is now stripped from utf-8 encoded response
bodies when calling `Response::text` and `Response::text_with_charset`.
This should prevent surprising behaviour when trying to use the returned
String.

Closes seanmonstar#1897
ollyswanson added a commit to ollyswanson/reqwest that referenced this issue Jul 6, 2023
The byte order mark (BOM) is now stripped from utf-8 encoded response
bodies when calling `Response::text` and `Response::text_with_charset`.
This should prevent surprising behaviour when trying to use the returned
String.

Closes seanmonstar#1897
seanmonstar pushed a commit that referenced this issue Jul 6, 2023
The byte order mark (BOM) is now stripped from utf-8 encoded response
bodies when calling `Response::text` and `Response::text_with_charset`.
This should prevent surprising behaviour when trying to use the returned
String.

Closes #1897
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
E-pr-welcome The feature is welcome to be added, instruction should be found in the issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants