BOM in `Response::text_with_charset` #1897

ollyswanson · 2023-07-05T11:23:49Z

Hi, just seeking a bit of clarification on whether including the byte order mark in decoded UTF-8 is intentional or not when calling Reponse::text & Response::text_with_charset.

        let (text, _, _) = encoding.decode(&full);
        if let Cow::Owned(s) = text {
            return Ok(s);
        }
        unsafe {
            // decoding returned Cow::Borrowed, meaning these bytes
            // are already valid utf8
            Ok(String::from_utf8_unchecked(full.to_vec()))
        }

By ignoring the Cow::Borrowed branch we're returning full which can include a BOM, whereas the str slice in the Cow::Borrowed returned by the call to decode strips the BOM.

This tripped us up when trying to deserialize an XML response after calling .text() and I'm unclear on where the responsibility for removing the BOM should live.

The text was updated successfully, but these errors were encountered:

liskin · 2023-07-05T12:24:55Z

This seems like a bug to me. Consider the following:

fn main() {
    for s in [&b"\xEF\xBB\xBFhelloo", &b"\xEF\xBB\xBFhello\xFF"] {
        let mut server = mockito::Server::new();
        let _mock = server.mock("GET", "/").with_body(s).create();

        println!(
            "{:?}",
            reqwest::blocking::get(server.url())
                .unwrap()
                .error_for_status()
                .unwrap()
                .text()
        );
    }
}

(cargo init, cargo add --features blocking reqwest , cargo add mockito, cargo run)

This outputs:

Ok("\u{feff}helloo")
Ok("hello�")

Surely whether the BOM gets or doesn't get stripped shouldn't depend on whether there were any replacements for invalid codepoints or not.

Here's a proposed fix:

diff --git a/src/async_impl/response.rs b/src/async_impl/response.rs
index 340e541..2251452 100644
--- a/src/async_impl/response.rs
+++ b/src/async_impl/response.rs
@@ -185,13 +185,9 @@ pub async fn text_with_charset(self, default_encoding: &str) -> crate::Result<St
         let full = self.bytes().await?;
 
         let (text, _, _) = encoding.decode(&full);
-        if let Cow::Owned(s) = text {
-            return Ok(s);
-        }
-        unsafe {
-            // decoding returned Cow::Borrowed, meaning these bytes
-            // are already valid utf8
-            Ok(String::from_utf8_unchecked(full.to_vec()))
+        match text {
+            Cow::Owned(s) => Ok(s),
+            Cow::Borrowed(s) => Ok(s.to_owned()),
         }
     }

I wonder why did this code ever bother doing the from_utf8_unchecked itself. encoding.decode has done it for at least 7 years. It's not like it was avoiding a copy either, .to_vec() does the copy just as .to_owned() does. I'm probably missing some Rust knowledge here though… 🙂

liskin · 2023-07-05T12:27:09Z

Actually, why not just:

diff --git a/src/async_impl/response.rs b/src/async_impl/response.rs
index 340e541..eaa5b20 100644
--- a/src/async_impl/response.rs
+++ b/src/async_impl/response.rs
@@ -1,4 +1,3 @@
-use std::borrow::Cow;
 use std::fmt;
 use std::net::SocketAddr;
 use std::pin::Pin;
@@ -185,14 +184,7 @@ pub async fn text_with_charset(self, default_encoding: &str) -> crate::Result<St
         let full = self.bytes().await?;
 
         let (text, _, _) = encoding.decode(&full);
-        if let Cow::Owned(s) = text {
-            return Ok(s);
-        }
-        unsafe {
-            // decoding returned Cow::Borrowed, meaning these bytes
-            // are already valid utf8
-            Ok(String::from_utf8_unchecked(full.to_vec()))
-        }
+        Ok(text.into_owned())
     }
 
     /// Try to deserialize the response body as JSON.

That's the same thing as the match above.

seanmonstar · 2023-07-05T12:32:21Z

This was the original pull request: #256

I cannot tell why it was written that way. If there's no significant reason to leave it alone, then seems worth fixing.

liskin · 2023-07-05T12:55:05Z

This was the original pull request: #256

I cannot tell why it was written that way. If there's no significant reason to leave it alone, then seems worth fixing.

It looks like you suggested that piece of code back then and the reason for doing so was to avoid a copy, and it indeed did avoid a copy back then because content was a Vec and String::from_utf8_unchecked took ownership of it. But it introduced the BOM bug unfortunately, and certainly it doesn't seem the BOM behaviour was intentional. Just an accident of copy avoidance done without the knowledge that the Cow::Borrowed may not be the full slice, but just a part of it.

A bunch of refactorings later, the current code no longer avoids a copy because it does full.to_vec() and to_vec() copies, so fixing the BOM bug won't introduce any copying. Whether there is still a way to avoid it, I don't know (yet), but we can sort that out later.

seanmonstar · 2023-07-05T13:05:58Z

Oh, indeed I did. I should have read a bit farther down that PR. I agree with your assessment.

The byte order mark (BOM) is now stripped from utf-8 encoded response bodies when calling `Response::text` and `Response::text_with_charset`. This should prevent surprising behaviour when trying to use the returned String. Closes seanmonstar#1897

The byte order mark (BOM) is now stripped from utf-8 encoded response bodies when calling `Response::text` and `Response::text_with_charset`. This should prevent surprising behaviour when trying to use the returned String. Closes #1897

seanmonstar added the E-pr-welcome The feature is welcome to be added, instruction should be found in the issue. label Jul 5, 2023

ollyswanson mentioned this issue Jul 5, 2023

fix: strip BOM in Response::text_with_charset #1898

Merged

seanmonstar closed this as completed in #1898 Jul 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BOM in `Response::text_with_charset` #1897

BOM in `Response::text_with_charset` #1897

ollyswanson commented Jul 5, 2023 •

edited

liskin commented Jul 5, 2023

liskin commented Jul 5, 2023

seanmonstar commented Jul 5, 2023

liskin commented Jul 5, 2023

seanmonstar commented Jul 5, 2023

BOM in Response::text_with_charset #1897

BOM in Response::text_with_charset #1897

Comments

ollyswanson commented Jul 5, 2023 • edited

liskin commented Jul 5, 2023

liskin commented Jul 5, 2023

seanmonstar commented Jul 5, 2023

liskin commented Jul 5, 2023

seanmonstar commented Jul 5, 2023

BOM in `Response::text_with_charset` #1897

BOM in `Response::text_with_charset` #1897

ollyswanson commented Jul 5, 2023 •

edited