I couldn't figure out how to nicely convert a streamable `reqwest::Response` into a `Readable` #48

asottile · 2022-09-11T19:21:13Z

this is likely because I am bad at rust but I struggled to get this working (though in theory it shouldn't be too difficult?)

I was hoping to be able to do something like:

let resp = reqwest::get(u).await?;

for token in html5gum::Tokenizer::new(&resp) { ... }

or even with bytes_stream

let resp = reqwest::get(u).await?.bytes_stream();

for token in html5gum::Tokenizer::new(&resp) { ... }

fell back on

let resp = reqwest::get(u).await?.text().await?;

for token in html5gum::Tokenizer::new(&resp) { ... }

but I'm pretty sure that's not going to stream the response and it's going to convert it back and forth from Bytes -> String -> Vec -> String unnecessarily

untitaker · 2022-09-11T19:59:11Z

yup that's a known problem. html5gum ideally would accept:

async streams (or whatever other async support there is, see impl Reader for async? #3)
iterators of characters, bytes, string chunks, whatever

then additionally, html5gum should ideally return String tokens (not Bytes), if the input was already guaranteed valid utf-8. however, that's a lot of trait magic i have to do.

i think it's likely that, if #47 #21 ever lands, i'll revamp the I/O setup significantly. a lot of the inflexibility i introduced in html5gum was based around performance improvements i can make if the entire input stream is available as a contiguous block of bytes in memory

i don't think buffering up all input in memory is strictly worse for performance. most html documents should fit into your I/O buffer, and in my experience you save quite a bit of branching when passing a string into the tokenizer vs passing in a File object (even with a massive I/O buffer size)

i was curious what you were working on. seems like it's some sort of improvement to pip? I think pip already does not use a fully spec-compliant HTML5 parser. And since you're parsing literally only one webpage from a single party, I suspect your range of possible HTML "dialects" you have to deal with will be very limited, so a custom parser or quick-xml might work fine (and probably quicker too)

asottile · 2022-09-13T16:15:13Z

thanks for the advice !

untitaker added the question Further information is requested label Sep 12, 2022

asottile closed this as completed Sep 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I couldn't figure out how to nicely convert a streamable `reqwest::Response` into a `Readable` #48

I couldn't figure out how to nicely convert a streamable `reqwest::Response` into a `Readable` #48

asottile commented Sep 11, 2022

untitaker commented Sep 11, 2022

asottile commented Sep 13, 2022

I couldn't figure out how to nicely convert a streamable reqwest::Response into a Readable #48

I couldn't figure out how to nicely convert a streamable reqwest::Response into a Readable #48

Comments

asottile commented Sep 11, 2022

untitaker commented Sep 11, 2022

asottile commented Sep 13, 2022

I couldn't figure out how to nicely convert a streamable `reqwest::Response` into a `Readable` #48

I couldn't figure out how to nicely convert a streamable `reqwest::Response` into a `Readable` #48