Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I couldn't figure out how to nicely convert a streamable reqwest::Response into a Readable #48

Closed
asottile opened this issue Sep 11, 2022 · 2 comments
Labels
question Further information is requested

Comments

@asottile
Copy link

this is likely because I am bad at rust but I struggled to get this working (though in theory it shouldn't be too difficult?)

I was hoping to be able to do something like:

let resp = reqwest::get(u).await?;

for token in html5gum::Tokenizer::new(&resp) { ... }

or even with bytes_stream

let resp = reqwest::get(u).await?.bytes_stream();

for token in html5gum::Tokenizer::new(&resp) { ... }

fell back on

let resp = reqwest::get(u).await?.text().await?;

for token in html5gum::Tokenizer::new(&resp) { ... }

but I'm pretty sure that's not going to stream the response and it's going to convert it back and forth from Bytes -> String -> Vec -> String unnecessarily

@untitaker
Copy link
Owner

yup that's a known problem. html5gum ideally would accept:

  1. async streams (or whatever other async support there is, see impl Reader for async? #3)
  2. iterators of characters, bytes, string chunks, whatever

then additionally, html5gum should ideally return String tokens (not Bytes), if the input was already guaranteed valid utf-8. however, that's a lot of trait magic i have to do.

i think it's likely that, if #47 #21 ever lands, i'll revamp the I/O setup significantly. a lot of the inflexibility i introduced in html5gum was based around performance improvements i can make if the entire input stream is available as a contiguous block of bytes in memory

i don't think buffering up all input in memory is strictly worse for performance. most html documents should fit into your I/O buffer, and in my experience you save quite a bit of branching when passing a string into the tokenizer vs passing in a File object (even with a massive I/O buffer size)

i was curious what you were working on. seems like it's some sort of improvement to pip? I think pip already does not use a fully spec-compliant HTML5 parser. And since you're parsing literally only one webpage from a single party, I suspect your range of possible HTML "dialects" you have to deal with will be very limited, so a custom parser or quick-xml might work fine (and probably quicker too)

@untitaker untitaker added the question Further information is requested label Sep 12, 2022
@asottile
Copy link
Author

thanks for the advice !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants