Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign up#21467 : Process blob url in chunks instead of only at end #21553
Conversation
highfive
commented
Aug 31, 2018
|
Thanks for the pull request, and welcome! The Servo team is excited to review your changes, and you should hear from @nox (or someone else) soon. |
highfive
commented
Aug 31, 2018
|
Heads up! This PR modifies the following files:
|
highfive
commented
Aug 31, 2018
|
We'll want to make use of the same setup that I described in my review in #21560. |
|
So I'm working through this right now and am noticing that most of the steps you listed in the other review are already done albeit in a different way. A new channel is created and a new thread is spawned, although the channel isn't stored in done_chan and the response body is never changed to receiving. In addition, if I'm interpreting the get_blob_buf method in filemanager_thread.rs correctly it seems like it's sending the blob buffer every 8kb. If I appropriately change the response body to receiving and store the channel in done_chan is that all that needs to be done? Am I incorrectly reading the code? servo/components/net/blob_loader.rs Lines 31 to 35 in 6aab6dd servo/components/net/filemanager_thread.rs Lines 74 to 86 in 6aab6dd Edit: Also what's a good way to test my changes? |
|
@jdm How is the code currently not working in an async manner? I'm worried that if I rewrite the load_blob_sync function in the way you outlined in #21560 it will keep the same issues that are currently in place as every step you gave except storing in done_chan is already being done. In addition, should it even be a rewrite to the load_blob_sync function or modification in get_blob_buf from filemanager_threads.rs? Cutting out the call to read file in load_blob_sync would create a lot of dead code yeah? |
|
Terribly sorry, I meant to reply to your previous comment and got distracted. You're not wrong that the code that load_blob_sync calls is written to support asynchronous loading, but as written we need to perform some surgery to extend that async loading to the fetch code. If you try to store the channel in done_chan, rustc will complain that the types don't match since the Filemanager::read_file sends blob-specific messages. The tricky bit is related to the end of load_blob_sync, which has the synchronous loop that receives the blob parts and stores them in the response body. We have two options - either we create a new thread which contains that loop and sends appropriate data over the done_chan as it is received, or we remove most of load_blob_sync and add a new api to Filemanager that sends the done_chan messages directly. I'm inclined to do the second option, which will be more efficient. |
|
To test your changes, running |
|
Ok awesome thanks for the response! The second option definitely seems like the better way to go ... in that scenario then there'd only be like a Edit: Wait you mean like the `filemanager.read_file(...) stays but there's a method somewhere in filemanager that sends the done_chan messages? So then while in chunked_read() / get_blob_buf() as it's reading through the blob it's also sending messages to the done_chan? |
|
I was imagining that instead of calling filemanager.read_file we would call a new filemanager.fetch_file which would accept a sender that matches the receiver for done_chan. |
|
It's possible we might be able to create a trait (or enum) that encompasses both kinds of channels to avoid duplicating a bunch of code. |
|
Ok I'll look into creating an enum that works for accepting both types of channels. If the refactoring work gets to be a bit much I'll let you know. Why are two different types of channels being used again? Is there a specific reason filemanager uses the ipc channels and done_chan uses the mpsc channels? |
|
The filemanager is also used by code that can exist in other processes, like https://github.com/servo/servo/blob/master/components/script/dom/blob.rs#L304-L329 so it needs to support IPC communication. The network loading code lives in the same process, so MPSC channels are fine. |
|
Sorry for the weeklong silence I've been moving places since last weekend. I tried to refactor the code into using a trait for the .read_file(...) call but it's not working since I don't see a way to avoid passing in the response into the read_file(...) call as well (since chunked_read will eventually be the one to set the response from receiving -> done now). Should I just create the fetch_file(..., response) function and duplicate a lot of code? That feels like the lazy way to do this but I don't see another way right now |
|
I would start with duplicating the code, and then either it will be easier for you to see ways to reduce the duplication, or it will be easier for your reviewer to see ways to reduce the duplication. |
|
Ok much later than I thought but initial changes for asynchronous loading are in. Currently |
|
First of all, I want to thank you for tackling this. It's definitely more complex than I assumed based on the original issue that I filed, and I can see why making the fetch model integrate with the existing implementation is akward. I've made some comments about the state of these changes today; I see two paths forward. In one, we fully embrace code duplication and create a model like this:
I think this will leave us with an implementation that is easier to read when integrating it with the fetch code. I have ideas about how to unify this new async blob fetching with the existing code, but it requires redesigning some of the existing code. I'll file a separate issue describing my proposed design for whoever would like to work on it. |
| let _ = sender.send(Data::Payload(blob_buf.bytes)); | ||
| } | ||
| Err(_) => { | ||
| *response = Response::network_error(NetworkError::Internal("Opening file failed".into())); |
This comment has been minimized.
This comment has been minimized.
jdm
Oct 9, 2018
Member
Err(err) => {
*response = Response::network_error(NetworkError::Internal(format!("Initial file read failed: {}", err)));
}| match file.read(&mut buf) { | ||
| Ok(n) => { | ||
| buf.truncate(n); | ||
| let blob_buf = BlobBuf { |
This comment has been minimized.
This comment has been minimized.
| let _ = sender.send(Data::Payload(buf)); | ||
| } | ||
| Err(_) => { | ||
| *response = Response::network_error(NetworkError::Internal("Opening file failed".into())); |
This comment has been minimized.
This comment has been minimized.
| fn chunked_fetch(sender: &servo_channel::Sender<Data>, | ||
| file: &mut File, size: usize, opt_filename: Option<String>, | ||
| type_string: String, response: &mut Response, bytes: &mut Vec<u8>) { | ||
| // First chunk |
This comment has been minimized.
This comment has been minimized.
jdm
Oct 9, 2018
Member
It's not clear to me why we need an initial read that's separate from the loop.
| origin: FileOrigin, | ||
| response: &mut Response) { | ||
| let store = self.store.clone(); | ||
| let mut r2 = response.clone(); |
This comment has been minimized.
This comment has been minimized.
jdm
Oct 9, 2018
Member
This cloned response is modified by try_fetch_file, but no other code observes the modifications and r2 is discarded.
| // Basic fetch, Step 4. | ||
| headers.set(ContentLength(blob_buf.size as u64)); | ||
| // Basic fetch, Step 5. | ||
| headers.set(ContentType(content_type.clone())); |
This comment has been minimized.
This comment has been minimized.
jdm
Oct 9, 2018
Member
The blob data received by network consumers no longer has the correct headers because of these changes and my later comments about the cloned Response value.
| let check_url_validity = true; | ||
| filemanager.read_file(sender, id, check_url_validity, origin); | ||
| filemanager.fetch_file(sender, id, check_url_validity, origin, &mut response); |
This comment has been minimized.
This comment has been minimized.
jdm
Oct 9, 2018
Member
Since the only caller of this method is passing true for check_url_validity, let's remove the argument.
| bytes.extend_from_slice(&blob_buf.bytes); | ||
|
|
||
| response.headers = headers; | ||
| *response.body.lock().unwrap() = ResponseBody::Done(bytes); |
This comment has been minimized.
This comment has been minimized.
jdm
Oct 9, 2018
Member
This needs to send bytes over sender instead of modifying the response object's body.
| let store = self.store.clone(); | ||
| let mut r2 = response.clone(); | ||
| thread::Builder::new().name("read file".to_owned()).spawn(move || { | ||
| store.try_fetch_file(&sender, id, check_url_validity, origin, &mut r2) |
This comment has been minimized.
This comment has been minimized.
jdm
Oct 9, 2018
Member
If try_fetch_file returns an Err, we shouldn't ignore it. I propose we check for an Err and set the response's body to Done like https://github.com/servo/servo/blob/master/components/net/http_loader.rs#L1160-L1163 instead.
|
Sorry for the long delay! Thanks again for tackling this! |
|
The delay is no problem at all! Unfortunately school has started up for me and I'm getting really swamped in work right now. I'd love to finish this or even attempt to fully implement the fix (now that I'm learning more about rust), but to be fair to you if I'm being honest the next time I'll have to complete this is Thanksgiving / late-November. If the ticket or current pull request is still needed in two months I'd jump at the chance to finish this but if not I apologize for in a sense failing to get the ticket done |
|
Ok! We'll close this PR for now and mark you as unassigned from the issue; if it's still open at that time you will be welcome to resume working on it :) |
ms2300 commentedAug 31, 2018
•
edited by SimonSapin
I changed the way that load_blob_sync works so that it takes the target and processes incrementally on the blob_url as opposed to simply appending the bytes to a vec. I used the same ideas as specified in the comments of #21466 to determine the correct way to process incrementally and am passing the tests and checklist, but am not 100% sure this is the best or correct way to do this and any feedback would be greatly appreciated (it seems like too easy of a solution).
[X ]
./mach build -ddoes not report any errors[X ]
./mach test-tidydoes not report any errors[X (first pass I think this does) ] These changes fix #21467 (github issue number if applicable).
There are tests for these changes OR
These changes do not require tests because _____
[ X] What's the best way to bulletproof test this?
This change is