Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upAdded Async HTML Tokenizer #17037
Added Async HTML Tokenizer #17037
Conversation
highfive
commented
May 25, 2017
|
Heads up! This PR modifies the following files:
|
highfive
commented
May 25, 2017
|
|
|
Initial partial review. |
| @@ -91,6 +92,7 @@ pub struct ServoParser { | |||
| aborted: Cell<bool>, | |||
| /// https://html.spec.whatwg.org/multipage/#script-created-parser | |||
| script_created_parser: bool, | |||
| async: bool, | |||
This comment has been minimized.
This comment has been minimized.
nox
May 31, 2017
Member
You don't need this, the provided tokeniser already lets you know whether it's async or not, please remove.
| @@ -54,6 +54,7 @@ use std::mem; | |||
| use style::context::QuirksMode as ServoQuirksMode; | |||
|
|
|||
| mod html; | |||
| mod async_html; | |||
This comment has been minimized.
This comment has been minimized.
| @@ -1364,7 +1364,7 @@ impl Element { | |||
| // Steps 1-2. | |||
| let context_document = document_from_node(self); | |||
| // TODO(#11995): XML case. | |||
| let new_children = ServoParser::parse_html_fragment(self, markup); | |||
| let new_children = ServoParser::parse_html_fragment(self, markup, true); | |||
This comment has been minimized.
This comment has been minimized.
| } | ||
|
|
||
| #[allow(unsafe_code)] | ||
| unsafe impl JSTraceable for HtmlTokenizer<TreeBuilder<ParseNode, Sink>> { |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
cynicaldevil
Jun 15, 2017
Author
Contributor
This is not possible, I tried to make the outer impl generic, but it seems that the inner impl (the one inside trace) cannot pick up the generic type.
| id: usize, | ||
| } | ||
|
|
||
| enum ParserNodeOrText { |
This comment has been minimized.
This comment has been minimized.
| node_data.data = Some(String::from(data)); | ||
| self.enqueue(ParseOperation::CreatePI(node.id)); | ||
| }, | ||
| None => panic!("Element was just created using new_parse_node() and should exist!"), |
This comment has been minimized.
This comment has been minimized.
| } | ||
|
|
||
| fn has_parent_node(&self, node: &Self::Handle) -> bool { | ||
| match self.parse_node_data.borrow().get(&node.id) { |
This comment has been minimized.
This comment has been minimized.
| parser.parse_chunk(String::from(input)); | ||
| } | ||
|
|
||
| // https://html.spec.whatwg.org/multipage/#parsing-html-fragments | ||
| pub fn parse_html_fragment(context: &Element, input: DOMString) -> FragmentParsingResult { | ||
| pub fn parse_html_fragment(context: &Element, input: DOMString, async: bool) -> FragmentParsingResult { |
This comment has been minimized.
This comment has been minimized.
| @@ -157,11 +171,18 @@ impl ServoParser { | |||
| } | |||
| } | |||
|
|
|||
| pub fn parse_html_script_input(document: &Document, url: ServoUrl, type_: &str) { | |||
| pub fn parse_html_script_input(document: &Document, url: ServoUrl, type_: &str, async: bool) { | |||
This comment has been minimized.
This comment has been minimized.
| pub fn parse_html_document(document: &Document, input: DOMString, url: ServoUrl) { | ||
| let parser = ServoParser::new(document, | ||
| Tokenizer::Html(self::html::Tokenizer::new(document, url, None)), | ||
| pub fn parse_html_document(document: &Document, input: DOMString, url: ServoUrl, async: bool) { |
This comment has been minimized.
This comment has been minimized.
nox
May 31, 2017
Member
No boolean argument for that please, and I think there is no reason an HTML document should be parsed in a sync way.
0dee74f
to
22627fa
|
@nox Right! I changed most of the things you requested, and got the wikipedia main page and my github profile page (also a few pages I selected from |
| let mut parse_node_data = self.parse_node_data.borrow_mut(); | ||
| let mut data = parse_node_data.get_mut(&target.id).expect("Expected data of parse node"); | ||
| let contents = match data.contents { | ||
| Some(ref node) => node.clone(), |
This comment has been minimized.
This comment has been minimized.
cynicaldevil
Jun 1, 2017
•
Author
Contributor
IIUC, we can safely clone the ParseNode to be returned from this function. I am not able to see the problem with it, but I would like your advice on this one.
| #[derive(JSTraceable, Clone, HeapSizeOf)] | ||
| pub struct ParseNode { | ||
| id: usize, | ||
| qual_name: Option<QualName>, |
This comment has been minimized.
This comment has been minimized.
cynicaldevil
Jun 1, 2017
Author
Contributor
I decided to move this field here from ParseNodeData, because that was the only way I could get elem_name() to work, given its lifetime constraints.
| } | ||
|
|
||
| // TODO: Handle this | ||
| fn complete_script(&mut self, node: &Self::Handle) -> NextParserState { |
This comment has been minimized.
This comment has been minimized.
|
|
||
| /// https://html.spec.whatwg.org/multipage/#html-integration-point | ||
| /// Specifically, the <annotation-xml> cases. | ||
| fn is_mathml_annotation_xml_integration_point(&self, handle: &Self::Handle) -> bool { |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
@bors-servo try |
[WIP] Added Async HTML Tokenizer Design: I realized having two different parsers for sync and async was wrong, because the API for both was fundamentally the same. All I needed to do was create another Tokenizer, because `ParseNode` ( representation for nodes which are yet to be created) is used by the TreeBuilder and the Sink, and the `Tokenizer` is the 'lowermost' type concerned with these two types. Therefore, I created one and placed it in `async_html.rs`, and also created a new Sink which deals with `ParseNode`s. I changed the methods in ServoParser to take an `async` argument too, which decides which Tokenizer will be used. The Tokenizer isn't exactly *async* for now, but this PR separates action creation from execution, which allows the async behaviour to be implemented later. Right now, all actions are executed soon after they are created. The Sink consists of two Hashmaps, `nodes`, which contains the actual nodes, with the key being their corresponding `ParseNode`'s id, and `parse_node_data`, which contains metadata about the nodes. It's still a bit rough, (I can't figure out how to deal with `complete_script` and `is_mathml_annotation_xml_integration_point`, along with some other parts I wrote in a hurry), but I believe the overall design is sound. I'd like to hear what you think about it. <!-- Reviewable:start --> --- This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/servo/servo/17037) <!-- Reviewable:end -->
|
|
| } | ||
|
|
||
| fn set_current_line(&mut self, line_number: u64) { | ||
| self.current_line = line_number; |
This comment has been minimized.
This comment has been minimized.
jdm
Jun 5, 2017
Member
This won't work as written - the code that creates elements relies on the line number that's stored, and it will be incorrect when elements are created after the synchronous section of the tokenizer.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
jdm
Jun 13, 2017
Member
I misread the changes. Currently this works fine, but when we stop calling process_operation immediately after creating parse nodes it will no longer work.
|
@bors-servo try |
[WIP] Added Async HTML Tokenizer Design: I realized having two different parsers for sync and async was wrong, because the API for both was fundamentally the same. All I needed to do was create another Tokenizer, because `ParseNode` ( representation for nodes which are yet to be created) is used by the TreeBuilder and the Sink, and the `Tokenizer` is the 'lowermost' type concerned with these two types. Therefore, I created one and placed it in `async_html.rs`, and also created a new Sink which deals with `ParseNode`s. I changed the methods in ServoParser to take an `async` argument too, which decides which Tokenizer will be used. The Tokenizer isn't exactly *async* for now, but this PR separates action creation from execution, which allows the async behaviour to be implemented later. Right now, all actions are executed soon after they are created. The Sink consists of two Hashmaps, `nodes`, which contains the actual nodes, with the key being their corresponding `ParseNode`'s id, and `parse_node_data`, which contains metadata about the nodes. It's still a bit rough, (I can't figure out how to deal with `complete_script` and `is_mathml_annotation_xml_integration_point`, along with some other parts I wrote in a hurry), but I believe the overall design is sound. I'd like to hear what you think about it. <!-- Reviewable:start --> --- This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/servo/servo/17037) <!-- Reviewable:end -->
|
|
|
@bors-servo try |
Added Async HTML Tokenizer Design: I realized having two different parsers for sync and async was wrong, because the API for both was fundamentally the same. All I needed to do was create another Tokenizer, because `ParseNode` ( representation for nodes which are yet to be created) is used by the TreeBuilder and the Sink, and the `Tokenizer` is the 'lowermost' type concerned with these two types. Therefore, I created one and placed it in `async_html.rs`, and also created a new Sink which deals with `ParseNode`s. I changed the methods in ServoParser to take an `async` argument too, which decides which Tokenizer will be used. The Tokenizer isn't exactly *async* for now, but this PR separates action creation from execution, which allows the async behaviour to be implemented later. Right now, all actions are executed soon after they are created. The Sink consists of two Hashmaps, `nodes`, which contains the actual nodes, with the key being their corresponding `ParseNode`'s id, and `parse_node_data`, which contains metadata about the nodes. It's still a bit rough, (I can't figure out how to deal with `complete_script` and `is_mathml_annotation_xml_integration_point`, along with some other parts I wrote in a hurry), but I believe the overall design is sound. I'd like to hear what you think about it. <!-- Reviewable:start --> --- This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/servo/servo/17037) <!-- Reviewable:end -->
|
@nox I think you can review it in-depth now, it's mostly complete aside from a few warnings which I'll remove later. I am still unsure of what to do with |
| ParseOperation::Insert(parent, reference_child, new_node) => { | ||
| let nodes = self.nodes.borrow(); | ||
| let parent = nodes.get(&parent).unwrap(); | ||
| let reference_child = reference_child.map(|n| nodes.get(&n).unwrap().deref()); |
This comment has been minimized.
This comment has been minimized.
|
You are doing great! |
| } | ||
|
|
||
| enum ParseOperation { | ||
| GetTemplateContents(usize, usize), |
This comment has been minimized.
This comment has been minimized.
nox
Jun 13, 2017
Member
Please use named fields or document what each usize is. Named fields would be better of course.
|
|
||
| fn new_parse_node(&self) -> ParseNode { | ||
| let id = self.next_parse_node_id.get(); | ||
| let data: ParseNodeData = Default::default(); |
This comment has been minimized.
This comment has been minimized.
| document: JS<Document>, | ||
| current_line: u64, | ||
| script: MutNullableJS<HTMLScriptElement>, | ||
| parse_node_data: DOMRefCell<HashMap<usize, ParseNodeData>>, |
This comment has been minimized.
This comment has been minimized.
nox
Jun 13, 2017
Member
Why does this need to be in a DOMRefCell given all TreeSink methods take a &mut self?
This comment has been minimized.
This comment has been minimized.
cynicaldevil
Jun 13, 2017
Author
Contributor
This is a tricky one, parse_node_data needs to be DOMRefCell because I want to use it mutably in the same_tree method, where I update the tree-related info based on the data I get from the updated Nodes (same tree borrows self immutably).
The nodes field needs it because it is used in process_operation, which takes &self. If process_operation begins taking &mut self, it would cause problems with the TreeSink methods, as many methods already immutably borrow parse_node_data there, when they call process_operation. Therefore, the process_operation must borrow immutably too, to satisfy the borrow checker.
However, I might be able to make nodes a normal Hashmap in the next PR, because it will exist in main thread altogether, and therefore there will be no conflicts with Sink then.
| } | ||
| } | ||
|
|
||
| fn enqueue(&self, op: ParseOperation) { |
This comment has been minimized.
This comment has been minimized.
| } | ||
|
|
||
| fn process_operation(&self, op: ParseOperation) { | ||
| let document = Root::from_ref(&**self.nodes.borrow().get(&0).expect("document node should exist!")); |
This comment has been minimized.
This comment has been minimized.
nox
Jun 13, 2017
Member
I wonder if the data for the document node shouldn't live outside the nodes map.
| let mut node_data = parse_node_data.get_mut(&node.id).expect("Element does not exist!"); | ||
| node_data.is_integration_point = attrs.iter() | ||
| .any(|attr| { | ||
| let attr_value = &String::from(attr.value.clone()); |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
cynicaldevil
Jun 14, 2017
Author
Contributor
I found that I could compute this value once and store it when I'm creating a new element (the value's needed in is_mathml_annotation_xml_integration_point). But now that I think about it, this could be a problem if the element's attributes related to this value are changed at any point of time after the element's created. Do you think I should change it to be synchronous ? (i.e, get the actual Node from the nodes map and compute the value from it).
|
|
||
| fn create_comment(&mut self, text: StrTendril) -> Self::Handle { | ||
| let node = self.new_parse_node(); | ||
| self.enqueue(ParseOperation::CreateComment(String::from(text), node.id)); |
This comment has been minimized.
This comment has been minimized.
| QuirksMode::Quirks => ServoQuirksMode::Quirks, | ||
| QuirksMode::LimitedQuirks => ServoQuirksMode::LimitedQuirks, | ||
| QuirksMode::NoQuirks => ServoQuirksMode::NoQuirks, | ||
| }; |
This comment has been minimized.
This comment has been minimized.
| } | ||
|
|
||
| fn complete_script(&mut self, _: &Self::Handle) -> NextParserState { | ||
| panic!("complete_script should not be called here!"); |
This comment has been minimized.
This comment has been minimized.
| } | ||
|
|
||
| fn set_current_line(&mut self, line_number: u64) { | ||
| self.current_line = line_number; |
This comment has been minimized.
This comment has been minimized.
|
|
||
| fn create_pi(&mut self, target: StrTendril, data: StrTendril) -> ParseNode { | ||
| let node = self.new_parse_node(); | ||
| { |
This comment has been minimized.
This comment has been minimized.
cynicaldevil
Jun 14, 2017
Author
Contributor
Quite a few of these scopes appear in the TreeSink methods. It's needed because self needs to be borrowed for parse_node_data operations, and then in the end needs to be borrowed mutably for process_operation.
db900e8
to
1361216
|
@bors-servo try This is looking good. @jdm Do we want to make the use of the async tokenizer settable through some config? |
Added Async HTML Tokenizer Design: I realized having two different parsers for sync and async was wrong, because the API for both was fundamentally the same. All I needed to do was create another Tokenizer, because `ParseNode` ( representation for nodes which are yet to be created) is used by the TreeBuilder and the Sink, and the `Tokenizer` is the 'lowermost' type concerned with these two types. Therefore, I created one and placed it in `async_html.rs`, and also created a new Sink which deals with `ParseNode`s. I changed the methods in ServoParser to take an `async` argument too, which decides which Tokenizer will be used. The Tokenizer isn't exactly *async* for now, but this PR separates action creation from execution, which allows the async behaviour to be implemented later. Right now, all actions are executed soon after they are created. The Sink consists of two Hashmaps, `nodes`, which contains the actual nodes, with the key being their corresponding `ParseNode`'s id, and `parse_node_data`, which contains metadata about the nodes. It's still a bit rough, (I can't figure out how to deal with `complete_script` and `is_mathml_annotation_xml_integration_point`, along with some other parts I wrote in a hurry), but I believe the overall design is sound. I'd like to hear what you think about it. <!-- Reviewable:start --> --- This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/servo/servo/17037) <!-- Reviewable:end -->
|
A preference probably makes the most sense. |
|
How should I implement it? |
|
|
|
Probably check the pref in parse_html_document and disregard Async if the preference is disabled. |
|
Added preference. |
|
@bors-servo r+ |
|
|
Added Async HTML Tokenizer Design: I realized having two different parsers for sync and async was wrong, because the API for both was fundamentally the same. All I needed to do was create another Tokenizer, because `ParseNode` ( representation for nodes which are yet to be created) is used by the TreeBuilder and the Sink, and the `Tokenizer` is the 'lowermost' type concerned with these two types. Therefore, I created one and placed it in `async_html.rs`, and also created a new Sink which deals with `ParseNode`s. I changed the methods in ServoParser to take an `async` argument too, which decides which Tokenizer will be used. The Tokenizer isn't exactly *async* for now, but this PR separates action creation from execution, which allows the async behaviour to be implemented later. Right now, all actions are executed soon after they are created. The Sink consists of two Hashmaps, `nodes`, which contains the actual nodes, with the key being their corresponding `ParseNode`'s id, and `parse_node_data`, which contains metadata about the nodes. It's still a bit rough, (I can't figure out how to deal with `complete_script` and `is_mathml_annotation_xml_integration_point`, along with some other parts I wrote in a hurry), but I believe the overall design is sound. I'd like to hear what you think about it. <!-- Reviewable:start --> --- This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/servo/servo/17037) <!-- Reviewable:end -->
|
|
cynicaldevil commentedMay 25, 2017
•
edited
Design: I realized having two different parsers for sync and async was wrong, because the API for both was fundamentally the same. All I needed to do was create another Tokenizer, because
ParseNode( representation for nodes which are yet to be created) is used by the TreeBuilder and the Sink, and theTokenizeris the 'lowermost' type concerned with these two types.Therefore, I created one and placed it in
async_html.rs, and also created a new Sink which deals withParseNodes. I changed the methods in ServoParser to take anasyncargument too, which decides which Tokenizer will be used. The Tokenizer isn't exactly async for now, but this PR separates action creation from execution, which allows the async behaviour to be implemented later. Right now, all actions are executed soon after they are created.The Sink consists of two Hashmaps,
nodes, which contains the actual nodes, with the key being their correspondingParseNode's id, andparse_node_data, which contains metadata about the nodes.It's still a bit rough, (I can't figure out how to deal with
complete_scriptandis_mathml_annotation_xml_integration_point, along with some other parts I wrote in a hurry), but I believe the overall design is sound. I'd like to hear what you think about it.This change is