Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Async HTML Tokenizer #17037

Merged
merged 2 commits into from Jun 17, 2017
Merged

Added Async HTML Tokenizer #17037

merged 2 commits into from Jun 17, 2017

Conversation

@cynicaldevil
Copy link
Contributor

cynicaldevil commented May 25, 2017

Design: I realized having two different parsers for sync and async was wrong, because the API for both was fundamentally the same. All I needed to do was create another Tokenizer, because ParseNode ( representation for nodes which are yet to be created) is used by the TreeBuilder and the Sink, and the Tokenizer is the 'lowermost' type concerned with these two types.

Therefore, I created one and placed it in async_html.rs, and also created a new Sink which deals with ParseNodes. I changed the methods in ServoParser to take an async argument too, which decides which Tokenizer will be used. The Tokenizer isn't exactly async for now, but this PR separates action creation from execution, which allows the async behaviour to be implemented later. Right now, all actions are executed soon after they are created.

The Sink consists of two Hashmaps, nodes, which contains the actual nodes, with the key being their corresponding ParseNode's id, and parse_node_data, which contains metadata about the nodes.

It's still a bit rough, (I can't figure out how to deal with complete_script and is_mathml_annotation_xml_integration_point, along with some other parts I wrote in a hurry), but I believe the overall design is sound. I'd like to hear what you think about it.


This change is Reviewable

@highfive
Copy link

highfive commented May 25, 2017

Heads up! This PR modifies the following files:

  • @fitzgen: components/script/dom/document.rs, components/script/dom/servoparser/async_html.rs, components/script/dom/element.rs, components/script/dom/xmlhttprequest.rs, components/script/dom/servoparser/mod.rs and 2 more
  • @KiChjang: components/script/dom/document.rs, components/script/dom/servoparser/async_html.rs, components/script/dom/element.rs, components/script/dom/xmlhttprequest.rs, components/script/dom/servoparser/mod.rs and 2 more
@highfive
Copy link

highfive commented May 25, 2017

warning Warning warning

  • These commits modify unsafe code. Please review it carefully!
  • These commits modify script code, but no tests are modified. Please consider adding a test!
@cynicaldevil
Copy link
Contributor Author

cynicaldevil commented May 25, 2017

cc @jdm @nox

@bors-servo
Copy link
Contributor

bors-servo commented May 27, 2017

The latest upstream changes (presumably #16989) made this pull request unmergeable. Please resolve the merge conflicts.

Copy link
Member

nox left a comment

Initial partial review.

@@ -91,6 +92,7 @@ pub struct ServoParser {
aborted: Cell<bool>,
/// https://html.spec.whatwg.org/multipage/#script-created-parser
script_created_parser: bool,
async: bool,

This comment has been minimized.

Copy link
@nox

nox May 31, 2017

Member

You don't need this, the provided tokeniser already lets you know whether it's async or not, please remove.

@@ -54,6 +54,7 @@ use std::mem;
use style::context::QuirksMode as ServoQuirksMode;

mod html;
mod async_html;

This comment has been minimized.

Copy link
@nox

nox May 31, 2017

Member

Nit: alphabetic order.

@@ -1364,7 +1364,7 @@ impl Element {
// Steps 1-2.
let context_document = document_from_node(self);
// TODO(#11995): XML case.
let new_children = ServoParser::parse_html_fragment(self, markup);
let new_children = ServoParser::parse_html_fragment(self, markup, true);

This comment has been minimized.

Copy link
@nox

nox May 31, 2017

Member

Why async? The only input here comes from document.write.

}

#[allow(unsafe_code)]
unsafe impl JSTraceable for HtmlTokenizer<TreeBuilder<ParseNode, Sink>> {

This comment has been minimized.

Copy link
@nox

nox May 31, 2017

Member

This impl can be shared with the sync tokeniser.

This comment has been minimized.

Copy link
@cynicaldevil

cynicaldevil Jun 14, 2017

Author Contributor

Forgot about this one, trying it now.

This comment has been minimized.

Copy link
@cynicaldevil

cynicaldevil Jun 15, 2017

Author Contributor

This is not possible, I tried to make the outer impl generic, but it seems that the inner impl (the one inside trace) cannot pick up the generic type.

id: usize,
}

enum ParserNodeOrText {

This comment has been minimized.

Copy link
@nox

nox May 31, 2017

Member

Why do you need this new enum?

node_data.data = Some(String::from(data));
self.enqueue(ParseOperation::CreatePI(node.id));
},
None => panic!("Element was just created using new_parse_node() and should exist!"),

This comment has been minimized.

Copy link
@nox

nox May 31, 2017

Member

What is this for?

}

fn has_parent_node(&self, node: &Self::Handle) -> bool {
match self.parse_node_data.borrow().get(&node.id) {

This comment has been minimized.

Copy link
@nox

nox May 31, 2017

Member

Use expect on that, instead of matching explicitly.

parser.parse_chunk(String::from(input));
}

// https://html.spec.whatwg.org/multipage/#parsing-html-fragments
pub fn parse_html_fragment(context: &Element, input: DOMString) -> FragmentParsingResult {
pub fn parse_html_fragment(context: &Element, input: DOMString, async: bool) -> FragmentParsingResult {

This comment has been minimized.

Copy link
@nox

nox May 31, 2017

Member

Why would you parse a fragment in async mode?

@@ -157,11 +171,18 @@ impl ServoParser {
}
}

pub fn parse_html_script_input(document: &Document, url: ServoUrl, type_: &str) {
pub fn parse_html_script_input(document: &Document, url: ServoUrl, type_: &str, async: bool) {

This comment has been minimized.

Copy link
@nox

nox May 31, 2017

Member

Why would you parse script input in async mode?

pub fn parse_html_document(document: &Document, input: DOMString, url: ServoUrl) {
let parser = ServoParser::new(document,
Tokenizer::Html(self::html::Tokenizer::new(document, url, None)),
pub fn parse_html_document(document: &Document, input: DOMString, url: ServoUrl, async: bool) {

This comment has been minimized.

Copy link
@nox

nox May 31, 2017

Member

No boolean argument for that please, and I think there is no reason an HTML document should be parsed in a sync way.

@cynicaldevil cynicaldevil force-pushed the cynicaldevil:impl-Sink branch 2 times, most recently from 0dee74f to 22627fa Jun 1, 2017
Copy link
Contributor Author

cynicaldevil left a comment

@nox Right! I changed most of the things you requested, and got the wikipedia main page and my github profile page (also a few pages I selected from tests) to run successfully at last! I'm still not sure how to implement a couple of methods from TreeSink, though.

let mut parse_node_data = self.parse_node_data.borrow_mut();
let mut data = parse_node_data.get_mut(&target.id).expect("Expected data of parse node");
let contents = match data.contents {
Some(ref node) => node.clone(),

This comment has been minimized.

Copy link
@cynicaldevil

cynicaldevil Jun 1, 2017

Author Contributor

IIUC, we can safely clone the ParseNode to be returned from this function. I am not able to see the problem with it, but I would like your advice on this one.

#[derive(JSTraceable, Clone, HeapSizeOf)]
pub struct ParseNode {
id: usize,
qual_name: Option<QualName>,

This comment has been minimized.

Copy link
@cynicaldevil

cynicaldevil Jun 1, 2017

Author Contributor

I decided to move this field here from ParseNodeData, because that was the only way I could get elem_name() to work, given its lifetime constraints.

}

// TODO: Handle this
fn complete_script(&mut self, node: &Self::Handle) -> NextParserState {

This comment has been minimized.

Copy link
@cynicaldevil

cynicaldevil Jun 1, 2017

Author Contributor

I'm still not exactly sure how to handle this.


/// https://html.spec.whatwg.org/multipage/#html-integration-point
/// Specifically, the <annotation-xml> cases.
fn is_mathml_annotation_xml_integration_point(&self, handle: &Self::Handle) -> bool {

This comment has been minimized.

Copy link
@cynicaldevil

cynicaldevil Jun 1, 2017

Author Contributor

This one too.

This comment has been minimized.

Copy link
@cynicaldevil

cynicaldevil Jun 2, 2017

Author Contributor

Fixed this one today.

@nox nox removed the S-needs-rebase label Jun 2, 2017
@cynicaldevil
Copy link
Contributor Author

cynicaldevil commented Jun 4, 2017

@bors-servo
Copy link
Contributor

bors-servo commented Jun 4, 2017

Trying commit 438f421 with merge f3ad8d8...

bors-servo added a commit that referenced this pull request Jun 4, 2017
[WIP] Added Async HTML Tokenizer

Design: I realized having two different parsers for sync and async was wrong, because the API for both was fundamentally the same. All I needed to do was create another Tokenizer, because `ParseNode` ( representation for nodes which are yet to be created) is used by the TreeBuilder and the Sink, and the `Tokenizer` is the 'lowermost' type concerned with these two types.

Therefore, I created one and placed it in `async_html.rs`, and also created a new Sink which deals with `ParseNode`s. I changed the methods in ServoParser to take an `async` argument too, which decides which Tokenizer will be used. The Tokenizer isn't exactly *async* for now, but this PR separates action creation from execution, which allows the async behaviour to be implemented later. Right now, all actions are executed soon after they are created.

The Sink consists of two Hashmaps, `nodes`, which contains the actual nodes, with the key being their corresponding `ParseNode`'s id, and `parse_node_data`, which contains metadata about the nodes.

It's still a bit rough, (I can't figure out how to deal with `complete_script` and `is_mathml_annotation_xml_integration_point`, along with some other parts I wrote in a hurry), but I believe the overall design is sound. I'd like to hear what you think about it.

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/servo/servo/17037)
<!-- Reviewable:end -->
@bors-servo
Copy link
Contributor

bors-servo commented Jun 4, 2017

💔 Test failed - linux-rel-wpt

}

fn set_current_line(&mut self, line_number: u64) {
self.current_line = line_number;

This comment has been minimized.

Copy link
@jdm

jdm Jun 5, 2017

Member

This won't work as written - the code that creates elements relies on the line number that's stored, and it will be incorrect when elements are created after the synchronous section of the tokenizer.

This comment has been minimized.

Copy link
@nox

nox Jun 13, 2017

Member

This stills stands, right @cynicaldevil?

This comment has been minimized.

Copy link
@jdm

jdm Jun 13, 2017

Member

I misread the changes. Currently this works fine, but when we stop calling process_operation immediately after creating parse nodes it will no longer work.

@cynicaldevil cynicaldevil force-pushed the cynicaldevil:impl-Sink branch from 438f421 to 24093e9 Jun 7, 2017
@highfive highfive removed the S-tests-failed label Jun 7, 2017
@cynicaldevil
Copy link
Contributor Author

cynicaldevil commented Jun 7, 2017

@bors-servo
Copy link
Contributor

bors-servo commented Jun 7, 2017

Trying commit 24093e9 with merge 5ffd5c1...

bors-servo added a commit that referenced this pull request Jun 7, 2017
[WIP] Added Async HTML Tokenizer

Design: I realized having two different parsers for sync and async was wrong, because the API for both was fundamentally the same. All I needed to do was create another Tokenizer, because `ParseNode` ( representation for nodes which are yet to be created) is used by the TreeBuilder and the Sink, and the `Tokenizer` is the 'lowermost' type concerned with these two types.

Therefore, I created one and placed it in `async_html.rs`, and also created a new Sink which deals with `ParseNode`s. I changed the methods in ServoParser to take an `async` argument too, which decides which Tokenizer will be used. The Tokenizer isn't exactly *async* for now, but this PR separates action creation from execution, which allows the async behaviour to be implemented later. Right now, all actions are executed soon after they are created.

The Sink consists of two Hashmaps, `nodes`, which contains the actual nodes, with the key being their corresponding `ParseNode`'s id, and `parse_node_data`, which contains metadata about the nodes.

It's still a bit rough, (I can't figure out how to deal with `complete_script` and `is_mathml_annotation_xml_integration_point`, along with some other parts I wrote in a hurry), but I believe the overall design is sound. I'd like to hear what you think about it.

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/servo/servo/17037)
<!-- Reviewable:end -->
@bors-servo
Copy link
Contributor

bors-servo commented Jun 7, 2017

💔 Test failed - linux-rel-wpt

@cynicaldevil cynicaldevil changed the title [WIP] Added Async HTML Tokenizer Added Async HTML Tokenizer Jun 7, 2017
@highfive highfive removed the S-tests-failed label Jun 7, 2017
@cynicaldevil
Copy link
Contributor Author

cynicaldevil commented Jun 7, 2017

@bors-servo
Copy link
Contributor

bors-servo commented Jun 7, 2017

Trying commit 725965f with merge 3598f78...

bors-servo added a commit that referenced this pull request Jun 7, 2017
Added Async HTML Tokenizer

Design: I realized having two different parsers for sync and async was wrong, because the API for both was fundamentally the same. All I needed to do was create another Tokenizer, because `ParseNode` ( representation for nodes which are yet to be created) is used by the TreeBuilder and the Sink, and the `Tokenizer` is the 'lowermost' type concerned with these two types.

Therefore, I created one and placed it in `async_html.rs`, and also created a new Sink which deals with `ParseNode`s. I changed the methods in ServoParser to take an `async` argument too, which decides which Tokenizer will be used. The Tokenizer isn't exactly *async* for now, but this PR separates action creation from execution, which allows the async behaviour to be implemented later. Right now, all actions are executed soon after they are created.

The Sink consists of two Hashmaps, `nodes`, which contains the actual nodes, with the key being their corresponding `ParseNode`'s id, and `parse_node_data`, which contains metadata about the nodes.

It's still a bit rough, (I can't figure out how to deal with `complete_script` and `is_mathml_annotation_xml_integration_point`, along with some other parts I wrote in a hurry), but I believe the overall design is sound. I'd like to hear what you think about it.

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/servo/servo/17037)
<!-- Reviewable:end -->
@cynicaldevil
Copy link
Contributor Author

cynicaldevil commented Jun 8, 2017

@nox I think you can review it in-depth now, it's mostly complete aside from a few warnings which I'll remove later. I am still unsure of what to do with complete_script, though.

@cynicaldevil cynicaldevil force-pushed the cynicaldevil:impl-Sink branch from 8af96fb to 8775737 Jun 12, 2017
ParseOperation::Insert(parent, reference_child, new_node) => {
let nodes = self.nodes.borrow();
let parent = nodes.get(&parent).unwrap();
let reference_child = reference_child.map(|n| nodes.get(&n).unwrap().deref());

This comment has been minimized.

Copy link
@nox

nox Jun 13, 2017

Member

Why the explicit deref call?

reference_child.map(|n| nodes.get(&n))
Copy link
Member

nox left a comment

You are doing great!

}

enum ParseOperation {
GetTemplateContents(usize, usize),

This comment has been minimized.

Copy link
@nox

nox Jun 13, 2017

Member

Please use named fields or document what each usize is. Named fields would be better of course.


fn new_parse_node(&self) -> ParseNode {
let id = self.next_parse_node_id.get();
let data: ParseNodeData = Default::default();

This comment has been minimized.

Copy link
@nox

nox Jun 13, 2017

Member

Nit: that could be ParseNodeData::default().

document: JS<Document>,
current_line: u64,
script: MutNullableJS<HTMLScriptElement>,
parse_node_data: DOMRefCell<HashMap<usize, ParseNodeData>>,

This comment has been minimized.

Copy link
@nox

nox Jun 13, 2017

Member

Why does this need to be in a DOMRefCell given all TreeSink methods take a &mut self?

This comment has been minimized.

Copy link
@cynicaldevil

cynicaldevil Jun 13, 2017

Author Contributor

This is a tricky one, parse_node_data needs to be DOMRefCell because I want to use it mutably in the same_tree method, where I update the tree-related info based on the data I get from the updated Nodes (same tree borrows self immutably).

The nodes field needs it because it is used in process_operation, which takes &self. If process_operation begins taking &mut self, it would cause problems with the TreeSink methods, as many methods already immutably borrow parse_node_data there, when they call process_operation. Therefore, the process_operation must borrow immutably too, to satisfy the borrow checker.

However, I might be able to make nodes a normal Hashmap in the next PR, because it will exist in main thread altogether, and therefore there will be no conflicts with Sink then.

}
}

fn enqueue(&self, op: ParseOperation) {

This comment has been minimized.

Copy link
@nox

nox Jun 13, 2017

Member

I think you can remove this method or process_operation. :)

}

fn process_operation(&self, op: ParseOperation) {
let document = Root::from_ref(&**self.nodes.borrow().get(&0).expect("document node should exist!"));

This comment has been minimized.

Copy link
@nox

nox Jun 13, 2017

Member

I wonder if the data for the document node shouldn't live outside the nodes map.

let mut node_data = parse_node_data.get_mut(&node.id).expect("Element does not exist!");
node_data.is_integration_point = attrs.iter()
.any(|attr| {
let attr_value = &String::from(attr.value.clone());

This comment has been minimized.

Copy link
@nox

nox Jun 13, 2017

Member

Why is this needed?

This comment has been minimized.

Copy link
@cynicaldevil

cynicaldevil Jun 14, 2017

Author Contributor

I found that I could compute this value once and store it when I'm creating a new element (the value's needed in is_mathml_annotation_xml_integration_point). But now that I think about it, this could be a problem if the element's attributes related to this value are changed at any point of time after the element's created. Do you think I should change it to be synchronous ? (i.e, get the actual Node from the nodes map and compute the value from it).


fn create_comment(&mut self, text: StrTendril) -> Self::Handle {
let node = self.new_parse_node();
self.enqueue(ParseOperation::CreateComment(String::from(text), node.id));

This comment has been minimized.

Copy link
@nox

nox Jun 13, 2017

Member

Why aren't we just storing the StrTendril value here?

QuirksMode::Quirks => ServoQuirksMode::Quirks,
QuirksMode::LimitedQuirks => ServoQuirksMode::LimitedQuirks,
QuirksMode::NoQuirks => ServoQuirksMode::NoQuirks,
};

This comment has been minimized.

Copy link
@nox

nox Jun 13, 2017

Member

Why do we have 2 quirks mode enum again?

}

fn complete_script(&mut self, _: &Self::Handle) -> NextParserState {
panic!("complete_script should not be called here!");

This comment has been minimized.

Copy link
@nox

nox Jun 13, 2017

Member

Please file a follow-up about killing this altogether.

}

fn set_current_line(&mut self, line_number: u64) {
self.current_line = line_number;

This comment has been minimized.

Copy link
@nox

nox Jun 13, 2017

Member

This stills stands, right @cynicaldevil?

@cynicaldevil cynicaldevil force-pushed the cynicaldevil:impl-Sink branch from 8775737 to 79ffa13 Jun 14, 2017

fn create_pi(&mut self, target: StrTendril, data: StrTendril) -> ParseNode {
let node = self.new_parse_node();
{

This comment has been minimized.

Copy link
@cynicaldevil

cynicaldevil Jun 14, 2017

Author Contributor

Quite a few of these scopes appear in the TreeSink methods. It's needed because self needs to be borrowed for parse_node_data operations, and then in the end needs to be borrowed mutably for process_operation.

@cynicaldevil cynicaldevil force-pushed the cynicaldevil:impl-Sink branch 2 times, most recently from db900e8 to 1361216 Jun 15, 2017
@nox
Copy link
Member

nox commented Jun 16, 2017

@bors-servo try

This is looking good.

@jdm Do we want to make the use of the async tokenizer settable through some config?

@bors-servo
Copy link
Contributor

bors-servo commented Jun 16, 2017

Trying commit 1361216 with merge 212b0b2...

bors-servo added a commit that referenced this pull request Jun 16, 2017
Added Async HTML Tokenizer

Design: I realized having two different parsers for sync and async was wrong, because the API for both was fundamentally the same. All I needed to do was create another Tokenizer, because `ParseNode` ( representation for nodes which are yet to be created) is used by the TreeBuilder and the Sink, and the `Tokenizer` is the 'lowermost' type concerned with these two types.

Therefore, I created one and placed it in `async_html.rs`, and also created a new Sink which deals with `ParseNode`s. I changed the methods in ServoParser to take an `async` argument too, which decides which Tokenizer will be used. The Tokenizer isn't exactly *async* for now, but this PR separates action creation from execution, which allows the async behaviour to be implemented later. Right now, all actions are executed soon after they are created.

The Sink consists of two Hashmaps, `nodes`, which contains the actual nodes, with the key being their corresponding `ParseNode`'s id, and `parse_node_data`, which contains metadata about the nodes.

It's still a bit rough, (I can't figure out how to deal with `complete_script` and `is_mathml_annotation_xml_integration_point`, along with some other parts I wrote in a hurry), but I believe the overall design is sound. I'd like to hear what you think about it.

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/servo/servo/17037)
<!-- Reviewable:end -->
@jdm
Copy link
Member

jdm commented Jun 16, 2017

A preference probably makes the most sense.

@cynicaldevil
Copy link
Contributor Author

cynicaldevil commented Jun 16, 2017

How should I implement it?

@bors-servo
Copy link
Contributor

bors-servo commented Jun 16, 2017

@jdm
Copy link
Member

jdm commented Jun 16, 2017

Probably check the pref in parse_html_document and disregard Async if the preference is disabled.

@cynicaldevil cynicaldevil force-pushed the cynicaldevil:impl-Sink branch from 1361216 to ad649bb Jun 17, 2017
@cynicaldevil
Copy link
Contributor Author

cynicaldevil commented Jun 17, 2017

Added preference.

@nox
nox approved these changes Jun 17, 2017
@nox
Copy link
Member

nox commented Jun 17, 2017

@bors-servo
Copy link
Contributor

bors-servo commented Jun 17, 2017

📌 Commit ad649bb has been approved by nox

@bors-servo
Copy link
Contributor

bors-servo commented Jun 17, 2017

Testing commit ad649bb with merge 9c2dffd...

bors-servo added a commit that referenced this pull request Jun 17, 2017
Added Async HTML Tokenizer

Design: I realized having two different parsers for sync and async was wrong, because the API for both was fundamentally the same. All I needed to do was create another Tokenizer, because `ParseNode` ( representation for nodes which are yet to be created) is used by the TreeBuilder and the Sink, and the `Tokenizer` is the 'lowermost' type concerned with these two types.

Therefore, I created one and placed it in `async_html.rs`, and also created a new Sink which deals with `ParseNode`s. I changed the methods in ServoParser to take an `async` argument too, which decides which Tokenizer will be used. The Tokenizer isn't exactly *async* for now, but this PR separates action creation from execution, which allows the async behaviour to be implemented later. Right now, all actions are executed soon after they are created.

The Sink consists of two Hashmaps, `nodes`, which contains the actual nodes, with the key being their corresponding `ParseNode`'s id, and `parse_node_data`, which contains metadata about the nodes.

It's still a bit rough, (I can't figure out how to deal with `complete_script` and `is_mathml_annotation_xml_integration_point`, along with some other parts I wrote in a hurry), but I believe the overall design is sound. I'd like to hear what you think about it.

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/servo/servo/17037)
<!-- Reviewable:end -->
@bors-servo
Copy link
Contributor

bors-servo commented Jun 17, 2017

@bors-servo bors-servo merged commit ad649bb into servo:master Jun 17, 2017
3 checks passed
3 checks passed
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
homu Test successful
Details
@cynicaldevil cynicaldevil deleted the cynicaldevil:impl-Sink branch Jun 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants
You can’t perform that action at this time.