Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new tests for syntax and ill-formed parser errors and fix... emm... errors #684

Merged
merged 12 commits into from
Nov 22, 2023

Conversation

Mingun
Copy link
Collaborator

@Mingun Mingun commented Nov 21, 2023

This PR is a prepare step for rewriting parser from scratch. First of all I want to guarantee that behaviour will not be changed, especially conditions on which errors is returned and a final state after the error.

Thanks to that tests, I found several bugs related to result of .buffer_position() that should report error position.

This PR changes how some ill-formed XML constructs are processed:

  • <?xml?> now will be threated as syntactically correct XML declaration, although it is not well-formed. Note, however, that some online validators allows such declaration. Previously it was reported as Event::PI. According to the specification, a version attribute is mandatory, but I follow the quick-xml ideology, that we do not do checks that could slow down parsing, but that checks are performed by demand. In that case when you call BytesDecl::version(). Maybe that is not the better way to do things, but this PR not about it.
  • <??> now will be parsed as syntactically correct processing instruction, although it is not well-formed, previously error is returned. Again, some online validators allows that. Currently there is no way to enforce well-formedless check for PI, but I think, that could be done in Processing instructions should have a dedicated type #650

@Mingun Mingun added the bug label Nov 21, 2023
@Mingun Mingun requested a review from dralley November 21, 2023 20:21
- `eof_1` and `bad_1` replaced by `syntax::decl` tests.
- `test_start` replaced by `syntax::tag::normal` tests.
- `test_offset_err_comment` and `test_offset_err_comment_trim_text` replaced by `syntax::comment` tests.
  Checks of position of preceding start tag is not meaningful for that tests, so such replacement is justified.

failures (150):
    syntax::cdata::unclosed1::async_tokio
    syntax::cdata::unclosed1::borrowed
    syntax::cdata::unclosed1::buffered
    syntax::cdata::unclosed2::async_tokio
    syntax::cdata::unclosed2::borrowed
    syntax::cdata::unclosed2::buffered
    syntax::cdata::unclosed3::async_tokio
    syntax::cdata::unclosed3::borrowed
    syntax::cdata::unclosed3::buffered
    syntax::cdata::unclosed4::async_tokio
    syntax::cdata::unclosed4::borrowed
    syntax::cdata::unclosed4::buffered
    syntax::cdata::unclosed5::async_tokio
    syntax::cdata::unclosed5::borrowed
    syntax::cdata::unclosed5::buffered
    syntax::cdata::unclosed6::async_tokio
    syntax::cdata::unclosed6::borrowed
    syntax::cdata::unclosed6::buffered
    syntax::cdata::unclosed7::async_tokio
    syntax::cdata::unclosed7::borrowed
    syntax::cdata::unclosed7::buffered
    syntax::cdata::unclosed8::async_tokio
    syntax::cdata::unclosed8::borrowed
    syntax::cdata::unclosed8::buffered
    syntax::cdata::unclosed9::async_tokio
    syntax::cdata::unclosed9::borrowed
    syntax::cdata::unclosed9::buffered
    syntax::comment::unclosed1::async_tokio
    syntax::comment::unclosed1::borrowed
    syntax::comment::unclosed1::buffered
    syntax::comment::unclosed2::async_tokio
    syntax::comment::unclosed2::borrowed
    syntax::comment::unclosed2::buffered
    syntax::comment::unclosed3::async_tokio
    syntax::comment::unclosed3::borrowed
    syntax::comment::unclosed3::buffered
    syntax::comment::unclosed4::async_tokio
    syntax::comment::unclosed4::borrowed
    syntax::comment::unclosed4::buffered
    syntax::comment::unclosed5::async_tokio
    syntax::comment::unclosed5::borrowed
    syntax::comment::unclosed5::buffered
    syntax::comment::unclosed6::async_tokio
    syntax::comment::unclosed6::borrowed
    syntax::comment::unclosed6::buffered
    syntax::comment::unclosed7::async_tokio
    syntax::comment::unclosed7::borrowed
    syntax::comment::unclosed7::buffered
    syntax::decl::normal1::async_tokio
    syntax::decl::normal1::borrowed
    syntax::decl::normal1::buffered
    syntax::decl::unclosed1::async_tokio
    syntax::decl::unclosed1::borrowed
    syntax::decl::unclosed1::buffered
    syntax::decl::unclosed2::async_tokio
    syntax::decl::unclosed2::borrowed
    syntax::decl::unclosed2::buffered
    syntax::decl::unclosed3::async_tokio
    syntax::decl::unclosed3::borrowed
    syntax::decl::unclosed3::buffered
    syntax::decl::unclosed4::async_tokio
    syntax::decl::unclosed4::borrowed
    syntax::decl::unclosed4::buffered
    syntax::doctype::unclosed1::async_tokio
    syntax::doctype::unclosed1::borrowed
    syntax::doctype::unclosed1::buffered
    syntax::doctype::unclosed2::async_tokio
    syntax::doctype::unclosed2::borrowed
    syntax::doctype::unclosed2::buffered
    syntax::doctype::unclosed3::async_tokio
    syntax::doctype::unclosed3::borrowed
    syntax::doctype::unclosed3::buffered
    syntax::doctype::unclosed4::async_tokio
    syntax::doctype::unclosed4::borrowed
    syntax::doctype::unclosed4::buffered
    syntax::doctype::unclosed5::async_tokio
    syntax::doctype::unclosed5::borrowed
    syntax::doctype::unclosed5::buffered
    syntax::doctype::unclosed6::async_tokio
    syntax::doctype::unclosed6::borrowed
    syntax::doctype::unclosed6::buffered
    syntax::doctype::unclosed7::async_tokio
    syntax::doctype::unclosed7::borrowed
    syntax::doctype::unclosed7::buffered
    syntax::doctype::unclosed8::async_tokio
    syntax::doctype::unclosed8::borrowed
    syntax::doctype::unclosed8::buffered
    syntax::doctype::unclosed9::async_tokio
    syntax::doctype::unclosed9::borrowed
    syntax::doctype::unclosed9::buffered
    syntax::pi::normal_empty::async_tokio
    syntax::pi::normal_empty::borrowed
    syntax::pi::normal_empty::buffered
    syntax::pi::unclosed10::async_tokio
    syntax::pi::unclosed10::borrowed
    syntax::pi::unclosed10::buffered
    syntax::pi::unclosed1::async_tokio
    syntax::pi::unclosed1::borrowed
    syntax::pi::unclosed1::buffered
    syntax::pi::unclosed2::async_tokio
    syntax::pi::unclosed2::borrowed
    syntax::pi::unclosed2::buffered
    syntax::pi::unclosed3::async_tokio
    syntax::pi::unclosed3::borrowed
    syntax::pi::unclosed3::buffered
    syntax::pi::unclosed4::async_tokio
    syntax::pi::unclosed4::borrowed
    syntax::pi::unclosed4::buffered
    syntax::pi::unclosed5::async_tokio
    syntax::pi::unclosed5::borrowed
    syntax::pi::unclosed5::buffered
    syntax::pi::unclosed6::async_tokio
    syntax::pi::unclosed6::borrowed
    syntax::pi::unclosed6::buffered
    syntax::pi::unclosed7::async_tokio
    syntax::pi::unclosed7::borrowed
    syntax::pi::unclosed7::buffered
    syntax::pi::unclosed8::async_tokio
    syntax::pi::unclosed8::borrowed
    syntax::pi::unclosed8::buffered
    syntax::pi::unclosed9::async_tokio
    syntax::pi::unclosed9::borrowed
    syntax::pi::unclosed9::buffered
    syntax::tag::unclosed1::async_tokio
    syntax::tag::unclosed1::borrowed
    syntax::tag::unclosed1::buffered
    syntax::tag::unclosed2::async_tokio
    syntax::tag::unclosed2::borrowed
    syntax::tag::unclosed2::buffered
    syntax::tag::unclosed3::async_tokio
    syntax::tag::unclosed3::borrowed
    syntax::tag::unclosed3::buffered
    syntax::tag::unclosed4::async_tokio
    syntax::tag::unclosed4::borrowed
    syntax::tag::unclosed4::buffered
    syntax::tag::unclosed5::async_tokio
    syntax::tag::unclosed5::borrowed
    syntax::tag::unclosed5::buffered
    syntax::tag::unclosed6::async_tokio
    syntax::tag::unclosed6::borrowed
    syntax::tag::unclosed6::buffered
    syntax::tag::unclosed7::async_tokio
    syntax::tag::unclosed7::borrowed
    syntax::tag::unclosed7::buffered
    syntax::unclosed_bang1::async_tokio
    syntax::unclosed_bang1::borrowed
    syntax::unclosed_bang1::buffered
    syntax::unclosed_bang2::async_tokio
    syntax::unclosed_bang2::borrowed
    syntax::unclosed_bang2::buffered
…nted to `<`

Some code become ugly, but anyway it will be rewritten soon

failures (6):
    syntax::decl::normal1::async_tokio
    syntax::decl::normal1::borrowed
    syntax::decl::normal1::buffered
    syntax::pi::normal_empty::async_tokio
    syntax::pi::normal_empty::borrowed
    syntax::pi::normal_empty::buffered
Although empty target name isnot allowed, we would check that (or change parsing)
the same way as currently we check the presence of mandatory `version` attribute
in XML declaration. This requires introducing separate type for PI content which
will be done in tafia#650

Fixed (3):
    syntax::pi::normal_empty::async_tokio
    syntax::pi::normal_empty::borrowed
    syntax::pi::normal_empty::buffered

failures (3):
    syntax::decl::normal1::async_tokio
    syntax::decl::normal1::borrowed
    syntax::decl::normal1::buffered
This is more consistent with the way how we check presence of the mandatory
`version` attribute in XML declaration. That check currently only performed
when you try to read version. Also, technically in the grammar space is
a part of version info: https://www.w3.org/TR/xml11/#NT-XMLDecl

Fixed (3):
    syntax::decl::normal1::async_tokio
    syntax::decl::normal1::borrowed
    syntax::decl::normal1::buffered
failures (15):
    ill_formed::double_hyphen_in_comment2::async_tokio
    ill_formed::double_hyphen_in_comment2::borrowed
    ill_formed::double_hyphen_in_comment2::buffered
    ill_formed::double_hyphen_in_comment3::async_tokio
    ill_formed::double_hyphen_in_comment3::borrowed
    ill_formed::double_hyphen_in_comment3::buffered
    ill_formed::double_hyphen_in_comment4::async_tokio
    ill_formed::double_hyphen_in_comment4::borrowed
    ill_formed::double_hyphen_in_comment4::buffered
    ill_formed::missed_doctype_name1::async_tokio
    ill_formed::missed_doctype_name1::borrowed
    ill_formed::missed_doctype_name1::buffered
    ill_formed::missed_doctype_name2::async_tokio
    ill_formed::missed_doctype_name2::borrowed
    ill_formed::missed_doctype_name2::buffered
…omment)`

`p` in the original code was the number of found occurrence and not a byte offset into `buf`

Fixed (9):
    ill_formed::double_hyphen_in_comment2::async_tokio
    ill_formed::double_hyphen_in_comment2::borrowed
    ill_formed::double_hyphen_in_comment2::buffered
    ill_formed::double_hyphen_in_comment3::async_tokio
    ill_formed::double_hyphen_in_comment3::borrowed
    ill_formed::double_hyphen_in_comment3::buffered
    ill_formed::double_hyphen_in_comment4::async_tokio
    ill_formed::double_hyphen_in_comment4::borrowed
    ill_formed::double_hyphen_in_comment4::buffered

failures (6):
    ill_formed::missed_doctype_name1::async_tokio
    ill_formed::missed_doctype_name1::borrowed
    ill_formed::missed_doctype_name1::buffered
    ill_formed::missed_doctype_name2::async_tokio
    ill_formed::missed_doctype_name2::borrowed
    ill_formed::missed_doctype_name2::buffered
…me)`

Fixed (6):
    ill_formed::missed_doctype_name1::async_tokio
    ill_formed::missed_doctype_name1::borrowed
    ill_formed::missed_doctype_name1::buffered
    ill_formed::missed_doctype_name2::async_tokio
    ill_formed::missed_doctype_name2::borrowed
    ill_formed::missed_doctype_name2::buffered
src/errors.rs Outdated
@@ -103,6 +114,12 @@ pub enum IllFormedError {
impl fmt::Display for IllFormedError {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
match self {
Self::MissedVersion(None) => {
Copy link
Collaborator

@dralley dralley Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, DeclWithoutVersion makes more sense as the name. MissedVersion doesn't provide very much context.

The existing tags using that pattern e.g. MissedEnd are a little better, enough to make it OK in my opinion, but on reflection maybe it would be better to be more explicit e.g. MissingEndTag.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to keep errors with a similar meaning named similar. What about

  • MissingDeclVersion
  • MissingDoctypeName
  • MissingEndTag
  • MismatchedEndTag
  • UnmatchedEndTag
  • DoubleHyphenInComment (not changed)

?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MissingDeclVersion is fine, I have no issue with those names.

return Err(Error::IllFormed(IllFormedError::MissedDoctypeName));
match buf[8..].iter().position(|&b| !is_whitespace(b)) {
Some(start) => Ok(Event::DocType(BytesText::wrap(
&buf[8 + start..],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be kind of nice if we could document these constants here and above. With a bit of thought most of them are not so difficult to figure out, but they're not obvious at first clance.

e.g.

            BangType::DocType if uncased_starts_with(buf, b"!DOCTYPE") => {
                let doctype_tag_len = 8;
                match buf[doctype_tag_len..].iter().position(|&b| !is_whitespace(b)) {
                    Some(start) => Ok(Event::DocType(BytesText::wrap(
                        &buf[doctype_tag_len + start..],

Feel free to tweak the details (such as const vs let, doctype_tag_len vs DOCTYPE_TAG.len())

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 87 - 113 could use the same treatment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that code already slightly rewritten (I'll push draft PR soon) and does not contain magic numbers (or they documented)

// ^= 9
err!(missed_doctype_name2("<!DOCTYPE \t\r\n>") => 13: IllFormedError::MissedDoctypeName);
// ^= 13
ok!(missed_doctype_name3("<!DOCTYPE \t\r\nx>") => Event::DocType(BytesText::new("x")));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you intend to leave these names as missed_doctype_name3 (and below, mismatched_end1, and double_hyphen_in_comment1) for testing variants that are well-formed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The main motivation is that these names of the "test about MissedDoctypeName error number one... number two... and so on". The final test is a canary test that ensures that after all we able to parse correct representation without mentioned error.

I could wrap that tests into module named missed_doctype_name (snake cased error name), but I decide, that indent level would be not very comfort. The similar names of tests for the particular error is useful for running of only related tests via cargo test ... -- missed_doctype_name.

In the end, I thought that just a common name with a number is quite convenient and concise.

Copy link
Collaborator

@dralley dralley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improvements are suggested. This looks OK otherwise.

Co-authored-by: Daniel Alley <dalley@redhat.com>
@codecov-commenter
Copy link

Codecov Report

Attention: 20 lines in your changes are missing coverage. Please review.

Comparison is base (64c4249) 65.14% compared to head (2c55638) 65.31%.

Files Patch % Lines
src/errors.rs 9.09% 10 Missing ⚠️
src/reader/mod.rs 93.81% 6 Missing ⚠️
src/events/mod.rs 0.00% 4 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #684      +/-   ##
==========================================
+ Coverage   65.14%   65.31%   +0.16%     
==========================================
  Files          38       38              
  Lines       17885    17920      +35     
==========================================
+ Hits        11651    11704      +53     
+ Misses       6234     6216      -18     
Flag Coverage Δ
unittests 65.31% <88.09%> (+0.16%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Mingun Mingun merged commit 9fb181a into tafia:master Nov 22, 2023
6 checks passed
@Mingun Mingun deleted the errors branch November 22, 2023 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants