Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: handle namespaced XML #73

Merged
merged 5 commits into from
Mar 23, 2017
Merged

WIP: handle namespaced XML #73

merged 5 commits into from
Mar 23, 2017

Conversation

pfernie
Copy link
Contributor

@pfernie pfernie commented Mar 16, 2017

When putting together a test case for #72 , I created a .xlsx using Open XML SDK to be able to manually create both shared string and inline string data. It turns out that the xml generated by that will use namespaced XML tags (an example of such can be seen here, where, e.g., <x:sheets> is used instead of <sheets>. A sample .xlsx showing this behavior was added (richtext-namespaced.xlsx).

This PR introduces a trait LocalName which is used to strip namespace prefixes from tags for Bytes{Start,End}. However, such a method probably makes more sense added to quick-xml? (Or possibly this functionality already exists there and I missed it). So consider this PR a WIP; I'm happy to retool it to work w/ an updated quick-xml or another approach.

I also tried to make Event::Eof handling consistent by making it an error and explicitly checking for various closing tags.

@tafia
Copy link
Owner

tafia commented Mar 17, 2017

I believe you could probably use read_namespaced_event, ignoring the resolved namespace?

@tafia
Copy link
Owner

tafia commented Mar 17, 2017

Could you please also rebase?

@tafia
Copy link
Owner

tafia commented Mar 17, 2017

Alternatively, if you don't want to use quick_xml namespaces resolution (which has some overhead), do you mind quickly benchmarking this alternative:

fn compare_name(node_name: &[u8], search_name: &[u8]) -> bool {
    node_name.ends_with(search_name)
        && (node_name.len() == search_name.len() 
            || node_name[node_name.len() - search_name.len() - 1] == b':') // as we are sure node_name.len() > search_name.len()
}

We could indeed probably amend quick-xml and provide this fn in BytesStart directly.

@pfernie
Copy link
Contributor Author

pfernie commented Mar 17, 2017

I've rebased. Perhaps I'm misunderstanding the read_namespaced_event fn, but for a simple test case it doesn't seem to do what is needed (namely, strip the namespace prefix):

    let xml = r#"<?xml version="1.0" encoding="utf-8"?>
<x:workbook xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
    <x:sheets>
        <x:sheet name="mySheet" sheetId="1" r:id="Rddc7711f116045e5" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" />
    </x:sheets>
</x:workbook>"#;
    let mut rdr = Reader::from_str(xml);

    let mut buf = Vec::new();
    loop {
        match rdr.read_namespaced_event(&mut buf).unwrap() {
            (_, Event::Eof) => break,
            (_, Event::Start(ref e)) => println!("{}", ::std::str::from_utf8(e.name()).unwrap()),
            (_, Event::End(ref e)) => println!("{}", ::std::str::from_utf8(e.name()).unwrap()),
            _ => {}
        }
    }

still yields names with the namespace:

x:workbook
x:sheets
x:sheets
x:workbook

I did a simple (repeated runs w/ time) comparison of the existing local_name vs. the suggested compare_name that simply iterates over a worksheet with 416k cells, and there was no discernable difference (both run ~0.98s on my machine); they seem quite comparable.

The local_name approach feels slightly more ergonomic as it allows for e.g.

match e.local_name() {
  b"is" => {}
  b"v" => {}
  ...
}

But either way does seem like it should end up in quick-xml.

@tafia
Copy link
Owner

tafia commented Mar 17, 2017

but for a simple test case it doesn't seem to do what is needed

Oh really! I'll check on quick-xml but it might be a regression then.

@tafia
Copy link
Owner

tafia commented Mar 17, 2017

Just opened an issue on quick-xml.

I am fine merging this PR, I'll update quick-xml later. Except if you want to make that change on quick-xml directly, I cannot do it at the moment.

@pfernie
Copy link
Contributor Author

pfernie commented Mar 17, 2017

After reviewing read_namespaced_event and realizing I had missed the extra step of resolve_namespace, I have submitted a PR tafia/quick-xml#60 which implements local_name there. I have an update of this branch that uses that, but obviously it will not build here until the quick-xml PR is resolved.

@tafia
Copy link
Owner

tafia commented Mar 21, 2017

Should we close this PR and wait for merging the one on quick-xml first?

@pfernie
Copy link
Contributor Author

pfernie commented Mar 21, 2017

I have a rebase of this PR which works with the quick-xml PR, so I can just update this PR if/once that is accepted.

@tafia
Copy link
Owner

tafia commented Mar 22, 2017

Just published quick-xml 0.7.0

@pfernie
Copy link
Contributor Author

pfernie commented Mar 22, 2017

PR updated to use quick-xml 0.7.0 and local_name. There seems to be some spurious failure on the AppVeyor build; it failed trying to run curl? Travis run went fine...

Copy link
Owner

@tafia tafia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to bother you again, but I am not sure you need read_namespaced_event if you are dealing with namespaces manually via the local_name function anyway.

src/xlsx.rs Outdated
match xml.read_event(&mut buf) {
Ok(Event::Start(ref e)) if e.name() == b"si" => {
if let Some(s) = read_string(&mut xml, b"si")? {
match xml.read_namespaced_event(&mut buf).map(|t| t.1) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you use read_event directly since the local_name is on BytesStart ?

@tafia
Copy link
Owner

tafia commented Mar 23, 2017

There seems to be some spurious failure on the AppVeyor build; it failed trying to run curl?

We're not the only ones:
conda-forge/conda-forge.github.io#357

Looking at the case on AppVeyor could you try adding this to the appveyor file:

install:
- set PATH=C:\Program Files\Git\mingw64\bin;%PATH%

EDIT: I've just did it in #74, can you rebase?

(cherry picked from commit 9f029bc)
tests both namespaced xml content and inline v. shared richtext strings

(cherry picked from commit 3cbe855)
* introduces LocalName trait
* impl `LocalName` for `Bytes{Start,End}`, which returns the name
  with any prefix removed
* use `local_name()` to handle namespacing

(cherry picked from commit 50176dd)
(cherry picked from commit a294fbf)
@pfernie
Copy link
Contributor Author

pfernie commented Mar 23, 2017

You're right; read_namespaced_event not needed with local_name; I switched them back to read_event and tidied up the branch. AppVeyor build is passing w/ your changes as well.

Thanks!

@tafia tafia merged commit d149214 into tafia:master Mar 23, 2017
@tafia
Copy link
Owner

tafia commented Mar 23, 2017

Thanks a lot for your patience!

@tafia
Copy link
Owner

tafia commented Mar 23, 2017

Just published version 0.7.0

dhbradshaw added a commit to dhbradshaw/calamine that referenced this pull request Nov 10, 2021
This continues using a pattern for parsing namespaced xlsx files introduced in pr
tafia#73 and thereby allows parsing of namespaced .xlsx
files with definedName values.

`xml.read_text` requires the fully namespaced name.  Since that isn't
always equal to the local_name, the pattern introduced has been to
use `e.local_name` for making comparisons to the raw value of interest and then `e.name()`
instead of that raw value for xml.read_text .
dhbradshaw added a commit to dhbradshaw/calamine that referenced this pull request Nov 10, 2021
This continues using a pattern for parsing namespaced xlsx files introduced in pr
tafia#73 and thereby allows parsing of namespaced .xlsx
files with definedName values.

`xml.read_text` requires the fully namespaced name.  Since that isn't
always equal to the local_name, the pattern introduced has been to
use `e.local_name` for making comparisons to the raw value of interest and then `e.name()`
instead of that raw value for xml.read_text .
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants