Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't read mzML files written by OpenChrom because of missing namespace declaration #17

Closed
ethanbass opened this issue Apr 24, 2023 · 5 comments

Comments

@ethanbass
Copy link
Contributor

ethanbass commented Apr 24, 2023

Hi,
The mzML files written by OpenChrom are missing a namespace declaration that is required for RaMS to parse the files correctly. It throws an error: Error in xml_find_first.xml_node(xml_data, paste0("//d1:", node_to_check)) : xmlXPathEval: evaluation failed". If I manually add the namespace declaration to the file, it parses fine.

I'm not sure what the best fix would be here. I'm not too familiar with mzML so I don't know if these namespace declarations are "required" for the mzML files to be valid. In any case I am wondering if there is a way to add a check in RaMS for the namespace declaration and maybe modify the xpaths accordingly so that RaMS can read the OpenChrom files without modification. Alternatively, I guess it would be possible to add the namespace declaration to the XML after importing it to R. I'm not sure which way would be a better approach here?

I'm attaching an example file that was produced by OpenChrom (with the .txt extension because GitHub wouldn't let me upload mzML).
alkanes 0.25 mg-ml_2-10-2021.txt

Thanks!
Ethan

@wkumler
Copy link
Owner

wkumler commented Apr 24, 2023

Hi Ethan, thanks for the bug report - it's good to know that this is happening. Namespaces are definitely one of those things I don't understand about XML very well so I'm not super surprised that we ran into an issue with it at some point. It looks like the main issue is that I wrote RaMS with the expectation that the default namespaces (d1/d2) are the only ones used, but the OpenChrom mzMLs only have the xsi namespace:

> xml_ns(xml_data_blk) # A random blank file from msconvert, read in using read_xml
d1   <-> http://psi.hupo.org/ms/mzml
d2   <-> http://psi.hupo.org/ms/mzml
xsi  <-> http://www.w3.org/2001/XMLSchema-instance
xsi1 <-> http://www.w3.org/2001/XMLSchema-instance
> xml_ns(xml_data_alkanes) # Your alkanes file, read in using read_xml
xsi <-> http://www.w3.org/2001/XMLSchema-instance

The good news is that the fix is relatively simple: we just need to drop the namespaces from the xpath expressions:

# Throws an error:
xml_find_first(xml_data_alkanes, '//d1:cvParam[@accession="MS:1000574"]')
# Works great:
xml_find_first(xml_data_alkanes, '//cvParam[@accession="MS:1000574"]')

I've implemented this in the namespace_free branch and it works perfectly on the demo file you shared, but is then broken for mzMLs from other places.

The bad news is that this is super pervasive - I've basically hard-coded these namespaces everywhere in the code and it won't be easy to carefully strip them out and create a new switch that handles both types. I think the real solution is more along the lines of the second strategy you propos and will probably be for me to learn more about XML namespaces and do the proper handling when the file is initially read in, but it may be a little while before I can set aside time for that. In the meantime, I hope the namespace_free branch will work for your purposes. You can install it with

devtools::install_github("https://github.com/wkumler/RaMS/tree/namespace_free")

@ethanbass
Copy link
Contributor Author

ethanbass commented Apr 25, 2023

Thanks for looking into this! It took me stupidly long to figure this out, but I think I actually found a pretty easy solution. It turns out you can just set the namespace as an attribute of the xml. I think a function like this will fix the problem (to be called right after reading in the xml with read_xml:

checkNamespace <- function(xml_data){
  if (is.na(xml2::xml_attr(xml_data,"xmlns"))){
    xml2::xml_attr(xml_data,"xmlns") <- "http://psi.hupo.org/ms/mzml"
  }
}

I could probably put together a pull request later this afternoon if you like

@wkumler
Copy link
Owner

wkumler commented Sep 25, 2023

Hi Ethan, just wanted to let you know I haven't forgotten about this! I'm a bit swamped with other tasks right now and have had to de-prioritize package development but I'm hoping to get v1.4 our by the end of this year or early 2024 and this would be the perfect thing to include at that point. Your PR looks great and I'll probably just implement it as-is, I just don't have the time to go through the updates and versioning process at the moment.

@ethanbass
Copy link
Contributor Author

Sounds good. Thanks for the update and no worries about the delay -- I am quite familiar with being swamped.

@wkumler wkumler mentioned this issue Nov 13, 2023
9 tasks
@wkumler
Copy link
Owner

wkumler commented Nov 30, 2023

Completed with #24

@wkumler wkumler closed this as completed Nov 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants