Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to process/generate ShEx reports #115

Open
andrawaag opened this issue Mar 24, 2021 · 2 comments
Open

How to process/generate ShEx reports #115

andrawaag opened this issue Mar 24, 2021 · 2 comments

Comments

@andrawaag
Copy link
Contributor

I am applying ShEx validation on a large scale on Wikidata items, but I am struggling to aggregate the results in a sensible report. I am looking for best practices here. This is the approach I have followed so far, which is working for this specific use case.

Take for example the following use case:

  • I want to validate all items in Wikidata that have a statement with the Disease Ontology Property (P699):
    https://w.wiki/387j.
  • For those 13054 wikidata items, I would like to know if they fit the following Shape Expression E3.
  • For those 13054 reports of the ShEx validation I would like to have an aggregated report that tells the overall issues observed.

I have developed a script using WikidataIntegrator and PyShEx to do the validation. These are the steps:

  • Do the validation: script
  • capture the generated reports: log
  • Parse the log an generate aggregated report: notebook

The current aggregated report is sufficient for its task, i.e. where are the issues. But getting there requires some suboptimal parsing of output of strings and some arbitrary clustering on types of errors.

{'No matching triples found for predicate p:P2888': 6608, 'No matching triples found for predicate ps:P279': 6217, '2 triples exceeds max {1,1}': 3666, 'No matching triples found for predicate prov:wasDerivedFrom': 2632, '{"values": ["http://www.wikidata.org/entity/Q5282129"], "typ...': 534, '{"values": ["http://www.wikidata.org/entity/Q27468140"], "ty...': 1304, 'No matching triples found for predicate pr:P699': 772, '3 triples exceeds max {1,1}': 9, 'No matching triples found for predicate pr:P5270': 1}

I am looking for:

  1. suggestion to improve the pipeline/alternatives
  2. a standard output from shex validation pipelines from which reports can be generated. For example can there be a finite set of error types? e.g "No matching triples, cardinality issue", etc.
@andrawaag
Copy link
Contributor Author

andrawaag commented Mar 24, 2021

@hsolbrig @ericprud @labra If I am not mistaken you have implemented various output detailing possible errors in a ShEx validation. Are there a finite number of possible errors? Would you mind listing the type of errors listed in your solutions?

@goodb
Copy link

goodb commented Mar 24, 2021

Hi @andrawaag when faced with a similar situation for the gene ontology folks, I opened up the java shex so I could access failure information directly, rather than parsing output files. You might have more luck that way as the information you need is by definition in there somewhere.

For that project, I returned 1) the node(s) that failed, 2) the properties that disagreed with the schema applied. I think this was enough to make a start on human readable error reporting, but maybe touch base with team GO to see how that is going now. The hard part we didn't really work on yet was the error cascade - e.g. when one node fails and then that failure causes another node to fail etc.. In those models, which are pretty small, human understanding can usually find the root of a problem but this will be an issue over larger models.

Work was related to this issue: geneontology/minerva#212

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants