Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reml-generated EML should include metadata stating so #22

Closed
1 of 3 tasks
cboettig opened this issue Jun 28, 2013 · 22 comments
Closed
1 of 3 tasks

reml-generated EML should include metadata stating so #22

cboettig opened this issue Jun 28, 2013 · 22 comments

Comments

@cboettig
Copy link
Member

That way if someone doesn't like the EML, they know who to blame ;-)

  • e.g. should have plain-text description of REML, contact info & bug report info. Have to figure out the best syntax for this.
  • A richer implementation could document the R function calls used to generate the EML.
  • Include citation to REML (e.g. as a software node and/or literature node)
@cboettig
Copy link
Member Author

cboettig commented Jul 6, 2013

@mbjones Is this a standard thing to do? Recommendation for how we encode it?

@mbjones
Copy link
Member

mbjones commented Jul 6, 2013

I think the best place to add it is in /eml/dataset/methods/methodStep/software, and in the sibling description element describe the role that reml played in generating the metadata. You might also want to add the citation in that subtree to REML. EML is pretty flexible, so there are other options as well, but I think this is the most appropriate.

@cboettig
Copy link
Member Author

cboettig commented Jul 7, 2013

Excellent. Since this will create a software node, we may as well write eml_software first, and eml_R_software, then we can create the software node with eml_software("reml"), see #32

@mbjones Um, I'm not spotting the documentation for how a sibling description element should be constructed?

@mbjones
Copy link
Member

mbjones commented Jul 7, 2013

Schema diagram is here:
http://knb.ecoinformatics.org/software/eml/eml-2.1.1/eml-methods.png
I'm referring to the three sibling elements:
/eml/dataset/methods/methodStep/software
/eml/dataset/methods/methodStep/description
/eml/dataset/methods/methodStep/citation

cboettig added a commit that referenced this issue Jul 7, 2013
Contains commands to generate a software node using eml_R_software (#32)
and contains commands to generate the description node based on package description text (custom text might be prefered?)
Still needs citation node #27, along with map between R citation() function and eml citation nodes
@cboettig
Copy link
Member Author

cboettig commented Jul 7, 2013

@mbjones Thanks, this makes sense. First two are done, but trying to wrap my head around EML citation objects.

I assume we cite software as <generic>?

I'm a bit confused why I don't see things like title and author listed under fields such as Article: http://knb.ecoinformatics.org/software/eml/eml-2.1.1/eml-literature.html#Article, I guess that's because we have <title> and <creator> defined elsewhere?

I see that the citation object is built around the endnote format. R's citation tools (and lots of other tools) can return citations in bibtex format; I'm wondering if there's anything clever that can be done here in place of just mapping each term by hand...

@mbjones
Copy link
Member

mbjones commented Jul 7, 2013

@cboettig Yeah, software is probably best listed as <generic>. At the time we did this, EndNote was massively predominant, and the likes of Mendeley and Zotero were far off on the horizon yet. In retrospect I wish I had known more about Bibtex, as it seems to have survived the test of time, but in 1998-2000 there simply weren't any XML-based citation schemas available. So, we ported endnote. There must be a decent mapping to convert to more modern standards like Bibtex or Bibo, but I haven't looked carefully for it. I think Dryad uses a subset of Bibo, but they had to define their own xml schema for that too, as Bibo doesn't have a schema doc.

Regarding why <title> and <creator> are not shown in the spec, its because they are part of another module that is included by reference, specifically res:ResourceGroup. In XSD, you can include schema portions by reference to a group of elements, and as we need the bibliographic fields that describe resources in many places, we created res:ResourceGroup to be the common group for these fields. So, near the top of the CitationType definition (http://knb.ecoinformatics.org/software/eml/eml-2.1.1/eml-literature.html#CitationType), you will see this:

A sequence of (
res:ResourceGroup
contact optional unbounded
...
}

which is a group inclusion. Follow the link to res:ResourceGroup and you'll see all of the fields. If you look at the diagram for eml-literature, you'll see that those fields in the group have been included by parsing the XSD (http://knb.ecoinformatics.org/software/eml/eml-2.1.1/eml-literature.png).

@cboettig
Copy link
Member Author

cboettig commented Jul 8, 2013

Sounds good. Thanks for explaining the group inclusion with
res:ResourceGroup, clearly I'm brand new to XSD so these pointers help me
get up to speed.

Yeah, Dryad uses the custom:
https://raw.github.com/datadryad/dryad-repo/dryad-master/dspace/modules/xmlui/src/main/webapp/themes/Dryad/meta/schema/v3.1/bibo.xsd

I was hoping Shotton's group might be persuaded to make a xsd file for
fabio, an alternative to bibo with some advantages, including be OWL2 DL
instead of OWL full, see
http://semanticpublishing.wordpress.com/2011/06/29/comparison-of-bibo-and-fabio/
(I
have only a fuzzy understanding of the the differences, but probably means
more to you). A moot point since they don't seem to have an XSD file
either at this time, and even if they did this would mean changing the EML
schema? Or is it trivial to extend if you had such bibliographic schema?

On Sun, Jul 7, 2013 at 4:45 PM, Matt Jones notifications@github.com wrote:

@cboettig https://github.com/cboettig Yeah, software is probably best
listed as . At the time we did this, EndNote was massively
predominant, and the likes of Mendeley and Zotero were far off on the
horizon yet. In retrospect I wish I had known more about Bibtex, as it
seems to have survived the test of time, but in 1998-2000 there simply
weren't any XML-based citation schemas available. So, we ported endnote.
There must be a decent mapping to convert to more modern standards like
Bibtex or Bibo, but I haven't looked carefully for it. I think Dryad uses a
subset of Bibo, but they had to define their own xml schema for that too,
as Bibo doesn't have a schema doc.

Regarding why <title> and are not shown in the spec, its
because they are part of another module that is included by reference,
specifically res:ResourceGroup. In XSD, you can include schema portions
by reference to a group of elements, and as we need the bibliographic
fields that describe resources in many places, we created 'res:ResourceGroupto
be the common group for these fields. So, near the top of theCitationType`
definition (
http://knb.ecoinformatics.org/software/eml/eml-2.1.1/eml-literature.html#CitationType),
you will see this:

A sequence of (
res:ResourceGroup

contact optional unbounded
...
}

which is a group inclusion. Follow the link to res:ResourceGroup and
you'll see all of the fields. If you look at the diagram for
eml-literature, you'll see that those fields in the group have been
included by parsing the XSD (
http://knb.ecoinformatics.org/software/eml/eml-2.1.1/eml-literature.png).


Reply to this email directly or view it on GitHubhttps://github.com/ropensci/reml/issues/22#issuecomment-20577230
.

Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/

@mbjones
Copy link
Member

mbjones commented Jul 8, 2013

Definitely not trivial to extend -- many groups around the world use the EML schema and have written software to generate it and consume it -- any schema changes, especially backwards incompatible ones, have a ripple effect on the community. So, we try to avoid changes that break existing EML documents. Adding something as an optional new field is more acceptable and can generally get approved by the EML community fairly quickly.

@cboettig
Copy link
Member Author

@mbjones Okay, I failed to write this methodsStep (which states that reml created the EML) correctly:

My R code creates a nod that looks like this:

<methods>
  <methodsStep>
    <software>
      <license>CC0</license>
      <version>0.0-1</version>
      <implementation>
        <distribution>
          <online>
            <url>https://github.com/ropensci/reml</url>
          </online>
        </distribution>
      </implementation>
    </software>
    <description>An R package for reading, writing, integrating and publishing data
    using the Ecological Metadata Language (EML) format.</description>
  </methodsStep>
</methods> 

And the validator complains:

[1] "cvc-complex-type.2.4.a: Invalid content starting with element 'methodsStep'. The content must match '(((\"\":methodStep){1-UNBOUNDED},(\"\":sampling){0-1}),(\"\":qualityControl){0-UNBOUNDED}){1-UNBOUNDED}'."

Um, does this mean I need sampling and qualityControl in a methods step? I'm confused.

@karthik
Copy link
Member

karthik commented Jul 20, 2013

My R code creates a nod that looks like this

What function generated that metadata?

@cboettig
Copy link
Member Author

eml_write, which called eml_dataset, which calls

  methodsStep <- newXMLNode("methodsStep", parent = methods_node)
  addChildren(methodsStep, eml_R_software("reml"))
  addChildren(methodsStep,
              newXMLNode("description",
                         packageDescription("reml", fields="Description")))

which uses eml_R_software creates that node... (using eml_software)

On Sat, Jul 20, 2013 at 8:56 AM, Karthik Ram notifications@github.comwrote:

My R code creates a nod that looks like this

What function generated that metadata?


Reply to this email directly or view it on GitHubhttps://github.com/ropensci/reml/issues/22#issuecomment-21295658
.

Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/

@mbjones
Copy link
Member

mbjones commented Jul 20, 2013

@cboettig -- Just pushed a fix -- "methodsStep" should have been "methodStep".

@cboettig
Copy link
Member Author

thanks!

On Sat, Jul 20, 2013 at 9:19 AM, Matt Jones notifications@github.comwrote:

@cboettig https://github.com/cboettig -- Just pushed a fix --
"methodsStep" should have been "methodStep".


Reply to this email directly or view it on GitHubhttps://github.com/ropensci/reml/issues/22#issuecomment-21296051
.

Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/

@cboettig
Copy link
Member Author

oh, validator still unhappy:

[1] "cvc-complex-type.2.4.a: Invalid content starting with element 'software'. The content must match '((((((\"\":description),((\"\":citation)|(\"\":protocol)){0-UNBOUNDED}),(\"\":instrumentation){0-UNBOUNDED}),(\"\":software){0-UNBOUNDED}),(\"\":subStep){0-UNBOUNDED}),(\"\":dataSource){0-UNBOUNDED})'."

@mbjones
Copy link
Member

mbjones commented Jul 20, 2013

What does the methodStep snippet look like now? The error message is just relating the schema rules, which are somewhat easier to grok in this image:

http://knb.ecoinformatics.org/software/eml/eml-2.1.1/eml-methods.png

@cboettig
Copy link
Member Author

now it is:

<methods>
  <methodStep>
    <software>
      <license>CC0</license>
      <version>0.0-1</version>
      <implementation>
        <distribution>
          <online>
            <url>https://github.com/ropensci/reml</url>
          </online>
        </distribution>
      </implementation>
    </software>
    <description>An R package for reading, writing, integrating and publishing data
    using the Ecological Metadata Language (EML) format.</description>
  </methodStep>
</methods> 

@mbjones
Copy link
Member

mbjones commented Jul 20, 2013

Elements need to be in a different order to be valid. In addition, you are missing required fields from the software module, including title, and creator. See:
http://knb.ecoinformatics.org/software/eml/eml-2.1.1/eml-software.png

Something like this might validate (I didn't try it, just tried to follow the schema):

<methods>
  <methodStep>
    <description>An R package for reading, writing, integrating and publishing data
    using the Ecological Metadata Language (EML) format.</description>
    <software>
      <title>reml</title>
      <creator>
            <individualName>
                  <givenName>Carl</givenName><surName>Boettiger</surName>
            </individualName>
      </creator>
      <creator>
            <individualName>
                  <givenName>Karthik</givenName><surName>Ram</surName>
            </individualName>
      </creator>
      <implementation>
        <distribution>
          <online>
            <url>https://github.com/ropensci/reml</url>
          </online>
        </distribution>
      </implementation>
      <license>CC0</license>
      <version>0.0-1</version>
    </software>
  </methodStep>
</methods> 

@cboettig
Copy link
Member Author

Thanks for clarifying, sorry I haven't got the hang of reading the spec still. I keep forgetting resourceGroup and forgetting to pay attention to node ordering. Once XMLSchema package is fully running, we will be able to automate the creation of corresponding S4 objects to the schema, so it will just be a matter of writing coercion methods (e.g. like eml_R_software that can extract information from native R formats (e.g. the R package DESCRIPTION) into the S4 object, which will reduce errors like this!

@karthik
Copy link
Member

karthik commented Jul 21, 2013

I don't have a complete understanding of the spec either. @mbjones can you suggest some readings that will allow me to get up to speed?

@cboettig
Copy link
Member Author

@karthikram The pngs are pretty handy once you get the hang of them, e.g.
http://knb.ecoinformatics.org/software/eml/eml-2.1.1/eml-software.png I
think ordering of nodes matters, top down as shown. dashed lines are
optional nodes. I don't get the symbols with boxes on lines (either linear
ones or stacked). And of course vector graphic would be eaiser to read
without squinting...

Otherwise I find the descriptions in the 'normative technical documents'
reasonably readable...
http://knb.ecoinformatics.org/software/eml/eml-2.1.1/eml-software.html

As I commented above, a working XMLSchema will be a big help in automating
a lot of this, and letting us focus on the UI. But writing a few nodes out
by hand is pretty instructive.

On Sat, Jul 20, 2013 at 9:23 PM, Karthik Ram notifications@github.comwrote:

I don't have a complete understanding of the spec either. @mbjoneshttps://github.com/mbjonescan you suggest some readings that will allow me to get up to speed?


Reply to this email directly or view it on GitHubhttps://github.com/ropensci/reml/issues/22#issuecomment-21304956
.

Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/

@mbjones
Copy link
Member

mbjones commented Jul 21, 2013

The diagrams are the best way to understand the spec, although note that the diagrams do not show XML attributes. This was a shortcoming of the software used to generate the schemas. There is a nice explanation of the diagrams here: http://www.diversitycampus.net/projects/tdwg-sdd/minutes/SchemaDocu/SchemaDesignElements.html

I'm not sure what you are referring to with the 'symbols with boxes on lines' comment. Sorry. Getting up to speed on the EML spec (or any other) just takes time -- the EML schema itself is the most useful. We wrote a paper describing use of the spec a while ago targeted at ecologists, but it doesn't get into the technical details of the spec -- see Fegraus et al. 2005: http://www.mendeley.com/download/public/1825821/4445891665/2d411d3da6a51fecf34b4f0061a68d250c486eaa/dl.pdf

These diagrams were built with XML Spy, an XML editor that can display XML Schema. There are several others that will produce diagrams. If you are trying to understand the EML schema, it can be useful to open the eml.xsd schema in one of these editors so that you can explore the schema tree more dynamically -- the images are just static screenshots of the diagrams for certain subtrees.

@cboettig
Copy link
Member Author

reml-generated methods node now exists, so I think we can close this issue. software and citation nodes will be written/improved upon once we get this S4 thing nailed down, and then we can revisit this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants