Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract & generate metadata from data objects (e.g. spatial sp, raster, etc) #144

Closed
lwasser opened this issue Mar 8, 2016 · 16 comments
Closed

Comments

@lwasser
Copy link

lwasser commented Mar 8, 2016

Hey @cboettig - i'm trying to wrap up a lesson and think you probably know the quick fix to this. I still am confused about accessing slots. I have created, a new, smaller eml file

eml_HARV <- read_eml("http://neon-workwithdata.github.io/NEON-R-Spatio-Temporal-Data-and-Management-Intro/hf001-revised.xml")

previously i could access the x,y values using

YCoord <- eml_HARV@dataset@coverage@geographicCoverage@boundingCoordinates@northBoundingCoordinate

You removed the resource group component but now this still isn't working.

I have tried get attributes however i'm a bit confused as to how to index that properly to grab an x,y location.

get_attributes(eml_HARV@dataset@coverage[[1]]@attributeList, join = TRUE)

What am I doing wrong?
Thank you!

@cboettig
Copy link
Member

cboettig commented Mar 8, 2016

Sorry about that, quick answer first:

eml_HARV@dataset@coverage@geographicCoverage[[1]]@boundingCoordinates@westBoundingCoordinate

The trick here is that EML permits multiple geographicCoverage elements, so eml_HARV@dataset@coverage@geographicCoverage is really a list of geographicCoverage elements, hence the [[1]]. (The earlier version of EML ignored this fact, since most people use only one such element anyway, but that's not technically correct and cannot be assumed of all EML).

We need a better way to document & let the user discover when an element is a list, but for now, try this. If you use RStudio or have tab-completion enabled in your editor, you can start typing the coverage@geogr and hit tab (or hit tab after each @) and you'll see the possible completions. If you don't see any completions other than .Data, just add a [[1]]@ and hit tab and see if that works.

I know that's not elegant, we'll try and think of a better way (maybe with a higher-level get method). But that's the challenge right now of being flexible.

Re get_attributes(eml_HARV@dataset@coverage[[1]]@attributeList, join = TRUE) I'm not sure what this is trying to do, but there is no such thing as an attributeList in coverage. Keep in mind an attributeList means something very specific in EML -- it defines the units in tabular data inside a dataTable object. The get_attributes() function isn't a generic get function, it only knows how to work with the EML attributesList object. If you do getSlots("coverage") you'll see coverage doesn't have attributeLists, it just has temporalCoverage, geographicCoverage etc.

Sorry this is confusing, your questions are great. You are just a bit ahead of me still, I'm hoping to get some finished examples and real documentation up soon, and then work out some more get_ methods. (I have also been focusing on the methods to create EML before tackling those to parse EML easily so far).

Thanks again!

@mbjones
Copy link
Member

mbjones commented Mar 8, 2016

And just a quick note that many sites and people use the repeating geographic, temporal, and taxonomic coverage elements in EML, so its great to have this support in the R package now. Its common for people to give bounding boxes for all of their sampling sites, as well as specific temporal ranges for when they did sampling, rather than overall boxes, especially when they only sampled a small part of the larger region.

@lwasser
Copy link
Author

lwasser commented Mar 8, 2016

ok thank you, @cboettig for your patience in answering these questions. the .Data part when using tab complete was throwing me off. So this means the information is stored / accessed via a list.

Re: attributeList - i was confused. It makes perfect sense that it defines data table units. I thought for some reason it was a means to index eml elements. My mistake!

@lwasser
Copy link
Author

lwasser commented Mar 8, 2016

+1 om @mbjones comment above as well! NEON is VERY likely to have those use cases.

@lwasser
Copy link
Author

lwasser commented Mar 8, 2016

ok - to clarify - to grab the actual coordinate value, I need:

eml_HARV@dataset@coverage@geographicCoverage[[1]]@boundingCoordinates@westBoundingCoordinate@.Data

I"m attempting to demonstrate how EML can be ingested and used in an automated workflow thus i need the coordinate values.

@cboettig
Copy link
Member

cboettig commented Mar 8, 2016

@lwasser yup that'll work. If you don't like @.Data you can just do [[1]] instead, e.g.

eml_HARV@dataset@coverage@geographicCoverage[[1]]@boundingCoordinates@westBoundingCoordinate[[1]]

It might be nice for us to hear/think more about the big picture workflow you have in mind too. While a user can always subset in this way, there may be a role for a helper function that could, say, extract all the bounding boxes out of a coverage node like @mbjones describes and return the information in a concise data.frame or sp spatial object or some such.

@lwasser
Copy link
Author

lwasser commented Mar 8, 2016

@cboettig Great - Thank you! i'm still sorting that out but i have some ideas!

The first use case that i came up with was, wanting to quickly create a base map of the site location - pulling from an EML. This would be a part of early site exploration where you collect a bunch of base files and need to look at things spatially.

In this case, i wanted to plot the extent of the site. SO if there are several coverage elements, it would be good to be able to extract either the x,y point or x,y extent box, convert to a number as.numeric and then plot or use this information for something else.

Here is an example in this lesson of that use case where i created a map.

http://neoninc.github.io/NEON-Data-Skills-Development/R/why-metadata-are-important/

if the data were spatial - and in .asc or some other format like H5 where the extent may or may not come in automatically, i might use that numeric information to spatially "place" the grid itself! coverage seems like an important element (of course i am biased being a spatial science type :) )

@mbjones
Copy link
Member

mbjones commented Mar 8, 2016

@cboettig @lwasser It strikes me that, once the core part of this package is done, it might be really useful to have a little mini-hackathon where EML/R users could propose use cases, and we could review and code solutions to make those use cases for metadata creation straightforward. This would probably really rapidly advance the cause. Towards that end, maybe we should start a use case markdown document, or maybe collect issues that are labeled as 'use_case'.

@lwasser
Copy link
Author

lwasser commented Mar 8, 2016

I think that's a great idea! It would also allow you to ID the more common use cases of interest to the community! :)

@cboettig
Copy link
Member

cboettig commented Mar 9, 2016

Definitely! Using the issues tracker for this sounds good to me.

This also seems like an area where having some good examples of what can be done and how could be essential (a la the Henry Ford quote, "ask people what they want and they say 'faster horses'"). @mbjones I really like the idea of a hackathon as a way to bridge that gap in connecting what is possible to what needs doing.

@lwasser Thanks for the description; that's definitely helpful. Like you say, this highlights an interesting line between what is "data" (read: files described by but external to EML) and what is "metadata" (read: inside an EML doc itself). For instance, the use case: "I have this data file and I want to visualize its extent" is probably a common one, but it may not be obvious what role EML plays in this picture. Best to leverage a standard spatial data format and a standard visualization tool for that data than re-invent the wheel. If we can just automate the map between the EML representation of this information (i.e. in geographicCoverage, but see also the spatialVector and spatialRaster modules in EML) and the standard formats then the visualization or other applications become easier.

Still, to me the use cases for EML that really shine happen only once we start considering more than one EML file at a time; particularly EML files documenting very different kinds of data. Consider the example @mbjones & team have built with the KNB data repository, where a user can search for a data term and see where on the map the available data comes from; or search for all data files falling within a particular region. I think this kind of use case really highlights the advantage of having, say, spatial coverage described in EML rather than only available in specialized data files, even if those files are generally more standard and compatible with existing visualization & other tools. So I am interested in developing use cases that show this kind of application over a whole repository of data, even when the underlying information described in each of the data files may be very different. Anyway, just brainstorming use cases here.

@lwasser
Copy link
Author

lwasser commented Mar 9, 2016 via email

@ivanhanigan
Copy link
Contributor

I like where this is going. I want to voice support for using R to generate EML from the data, leveraging from sp or raster for the spatial, table for nominal, levels for ordinal variables, summary for numeric, Date for temporal coverage... Possibly from taxize for the taxonomy stuff.

I strongly feel that unlike in Morpho the R/EML environment allows data objects to not be so separated from their documentation. When I worked on a metacat portal for ecological plot based collections, the data providers often sent data that was inconsistent with the metadata they provided. For eg a ordinal variable would be described with the value labels but not missings... or some values were entered as if the data entry operator was equivocating medium/high, but in the documentation the researchers provided they say our values are low, medium or high.

Another often found case was when there were errors in what they provided for the geographicCoverage, and the end result was EML spatial references on the portal that were nowhere near the data.

I feel the need to strongly emphasize automatically extracting metadata from the data.
For eg one of the solutions I worked on prior to leaving the portal is shown below. It aimed to give data librarians who created our EMLs in Morpho the coordinates from the spatial file rather than what was given by the researchers, so for the geographicCoverage they could just paste this into Morpho. If I'd had time and EML package was available I'd have actually returned the EML coverage rather than just this matrix for morpho.

morpho_bounding_box <- function(x){
  ## TODO Check if spatial obj and proj4string is valid first
  bb <- x@bbox
  # TODO the following is only for southern hemisphere (Oz)
  loc <- data.frame(
    rbind(
      c(NA, round(abs(bb[2,2]), 5), NA),
      c(round(bb[1,1], 5),  NA, round(bb[1,2],5)),
      c(NA, round(abs(bb[2,1]),5), NA)
      )
    )
  # make something to print in the shape Morpho wants it
  loc$X2[c(1,3)] <- sprintf("%s S", abs(loc$X2[c(1,3)]))
  loc[2,c(1,3)] <- sprintf("%s E", abs(loc[2,c(1,3)]))

  return(loc)
  }

HTH

@ivanhanigan
Copy link
Contributor

@mbjones re #144 (comment)
I raised this opportunity for the Brisbane Australia rOpenSci 2016 unconference Australia
ropensci/auunconf#11

@ivanhanigan
Copy link
Contributor

The Aussie ropensci unconference is in two weeks and led to this idea re attributes for functions/data that would be "retrievable as first-class objects via some method, or printable" ropensci/auunconf#18 (comment).

I wonder if you have thoughts on that? I am not sure if this is compatible with the EML package approach. I won't be able to attend Brisbane in person but instead will try to engage remotely and set aside the two days to work on implementing EML functions into the public health observatory at my university.

@cboettig
Copy link
Member

merging this into issue #150

@cboettig cboettig changed the title Pulling X,Y coordinates from the coverage Extract & generate metadata from data objects (e.g. spatial sp, raster, etc) Feb 28, 2017
@cboettig cboettig reopened this Feb 28, 2017
@cboettig
Copy link
Member

closing in favor of 150

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants