Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Label Extension #362

Merged
merged 43 commits into from Jun 19, 2019

Conversation

Projects
None yet
9 participants
@matthewhanson
Copy link
Collaborator

commented Nov 27, 2018

No description provided.

|-----------------|-----------------|----------------------------|--------------------------------------------------------------------------------------------------|
| td:title | string | Title | A human readable title of the dataset for display |
| td:description | string | Description | **REQUIRED.** A description of the training data, how it was created, and what it is recommended for |
| td:classes | [string] | Classes | **REQUIRED.** a list of keywords representing the nature of the labels. (e.g., tree, building, car, hippo)

This comment has been minimized.

Copy link
@lossyrob

lossyrob Dec 7, 2018

Collaborator

Have you thought about supporting non-classificaiton tasks? e.g. "estimate count of buildings in this 500x500 pixel grid". Requiring classes here wouldn't fit with regression problems. An easy work around would be to just give a single "value" class, but you may want to bake a flexibility in the data model instead.

This comment has been minimized.

Copy link
@matthewhanson

matthewhanson Mar 5, 2019

Author Collaborator

It's a good idea @lossyrob , so we could:

  • have a "value" property in the asset features rather than a "label" property
  • td:classes renamed to td:values which could contains the list of possible values. Although for something like 'count' this would be unwieldy.

How would you define the task? Right now we have this "label_type", which is: "One of 'classification', 'detection', or 'segmentation'", but I agree this is unnecessarily limiting. Going to write some more thoughts below.

@daveluo

This comment has been minimized.

Copy link

commented Mar 1, 2019

Hi,

To add more ML use cases and sample data for consideration in this STAC extension proposal, here's a small sample set of ML training inputs (source imagery from UAV & building footprint labels) and outputs (raster and polygons of segmented buildings) for building segmentation in Zanzibar.

catalog: https://github.com/daveluo/stac4ml-demo
stac browser (temp url): https://zen-turing-2069dc.netlify.com/?t=catalogs

Data provided is for illustrative purposes only and I manually defined much of the metadata so any errors/inconsistencies with the official data sources or the schema are mine.

I organized the source imagery into its own collection with rel links within the training and output items pointing to them as rel : source and td:assets : rgb. Redundantly, I also added the source imagery COGs as assets within each training and output item which nicely renders them as previews in the stac-browser. Any better way to do this?

Also still looking for good ideas to organize ML outputs. Currently I have job... items under the outputs collection and all the artifacts as assets within the item. This is inspired by the spacenet-stac example with ml_exports: https://spacenet-stac.netlify.com/ml-exports

@cholmes

This comment has been minimized.

Copy link
Contributor

commented Mar 1, 2019

Awesome demo @daveluo! Cool to see it in action in stac browser.

It'll take some modifications to STAC Browser, but we talked about for training data just referencing a single 'master' COG and then making the bounding box meaningful, where a renderer would just render the portion of the source image that is in the bounding box.

The other cool change to make to STAC Browser is to render the geojson asset in addition to the COG. If I'm tempted to try to make that PR for STAC Browser, but my time is way too tight these days... It's nice you've got the gist link now so that anyone can click and see it. To have stac browser render it it'd probably have to be a link to the 'raw' gist https://gist.githubusercontent.com/daveluo/c743c6b0f99795336636a1b0084786b5/raw/2d28c8638a1018e5f581cb7e390f0859ac538810/znz-example-labels.json or else teach stac browser to be able to get the raw from a gist geojson.

cc @mojodna

@matthewhanson

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 5, 2019

Thanks for the example @daveluo , great seeing it up and with a STAC Browser deployed.

I think the way that you have organized things is fine If your source imagery was already represented in a STAC catalog somewhere you could skip the "imagery" collection and just point to that Item directly.

Here's another example catalog where we simply point to a record in the Sentinel STAC catalog as our source rather than having a separate catalog:
http://mlhub-earth.s3.amazonaws.com/catalog.json

Although note those links are currently incorrect :-(, I've got to fix that.

It's a good point about having the COGs as assets in multiple Items, this does make it easier to preview them.

What we are doing with the catalog above, and I recently just pushed some changes to this PR to reflect it, is you provide an optional "rendered image" asset in the Item, because in many cases the training data was generated not using the original source imagery as is, but rather rendered in some way. For example, for Sentinel there's no RGB true color image available, so we create that and save it as an asset along with our training data labels asset and include it as an asset in the Item.

@matthewhanson

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 5, 2019

@lossyrob brought up a good point about supporting regression, it would be nice if the extension could handle both regression and classification tasks.

So perhaps the following changes:

  • change "td:label_type" to "td:type" to include "regression", so options would be ['regression, 'classification', 'detection', or 'segmentation']
  • Change td:label_property to be td:value_property which will be the property name containing the value (numeric, string) based on the type.
  • Change td:classes to be optional, and is not used with regression tasks
  • Add in an optional td:value_range field which is a 2 element numeric field containing the range of values for regression tasks

Thoughts?

@daveluo

This comment has been minimized.

Copy link

commented Apr 8, 2019

Thanks for the thoughts!

@matthewhanson
Agreed on having optional "rendered image" asset items to capture intermediary steps that get us to train-able datasets. In addition to your example of rendering RGB or other band combos out of source data, I could see other pre-processing steps as rendered assets as well, e.g. realigning/correcting polygons to base imagery. My initial thought is to keep the spec open/flexible and leave it up to the catalog/item maker to decide what is defined and shared as a rendered asset during preprocessing. Could range from everything (rendered assets + notebooks/scripts demonstrating the pipeline) to nothing (just sharing the finalized rendered inputs for ML).

Also agreed with the idea of generalizing labels beyond classification. I need to think more about how generic this should be w.r.t. different ML tasks. There could be more than one label type for a task, ie. object detection with both regression (of bounding box coordinates) and classification (of object within each bbox) labels. Maybe regression and classification are two primary label types and multiple types are allowed within an item to flexibly suit the particular ML task? I'll try out some examples to see what may work well.

@cholmes, @mojodna
Would be awesome to render geojson directly in the preview, either as a new layer on top of base COG imagery or separately displayed. Having a bounding box to render portions of a master COG would be cool too, although I've love to see the geojson preview first!

@daveluo

This comment has been minimized.

Copy link

commented Apr 24, 2019

Updated stac4ml demo catalog & browser at https://zen-turing-2069dc.netlify.com with some new things for consideration:

  • added a "chips" asset to each item which points to a folder of preprocessed image/mask chips. Another example of an asset rendered from raw source imagery+label. Could also include in description the details of how chipping was done, i.e. "512x512 z19@2x tile jpgs and 3-channel RGB mask pngs"
  • this training data presents both segmentation and classification (of building condition) tasks so I've listed both within td:type. Right now, the class labels are contained within the geojson for each building polygon but I can make that into another asset, i.e. a csv of building_id,building_condition,WKTstring which also would show a different way of presenting training labels
  • modified geometry of items & stac-browser leaflet properties to render all the building polygons as one nicely preview-able MultiPolygon. Not the ideal way to do this of course, just for illustrative purposes atm.
    EDIT:
  • one more thought: we find that there may be at least 2 potential licensors and licenses for training data assets: one set for the imagery (i.e. CC-BY-4.0 for imagery on OpenAerialMap), another for the labels (ODbL for OSM), and maybe another for the processor. It seems there's only space in the schema for 1 license per item. In this case, it should probably defer to the licensor/license of the labels since that's the novel part of this extension but ideally we'd be able to list all the respective licenses somewhere on the item and asset levels.

philvarner and others added some commits Jun 5, 2019

@cholmes

This comment has been minimized.

Copy link
Contributor

commented Jun 13, 2019

Thanks for trying @aaronxsu - I just busted out the github web editor to resolve it (and it actually loaded it this time), so we should be good.

Though looks like we need more reviewers now.

@cholmes
Copy link
Contributor

left a comment

Looks great to get in and iterating, thanks everyone for their work on this!

@cholmes cholmes changed the title [WIP] Label Extension Label Extension Jun 13, 2019

@matthewhanson

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 15, 2019

I have reviewed this, but i am unable to officially add my review since I'm the one who opened the PR, so we'll need someone else to add theirs.

Just a few things:

  • The Examples linked to in the Label README need to be fixed. The link for the second one is incorrect, it's using .json rather than .geojson as the extension.
  • Not all of the examples in the directory are linked to from the README
  • I think the names of the examples can be cleaned up - for instance they don't need to have "sprint4" in the titles, they should be shorter and simpler
  • I'm not sure llabel:version belongs here. Not because labels don't have a version, but because that's not specific to Labels. I think we need to talk about versions in Items (or assets) as a separate issue, until then versions should be included in the Item ID if you want to track versions.
@cholmes

This comment has been minimized.

Copy link
Contributor

commented Jun 17, 2019

Discussed on call: Make issue for versioning and remove out of here. Ideally fix Matt's suggestions, but we are also ok to merge.

@cholmes

This comment has been minimized.

Copy link
Contributor

commented Jun 18, 2019

@daveluo @aaronxsu and @nrweir - are any of you able to fix the things Matt raised?

And please raise the 'version' issue, as it's something we should propose at the Item level, as it should be consistent across extensions. Unless you have a reason that it is special to labeling.

@nrweir

This comment has been minimized.

Copy link

commented Jun 18, 2019

Yes, I'm happy to do the fixes, will work on them now - sorry about the delay, I've been tied up at CVPR.

@nrweir

This comment has been minimized.

Copy link

commented Jun 18, 2019

@matthewhanson I created a PR against this branch to incorporate the discussed changes.

m-mohr added some commits Jun 18, 2019

Merge pull request #502 from nrweir/extension/label_nw
Cleaning up examples and removing label:version
@m-mohr
Copy link
Collaborator

left a comment

The PR by @nrweir looked good (thanks), so I merged it. Based on that, I fixed errors in the examples and made changes to the data types in the spec to be consistent with the other specs/extensions. Also, added two comments for consideration below (for context: #499).


| Field Name | Type | name | description |
|-----------------|------------|----------------------------|--------------------------------------------------------------------------------------------------|
| stat_name | string | Stat Name | The name of the statistic being reported. |

This comment has been minimized.

Copy link
@m-mohr

m-mohr Jun 18, 2019

Collaborator

Shorten the name from stat_name to name or id? (see #499)

This comment has been minimized.

Copy link
@nrweir

nrweir Jun 19, 2019

I'm good with that. I think id would be confusing to data scientists...what’s an id with relation to a stat? Would prefer name.


| Field Name | Type | name | description |
|-----------------|-----------------|----------------------------|--------------------------------------------------------------------------------------------------|
| class_name | string | Class Name | The different possible classes within the property `name`. |

This comment has been minimized.

Copy link
@m-mohr

m-mohr Jun 18, 2019

Collaborator

Shorten the name from class_name to name or id? (see #499)

This comment has been minimized.

Copy link
@nrweir

nrweir Jun 19, 2019

Same with above, I think name would probably make more sense, but no problem changing it from class_name.

I'll make the updates.

@nrweir nrweir referenced this pull request Jun 19, 2019

Closed

Extension/label nw #504

@nrweir

This comment has been minimized.

Copy link

commented Jun 19, 2019

@matthewhanson @m-mohr I changed those two labels to "name" and also merged in dev to address the branch being out of date with base, then created another PR into this branch: #504.

m-mohr and others added some commits Jun 19, 2019

@m-mohr

m-mohr approved these changes Jun 19, 2019

@m-mohr

This comment has been minimized.

Copy link
Collaborator

commented Jun 19, 2019

Should be ready for the merge. Can you confirm, @matthewhanson (and others)?

@matthewhanson

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 19, 2019

Looks good to me, and it's got approvals so I'm merging.

@matthewhanson matthewhanson merged commit 7c1a665 into dev Jun 19, 2019

1 check passed

ci/circleci: build Your tests passed on CircleCI!
Details

@matthewhanson matthewhanson deleted the extension/training_data branch Jun 19, 2019

@mojodna

This comment has been minimized.

Copy link
Member

commented on extensions/label/README.md in 0b87748 Jun 26, 2019

@daveluo (et al): the Zanzibar example has a list of classes (because there are multiple label properties); should this always be [Class Object] vs. needing to check list or object?

This comment has been minimized.

Copy link

replied Jun 27, 2019

Good thought, I think this should always be [Class Object] even when it reduces to a single label and single class dataset like in the spacenet-buildings example. It's also consistent with how we use [Count Object] and [Stats Object] in label:overview.

Related: I'll update the zanzibar examples to add the Count Object for the "condition" label property so it's a complete example of multi-label multi-class data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.