-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added support for summaries #264
Conversation
It'd be great to get this in. I just put an issue in stac-fields, to see what they think. @volaya - are there other ways the stats are less than ideal? I was also thinking that some user input would make sense (I'm thinking for CLI, but would also extend to arguments for the code in some way). Like I could see summaries being 'core', 'extended' and 'all'. Core would be just common metadata, extended would include the stable extensions and then 'all' would attempt to summarize everything. And then I could also see being able to specify additional fields that you want summarized for core/extended options, or fields you want to leave off for 'all'. It does occur to me it'd also be good to let people specify their own fields.json. Not sure if it was intended that way, but I was thinking it'd be good to let users be able to specify a json file with the summary 'rules' they wanted. But seems like a natural path for many would be to start with the fields.json and then customize it. That could be a new field adding summary info there, or it could just be marking more as summary = false for users deciding they don't want that info in their summary. Where there other things that were less ideal to summarize @volaya ? I'll try to test things out, but could be good to update this PR to the latest (like stats object is now minimum and maximum I think), and get it building. |
pystac/collection.py
Outdated
@@ -9,6 +12,16 @@ | |||
from pystac.utils import datetime_to_str | |||
|
|||
|
|||
fieldsfilename = os.path.join(os.path.dirname(__file__), | |||
"resources", "fields-normalized.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to load it from a CDN by default? That file is expected to change pretty often (mostly to add new fields)...
The proposal made in stac-fields looks good to me. That is, to not just have true or false for the summary property, but some details about how to create the summary. About how to handle the fields.json file, I think we can easily implement some flexible approach like this:
Let me sketch something like that in this same branch, so we can discuss based on it. |
…aries # Conflicts: # pystac/collection.py
I adapted the code that I had done before (which basically computes and updates the summaries when an item is added to a collection), to the recent changes in the collection module. The code that takes care of summaries is now in its own module. It accepts a file with field definitions (for now, those fields are not assumed to contain information on how to summarize their values, but just whether or not to do it, as it happened before). If no file is passed, it picks the code from the CDN, or uses a local file if that's not available. CI fails, but gives no information about the test that is failing. I am not seeing that issue when running tests locally, so I will try to figure out what is going on. |
@volya are you not able to see the github action logs? Clicking on the "Details" for one of the fails of the checks brings me to the GitHub Actions build logs, where I can see the "Execute linters and test suite" failed, and the bottom of the logs have:
Are you running |
Yes, I can see the log, but it doesn't tell me which test is failing or the full error stack trace. And I cannot figure out how a RangeSummary object might arrive to that line... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One high level comment is that I'm not sure Summarizer should be a concept that either Collection or Summaries know about - the default create_summaries
is destructive to any manual summaries that may have been added, and it's not clear how to flexibly use the summarizer with methods against the objects. There's also the issue with multiple transparent fetches mentioned below.
I think generating summaries from the properties JSON is a specific enough task that having some independent flexible API around it might be in order. I.e. having the way to generate Summaries is to use a Summarizer directly, instead of having that functionality wrapped in methods.
e.g. an API that is used like:
import pystac
from pystac.summaries import Summarizer
collection: pystac.Collection = ...
new_items: List[pytsac.Item] = ...
summerizer = Summarizer()
# Example 1: Create from strach
collection.summaries = summerizer.summerize(collection)
# Example 2: Create summaries and merge (generated have precedent)
collection.summaires.update(summarizer.summerize(collection)
# Example 3: Create summaries and merge (existing have precedent)
collection.summaries = summarizer.summerize(collection).update(collection.summaries)
# Example 4: Update summaries with new items before adding to collection
new_summaries = summarizer.summerize(new_items)
collection.summaries.merge(new_summaries)
collection.add_items(new_items)
Here summarizer has a summarize
method:
def summerize(source: Union[pystac.Collection, Iterable[pystac.Item]) -> Summaries:
...
There's a difference between update
(where the summary for each summary key is overriding any existing summary) and merge
(which expands the summary to include the information from the incoming summaries for overlapping keys). I'm not sure these would be the best name, but identifying the different use cases to me means the API is complex enough that we might want to keep the generation of summaries and the update/merging of summaries as independent steps (Summarizer only generates summaries, summaries know how to update themselves or merge in new summaries).
I think also breaking out the Summaries merge/update methods to be independent of the stac-fields
based summary generation logic would allow those methods to be used independently, if a user has their own method of creating summaries that they want to merge together.
pystac/summaries.py
Outdated
fields(str): the path to the json file with field descriptions. | ||
If no file is passed, a default one will be used. | ||
''' | ||
def __init__(self, fields: str = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def __init__(self, fields: str = None): | |
def __init__(self, fields: Optional[str] = None): |
pystac/collection.py
Outdated
@@ -598,6 +518,14 @@ def __init__( | |||
def __repr__(self) -> str: | |||
return "<Collection id={}>".format(self.id) | |||
|
|||
def create_summaries(self, summarizer: Summarizer = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def create_summaries(self, summarizer: Summarizer = None): | |
def create_summaries(self, summarizer: Optional[Summarizer] = None) -> None: |
pystac/summaries.py
Outdated
def __init__(self, summaries: Dict[str, Any], summarizer: Summarizer = None) -> None: | ||
self._summaries = summaries | ||
|
||
self.summarizer = summarizer or Summarizer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's currently no caching for the properties json so this would fetch a new document for every collection, which has performance implications.
Yes, you are right, better to have something independent for summaries. My idea was to have some method of automatically creating summaries when adding elements to a collection, but it might be a better approach to let the user choose how to do that, and just provide the functionality to do it. Let me update the PR with your ideas and we can continue the discussion. Thanks for the comments! |
@volaya the failure is not on a unit test, but on a static type checking. There's linting and type checking in
In summaries.py on line 22, there's a call to Being able to specify that generic types are sortable is apparently been a pain; see python/typing#59 and this stackoverflow answer for a solution |
Ah, ok. Thanks for that hint. I didn't know that there were type checks, and I was a bit puzzled with that error message... |
@lossyrob I have Updated the PR with your ideas. Collections do not know about summaries, and the updated/combine operation is done by the Summaries class. Creating a summary is done by the Summarizer class. One comment about that: I think the Summarizer should take care of performing the merge/combine, since I believe that those operations (or at least the combine one) must depend on the stac-fields json. Otherwise, you can combine summaries that do not have the same type of summary for a given value (for instance one in which a numerical field is summarized with a range, and another one where it is summarized with more detailed stats). If that's ok, then I guess we need some to add some checking in the corresponding methods. But probably it's better to move them to the Summarizer class. Without knowing how to summarize a given field (which is at the moment known only by the Summarizer class, thanks to the json file with filed definitions), one cannot know how to combine existing summaries for that field. Also, notice that, in the current design of the Summaries class, you can use the same id for summaries of different type (let's say, for a range and for a list). That might happen if trying to update/combine two summaries that have summarized a given field in different ways. Maybe it can be a good idea to not store lists, ranges, schemas and other in separate dicts as instance variables of the Summaries class, but just keep a single dictionary of summaries and have several classes for them (like we have the RangeSummary class now). Let me know what you think and I can start adding some unit tests, once we have the approach for this classes more or less ready. |
CI still has issues with the typing. The solution you linked to doesn't seem to work (however, I tried locally with mypy, and that fix makes it happy...) I will see if I can manage to make it work somehow |
I also added the idea mentioned in #178 about having a limit for the number of elements in list summaries |
@volaya stac-fields will return more details on summaries once the CDN picks up the changes (usually within 24hours), see also stac-utils/stac-fields#6 |
@m-mohr I have implemented here the ideas that you added in stac-utils/stac-fields#6 |
pystac/summaries.py
Outdated
RANGE = "r" | ||
SCHEMA = "s" | ||
DONT_SUMMARIZE = False | ||
UNDEFINED = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UNDEFINED = True | |
UNDEFINED = True |
I think "default" or "detect" would be better choices as name. There are rules defined in the fields json readme for the case it's not set, which depends on the data type mostly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
…maries # Conflicts: # tests/test_summaries.py
@lossyrob Good suggestions. I added the changes that you proposed. |
Codecov Report
@@ Coverage Diff @@
## main #264 +/- ##
==========================================
- Coverage 89.80% 89.43% -0.38%
==========================================
Files 38 39 +1
Lines 4953 5063 +110
==========================================
+ Hits 4448 4528 +80
- Misses 505 535 +30
Continue to review full report at Codecov.
|
Looks like Protocol is new in 3.8 . Not sure how the windows CI builds are passing for pre-3.8 builds. I'll look into another solution that works across python versions for the |
This commit adds an extra dependency to PySTAC when installed in environments with versions of Python before 3.8. The 'Protocol' class in `typing` was made available in the core module as of 3.8, but previous versions can use it from the `typing_extensions` package. setup.py is changed to install typing_extensions for pre-3.8 versions, and a conditional import is used to import the Protocol class from the correct package.
I just pushed a commit that will conditionally install |
Merged in upstream/main, and made some changes to pass the mypy CI check that was added. This included removing the |
Changes approved, but this does need a CHANGELOG entry @volaya. Should probably change the PR title and description too to reflect that it's ready. Other then that I think this is good to merge! |
pystac/summaries.py
Outdated
return cls(minimum=minimum, maximum=maximum) | ||
|
||
|
||
FIELDS_JSON_URL = "https://cdn.jsdelivr.net/npm/@radiantearth/stac-fields/fields.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FIELDS_JSON_URL = "https://cdn.jsdelivr.net/npm/@radiantearth/stac-fields/fields.json" | |
FIELDS_JSON_URL = "https://cdn.jsdelivr.net/npm/@radiantearth/stac-fields/fields-normalized.json" |
This must be the normalized version, otherwise, you'll run into bugs here as some fields may not be resolved yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@volaya Thanks for all the work on this, it looks great! I had a couple of minor comments and it looks like there is one comment from @m-mohr that still needs to be addressed regarding the URL for the fields file.
We should also add all of the classes from pystac.summaries
to the API docs in docs/api.rst
.
pystac/summaries.py
Outdated
if hasattr(source, "get_all_items"): | ||
for item in source.get_all_items(): # type: ignore[union-attr] | ||
self._update_with_item(summaries, item) | ||
else: | ||
for item in source: # type: ignore[union-attr] | ||
self._update_with_item(summaries, item) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This satisfies the type checker without needing to ignore the error.
if hasattr(source, "get_all_items"): | |
for item in source.get_all_items(): # type: ignore[union-attr] | |
self._update_with_item(summaries, item) | |
else: | |
for item in source: # type: ignore[union-attr] | |
self._update_with_item(summaries, item) | |
if isinstance(source, pystac.Collection): | |
for item in source.get_all_items(): | |
self._update_with_item(summaries, item) | |
else: | |
for item in source: | |
self._update_with_item(summaries, item) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this...what happens if we are dealing with a Catalog instead of a Collection? It also has the get_all_items' method. Is that a valid type of element for this? Should we extend that
isinstance` statement to check for both classes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like as of 1.0.0-rc.4 Catalogs no longer support summaries, so I think it makes sense to restrict this to only Collections.
@volaya The CI failures are a little perplexing since it's failing on a file that doesn't even exist in your branch 🤷 . I've merged in the latest from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for this. I think the only remaining items are:
- Update the PR title to reflect the new status
- Add a CHANGELOG entry
Title changed and changelog entry added already |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks again. This is will be a big addition to the next release!
[PR not to be merged yet]
This PR implements the mechanism for generating summaries automatically for Collection objects.
It uses the fields.json file mentioned in stac-utils/stactools#45 to know which attributes should be summarized.
However, that file doesnt provide enough information, and it should be improved before merging this PR. Summaries generated with the current code in this branch are not optimal.
For instance, for numeric values, there should be some information about whether to calculate stats or just collect all existing values (as it happens with epsg codes)