Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for summaries #264

Merged
merged 33 commits into from
Jun 2, 2021
Merged

Added support for summaries #264

merged 33 commits into from
Jun 2, 2021

Conversation

volaya
Copy link
Contributor

@volaya volaya commented Feb 21, 2021

[PR not to be merged yet]

This PR implements the mechanism for generating summaries automatically for Collection objects.

It uses the fields.json file mentioned in stac-utils/stactools#45 to know which attributes should be summarized.

However, that file doesnt provide enough information, and it should be improved before merging this PR. Summaries generated with the current code in this branch are not optimal.

For instance, for numeric values, there should be some information about whether to calculate stats or just collect all existing values (as it happens with epsg codes)

@cholmes
Copy link

cholmes commented May 10, 2021

It'd be great to get this in. I just put an issue in stac-fields, to see what they think.

@volaya - are there other ways the stats are less than ideal?

I was also thinking that some user input would make sense (I'm thinking for CLI, but would also extend to arguments for the code in some way). Like I could see summaries being 'core', 'extended' and 'all'. Core would be just common metadata, extended would include the stable extensions and then 'all' would attempt to summarize everything.

And then I could also see being able to specify additional fields that you want summarized for core/extended options, or fields you want to leave off for 'all'.

It does occur to me it'd also be good to let people specify their own fields.json. Not sure if it was intended that way, but I was thinking it'd be good to let users be able to specify a json file with the summary 'rules' they wanted. But seems like a natural path for many would be to start with the fields.json and then customize it. That could be a new field adding summary info there, or it could just be marking more as summary = false for users deciding they don't want that info in their summary.

Where there other things that were less ideal to summarize @volaya ? I'll try to test things out, but could be good to update this PR to the latest (like stats object is now minimum and maximum I think), and get it building.

@@ -9,6 +12,16 @@
from pystac.utils import datetime_to_str


fieldsfilename = os.path.join(os.path.dirname(__file__),
"resources", "fields-normalized.json")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to load it from a CDN by default? That file is expected to change pretty often (mostly to add new fields)...

@volaya
Copy link
Contributor Author

volaya commented May 11, 2021

The proposal made in stac-fields looks good to me. That is, to not just have true or false for the summary property, but some details about how to create the summary.

About how to handle the fields.json file, I think we can easily implement some flexible approach like this:

  • If the user does not provide a path to a custom fields.json file, a remote common one is used (or a local one shipped with the library, as a fallback option in case there is no connection)
  • If a path to a json.file is provided, that is used instead, so there is a way to easily have a custom way of summarizing values
  • We can provide some predefined files with subgroups of parameters (like those 'core', 'extended' and 'all' that @cholmes mentioned), and have a simple way of using them (in the command-line tool, that can be by defining some specific argument for those hardcoded files)

Let me sketch something like that in this same branch, so we can discuss based on it.

@volaya
Copy link
Contributor Author

volaya commented May 12, 2021

I adapted the code that I had done before (which basically computes and updates the summaries when an item is added to a collection), to the recent changes in the collection module. The code that takes care of summaries is now in its own module.

It accepts a file with field definitions (for now, those fields are not assumed to contain information on how to summarize their values, but just whether or not to do it, as it happened before). If no file is passed, it picks the code from the CDN, or uses a local file if that's not available.

CI fails, but gives no information about the test that is failing. I am not seeing that issue when running tests locally, so I will try to figure out what is going on.

@lossyrob
Copy link
Member

@volya are you not able to see the github action logs? Clicking on the "Details" for one of the fails of the checks brings me to the GitHub Actions build logs, where I can see the "Execute linters and test suite" failed, and the bottom of the logs have:

+ echo ' -- CHECKING TYPES WITH PYRIGHT --'
+ echo
+ scripts/pyright pystac tests
Checking node version...
Checking node_modules dir...
Checking pyright exists...
...installing pyright
/usr/local/bin/pyright -> /usr/local/lib/node_modules/pyright/index.js
/usr/local/bin/pyright-langserver -> /usr/local/lib/node_modules/pyright/langserver.index.js
+ pyright@1.1.138
added 1 package from 1 contributor in 0.886s
done.
No configuration file found.
No pyproject.toml file found.
stubPath /home/runner/work/pystac/pystac/typings is not a valid directory.
Assuming Python platform Linux
Searching for source files
Found 73 source files
/home/runner/work/pystac/pystac/pystac/summaries.py
  /home/runner/work/pystac/pystac/pystac/summaries.py:22:24 - error: No overloads for "min" match the provided arguments
    Argument types: (T@RangeSummary, T@RangeSummary) (reportGeneralTypeIssues)
  /home/runner/work/pystac/pystac/pystac/summaries.py:23:24 - error: No overloads for "min" match the provided arguments
    Argument types: (T@RangeSummary, T@RangeSummary) (reportGeneralTypeIssues)
2 errors, 0 warnings, 0 infos 
Completed in 6.079sec
Error: Process completed with exit code 1.

Are you running scripts/test locally, which runs pyright for type checking? I would assume this should also fail locally.

@volaya
Copy link
Contributor Author

volaya commented May 12, 2021

Yes, I can see the log, but it doesn't tell me which test is failing or the full error stack trace. And I cannot figure out how a RangeSummary object might arrive to that line...

Copy link
Member

@lossyrob lossyrob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One high level comment is that I'm not sure Summarizer should be a concept that either Collection or Summaries know about - the default create_summaries is destructive to any manual summaries that may have been added, and it's not clear how to flexibly use the summarizer with methods against the objects. There's also the issue with multiple transparent fetches mentioned below.

I think generating summaries from the properties JSON is a specific enough task that having some independent flexible API around it might be in order. I.e. having the way to generate Summaries is to use a Summarizer directly, instead of having that functionality wrapped in methods.

e.g. an API that is used like:

import pystac
from pystac.summaries import Summarizer

collection: pystac.Collection = ...
new_items: List[pytsac.Item] = ...

summerizer = Summarizer()

# Example 1: Create from strach
collection.summaries = summerizer.summerize(collection)

# Example 2: Create summaries and merge (generated have precedent)
collection.summaires.update(summarizer.summerize(collection)

# Example 3: Create summaries and merge (existing have precedent)
collection.summaries = summarizer.summerize(collection).update(collection.summaries)

# Example 4: Update summaries with new items before adding to collection
new_summaries = summarizer.summerize(new_items)
collection.summaries.merge(new_summaries)
collection.add_items(new_items)

Here summarizer has a summarize method:

def summerize(source: Union[pystac.Collection, Iterable[pystac.Item]) -> Summaries:
   ...

There's a difference between update (where the summary for each summary key is overriding any existing summary) and merge (which expands the summary to include the information from the incoming summaries for overlapping keys). I'm not sure these would be the best name, but identifying the different use cases to me means the API is complex enough that we might want to keep the generation of summaries and the update/merging of summaries as independent steps (Summarizer only generates summaries, summaries know how to update themselves or merge in new summaries).

I think also breaking out the Summaries merge/update methods to be independent of the stac-fields based summary generation logic would allow those methods to be used independently, if a user has their own method of creating summaries that they want to merge together.

fields(str): the path to the json file with field descriptions.
If no file is passed, a default one will be used.
'''
def __init__(self, fields: str = None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def __init__(self, fields: str = None):
def __init__(self, fields: Optional[str] = None):

@@ -598,6 +518,14 @@ def __init__(
def __repr__(self) -> str:
return "<Collection id={}>".format(self.id)

def create_summaries(self, summarizer: Summarizer = None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def create_summaries(self, summarizer: Summarizer = None):
def create_summaries(self, summarizer: Optional[Summarizer] = None) -> None:

def __init__(self, summaries: Dict[str, Any], summarizer: Summarizer = None) -> None:
self._summaries = summaries

self.summarizer = summarizer or Summarizer()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's currently no caching for the properties json so this would fetch a new document for every collection, which has performance implications.

@volaya
Copy link
Contributor Author

volaya commented May 12, 2021

Yes, you are right, better to have something independent for summaries. My idea was to have some method of automatically creating summaries when adding elements to a collection, but it might be a better approach to let the user choose how to do that, and just provide the functionality to do it.

Let me update the PR with your ideas and we can continue the discussion. Thanks for the comments!

@lossyrob
Copy link
Member

@volaya the failure is not on a unit test, but on a static type checking. There's linting and type checking in scripts/test, it doesn't only run unit tests.

 /home/runner/work/pystac/pystac/pystac/summaries.py:22:24 - error: No overloads for "min" match the provided arguments
    Argument types: (T@RangeSummary, T@RangeSummary) (reportGeneralTypeIssues)
  /home/runner/work/pystac/pystac/pystac/summaries.py:23:24 - error: No overloads for "min" match the provided arguments
    Argument types: (T@RangeSummary, T@RangeSummary) (reportGeneralTypeIssues)

In summaries.py on line 22, there's a call to min(...) that deals with values of type T. Since T is generic, there's no guarantee that the type of v or self.minimum would be something you could put into min. Solving this would require putting a type constraint on T that ensures the types it allows would have the ability to be ordered/compared.

Being able to specify that generic types are sortable is apparently been a pain; see python/typing#59 and this stackoverflow answer for a solution

@volaya
Copy link
Contributor Author

volaya commented May 12, 2021

Ah, ok. Thanks for that hint. I didn't know that there were type checks, and I was a bit puzzled with that error message...

@volaya
Copy link
Contributor Author

volaya commented May 13, 2021

@lossyrob I have Updated the PR with your ideas.

Collections do not know about summaries, and the updated/combine operation is done by the Summaries class. Creating a summary is done by the Summarizer class.

One comment about that: I think the Summarizer should take care of performing the merge/combine, since I believe that those operations (or at least the combine one) must depend on the stac-fields json. Otherwise, you can combine summaries that do not have the same type of summary for a given value (for instance one in which a numerical field is summarized with a range, and another one where it is summarized with more detailed stats). If that's ok, then I guess we need some to add some checking in the corresponding methods. But probably it's better to move them to the Summarizer class. Without knowing how to summarize a given field (which is at the moment known only by the Summarizer class, thanks to the json file with filed definitions), one cannot know how to combine existing summaries for that field.

Also, notice that, in the current design of the Summaries class, you can use the same id for summaries of different type (let's say, for a range and for a list). That might happen if trying to update/combine two summaries that have summarized a given field in different ways.

Maybe it can be a good idea to not store lists, ranges, schemas and other in separate dicts as instance variables of the Summaries class, but just keep a single dictionary of summaries and have several classes for them (like we have the RangeSummary class now).

Let me know what you think and I can start adding some unit tests, once we have the approach for this classes more or less ready.

@volaya
Copy link
Contributor Author

volaya commented May 13, 2021

CI still has issues with the typing. The solution you linked to doesn't seem to work (however, I tried locally with mypy, and that fix makes it happy...) I will see if I can manage to make it work somehow

@volaya
Copy link
Contributor Author

volaya commented May 13, 2021

I also added the idea mentioned in #178 about having a limit for the number of elements in list summaries

@m-mohr
Copy link
Contributor

m-mohr commented May 13, 2021

@volaya stac-fields will return more details on summaries once the CDN picks up the changes (usually within 24hours), see also stac-utils/stac-fields#6

@volaya
Copy link
Contributor Author

volaya commented May 14, 2021

@m-mohr I have implemented here the ideas that you added in stac-utils/stac-fields#6

RANGE = "r"
SCHEMA = "s"
DONT_SUMMARIZE = False
UNDEFINED = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
UNDEFINED = True
UNDEFINED = True

I think "default" or "detect" would be better choices as name. There are rules defined in the fields json readme for the case it's not set, which depends on the data type mostly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

@volaya
Copy link
Contributor Author

volaya commented May 21, 2021

@lossyrob Good suggestions. I added the changes that you proposed.

@codecov-commenter
Copy link

codecov-commenter commented May 21, 2021

Codecov Report

Merging #264 (fa8767e) into main (6f4ca38) will decrease coverage by 0.37%.
The diff coverage is 75.46%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #264      +/-   ##
==========================================
- Coverage   89.80%   89.43%   -0.38%     
==========================================
  Files          38       39       +1     
  Lines        4953     5063     +110     
==========================================
+ Hits         4448     4528      +80     
- Misses        505      535      +30     
Impacted Files Coverage Δ
pystac/extensions/file.py 83.57% <0.00%> (ø)
pystac/extensions/raster.py 90.21% <ø> (ø)
pystac/summaries.py 75.64% <75.64%> (ø)
pystac/__init__.py 97.22% <100.00%> (+0.07%) ⬆️
pystac/collection.py 94.56% <100.00%> (+1.55%) ⬆️
pystac/extensions/eo.py 97.33% <100.00%> (ø)
pystac/cache.py 94.16% <0.00%> (ø)
pystac/layout.py 92.44% <0.00%> (ø)
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6f4ca38...fa8767e. Read the comment docs.

@lossyrob
Copy link
Member

Looks like Protocol is new in 3.8 . Not sure how the windows CI builds are passing for pre-3.8 builds. I'll look into another solution that works across python versions for the Comparable issue

This commit adds an extra dependency to PySTAC when installed in
environments with versions of Python before 3.8. The 'Protocol' class
in `typing` was made available in the core module as of 3.8, but
previous versions can use it from the `typing_extensions`
package. setup.py is changed to install typing_extensions for pre-3.8
versions, and a conditional import is used to import the Protocol
class from the correct package.
@lossyrob
Copy link
Member

I just pushed a commit that will conditionally install typing_extensions, which allows the import of Protocol in pre-3.8 versions of Python. This adds an extra dependency in those situations, but does not add any new dependencies for Python 3.8 or later.

@lossyrob
Copy link
Member

Merged in upstream/main, and made some changes to pass the mypy CI check that was added. This included removing the typ parameter from RangeSummary.from_dict.

@lossyrob
Copy link
Member

Changes approved, but this does need a CHANGELOG entry @volaya. Should probably change the PR title and description too to reflect that it's ready. Other then that I think this is good to merge!

return cls(minimum=minimum, maximum=maximum)


FIELDS_JSON_URL = "https://cdn.jsdelivr.net/npm/@radiantearth/stac-fields/fields.json"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
FIELDS_JSON_URL = "https://cdn.jsdelivr.net/npm/@radiantearth/stac-fields/fields.json"
FIELDS_JSON_URL = "https://cdn.jsdelivr.net/npm/@radiantearth/stac-fields/fields-normalized.json"

This must be the normalized version, otherwise, you'll run into bugs here as some fields may not be resolved yet.

Copy link
Contributor

@duckontheweb duckontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volaya Thanks for all the work on this, it looks great! I had a couple of minor comments and it looks like there is one comment from @m-mohr that still needs to be addressed regarding the URL for the fields file.

We should also add all of the classes from pystac.summaries to the API docs in docs/api.rst.

pystac/summaries.py Outdated Show resolved Hide resolved
Comment on lines 164 to 169
if hasattr(source, "get_all_items"):
for item in source.get_all_items(): # type: ignore[union-attr]
self._update_with_item(summaries, item)
else:
for item in source: # type: ignore[union-attr]
self._update_with_item(summaries, item)
Copy link
Contributor

@duckontheweb duckontheweb May 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This satisfies the type checker without needing to ignore the error.

Suggested change
if hasattr(source, "get_all_items"):
for item in source.get_all_items(): # type: ignore[union-attr]
self._update_with_item(summaries, item)
else:
for item in source: # type: ignore[union-attr]
self._update_with_item(summaries, item)
if isinstance(source, pystac.Collection):
for item in source.get_all_items():
self._update_with_item(summaries, item)
else:
for item in source:
self._update_with_item(summaries, item)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this...what happens if we are dealing with a Catalog instead of a Collection? It also has the get_all_items' method. Is that a valid type of element for this? Should we extend that isinstance` statement to check for both classes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like as of 1.0.0-rc.4 Catalogs no longer support summaries, so I think it makes sense to restrict this to only Collections.

@duckontheweb
Copy link
Contributor

@volaya The CI failures are a little perplexing since it's failing on a file that doesn't even exist in your branch 🤷 . I've merged in the latest from main and fixed those issues.

Copy link
Contributor

@duckontheweb duckontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for this. I think the only remaining items are:

  • Update the PR title to reflect the new status
  • Add a CHANGELOG entry

@volaya volaya changed the title added summaries (WIP) Added support for summaries Jun 2, 2021
@volaya
Copy link
Contributor Author

volaya commented Jun 2, 2021

Title changed and changelog entry added already

@duckontheweb duckontheweb self-requested a review June 2, 2021 13:17
Copy link
Contributor

@duckontheweb duckontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks again. This is will be a big addition to the next release!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants