Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

accept __geo_interface__ attribute as data #1664

Merged
merged 10 commits into from
Aug 29, 2019
Merged

accept __geo_interface__ attribute as data #1664

merged 10 commits into from
Aug 29, 2019

Conversation

mattijn
Copy link
Contributor

@mattijn mattijn commented Aug 18, 2019

I'm not sure if I've understood you correctly here: #588 (comment)

But in this PR I tried to integrate serialising data objects with a __geo_interface__ attribute.
And it worked OK for Python packages that deal with geographical data types that support the __geo_interface__. I tried the packages shapely, pyshp, geojson, geopandas.

It works for:

  • InlineData
  • json data transformer and
  • data_server data transformer.

See animated gif:
ezgif com-optimize-2

After I got this working, I compared it to https://github.com/altair-viz/altair/pull/818/files, but the idea is very much the same. Maybe @iliatimofeev can shed a light on this proof of concept as well.

@jakevdp
Copy link
Collaborator

jakevdp commented Aug 20, 2019

Looks good - do you have an example of how using this would look in practice?

@mattijn
Copy link
Contributor Author

mattijn commented Aug 21, 2019

Here are some examples of possible tests. Not all tests are passing yet. Also some advice how and where to implement these tests within Altair would be appreciated.

import altair as alt

def geom_obj(geom):
    class Geom(object):
        pass
    geom_obj = Geom()
    setattr(geom_obj, '__geo_interface__', geom)
    return geom_obj
geom_a = {
    "coordinates": [[
        (0, 0), 
        (0, 2), 
        (2, 2), 
        (2, 0), 
        (0, 0)
    ]],
    "type": "Polygon"
}
feat_a = geom_obj(geom_a)

# correct translation of Polygon geometry to Feature type
alt.Chart(feat_a).mark_geoshape(tooltip={"content": "data"})

output_1_0

geom_b = {
    "geometry": {
        "coordinates": [[
            [6.90, 53.48],
            [5.98, 51.85],
            [6.07, 53.51],
            [6.90, 53.48]
        ]], 
        "type": "Polygon"
    }, 
    "id": None, 
    "properties": {}, 
    "type": "Feature"
}
feat_b = geom_obj(geom_b)

# removal of empty `properties` key
alt.Chart(feat_b).mark_geoshape(tooltip={"content": "data"})

output_2_0

geom_c = {
    "geometry": {
        "coordinates": [[
            [6.90, 53.48],
            [5.98, 51.85],
            [6.07, 53.51],
            [6.90, 53.48]
        ]], 
        "type": "Polygon"
    }, 
    "id": None, 
    "properties": {"country": "Spain"}, 
    "type": "Feature"
}
feat_c = geom_obj(geom_c)

# correct registration of `country` as foreign member
alt.Chart(feat_c).mark_geoshape(tooltip={"content": "data"})

output_3_0

import array as arr
geom_d = {
    "bbox": arr.array('d', [1.1, 3.5, 4.5]),    
    "geometry": {
        "coordinates": [tuple((
            tuple((6.90, 53.48)),
            tuple((5.98, 51.85)),
            tuple((6.07, 53.51)),
            tuple((6.90, 53.48))
        ))], 
        "type": "Polygon"
    }, 
    "id": 27, 
    "properties": {}, 
    "type": "Feature"
}
feat_d = geom_obj(geom_d)

# serializing of arrays to lists
# serializing of (nested) tuples to (nested) lists
# removal of empty `properties` key
alt.Chart(feat_d).mark_geoshape(tooltip={"content": "data"})

output_4_0

geom_d = {
    "geometry": {
        "coordinates": [[
            [6.90, 53.48],
            [5.98, 51.85],
            [6.07, 53.51],
            [6.90, 53.48]
        ]], 
        "type": "Polygon"
    }, 
    "id": 27, 
    "properties": {"type": "foo"}, 
    "type": "Feature"
}

feat_d = geom_obj(geom_d)

# cannot draw geoshape
# incorrect registration of `type` as foreign member
alt.Chart(feat_d).mark_geoshape(tooltip={"content": "data"})

output_5_0

# geopandas can handle unicode characters in __geo_interface__
import geopandas as gpd
fp_earth = gpd.datasets.get_path('naturalearth_lowres')
gdf = gpd.read_file(fp_earth)
gpd_geo_interface = gdf.__geo_interface__
# shapefile cannot handle unicode characters in __geo_interface__
# not the problem of altair
# pip install pyshp
import shapefile
sf = shapefile.Reader(fp_earth)
sf_geo_interface = sf.__geo_interface__
---------------------------------------------------------------------------

UnicodeDecodeError                        Traceback (most recent call last)

<ipython-input-8-b6bca3bd9a1a> in <module>
      3 import shapefile
      4 sf = shapefile.Reader(fp_earth)
----> 5 sf_geo_interface = sf.__geo_interface__


/usr/local/lib/python3.7/site-packages/shapefile.py in __geo_interface__(self)
    620         fieldnames = [f[0] for f in self.fields]
    621         features = []
--> 622         for feat in self.iterShapeRecords():
    623             fdict = {'type': 'Feature',
    624                      'properties': dict(zip(fieldnames,feat.record)),


/usr/local/lib/python3.7/site-packages/shapefile.py in iterShapeRecords(self)
   1042         """Returns a generator of combination geometry/attribute records for
   1043         all records in a shapefile."""
-> 1044         for shape, record in izip(self.iterShapes(), self.iterRecords()):
   1045             yield ShapeRecord(shape=shape, record=record)
   1046 


/usr/local/lib/python3.7/site-packages/shapefile.py in iterRecords(self)
   1023         f.seek(self.__dbfHdrLength)
   1024         for i in xrange(self.numRecords):
-> 1025             r = self.__record()
   1026             if r:
   1027                 yield r


/usr/local/lib/python3.7/site-packages/shapefile.py in __record(self, oid)
    985             else:
    986                 # anything else is forced to string/unicode
--> 987                 value = u(value, self.encoding, self.encodingErrors)
    988                 value = value.strip()
    989             record.append(value)


/usr/local/lib/python3.7/site-packages/shapefile.py in u(v, encoding, encodingErrors)
    102         if isinstance(v, bytes):
    103             # For python 3 decode bytes to str.
--> 104             return v.decode(encoding, encodingErrors)
    105         elif isinstance(v, str):
    106             # Already str.


UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf4 in position 1: invalid continuation byte
# GeoJSON’s RFC 7946 winding order is opposite compare to d3-geo (vega, vega-lite, altair)
# use geopandas if this is problematic
# pip install Shapely==1.7a2
from shapely.ops import orient
gdf_sa = gdf[gdf.name=='South Africa']
gdf_sa_ccw = gdf_sa.geometry.apply(orient, args=(-1,))

# correct winding order
# exterior shell is counterclockwise and interionr rings are clockwise
alt.Chart(gdf_sa_ccw).mark_geoshape().project(type='mercator')

output_9_0

gdf_sa_cw = gdf_sa.geometry.apply(orient, args=(1,))

# incorrect winding order
# exterior shell is clockwise and interior rings are counterclockwise
alt.Chart(gdf_sa_cw).mark_geoshape(tooltip={"content": "data"}).project(type='mercator', reflectY=True)

output_10_0

# test using json data transformer
alt.data_transformers.enable('json')
gdf_na = gdf[gdf.continent=='North America']
# serialize Feature Collections
# using data transformer json
alt.Chart(gdf_na).mark_geoshape().project(type='mercator')

output_12_0

alt.data_transformers.enable('data_server')
# using data transformer data_server
alt.Chart(gdf_na).mark_geoshape().project(type='mercator')

output_14_0

@jakevdp
Copy link
Collaborator

jakevdp commented Aug 21, 2019

None of our tests do any visual output comparison: they're all based on the generated chart specifications. So the right way to test these, I think, is to create a series of objects with __geo_interface__ methods that mock the various geo libraries, and assert that chart.to_dict() (1) passes schema validation, and (2) outputs something reasonable given the input.

@mattijn
Copy link
Contributor Author

mattijn commented Aug 24, 2019

@jakevdp I've added several tests plus some paragraphs of documentation. If some parts are a bit confusing, there is this issue: vega/vega#1319 on the Vega repo that might explain a bit more on the adopted approach for serialization.

Basically its an attempt to implement this comment in the linked issue:

Rather then have two different Vega styles (for geo vs. non-geo data), one solution may be to change your data output scheme. Instead of generating pure GeoJSON, can you get Pandas to output row-oriented JSON that includes a single GeoJSON feature as a property? Then you have a flat table and no namespace issues, as your geometry is included under its own property. That plays well with Vega and, while not adhering to a pure GeoJSON format, is still a completely valid JSON data file.

And generally speaking this works, as long as you don't use the column names type, geometry, id and properties.

Please feel free to share your thoughts.

@mattijn
Copy link
Contributor Author

mattijn commented Aug 27, 2019

The more I think about it, the more I’m not liking it. Instead of un-nesting the entries of properties to top-level, it’s probably better to take the entries of properties as basis and add the type and geometry to it. Meaning that a __geo_interface__ originating from geopandas should only avoid using type as column name, since the geometries are already kept in the geometry column. Moreover this also aligns with the serialization of standard DataFrames to not have the index included.

@jakevdp
Copy link
Collaborator

jakevdp commented Aug 27, 2019

OK - happy to review that version if you think it would be better 😁

@mattijn mattijn changed the title WIP accept __geo_interface__ attribute as data accept __geo_interface__ attribute as data Aug 28, 2019
@mattijn
Copy link
Contributor Author

mattijn commented Aug 28, 2019

@jakevdp this is ready for review.

Copy link
Collaborator

@jakevdp jakevdp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for the work on this. A few comments inline

"""

try:
feat['properties'].update({k: feat[k] for k in ('type', 'geometry')})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like {k: feat[k] for k in ('type', 'geometry')} could be done above the try/except, since it happens in both blocks.



@contextmanager
def not_raises(ExpectedException):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this context manager and all its uses, and let the unit test framework handle any exceptions that are raised.


- as a `Pandas DataFrame <http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html>`_
- as a :class:`Data` or related object (i.e. :class:`UrlData`, :class:`InlineData`, :class:`NamedData`)
- as a url string pointing to a ``json`` or ``csv`` formatted text file
- as an object that supports the `__geo_interface__`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List some examples of data types that define ``geo_interface`

doc/user_guide/data.rst Show resolved Hide resolved
@mattijn
Copy link
Contributor Author

mattijn commented Aug 29, 2019

Updated as requested

@jakevdp
Copy link
Collaborator

jakevdp commented Aug 29, 2019

Looks good. One final comment if you want to make the change; you can avoid having to look up ds_key each time if you turn off dataset consolidation:

with alt.data_transformers.enable(consolidate_datasets=False):
    spec = chart.to_dict()
data = spec['data']

@mattijn
Copy link
Contributor Author

mattijn commented Aug 29, 2019

👍 done!

@jakevdp
Copy link
Collaborator

jakevdp commented Aug 29, 2019

Awesome! Thanks for all your work on this!

@jakevdp jakevdp merged commit f5f6032 into vega:master Aug 29, 2019
@mattijn
Copy link
Contributor Author

mattijn commented Aug 29, 2019

Let me enjoy this enlightening experience by closing #588

@mattijn mattijn deleted the add-__geo_interface__-attribute-as-accepted-data-type branch August 29, 2019 21:15
@kannes
Copy link

kannes commented Sep 4, 2019

oh boy oh boy oh boy oh boy

thank you!

@kannes
Copy link

kannes commented Sep 5, 2019

Could you clarify exactly what parts of the geo interface spec are used and how the data must be structured? I assume you must pass full Features, not just Geometrys, so that properties/attributes for classification and coloring are included?

Does the data object passed to alt.Chart() have to have a __geo_interface__ on its own (the (sadly not really documented, off-spec) GeoPandas approach if I read it right) or can you pass a sequence of objects that each have their own __geo_interface__ (a more traditional approach that would be more easily to work with using lower-level modules like Fiona)?

@mattijn
Copy link
Contributor Author

mattijn commented Sep 5, 2019

Any object that supports the __geo_interface__ is serialized and sanitized to long-form structured data in JSON format that plays well with Altair.

The mark_geoshape of Altair understands a record with a Feature

{"type": "Feature", "geometry": []}

or a list of Features

[
    {"type": "Feature", "geometry": []}, 
    {"type": "Feature", "geometry": []}
]

Since a single geometry is valid under the geo_interface it is sanitized to a Feature without properties (covered by this test).

A list of objects where only the objects in the list contains the __geo_interface__ attribute would not work. The object by itself, in this example the list, need to have the __geo_interface__ attribute to be serialized properly¹. Consider converting to a geojson FeatureCollection first

Fiona does not (yet) support the __geo_interface__ attribute and as documented on its website is not (yet) intended for creating JSON objects:

In what cases would you not benefit from using Fiona?

  • If your data is in or destined for a JSON document you should use Python’s json or simplejson modules.

This might change in the future since you opened an issue on the fiona repo.

¹ in topojson I do accept lists where only the objects support the __geo_interface__, but in my opinion this is not appropriate for Altair. What about lists of pandas DataFrames?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants