Run data transformers on all inputs types #887

saulshanabrook · 2018-05-22T22:02:57Z

This closes #843 by running the data transformers on all inputs types.

This will allow users to specify transformations for custom input types, like Ibis expressions.

It changes the existing transformations to just raise a warning if they are called on types they can't handle, instead of raising an exception.

jakevdp · 2018-05-23T01:08:21Z

altair/vegalite/v2/api.py

-        warnings.warn("data of type {0} not recognized".format(type(data)))
-        return data
+    if isinstance(data, six.string_types):
+        data = core.UrlData(data)


What if we made this part of the default data transformer... the first stage in the transformer pipeline could be a function that, given a string, outputs a URL data object, and otherwise lets things pass through.

That way we give the user complete control if, for example, they want to create a data transformer that does something different with strings.

That makes sense to me. But how would the transformer indicate that some things should be passed through the pipeline and others should just return exactly as is? I mean, how would the string transformers conditionally execute the rest of the default transformations, depending on the type of the argument?

Seems like a good idea.

How should the string transformer conditionally execute the rest of the pipeline?

Currently transforms are made up of a number of units, defined here: https://github.com/altair-viz/altair/blob/master/altair/utils/data.py

They are strung together in a pipeline, defined here: https://github.com/altair-viz/altair/blob/master/altair/vegalite/data.py#L10-L12

I'd imagine you'd add a string transformer to this pipeline that says "if the input is a string I'll transform it, otherwise I'll return the input unchanged". We'd have to have similar logic for the other transforms to pass types through unchanged if they don't know how to transform them.

In other words, each step in the pipeline knows how to deal with a specific type of data.

But then won't the rest of the pipeline be executed on the transformed string and raise warnings?

As currently written, yes. My proposal is to change all steps in the pipeline such that they transform data types they know about, and silently pass through anything else.

OK I have updated it to remove the warnings and made the url conversion into a transformer.

jakevdp · 2018-05-24T15:51:33Z

Looks like the travis failure is just an unused import: remove the import pandas line in api.py and things should pass.

jakevdp · 2018-05-24T15:53:51Z

altair/utils/data.py

+    try:
+        check_data_type(data)
+    except TypeError:
+        return data


I think it would be cleaner to get rid of the check_data_type function, because it doesn't really help now that more data types are supported. For example, here the type checks are already accomplished below... all that's needed is an else statement that returns the data.

Sounds good, updated.

jakevdp · 2018-05-25T18:34:50Z

Looks great, thanks for the updates.

I'm pretty happy with this right now – I'd like to get @ellisonbg's review before we merge this, since the data transformer registry was his creation.

saulshanabrook · 2018-05-25T18:34:57Z

Thanks for reviewing. I fixed the import and removed the check_data_type function.

jakevdp · 2018-05-25T18:37:39Z

Brian – in case you haven't followed the discussion: the idea here is to put all data transformations within the configurable registry, and not hard-code any logic about what input data types are valid. This will allow users (and us!) to more cleanly create transformers to properly handle arbitrary data sources, such as SQL queries, xarray objects, geodataframes, etc.

iliatimofeev · 2018-06-10T11:26:47Z

I've looked to the changes. It is good idea to introduce a pattern of bypassing unknown types into transformation. I did exactly the same thing with limit_rows in my #818.

But I'm afraid of cases where pipeline will silently do nothing. Let's imagine user setup to_csv in a pipeline and then pass a geojson object to Chats data. Geojson object could not be saved as csv file format but to_csv will silently do nothing and user will get some "254: SchemaValidationError" with "'data' is a required property " or "is not valid under any of the given schemas". Personally I will interpret that message as "something is broken in altair" because data is set and documentation says that I can pass Geojson objects as data.

Can you suggest some patterns how to diagnose incorrect pipeline setup?

jakevdp · 2018-06-10T13:47:27Z

I think it's as simple as having the more specialized step in the pipeline come first. Try processing the data as geojson, and if it's invalid then pass it through to the next step which treats it as a regular csv.

I also think geojson is specialized enough that it probably shouldn't be enabled by default, and this PR makes it easier to have plugins that do everything the geojson transformer needs to do.

Once this is merged, I think we have a better framework in which to think about the details of how to handle geojson.

saulshanabrook · 2018-06-10T14:22:51Z

@iliatimofeev I can imagine a couple of ways to support things like to_csv with geodataframes using this PR.

The first would be to create a transformer that converts geodataframes into pandas dataframes and put this in the beginning the transformation pipeline. It could silently pass on other types. Then, a user could enable this by default (or you could enable in on import) and their existing data would be passed through untouched, but their geodataframes would be transformed first.

However, it seems like you do some custom logic to save to a CSV, that isn't just converting to a dataframe and using their implementation. In that case, when the user imports your package, you could re-register the csv transformer that provides a wrapped version of that function, that also handles the geodataframe case. That way, as long as the user imports your library, when the enable csv, it will call your wrapped version which would check if they have a geodataframe and do special logic or pass it on the original implementation.

saulshanabrook · 2018-06-10T14:28:10Z

By allowing custom data transformers to operate on any types, then third party library can extend the process however they like. That would mean your PR could instead live in the geopandas repo or in another repo.

I could see a more complicated version of this system, where every data transformer is really a single dispatch function, so that anyone could easily register their own type specific versions of limit_rows or csv. That's a bigger change though and I have not tried anything out like that to see if it would work completely.

iliatimofeev · 2018-06-10T14:43:12Z

Am I understand right your logic that to_geojson is a separate transformation that should be added to pipeline to work? So to_geojson has to return something that to_values and to_json should accept as data to store (like {"values":array of GeoJSON}) . But in that case to_csv will accept it as well and we will have empty chart without any warning from altair. Otherwise will need add an extra to_geojson_values that returns alt.InlineData object and to_geojson should return alt.UrlData and repeat all the filenaming logics separately, than add to_flat_geojson and to to_prop_geojson with corresponding *_values versions - for sure it will work but looks strange for my taste.

In any case I suggest to check that pipeline returns valid alt.Data object it will help users to understand that error is in pipeline definition but not in the Chat.

saulshanabrook · 2018-06-10T14:56:04Z

In any case I suggest to check that pipeline returns valid alt.Data object it will help users to understand that error is in pipeline definition but not in the Chat.

So this is what the error looks like now in the PR for what I think is the user case you are talking about:

$ ipython
Python 3.6.5 | packaged by conda-forge | (default, Apr  6 2018, 13:44:09)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import altair as alt

In [2]: alt.data_transformers.register('new_thing', lambda data: ['not right'])
Out[2]: <function __main__.<lambda>(data)>

In [3]: alt.data_transformers.enable('new_thing')
Out[3]: DataTransformerRegistry.enable('new_thing')

In [4]: alt.Chart('something')
Out[4]: ---------------------------------------------------------------------------
SchemaValidationError                     Traceback (most recent call last)
~/p/altair/altair/vegalite/v2/api.py in to_dict(self, *args, **kwargs)
    327         if dct is None:
    328             kwargs['validate'] = 'deep'
--> 329             dct = super(TopLevelMixin, copy).to_dict(*args, **kwargs)
    330
    331         if is_top_level:

~/p/altair/altair/utils/schemapi.py in to_dict(self, validate, ignore, context)
    252                 self.validate(result)
    253             except jsonschema.ValidationError as err:
--> 254                 raise SchemaValidationError(self, err)
    255         return result
    256

SchemaValidationError: Invalid specification

        altair.vegalite.v2.api.Chart->data, validating 'anyOf'

        ['not right'] is not valid under any of the given schemas


Chart({
  data: 'something'
})

We could add the transformer that produced that data in the error message, somewhere in https://github.com/saulshanabrook/altair/blob/data-transformers/altair/vegalite/v2/api.py#L329.

But that seems to require a little bit of re-organization there. I think that could be done in a seperate PR after this one.

iliatimofeev · 2018-06-10T15:10:35Z

My problem that not all data sources could be stored in all formats. For example geojson could not be stored in csv format - it would not be recognized by vega as geometry. And I'm sure that in that case we should warn somehow user on a problem.

I think that real problem of current abstraction is that in fact we have two types of transformations first is really preprocessing steps like sample another is storing step like to_values or to_cvs. Preprocessing transformation could be stacked, but for storing we need to choose format.

iliatimofeev · 2018-06-10T15:32:38Z

@saulshanabrook

So this is what the error looks like now in the PR for what I think is the user case you are talking about:

Yep, it is quit readable in your sample but in real live it will be less understandable. (just add return data in the beginning of to_values and run test_examples)

self = Chart({
  data:    x          y  yerr
  0  1  10.882026   0.2
  1  2  10.200079   0.2
  2  3  10.489369   0.2
  3  4  ...}),
      type: 'quantitative'
    })
  }),
  mark: MarkDef({
    filled: True,
    size: 50,
    type: 'point'
  })
})
validate = 'deep', ignore = []
context = {'data':    x          y  yerr
0  1  10.882026   0.2
1  2  10.200079   0.2
2  3  10.489369   0.2
3  4  11.120447   0.2
4  5  10.933779   0.2, 'top_level': False}
            except jsonschema.ValidationError as err:
>               raise SchemaValidationError(self, err)
E               altair.utils.schemapi.SchemaValidationError: Invalid specification
E               
E                       altair.vegalite.v2.api.Chart->data, validating 'anyOf'
E               
E                          x          y  yerr
E               0  1  10.882026   0.2
E               1  2  10.200079   0.2
E               2  3  10.489369   0.2
E               3  4  11.120447   0.2
E               4  5  10.933779   0.2 is not valid under any of the given schemas

altair/altair/utils/schemapi.py:254: SchemaValidationError

iliatimofeev · 2018-06-10T15:59:38Z

As resume of discussion, I'm only want warn that strategy of bypassing unknown inputs without any warnings could lead to some unexpected results in a future. In current PR situation it has more benefits than risks but will be great to somehow limit a scope of unknown with appropriate checks. By adding check to api or overlapping pipe function.

All other questions we can discuss at the moment of integration of my PR when this one will be accepted in codebase.
@ellisonbg that PR become a blocker for GeoJSON support for now, could you give us some timeline for reviewing it?

iliatimofeev · 2018-06-10T16:32:07Z

altair/vegalite/v2/api.py

-    else:
-        warnings.warn("data of type {0} not recognized".format(type(data)))
-        return data
+    return pipe(data, data_transformers.get())


I think it's a good place to check isinstance(pipe_result, (dict, core.Data, core.InlineData, core.UrlData, core.NamedData)) and raise Error that pipeline returned unexpected data object, so check current pipeline definition

ellisonbg · 2018-08-18T01:24:21Z

I posted a note on the related issue (#843). Quick summary:

I think it would be useful to always run the data transformers on all data.
Let's use mutipledispatch to handle the type switching and make it easy to implement.

jakevdp · 2018-08-18T02:10:11Z

Let's use mutipledispatch to handle the type switching and make it easy to implement.

That's a really good idea.

stonebig · 2019-04-26T13:46:14Z

any news ?

saulshanabrook added 2 commits May 22, 2018 16:55

Change data transformations to apply to all types

196edab

Add warning for transformations of invalid types

7b354e5

jakevdp reviewed May 23, 2018

View reviewed changes

saulshanabrook added 3 commits May 23, 2018 08:32

move str->url to transformer

2be5393

remove warnings on unsupported datatype

4450ef3

document url transformer

1928316

jakevdp mentioned this pull request May 23, 2018

xarray support #891

Open

jakevdp reviewed May 24, 2018

View reviewed changes

saulshanabrook added 2 commits May 25, 2018 13:21

Remove check_data_type

3806523

remove extra pandas import

ed08930

jakevdp requested a review from ellisonbg May 25, 2018 18:35

jakevdp mentioned this pull request Jun 1, 2018

Candlestick gallery example has no data visible #910

Closed

jakevdp added this to the 2.2 milestone Jun 6, 2018

jakevdp mentioned this pull request Jun 10, 2018

Integration with geopandas #588 #818

Closed

3 tasks

iliatimofeev reviewed Jun 10, 2018

View reviewed changes

mattijn mentioned this pull request Aug 16, 2019

Integration with geopandas geometries #588

Closed

jakevdp closed this Mar 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run data transformers on all inputs types #887

Run data transformers on all inputs types #887

saulshanabrook commented May 22, 2018

jakevdp May 23, 2018

saulshanabrook May 23, 2018

saulshanabrook May 23, 2018

jakevdp May 23, 2018

saulshanabrook May 23, 2018

jakevdp May 23, 2018 •

edited

Loading

saulshanabrook May 23, 2018

jakevdp commented May 24, 2018

jakevdp May 24, 2018

saulshanabrook May 25, 2018

jakevdp commented May 25, 2018

saulshanabrook commented May 25, 2018 •

edited

Loading

jakevdp commented May 25, 2018 •

edited

Loading

iliatimofeev commented Jun 10, 2018

jakevdp commented Jun 10, 2018

saulshanabrook commented Jun 10, 2018 •

edited

Loading

saulshanabrook commented Jun 10, 2018

iliatimofeev commented Jun 10, 2018

saulshanabrook commented Jun 10, 2018

iliatimofeev commented Jun 10, 2018

iliatimofeev commented Jun 10, 2018

iliatimofeev commented Jun 10, 2018

iliatimofeev Jun 10, 2018 •

edited

Loading

ellisonbg commented Aug 18, 2018

jakevdp commented Aug 18, 2018

stonebig commented Apr 26, 2019

Run data transformers on all inputs types #887

Run data transformers on all inputs types #887

Conversation

saulshanabrook commented May 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakevdp May 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakevdp commented May 24, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakevdp commented May 25, 2018

saulshanabrook commented May 25, 2018 • edited Loading

jakevdp commented May 25, 2018 • edited Loading

iliatimofeev commented Jun 10, 2018

jakevdp commented Jun 10, 2018

saulshanabrook commented Jun 10, 2018 • edited Loading

saulshanabrook commented Jun 10, 2018

iliatimofeev commented Jun 10, 2018

saulshanabrook commented Jun 10, 2018

iliatimofeev commented Jun 10, 2018

iliatimofeev commented Jun 10, 2018

iliatimofeev commented Jun 10, 2018

iliatimofeev Jun 10, 2018 • edited Loading

Choose a reason for hiding this comment

ellisonbg commented Aug 18, 2018

jakevdp commented Aug 18, 2018

stonebig commented Apr 26, 2019

jakevdp May 23, 2018 •

edited

Loading

saulshanabrook commented May 25, 2018 •

edited

Loading

jakevdp commented May 25, 2018 •

edited

Loading

saulshanabrook commented Jun 10, 2018 •

edited

Loading

iliatimofeev Jun 10, 2018 •

edited

Loading