-
Notifications
You must be signed in to change notification settings - Fork 789
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run data transformers on all inputs types #887
Conversation
altair/vegalite/v2/api.py
Outdated
warnings.warn("data of type {0} not recognized".format(type(data))) | ||
return data | ||
if isinstance(data, six.string_types): | ||
data = core.UrlData(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we made this part of the default data transformer... the first stage in the transformer pipeline could be a function that, given a string, outputs a URL data object, and otherwise lets things pass through.
That way we give the user complete control if, for example, they want to create a data transformer that does something different with strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me. But how would the transformer indicate that some things should be passed through the pipeline and others should just return exactly as is? I mean, how would the string transformers conditionally execute the rest of the default transformations, depending on the type of the argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a good idea.
How should the string transformer conditionally execute the rest of the pipeline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently transforms are made up of a number of units, defined here: https://github.com/altair-viz/altair/blob/master/altair/utils/data.py
They are strung together in a pipeline, defined here: https://github.com/altair-viz/altair/blob/master/altair/vegalite/data.py#L10-L12
I'd imagine you'd add a string transformer to this pipeline that says "if the input is a string I'll transform it, otherwise I'll return the input unchanged". We'd have to have similar logic for the other transforms to pass types through unchanged if they don't know how to transform them.
In other words, each step in the pipeline knows how to deal with a specific type of data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then won't the rest of the pipeline be executed on the transformed string and raise warnings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As currently written, yes. My proposal is to change all steps in the pipeline such that they transform data types they know about, and silently pass through anything else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I have updated it to remove the warnings and made the url conversion into a transformer.
Looks like the travis failure is just an unused import: remove the |
altair/utils/data.py
Outdated
try: | ||
check_data_type(data) | ||
except TypeError: | ||
return data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be cleaner to get rid of the check_data_type
function, because it doesn't really help now that more data types are supported. For example, here the type checks are already accomplished below... all that's needed is an else
statement that returns the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, updated.
Looks great, thanks for the updates. I'm pretty happy with this right now – I'd like to get @ellisonbg's review before we merge this, since the data transformer registry was his creation. |
Thanks for reviewing. I fixed the import and removed the |
Brian – in case you haven't followed the discussion: the idea here is to put all data transformations within the configurable registry, and not hard-code any logic about what input data types are valid. This will allow users (and us!) to more cleanly create transformers to properly handle arbitrary data sources, such as SQL queries, xarray objects, geodataframes, etc. |
I've looked to the changes. It is good idea to introduce a pattern of bypassing unknown types into transformation. I did exactly the same thing with But I'm afraid of cases where pipeline will silently do nothing. Let's imagine user setup Can you suggest some patterns how to diagnose incorrect pipeline setup? |
I think it's as simple as having the more specialized step in the pipeline come first. Try processing the data as geojson, and if it's invalid then pass it through to the next step which treats it as a regular csv. I also think geojson is specialized enough that it probably shouldn't be enabled by default, and this PR makes it easier to have plugins that do everything the geojson transformer needs to do. Once this is merged, I think we have a better framework in which to think about the details of how to handle geojson. |
@iliatimofeev I can imagine a couple of ways to support things like The first would be to create a transformer that converts geodataframes into pandas dataframes and put this in the beginning the transformation pipeline. It could silently pass on other types. Then, a user could enable this by default (or you could enable in on import) and their existing data would be passed through untouched, but their geodataframes would be transformed first. However, it seems like you do some custom logic to save to a CSV, that isn't just converting to a dataframe and using their implementation. In that case, when the user imports your package, you could re-register the |
By allowing custom data transformers to operate on any types, then third party library can extend the process however they like. That would mean your PR could instead live in the geopandas repo or in another repo. I could see a more complicated version of this system, where every data transformer is really a single dispatch function, so that anyone could easily register their own type specific versions of |
Am I understand right your logic that In any case I suggest to check that pipeline returns valid |
So this is what the error looks like now in the PR for what I think is the user case you are talking about: $ ipython
Python 3.6.5 | packaged by conda-forge | (default, Apr 6 2018, 13:44:09)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import altair as alt
In [2]: alt.data_transformers.register('new_thing', lambda data: ['not right'])
Out[2]: <function __main__.<lambda>(data)>
In [3]: alt.data_transformers.enable('new_thing')
Out[3]: DataTransformerRegistry.enable('new_thing')
In [4]: alt.Chart('something')
Out[4]: ---------------------------------------------------------------------------
SchemaValidationError Traceback (most recent call last)
~/p/altair/altair/vegalite/v2/api.py in to_dict(self, *args, **kwargs)
327 if dct is None:
328 kwargs['validate'] = 'deep'
--> 329 dct = super(TopLevelMixin, copy).to_dict(*args, **kwargs)
330
331 if is_top_level:
~/p/altair/altair/utils/schemapi.py in to_dict(self, validate, ignore, context)
252 self.validate(result)
253 except jsonschema.ValidationError as err:
--> 254 raise SchemaValidationError(self, err)
255 return result
256
SchemaValidationError: Invalid specification
altair.vegalite.v2.api.Chart->data, validating 'anyOf'
['not right'] is not valid under any of the given schemas
Chart({
data: 'something'
}) We could add the transformer that produced that data in the error message, somewhere in https://github.com/saulshanabrook/altair/blob/data-transformers/altair/vegalite/v2/api.py#L329. But that seems to require a little bit of re-organization there. I think that could be done in a seperate PR after this one. |
My problem that not all data sources could be stored in all formats. For example geojson could not be stored in csv format - it would not be recognized by vega as geometry. And I'm sure that in that case we should warn somehow user on a problem. I think that real problem of current abstraction is that in fact we have two types of transformations first is really preprocessing steps like |
Yep, it is quit readable in your sample but in real live it will be less understandable. (just add self = Chart({
data: x y yerr
0 1 10.882026 0.2
1 2 10.200079 0.2
2 3 10.489369 0.2
3 4 ...}),
type: 'quantitative'
})
}),
mark: MarkDef({
filled: True,
size: 50,
type: 'point'
})
})
validate = 'deep', ignore = []
context = {'data': x y yerr
0 1 10.882026 0.2
1 2 10.200079 0.2
2 3 10.489369 0.2
3 4 11.120447 0.2
4 5 10.933779 0.2, 'top_level': False}
except jsonschema.ValidationError as err:
> raise SchemaValidationError(self, err)
E altair.utils.schemapi.SchemaValidationError: Invalid specification
E
E altair.vegalite.v2.api.Chart->data, validating 'anyOf'
E
E x y yerr
E 0 1 10.882026 0.2
E 1 2 10.200079 0.2
E 2 3 10.489369 0.2
E 3 4 11.120447 0.2
E 4 5 10.933779 0.2 is not valid under any of the given schemas
altair/altair/utils/schemapi.py:254: SchemaValidationError |
As resume of discussion, I'm only want warn that strategy of bypassing unknown inputs without any warnings could lead to some unexpected results in a future. In current PR situation it has more benefits than risks but will be great to somehow limit a scope of unknown with appropriate checks. By adding check to api or overlapping pipe function. All other questions we can discuss at the moment of integration of my PR when this one will be accepted in codebase. |
else: | ||
warnings.warn("data of type {0} not recognized".format(type(data))) | ||
return data | ||
return pipe(data, data_transformers.get()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a good place to check isinstance(pipe_result, (dict, core.Data, core.InlineData, core.UrlData, core.NamedData))
and raise Error that pipeline returned unexpected data object, so check current pipeline definition
I posted a note on the related issue (#843). Quick summary:
|
That's a really good idea. |
any news ? |
This closes #843 by running the data transformers on all inputs types.
This will allow users to specify transformations for custom input types, like Ibis expressions.
It changes the existing transformations to just raise a warning if they are called on types they can't handle, instead of raising an exception.