Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run data transformers on all inputs types #887

Closed
wants to merge 7 commits into from

Conversation

saulshanabrook
Copy link
Contributor

This closes #843 by running the data transformers on all inputs types.

This will allow users to specify transformations for custom input types, like Ibis expressions.

It changes the existing transformations to just raise a warning if they are called on types they can't handle, instead of raising an exception.

warnings.warn("data of type {0} not recognized".format(type(data)))
return data
if isinstance(data, six.string_types):
data = core.UrlData(data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we made this part of the default data transformer... the first stage in the transformer pipeline could be a function that, given a string, outputs a URL data object, and otherwise lets things pass through.

That way we give the user complete control if, for example, they want to create a data transformer that does something different with strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me. But how would the transformer indicate that some things should be passed through the pipeline and others should just return exactly as is? I mean, how would the string transformers conditionally execute the rest of the default transformations, depending on the type of the argument?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good idea.

How should the string transformer conditionally execute the rest of the pipeline?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently transforms are made up of a number of units, defined here: https://github.com/altair-viz/altair/blob/master/altair/utils/data.py

They are strung together in a pipeline, defined here: https://github.com/altair-viz/altair/blob/master/altair/vegalite/data.py#L10-L12

I'd imagine you'd add a string transformer to this pipeline that says "if the input is a string I'll transform it, otherwise I'll return the input unchanged". We'd have to have similar logic for the other transforms to pass types through unchanged if they don't know how to transform them.

In other words, each step in the pipeline knows how to deal with a specific type of data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then won't the rest of the pipeline be executed on the transformed string and raise warnings?

Copy link
Collaborator

@jakevdp jakevdp May 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As currently written, yes. My proposal is to change all steps in the pipeline such that they transform data types they know about, and silently pass through anything else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I have updated it to remove the warnings and made the url conversion into a transformer.

@jakevdp jakevdp mentioned this pull request May 23, 2018
@jakevdp
Copy link
Collaborator

jakevdp commented May 24, 2018

Looks like the travis failure is just an unused import: remove the import pandas line in api.py and things should pass.

try:
check_data_type(data)
except TypeError:
return data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be cleaner to get rid of the check_data_type function, because it doesn't really help now that more data types are supported. For example, here the type checks are already accomplished below... all that's needed is an else statement that returns the data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, updated.

@jakevdp
Copy link
Collaborator

jakevdp commented May 25, 2018

Looks great, thanks for the updates.

I'm pretty happy with this right now – I'd like to get @ellisonbg's review before we merge this, since the data transformer registry was his creation.

@saulshanabrook
Copy link
Contributor Author

saulshanabrook commented May 25, 2018

Thanks for reviewing. I fixed the import and removed the check_data_type function.

@jakevdp jakevdp requested a review from ellisonbg May 25, 2018 18:35
@jakevdp
Copy link
Collaborator

jakevdp commented May 25, 2018

Brian – in case you haven't followed the discussion: the idea here is to put all data transformations within the configurable registry, and not hard-code any logic about what input data types are valid. This will allow users (and us!) to more cleanly create transformers to properly handle arbitrary data sources, such as SQL queries, xarray objects, geodataframes, etc.

@jakevdp jakevdp added this to the 2.2 milestone Jun 6, 2018
@jakevdp jakevdp mentioned this pull request Jun 10, 2018
3 tasks
@iliatimofeev
Copy link
Contributor

I've looked to the changes. It is good idea to introduce a pattern of bypassing unknown types into transformation. I did exactly the same thing with limit_rows in my #818.

But I'm afraid of cases where pipeline will silently do nothing. Let's imagine user setup to_csv in a pipeline and then pass a geojson object to Chats data. Geojson object could not be saved as csv file format but to_csv will silently do nothing and user will get some "254: SchemaValidationError" with "'data' is a required property " or "is not valid under any of the given schemas". Personally I will interpret that message as "something is broken in altair" because data is set and documentation says that I can pass Geojson objects as data.

Can you suggest some patterns how to diagnose incorrect pipeline setup?

@jakevdp
Copy link
Collaborator

jakevdp commented Jun 10, 2018

I think it's as simple as having the more specialized step in the pipeline come first. Try processing the data as geojson, and if it's invalid then pass it through to the next step which treats it as a regular csv.

I also think geojson is specialized enough that it probably shouldn't be enabled by default, and this PR makes it easier to have plugins that do everything the geojson transformer needs to do.

Once this is merged, I think we have a better framework in which to think about the details of how to handle geojson.

@saulshanabrook
Copy link
Contributor Author

saulshanabrook commented Jun 10, 2018

@iliatimofeev I can imagine a couple of ways to support things like to_csv with geodataframes using this PR.

The first would be to create a transformer that converts geodataframes into pandas dataframes and put this in the beginning the transformation pipeline. It could silently pass on other types. Then, a user could enable this by default (or you could enable in on import) and their existing data would be passed through untouched, but their geodataframes would be transformed first.

However, it seems like you do some custom logic to save to a CSV, that isn't just converting to a dataframe and using their implementation. In that case, when the user imports your package, you could re-register the csv transformer that provides a wrapped version of that function, that also handles the geodataframe case. That way, as long as the user imports your library, when the enable csv, it will call your wrapped version which would check if they have a geodataframe and do special logic or pass it on the original implementation.

@saulshanabrook
Copy link
Contributor Author

By allowing custom data transformers to operate on any types, then third party library can extend the process however they like. That would mean your PR could instead live in the geopandas repo or in another repo.

I could see a more complicated version of this system, where every data transformer is really a single dispatch function, so that anyone could easily register their own type specific versions of limit_rows or csv. That's a bigger change though and I have not tried anything out like that to see if it would work completely.

@iliatimofeev
Copy link
Contributor

Am I understand right your logic that to_geojson is a separate transformation that should be added to pipeline to work? So to_geojson has to return something that to_values and to_json should accept as data to store (like {"values":array of GeoJSON}) . But in that case to_csv will accept it as well and we will have empty chart without any warning from altair. Otherwise will need add an extra to_geojson_values that returns alt.InlineData object and to_geojson should return alt.UrlData and repeat all the filenaming logics separately, than add to_flat_geojson and to to_prop_geojson with corresponding *_values versions - for sure it will work but looks strange for my taste.

In any case I suggest to check that pipeline returns valid alt.Data object it will help users to understand that error is in pipeline definition but not in the Chat.

@saulshanabrook
Copy link
Contributor Author

In any case I suggest to check that pipeline returns valid alt.Data object it will help users to understand that error is in pipeline definition but not in the Chat.

So this is what the error looks like now in the PR for what I think is the user case you are talking about:

$ ipython
Python 3.6.5 | packaged by conda-forge | (default, Apr  6 2018, 13:44:09)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import altair as alt

In [2]: alt.data_transformers.register('new_thing', lambda data: ['not right'])
Out[2]: <function __main__.<lambda>(data)>

In [3]: alt.data_transformers.enable('new_thing')
Out[3]: DataTransformerRegistry.enable('new_thing')

In [4]: alt.Chart('something')
Out[4]: ---------------------------------------------------------------------------
SchemaValidationError                     Traceback (most recent call last)
~/p/altair/altair/vegalite/v2/api.py in to_dict(self, *args, **kwargs)
    327         if dct is None:
    328             kwargs['validate'] = 'deep'
--> 329             dct = super(TopLevelMixin, copy).to_dict(*args, **kwargs)
    330
    331         if is_top_level:

~/p/altair/altair/utils/schemapi.py in to_dict(self, validate, ignore, context)
    252                 self.validate(result)
    253             except jsonschema.ValidationError as err:
--> 254                 raise SchemaValidationError(self, err)
    255         return result
    256

SchemaValidationError: Invalid specification

        altair.vegalite.v2.api.Chart->data, validating 'anyOf'

        ['not right'] is not valid under any of the given schemas


Chart({
  data: 'something'
})

We could add the transformer that produced that data in the error message, somewhere in https://github.com/saulshanabrook/altair/blob/data-transformers/altair/vegalite/v2/api.py#L329.

But that seems to require a little bit of re-organization there. I think that could be done in a seperate PR after this one.

@iliatimofeev
Copy link
Contributor

My problem that not all data sources could be stored in all formats. For example geojson could not be stored in csv format - it would not be recognized by vega as geometry. And I'm sure that in that case we should warn somehow user on a problem.

I think that real problem of current abstraction is that in fact we have two types of transformations first is really preprocessing steps like sample another is storing step like to_values or to_cvs. Preprocessing transformation could be stacked, but for storing we need to choose format.

@iliatimofeev
Copy link
Contributor

@saulshanabrook

So this is what the error looks like now in the PR for what I think is the user case you are talking about:

Yep, it is quit readable in your sample but in real live it will be less understandable. (just add return data in the beginning of to_values and run test_examples)

self = Chart({
  data:    x          y  yerr
  0  1  10.882026   0.2
  1  2  10.200079   0.2
  2  3  10.489369   0.2
  3  4  ...}),
      type: 'quantitative'
    })
  }),
  mark: MarkDef({
    filled: True,
    size: 50,
    type: 'point'
  })
})
validate = 'deep', ignore = []
context = {'data':    x          y  yerr
0  1  10.882026   0.2
1  2  10.200079   0.2
2  3  10.489369   0.2
3  4  11.120447   0.2
4  5  10.933779   0.2, 'top_level': False}
            except jsonschema.ValidationError as err:
>               raise SchemaValidationError(self, err)
E               altair.utils.schemapi.SchemaValidationError: Invalid specification
E               
E                       altair.vegalite.v2.api.Chart->data, validating 'anyOf'
E               
E                          x          y  yerr
E               0  1  10.882026   0.2
E               1  2  10.200079   0.2
E               2  3  10.489369   0.2
E               3  4  11.120447   0.2
E               4  5  10.933779   0.2 is not valid under any of the given schemas

altair/altair/utils/schemapi.py:254: SchemaValidationError

@iliatimofeev
Copy link
Contributor

As resume of discussion, I'm only want warn that strategy of bypassing unknown inputs without any warnings could lead to some unexpected results in a future. In current PR situation it has more benefits than risks but will be great to somehow limit a scope of unknown with appropriate checks. By adding check to api or overlapping pipe function.

All other questions we can discuss at the moment of integration of my PR when this one will be accepted in codebase.
@ellisonbg that PR become a blocker for GeoJSON support for now, could you give us some timeline for reviewing it?

else:
warnings.warn("data of type {0} not recognized".format(type(data)))
return data
return pipe(data, data_transformers.get())
Copy link
Contributor

@iliatimofeev iliatimofeev Jun 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good place to check isinstance(pipe_result, (dict, core.Data, core.InlineData, core.UrlData, core.NamedData)) and raise Error that pipeline returned unexpected data object, so check current pipeline definition

@ellisonbg
Copy link
Collaborator

I posted a note on the related issue (#843). Quick summary:

  • I think it would be useful to always run the data transformers on all data.
  • Let's use mutipledispatch to handle the type switching and make it easy to implement.

@jakevdp
Copy link
Collaborator

jakevdp commented Aug 18, 2018

Let's use mutipledispatch to handle the type switching and make it easy to implement.

That's a really good idea.

@stonebig
Copy link
Contributor

any news ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Run data transformer on other types of inputs
5 participants