ENH: Build vega_datasets into altair? #796

jakevdp · 2018-04-30T17:06:18Z

This is something that @palewire mentioned a while ago, and the idea has been growing on me.

Currently, many of our examples are like this:

import altair as alt
from vega_datasets import data

cars = data.cars()
alt.Chart(cars)...

What if we were to import vega_datasets into Altair's namespace by default, so instead it would be

import altair as alt

cars = alt.datasets.cars()
alt.Chart(cars)...

The advantage is fewer imports and less boilerplate.

A minor disadvantage is that vega_datasets would become a hard requirement for Altair (unless we did some kind of lazy import mechanism, which would add complication).

A more major disadvantage is that it would obscure the fact that vega_datasets is a separate package, rather than a part of Altair, which might confuse people.

Thoughts?

The text was updated successfully, but these errors were encountered:

palewire · 2018-04-30T17:19:55Z

I like it!

pagpires · 2018-04-30T18:28:29Z

I don't like it... Personally I consider altair as a plotting tool (like matplotlib/ggplot2), so it's better to separate the tool from the data, especially considering the fact that vega_datasets is mostly for tutorial and the user should be able to get the idea how to use altair in 2-3 days. After that, the users should be good to go and never use vega_dasets again.

ellisonbg · 2018-05-01T16:59:11Z

I am probably 50/50 - see both sides.

…

On Mon, Apr 30, 2018 at 11:28 AM, pagpires ***@***.***> wrote: I don't like it... Personally I consider altair as a plotting tool (like matplotlib/ggplot2), so it's better to separate the tool from the data, especially considering the fact that vega_datasets is mostly for tutorial and the user should be able to get the idea how to use altair in 2-3 days. After that, the users should be good to go and never use vega_dasets again. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#796 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AABr0GS17S3coiEv6xTQ-i1sofHckP4Uks5tt1fNgaJpZM4Ts5-X> .

-- Brian E. Granger Associate Professor of Physics and Data Science Cal Poly State University, San Luis Obispo @ellisonbg on Twitter and GitHub bgranger@calpoly.edu and ellisonbg@gmail.com

palewire · 2018-05-01T17:02:53Z

IMHO, this is a step that has a big benefit for newbies (Easy to read, copy and understand examples), and little cost to experts (A slightly larger package).

My view is that any cheap way Altair can welcome in new people is worth the price.

mattijn · 2018-05-01T18:33:44Z

Im +1, in sklearn they also include some datasets, like:

from sklearn import datasets
iris = datasets.load_iris()

It's convenient for toy examples.

I'm +0 if it tenfolds the size of the package (as there are many datasets).

mattijn · 2018-05-01T18:44:23Z

Not sure what happens under the hood at sklearn but maybe it downloads the dataset once the dataset is loaded. Also an interesting approach

jakevdp · 2018-05-01T18:47:27Z

sklearn downloads the data the first time it is accessed and caches it locally in $HOME/scikit_learn_data/

vega_datasets includes a few smaller datasets within the installation, but downloads the rest on demand without any local cacheing.

The current vega_datasets release, including these bundled datasets, is about 200KB.

mattijn · 2018-05-01T19:06:42Z

ok, then I'm +1, since altair is around 250kb, it will almost double in size, but in comparison with matplotlib (~8.5mb) its still super lightweight.

jakevdp · 2018-05-01T19:08:31Z

Just to be clear: I wouldn't suggest actually bunding vega_datasets into altair; rather, I'm suggesting making it a hard dependency and importing it by default in Altair's namespace.

mattijn · 2018-05-01T19:31:12Z

I just noticed that you created vega_datasets package under the altair-viz repository. Since the name, I thought it came from the vega repository like ipyvega.
Still +1 though.

jakevdp · 2018-05-01T21:42:06Z

One place that it would make sense to have tighter integration, even beyond intro tutorials, is in the geographic datasets used for map backgrounds.

palewire · 2018-05-02T15:41:04Z

If you've decided you want this I'd be happy to help with the implementation.

sebastianneubauer · 2018-05-03T09:20:40Z

I would object this because of the following reasons:

some companies/organizations have very strict rules for software in production, each and every hard dependency enhances the probability that altair cannot be used.
Each library should try to minimize the number of hard dependencies. It increases security and not last saves a lot of time and resources while installing the package millions of times. The beginners tutorials will only be relevant in a tiny fraction of the installations.
having the ability to install this requirement with the very convenient extra requirements pip install altair[dev] gives nearly the same convenience for beginners (just document to install it with the dev requirements in the beginners tutorials) while at the same time reduces the hard dependency with the advantages pointed out above. I fully agree it should be super easy to get started, but I think extra dev requirements is a very good practice to reach that goal. Furthermore, (in line with the zen of python) needing an explicit import statement if an external package is used is better than an implicit hard dependency covering up the fact that it is an external package (e.g. if you need to lookup the code on github, because the CIO requests a proof of the licence of each and every used data set ;-) ).

pagpires · 2018-05-03T17:05:09Z

Since the main developers for vega/vega-lite/altair are also teaching in academia, maybe someone can have a sense on how much time students need vega_dataset to learn altair, and how frequently they will have to come back to it (to re-get familiar) after they learn this package? I assume it will take very short time for students to learn and very few will come back, but I can be wrong.

Also, the case of altair is a bit different from sklearn because of 3 factors:

Just as mentioned above, the data/package ratio is much smaller for sklearn (minor factor);
Sklearn is a very versatile tool and it's more preferable to have a consistent/well-crafted dataset to test out all the combinations of APIs (and for users to re-get familiar with them), while altair has a smaller and (thus?) clearer structure of APIs.
It's usually harder to know if one is using a ML tool correctly or the underlying mechanisms are behaving normally (i.e. harder to get feedback during the process), thus a designated dataset can be useful as the control. However, in the process of learning/doing viz, you get feedback immediately (you will also most likely take advantages of vega-lite editor to explore APIs, because I assume it's very rare if someone is only familiar with altair but not vega-lite), again making a designated dataset less useful.

palewire · 2018-05-03T17:30:48Z

This is of course a debatable thing, without a clear answer.

I am of the opinion that handlebars for newbies should be a top design priority. Especially for a package like this that aims to be used by groups of programmers with less experience and expertise. The journalists I frequently train are often writing Python for the first time when they open a Jupyter Notebook in a class like First Python Notebook.

In my opinion, this package has the potential to breakthrough and draw thousands of people into Python. But to do that it needs to not just convert matplotlib experts and Python developers. It also needs to draw in people who are writing code for the very first time. I see this idea as one of many steps that can reduce technical and conceptual hurdles

"What is Vega?" is a question a newbie likely will need to answer to read even the most basic examples right now. So is "Why are there two different ways to import things?" I'm sure those concepts are obvious to everyone reading this thread, but they are not apparent to the beginner and can be enough to stop someone from adopting the tool. I've seen it time and time again.

For that reason, I think that the trade off is worth the price.

There also might be some way to modify vega_datasets to reduce its file size and lower the burden it brings.

arokem · 2018-05-03T17:58:24Z

For the time being, I would recommend to at least mention this in the documentation on the front page of the project. The example on that page does not currently work out of the box without an additional pip install vega_datasets

jakevdp · 2018-05-03T18:04:08Z

I really appreciate everyone weighing in here. Overall, I think the key points on either side are:

Pros

fewer imports would lead to an easier learning curve for beginners
even beyond beginners, some datasets (e.g. geoshape data for choropleth maps) would be useful to have more readily available/built-in to the Altair API

Cons

more hard dependencies could add friction for companies hoping to use Altair
marginally larger installation footprint (200KB)
obscuring the existence of the vega_datasets package makes it less clear how to debug issues that arise, particularly for beginners

With those in mind, how about a compromise: make vega_datasets a soft requirement accessible from the Altair namespace. We could add a datasets object at the top-level of the module which provides access to everything in vega-datasets (if it is installed) and raises an informative error (if it is not). I'd imagine something like this:

# in altair/__init__.py

class VegaDatasetsUnavailable(object):
    def __getattr__(self, attr):
        raise ImportError("To use datasets in Altair requires installing the vega_datasets package: "
                          "See https://github.com/altair-viz/vega_datasets")
    __call__ = __getattr__

try:
    from vega_datasets import data as datasets
except ImportError:
    datasets = VegaDatasetsUnavailable()

Then the hard dependencies of Altair would not change, but it would make the following available to users who do have vega_datasets installed:

import altair as alt
cars = alt.datasets.cars()
alt.Chart(cars).mark_point()#etc.

If vega_datasets is installed, this will work as expected. If not, it will give the user an informative error.

We could then adjust all our "getting started" installation instructions to include installation of vega_datasets (as they probably already should).

jakevdp · 2018-05-03T18:09:07Z

For the time being, I would recommend to at least mention this in the documentation on the front page of the project. The example on that page does not currently work out of the box without an additional pip install vega_datasets

Yeah, I'll fix that. Up until recently, vega_datasets was listed as a hard requirement. Now that it's not, we need to update the installation instructions.

Edit: I fixed this in 6846d79 and pushed a new doc build

sebastianneubauer · 2018-05-03T21:07:15Z

I want to mention once again, because the documentation now is more complicated than it could be:
pip install altair
installs altair and all hard dependencies
pip install altair[dev]
installs all hard + all dev dependencies, including ipython and vega_datasets, everything what is listed in requirements_dev.txt. We can also add everything there which is needed for any interactive work like tutorials.

jakevdp · 2018-05-03T21:09:51Z

@sebastianneubauer – I want to avoid recommending that new users install all the dev dependencies (users don't need jinja, sphinx, m2r, docutils, flake8, etc.)

Additionally, I think it's much simpler to understand what's going on with pip install altair vega_datasets than it is with pip install altair[dev].

sebastianneubauer · 2018-05-03T21:14:44Z

I fully agree, in fact, I like the explicit approach also more. Sometimes people complain about things being "not convenient enough", but then if things fail they complain about the complexity buried under the convenience layer, about the magic that is happening below the surface ;-)

jakevdp · 2018-05-08T21:21:01Z

The more I think about it, the more I like the compromise I mentioned above, particularly as I develop the Altair tutorial.

After PyCon, I'm going to look at implementing this unless people have objections (@ellisonbg – I'd love to hear your thoughts)

jakevdp · 2018-05-18T20:38:14Z

I have an implementation of this at #872.

palewire · 2018-07-11T16:55:28Z

I believe this ticket can be closed, per the resolution in #872.

AliciaSchep mentioned this issue May 1, 2018

Importing Vega datasets vegawidget/altair#25

Closed

ijlyttle mentioned this issue May 8, 2018

Keep an eye on vega_datasets being imported into altair vegawidget/altair#51

Closed

jakevdp added this to the 2.1 milestone May 18, 2018

jakevdp mentioned this issue May 18, 2018

ENH: make vega_datasets accessible from alt.datasets #872

Closed

jakevdp closed this as completed Jul 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Build vega_datasets into altair? #796

ENH: Build vega_datasets into altair? #796

jakevdp commented Apr 30, 2018

palewire commented Apr 30, 2018

pagpires commented Apr 30, 2018

ellisonbg commented May 1, 2018 via email

palewire commented May 1, 2018

mattijn commented May 1, 2018

mattijn commented May 1, 2018

jakevdp commented May 1, 2018 •

edited

mattijn commented May 1, 2018

jakevdp commented May 1, 2018

mattijn commented May 1, 2018

jakevdp commented May 1, 2018

palewire commented May 2, 2018

sebastianneubauer commented May 3, 2018 •

edited

pagpires commented May 3, 2018

palewire commented May 3, 2018 •

edited

arokem commented May 3, 2018

jakevdp commented May 3, 2018 •

edited

jakevdp commented May 3, 2018 •

edited

sebastianneubauer commented May 3, 2018 •

edited

jakevdp commented May 3, 2018 •

edited

sebastianneubauer commented May 3, 2018

jakevdp commented May 8, 2018

jakevdp commented May 18, 2018

palewire commented Jul 11, 2018

ENH: Build vega_datasets into altair? #796

ENH: Build vega_datasets into altair? #796

Comments

jakevdp commented Apr 30, 2018

palewire commented Apr 30, 2018

pagpires commented Apr 30, 2018

ellisonbg commented May 1, 2018 via email

palewire commented May 1, 2018

mattijn commented May 1, 2018

mattijn commented May 1, 2018

jakevdp commented May 1, 2018 • edited

mattijn commented May 1, 2018

jakevdp commented May 1, 2018

mattijn commented May 1, 2018

jakevdp commented May 1, 2018

palewire commented May 2, 2018

sebastianneubauer commented May 3, 2018 • edited

pagpires commented May 3, 2018

palewire commented May 3, 2018 • edited

arokem commented May 3, 2018

jakevdp commented May 3, 2018 • edited

Pros

Cons

jakevdp commented May 3, 2018 • edited

sebastianneubauer commented May 3, 2018 • edited

jakevdp commented May 3, 2018 • edited

sebastianneubauer commented May 3, 2018

jakevdp commented May 8, 2018

jakevdp commented May 18, 2018

palewire commented Jul 11, 2018

jakevdp commented May 1, 2018 •

edited

sebastianneubauer commented May 3, 2018 •

edited

palewire commented May 3, 2018 •

edited

jakevdp commented May 3, 2018 •

edited

jakevdp commented May 3, 2018 •

edited

sebastianneubauer commented May 3, 2018 •

edited

jakevdp commented May 3, 2018 •

edited