Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Build vega_datasets into altair? #796

Closed
jakevdp opened this issue Apr 30, 2018 · 24 comments
Closed

ENH: Build vega_datasets into altair? #796

jakevdp opened this issue Apr 30, 2018 · 24 comments
Milestone

Comments

@jakevdp
Copy link
Collaborator

jakevdp commented Apr 30, 2018

This is something that @palewire mentioned a while ago, and the idea has been growing on me.

Currently, many of our examples are like this:

import altair as alt
from vega_datasets import data

cars = data.cars()
alt.Chart(cars)...

What if we were to import vega_datasets into Altair's namespace by default, so instead it would be

import altair as alt

cars = alt.datasets.cars()
alt.Chart(cars)...

The advantage is fewer imports and less boilerplate.

A minor disadvantage is that vega_datasets would become a hard requirement for Altair (unless we did some kind of lazy import mechanism, which would add complication).

A more major disadvantage is that it would obscure the fact that vega_datasets is a separate package, rather than a part of Altair, which might confuse people.

Thoughts?

@palewire
Copy link
Contributor

I like it!

@pagpires
Copy link
Contributor

I don't like it... Personally I consider altair as a plotting tool (like matplotlib/ggplot2), so it's better to separate the tool from the data, especially considering the fact that vega_datasets is mostly for tutorial and the user should be able to get the idea how to use altair in 2-3 days. After that, the users should be good to go and never use vega_dasets again.

@ellisonbg
Copy link
Collaborator

ellisonbg commented May 1, 2018 via email

@palewire
Copy link
Contributor

palewire commented May 1, 2018

IMHO, this is a step that has a big benefit for newbies (Easy to read, copy and understand examples), and little cost to experts (A slightly larger package).

My view is that any cheap way Altair can welcome in new people is worth the price.

@mattijn
Copy link
Contributor

mattijn commented May 1, 2018

Im +1, in sklearn they also include some datasets, like:

from sklearn import datasets
iris = datasets.load_iris()

It's convenient for toy examples.

I'm +0 if it tenfolds the size of the package (as there are many datasets).

@mattijn
Copy link
Contributor

mattijn commented May 1, 2018

Not sure what happens under the hood at sklearn but maybe it downloads the dataset once the dataset is loaded. Also an interesting approach

@jakevdp
Copy link
Collaborator Author

jakevdp commented May 1, 2018

sklearn downloads the data the first time it is accessed and caches it locally in $HOME/scikit_learn_data/

vega_datasets includes a few smaller datasets within the installation, but downloads the rest on demand without any local cacheing.

The current vega_datasets release, including these bundled datasets, is about 200KB.

@mattijn
Copy link
Contributor

mattijn commented May 1, 2018

ok, then I'm +1, since altair is around 250kb, it will almost double in size, but in comparison with matplotlib (~8.5mb) its still super lightweight.

@jakevdp
Copy link
Collaborator Author

jakevdp commented May 1, 2018

Just to be clear: I wouldn't suggest actually bunding vega_datasets into altair; rather, I'm suggesting making it a hard dependency and importing it by default in Altair's namespace.

@mattijn
Copy link
Contributor

mattijn commented May 1, 2018

I just noticed that you created vega_datasets package under the altair-viz repository. Since the name, I thought it came from the vega repository like ipyvega.
Still +1 though.

@jakevdp
Copy link
Collaborator Author

jakevdp commented May 1, 2018

One place that it would make sense to have tighter integration, even beyond intro tutorials, is in the geographic datasets used for map backgrounds.

@palewire
Copy link
Contributor

palewire commented May 2, 2018

If you've decided you want this I'd be happy to help with the implementation.

@sebastianneubauer
Copy link
Contributor

sebastianneubauer commented May 3, 2018

I would object this because of the following reasons:

  • some companies/organizations have very strict rules for software in production, each and every hard dependency enhances the probability that altair cannot be used.
  • Each library should try to minimize the number of hard dependencies. It increases security and not last saves a lot of time and resources while installing the package millions of times. The beginners tutorials will only be relevant in a tiny fraction of the installations.
  • having the ability to install this requirement with the very convenient extra requirements pip install altair[dev] gives nearly the same convenience for beginners (just document to install it with the dev requirements in the beginners tutorials) while at the same time reduces the hard dependency with the advantages pointed out above. I fully agree it should be super easy to get started, but I think extra dev requirements is a very good practice to reach that goal. Furthermore, (in line with the zen of python) needing an explicit import statement if an external package is used is better than an implicit hard dependency covering up the fact that it is an external package (e.g. if you need to lookup the code on github, because the CIO requests a proof of the licence of each and every used data set ;-) ).

@pagpires
Copy link
Contributor

pagpires commented May 3, 2018

Since the main developers for vega/vega-lite/altair are also teaching in academia, maybe someone can have a sense on how much time students need vega_dataset to learn altair, and how frequently they will have to come back to it (to re-get familiar) after they learn this package? I assume it will take very short time for students to learn and very few will come back, but I can be wrong.

Also, the case of altair is a bit different from sklearn because of 3 factors:

  1. Just as mentioned above, the data/package ratio is much smaller for sklearn (minor factor);
  2. Sklearn is a very versatile tool and it's more preferable to have a consistent/well-crafted dataset to test out all the combinations of APIs (and for users to re-get familiar with them), while altair has a smaller and (thus?) clearer structure of APIs.
  3. It's usually harder to know if one is using a ML tool correctly or the underlying mechanisms are behaving normally (i.e. harder to get feedback during the process), thus a designated dataset can be useful as the control. However, in the process of learning/doing viz, you get feedback immediately (you will also most likely take advantages of vega-lite editor to explore APIs, because I assume it's very rare if someone is only familiar with altair but not vega-lite), again making a designated dataset less useful.

@palewire
Copy link
Contributor

palewire commented May 3, 2018

This is of course a debatable thing, without a clear answer.

I am of the opinion that handlebars for newbies should be a top design priority. Especially for a package like this that aims to be used by groups of programmers with less experience and expertise. The journalists I frequently train are often writing Python for the first time when they open a Jupyter Notebook in a class like First Python Notebook.

In my opinion, this package has the potential to breakthrough and draw thousands of people into Python. But to do that it needs to not just convert matplotlib experts and Python developers. It also needs to draw in people who are writing code for the very first time. I see this idea as one of many steps that can reduce technical and conceptual hurdles

"What is Vega?" is a question a newbie likely will need to answer to read even the most basic examples right now. So is "Why are there two different ways to import things?" I'm sure those concepts are obvious to everyone reading this thread, but they are not apparent to the beginner and can be enough to stop someone from adopting the tool. I've seen it time and time again.

For that reason, I think that the trade off is worth the price.

There also might be some way to modify vega_datasets to reduce its file size and lower the burden it brings.

@arokem
Copy link

arokem commented May 3, 2018

For the time being, I would recommend to at least mention this in the documentation on the front page of the project. The example on that page does not currently work out of the box without an additional pip install vega_datasets

@jakevdp
Copy link
Collaborator Author

jakevdp commented May 3, 2018

I really appreciate everyone weighing in here. Overall, I think the key points on either side are:

Pros

  • fewer imports would lead to an easier learning curve for beginners
  • even beyond beginners, some datasets (e.g. geoshape data for choropleth maps) would be useful to have more readily available/built-in to the Altair API

Cons

  • more hard dependencies could add friction for companies hoping to use Altair
  • marginally larger installation footprint (200KB)
  • obscuring the existence of the vega_datasets package makes it less clear how to debug issues that arise, particularly for beginners

With those in mind, how about a compromise: make vega_datasets a soft requirement accessible from the Altair namespace. We could add a datasets object at the top-level of the module which provides access to everything in vega-datasets (if it is installed) and raises an informative error (if it is not). I'd imagine something like this:

# in altair/__init__.py

class VegaDatasetsUnavailable(object):
    def __getattr__(self, attr):
        raise ImportError("To use datasets in Altair requires installing the vega_datasets package: "
                          "See https://github.com/altair-viz/vega_datasets")
    __call__ = __getattr__

try:
    from vega_datasets import data as datasets
except ImportError:
    datasets = VegaDatasetsUnavailable()

Then the hard dependencies of Altair would not change, but it would make the following available to users who do have vega_datasets installed:

import altair as alt
cars = alt.datasets.cars()
alt.Chart(cars).mark_point()#etc.

If vega_datasets is installed, this will work as expected. If not, it will give the user an informative error.

We could then adjust all our "getting started" installation instructions to include installation of vega_datasets (as they probably already should).

@jakevdp
Copy link
Collaborator Author

jakevdp commented May 3, 2018

For the time being, I would recommend to at least mention this in the documentation on the front page of the project. The example on that page does not currently work out of the box without an additional pip install vega_datasets

Yeah, I'll fix that. Up until recently, vega_datasets was listed as a hard requirement. Now that it's not, we need to update the installation instructions.

Edit: I fixed this in 6846d79 and pushed a new doc build

@sebastianneubauer
Copy link
Contributor

sebastianneubauer commented May 3, 2018

I want to mention once again, because the documentation now is more complicated than it could be:
pip install altair
installs altair and all hard dependencies
pip install altair[dev]
installs all hard + all dev dependencies, including ipython and vega_datasets, everything what is listed in requirements_dev.txt. We can also add everything there which is needed for any interactive work like tutorials.

@jakevdp
Copy link
Collaborator Author

jakevdp commented May 3, 2018

@sebastianneubauer – I want to avoid recommending that new users install all the dev dependencies (users don't need jinja, sphinx, m2r, docutils, flake8, etc.)

Additionally, I think it's much simpler to understand what's going on with pip install altair vega_datasets than it is with pip install altair[dev].

@sebastianneubauer
Copy link
Contributor

I fully agree, in fact, I like the explicit approach also more. Sometimes people complain about things being "not convenient enough", but then if things fail they complain about the complexity buried under the convenience layer, about the magic that is happening below the surface ;-)

@jakevdp
Copy link
Collaborator Author

jakevdp commented May 8, 2018

The more I think about it, the more I like the compromise I mentioned above, particularly as I develop the Altair tutorial.

After PyCon, I'm going to look at implementing this unless people have objections (@ellisonbg – I'd love to hear your thoughts)

@jakevdp
Copy link
Collaborator Author

jakevdp commented May 18, 2018

I have an implementation of this at #872.

@palewire
Copy link
Contributor

I believe this ticket can be closed, per the resolution in #872.

@jakevdp jakevdp closed this as completed Jul 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants