# Intro to Interactive Data Visualization with Altair

by Nicholas Vadivelu (@nicvadivelu, nicholas.vadivelu@gmail.com)


Through this workshop, we'll explore data using `altair`, create simple dashboards, and deploy them for others to use. This was heavily inspired by [Jake VanderPlas's workshop at PyCon 2018](https://www.youtube.com/watch?v=ms29ZPUKxbU). If this workshop piques your interest, definitely check that video out for a longer more in depth version.

Let's start by installing `altair`:

In [1]:
!pip install altair



For this workshop, we need `altair` and `pandas`.  `pandas` is a data analysis library in python which provides a `DataFrame` to represent tabular data. Don't worry if you aren't familiar with it--we'll cover what you need as we go.

In [2]:
import altair as alt
import pandas as pd

We'll use a Pokemon dataset to demonstrate the library. This dataset contains data about Pokemon stats, moves, and competitive tiers from Gen VI. Don't worry if you don't understand Pokemon for this workshop!

In [3]:
csv_url = 'https://raw.githubusercontent.com/n2cholas/dsc-workshops/master/Intro_to_Interactive_Data_Viz_with_Altair/pokemon-data-cleaned.csv'
# Turn off na_filter so blanks aren't read as NA
df = pd.read_csv(csv_url, na_filter=False)

## Preliminaries

We're working in an environment called Google Colab, which is a Jupyter notebook hosted by Google. It runs in your browser, and you optionally have access to hardware accelerators like GPUs or TPUs.

A Jupyter Notebook is a web-based application that allows you to create documents of live code, visualizations, equations, and markdown text. Its interactive nature makes it great for data analysis. Before moving on, here are some useful tricks:

In [4]:
?pd.DataFrame # using one question mark gives you the function/class signature with the description

In [5]:
??pd.DataFrame # two question marks gives you the actual code for that function

Commands prefaced by “%” or “%%” are called magic commands. You can read about more [here](https://ipython.readthedocs.io/en/stable/interactive/magics.html).

## Preprocessing

We'll start by processing the data a bit to make it easier to work with. Don't worry about these steps for this workshop. If you're curious, check out [this blog post](https://nicholasvadivelu.com/2019/09/27/intro-to-pandas/) on using `pandas` to clean up this dataset. 

On a high level, we are consolidating one of the categorical features so that it is easier to visualize.

In [6]:
# Narrow scope of data
df.loc[df['Tier'] == 'OUBL','Tier'] = 'Uber'
df.loc[df['Tier'] == 'UUBL','Tier'] = 'OU'
df.loc[df['Tier'] == 'RUBL','Tier'] = 'UU'
df.loc[df['Tier'] == 'NUBL','Tier'] = 'RU'
df.loc[df['Tier'] == 'PUBL','Tier'] = 'NU'
df = df[df['Tier'].isin(['Uber', 'OU', 'UU', 'NU', 'RU', 'PU'])]

## Visualization

Before visualizing data, make sure your data is in a standard format:

In [7]:
df.sample()

Unnamed: 0,Name,Tier,Num Types,Type 1,Type 2,Num Abilities,Ability 1,Ability 2,Ability 3,Has Negative Ability,HP,Attack,Defense,Special Attack,Special Defense,Speed,Base Stat Total,Next Evolution(s),Evolutionary Stage,Num Evolutionary Stages,Evolutionary Progress,Is Mega Evolution,Is Alternate Form,Num Moves,Moves,Defensive Boost Moves,Offensive Boost Moves,Max Defensive Boost Amount,Max Offensive Boost Amount,Recovery Moves,Priority STAB Attacks,Entry Hazards,Hazard Clearing Moves,Phazing Moves,Switch Attacks,High Prob Side FX Attacks,Constant Damage Attacks,Trapping Moves
84,Bruxish,RU,2,Psychic,Water,3,Dazzling,Strong Jaw,Wonder Skin,0,68,105,70,70,70,92,475,[],1,1,1.0,0,0,54,"{'Toxic', 'Light Screen', 'Blizzard', 'Screech...","{'Bulk Up', 'Calm Mind'}","{'Bulk Up', 'Swords Dance', 'Calm Mind'}",1,2,set(),{'Aqua Jet'},set(),set(),set(),set(),set(),set(),set()


Each row should be an observation (here, a Pokemon) and each column should be a feature (i.e. a property of the observation). Features can be continuous variables, a category, etc. 

We'll remove some columns we don't need for this workshop to make this notebook more lightweight:

In [8]:
df.drop(['Num Types', 'Type 1', 'Type 2', 'Num Abilities',
        'Ability 1', 'Ability 2', 'Ability 3', 'Has Negative Ability', 'HP',
        'Attack', 'Defense', 'Special Attack', 'Special Defense', 'Speed',
        'Next Evolution(s)', 'Evolutionary Stage',
        'Num Evolutionary Stages', 'Evolutionary Progress', 'Is Mega Evolution',
        'Is Alternate Form', 'Moves', 'Defensive Boost Moves',
        'Offensive Boost Moves', 'Max Defensive Boost Amount',
        'Max Offensive Boost Amount', 'Recovery Moves', 'Priority STAB Attacks',
        'Entry Hazards', 'Hazard Clearing Moves', 'Phazing Moves',
        'Switch Attacks', 'High Prob Side FX Attacks',
        'Constant Damage Attacks', 'Trapping Moves'], axis=1, inplace=True)

df.sample()

Unnamed: 0,Name,Tier,Base Stat Total,Num Moves
193,Drampa,PU,485,65



Through this workshop, we'll be working towards creating a plot like this: 

In [9]:
multi = alt.selection_multi(fields=['Tier'], empty='all')
interval = alt.selection_interval(encodings=['y'])

scatter = alt.Chart(df).mark_point().encode(
    x='Base Stat Total',
    y='Num Moves',
    color=alt.condition(multi & interval, 'Tier', alt.value('lightgray'))
).properties(
    selection=interval
)

bar = alt.Chart(df).mark_bar().encode(
    x='average(Base Stat Total)',
    y='Tier',
    color=alt.condition(multi, 'Tier', alt.value('lightgray'))
).properties(
    selection=multi
).transform_filter(
    interval
)

plot = scatter & bar
plot

Above, we have a scatter plot above showing the relationship between the base state total of a Pokemon (an indicator of its strength) and the number of of moves it can learn. Below, we have the average base stat total by the tier. We can create a selection of the scatter plot across the y-interval, which controls which data points are used in the bar plot to show the averages. We can click individual bars (or shift-click multiple bars) to control which points are highlighted in the above plot.

Let's start with something simpler, which is just the scatter plot without interaction:

In [10]:
alt.Chart(df).mark_point().encode(
    x='Base Stat Total',
    y='Num Moves',
    color='Tier',
    tooltip='Name'
).properties(
    title='Pokemon Data'
).interactive()

Let's break down how each line works.

Altair produces plots that follow a specification called Vega Lite, which we'll cover in more detail later. Essentially, it has the ability to create and verify a schema that describes how a plot should be constructed. This schema is used by a JavaScript front-end to render our plot.

In [11]:
alt.Chart(df)

SchemaValidationError: ignored

alt.Chart(...)

We get a `SchemaValidationError` because we need a type of mark, which represents our data. As mentioned, an `alt.Chart` both creates and verifies the schema. 

In our opening example, we saw `mark_point()`, which creates a scatter plot. There are many others, such as `mark_bar()`, `mark_tick()`, and more. You can read about them all [here](https://altair-viz.github.io/user_guide/marks.html).

In [12]:
alt.Chart(df).mark_point(size=10)

We made a scatter plot! There is a point for every row in the dataset, but you can't tell because they're all stacked on top of eachother. We also adjusted the size of our points by passing in `size=10` to `mark_point()`. This is not useful yet, as it doesn't portray any information. We need to tell Altair how to encode these points to get a real scatterplot.

In [13]:
alt.Chart(df).mark_point().encode(
    x='Base Stat Total'
)

We're getting there, are now encoding the points in one dimension. To be precise, we told `altair` to encode Base State Total along the x-axis. 

It's a bit hard to see, we can change the type of mark to see things more clearly:

In [14]:
alt.Chart(df).mark_tick().encode(
    x='Base Stat Total'
)

But we digress, we wanted a scatter plot. Let's encode the Number of moves in y-axis too.

In [15]:
alt.Chart(df).mark_point().encode(
    x='Base Stat Total',
    y='Num Moves',
)

Let's work towards making a histogram of `Base Stat Total`s, starting with the scatter plot we already make. If you're used to other plotting frameworks, this may seem strange to you--scatterplots and histograms are completely different. But Altair gives us a declarative visualization grammar to build plots. We can use basic building blocks to make a wide variety of plots. Don't worry if that didn't make sense--let's see it in action.

First, let's count the number of `Base Stat Total` that take on each value:

In [16]:
alt.Chart(df).mark_point().encode(
    x='Base Stat Total',
    y='count()',
)

We used the string `count()` to indicatate that we want to count the number of points. If we didn't encode Base Stat Total in the x-axis, we would just count the total number of points, like shown below:

In [17]:
alt.Chart(df).mark_point().encode(
    y='count()',
)

Pretty interesting that so many pokemon have a `Base Stat Total` of 600, but this plot is not so useful to understand other trends, since there are too many values the base stat totals take on. Let's bin the x-axis:

In [18]:
alt.Chart(df).mark_point().encode(
    x=alt.X('Base Stat Total', bin=True),
    y='count()',
)

Points are not the standard choice to understand the distribution of this variate, let's use a bar:

In [19]:
alt.Chart(df).mark_bar().encode(
    x=alt.X('Base Stat Total', bin=True),
    y='count()',
)

Ta-da, a histogram! You can typically rely on Altair to pick good defaults, but if you want more fine-tuned control, you can always tweak things. For example, let's increase the number of bins:

In [20]:
alt.Chart(df).mark_bar().encode(
    x=alt.X('Base Stat Total', bin=alt.Bin(maxbins=30)),
    y='count()',
    color='Tier',
)

Nice, we have a pretty handsome histogram! Using Altair's expressive API, we don't need to worry about specific calls to make histograms or scatterplots: we just need to remember the basic building plots. 

We also added the color to show the distribution by Tier, but this is not so clear. Let's try to make a heatmap to illustrate this instead:

In [21]:
alt.Chart(df).mark_bar().encode(
    x=alt.X('Base Stat Total', bin=alt.Bin(maxbins=30)),
    color='count()',
    y='Tier',
)

Just by switching `y` and `color`, we were able to make a heatmap using the expressive building blocks Altair provides. Notice that Altair chooses an appropriate colour scale for this continuous variable (whereas before it was choosing a scale for discrete categories).

But this still isn't great for understanding distributions, let's split this into multiple histograms by tier:

In [22]:
alt.Chart(df).mark_bar().encode(
    x=alt.X('Base Stat Total', bin=alt.Bin(maxbins=30)),
    y='count()',
    color='Tier',
    column='Tier',
)

Great! Now we can clearly see the distributions of Base Stat Totals by tier. 

Let's move onto a different sort of plot. What if we're interested in the average BST by tier?

In [23]:
alt.Chart(df).mark_bar().encode(
    x='average(Base Stat Total)',
    y='Tier',
    color='Tier'
)

`'count()'` and `'average()'` are examples of data aggregations. Since we're taking the average by tiers, we can think of it as a split-apply-combine, illustrated below:

![](https://blog.dask.org/images/split-apply-combine.png).

We split by some key (`x` in the diagram, `tier` in our example), apply some aggregation (e.g. `average`), then combine the data back together.

In pandas, you would do this as follows:


In [24]:
df.groupby('Tier')['Base Stat Total'].mean()

Tier
NU      495.132353
OU      565.896104
PU      464.165919
RU      524.486111
UU      538.181818
Uber    657.042553
Name: Base Stat Total, dtype: float64

Before moving onto interaction, let's briefly discuss datatypes.

|Data Type    | Shorthand Code | Description                     |
|-------------|----------------|---------------------------------|
|quantitative | Q              |a continuous real-valued quantity|
|ordinal      | O              |a discrete ordered quantity      |
|nominal      | N              |a discrete unordered category    |
|temporal     | T              |a time or date value             |
|geojson      | G              |a geographic shape               |

These are typically inferred from your pandas DataFrame. When you don't use a DataFrame, however, you need to specify these. It helps Altair choose the appropriate encoding and scales. For example, if you use a color scale for a quantitative variable, you likely want a smooth gradient. For a nominal variable, you likely want dicrete colours. For a an ordinal variable, you probably want discrete colours that increase in intensity to represent the order. You can read more about them [here](https://altair-viz.github.io/user_guide/encoding.html#encoding-data-types).

I'd recommend always explicitly specifying them, as shown below. 

In [25]:
alt.Chart(csv_url).mark_point().encode(
    x='Base Stat Total:Q',
    y='Num Moves:Q',
    color='Tier:N',
    tooltip='Name:N'
)

Since we specified the data types, we were able to consume the CSV straight from the web. Without the data types, we get an error:

In [26]:
alt.Chart(csv_url).mark_point().encode(
    x='Base Stat Total',
    y='Num Moves',
    color='Tier',
    tooltip='Name'
)

ValueError: ignored

alt.Chart(...)

For the sake of this (live-coded) workshop, we will skip out on specifying the data type since they are inferred in our case.

## Interaction

So far, we've only seen basic panning and zooming within plots, but we can build far more interesting interactions. Let's start with an interval selection.

In [27]:
interval = alt.selection_interval()

alt.Chart(df).mark_point().encode(
    x='Base Stat Total',
    y='Num Moves',
    color='Tier'
).properties(
    selection=interval
)

Now we can make selection rectangles in our plot. Let's make it highlight the selected points:

In [28]:
interval = alt.selection_interval()

alt.Chart(df).mark_point().encode(
    x='Base Stat Total',
    y='Num Moves',
    color=alt.condition(interval, 'Tier', alt.value('lightgray'))
).properties(
    selection=interval
)

For all the types of selections, you can use `alt.condition` to control various aspects of the plot. The first argument is the predicate (here, the selection object), the second argument is the value the encoded data assumes when the condition is true (i.e. when selected), the third is the value they assume when not selected.

When you make a selection, this gives the JavaScript a signal about what points are inside vs. outside the selection, which drives actions such as `alt.condition`.

Since the selections act on the data points themselves, so we can tie together data on multiple plots:

In [29]:
interval = alt.selection_interval(encodings=['y'])

scatter = alt.Chart(df).mark_point().encode(
    x='Base Stat Total',
    y='Num Moves',
    color=alt.condition(interval, 'Tier', alt.value('lightgray'))
).properties(
    selection=interval
)

bar = alt.Chart(df).mark_bar().encode(
    x='average(Base Stat Total)',
    y='Tier',
    color='Tier'
).transform_filter(
    interval
)

scatter & bar  # vertically concatenates plots
# scatter | bar  # horizontally concatenates plots

We did a bunch of things there, let's break it down. First, we constrained our `selection_interval` to only operate on the y encoding, so we can't make freeform boxes. Next, we created a bar chart like above, with a `transform_filter` to only consider points selected by the interval. To be clear, the `transform_filter` takes in a predicate, which in this case is whether or not a point is in the selection (represented by the selection object). Finally, we used `scatter & bar` to vertically concatenate both plots (we could also use `alt.voncat(scatter, bar)`. 

The result is a plot where we can look at the average base stat total by tier where we can filter what data we consider by the number of moves the Pokemon learn.

Let's try using a multiselect to control what's shown on the scatterplot with the barplot.

In [30]:
multi = alt.selection_multi(fields=['Tier'])

scatter = alt.Chart(df).mark_point().encode(
    x='Base Stat Total',
    y='Num Moves',
    color=alt.condition(multi, 'Tier', alt.value('lightgray'))
)

bar = alt.Chart(df).mark_bar().encode(
    x='average(Base Stat Total)',
    y='Tier',
    color=alt.condition(multi, 'Tier', alt.value('lightgray'))
).properties(
    selection=multi
)

scatter & bar

Now, our selection is on the Tier in the bar plot, and we're altering the colour of the scatter plot depending on what's selected. Only the pokemon belonging tot he bars that are selected will be highlighted in our plot. If you hold shift, you can select multiple bars.

Let's add two-way interaction and finalize our plot:

In [31]:
multi = alt.selection_multi(fields=['Tier'], empty='all')
interval = alt.selection_interval(encodings=['y'])

scatter = alt.Chart(df).mark_point().encode(
    x='Base Stat Total',
    y='Num Moves',
    color=alt.condition(multi & interval, 'Tier', alt.value('lightgray'))
).properties(
    selection=interval
)

bar = alt.Chart(df).mark_bar().encode(
    x='average(Base Stat Total)',
    y='Tier',
    color=alt.condition(multi, 'Tier', alt.value('lightgray'))
).properties(
    selection=multi
).transform_filter(
    interval
)

plot = scatter & bar
plot

Now, we combined the multi-selection and the interval selection in our plots. 

We only look a look at one case of using interactions. Read more about the types of selections, selection defaults, how to trigger those selections, and what you can do with those selections [here](https://altair-viz.github.io/user_guide/interactions.html#selection-types-interval-single-multi). 

## Deploying Your Visualization

Altair creates a Vega Lite specification. Let's take a small subset of our data to understand exactly what that is:

In [32]:
small_df = df.sample(2)
small_df

Unnamed: 0,Name,Tier,Base Stat Total,Num Moves
600,Plusle,PU,405,76
71,Blaziken-Mega,Uber,630,107


We take a random sample of two points in our dataset.

In [33]:
multi = alt.selection_multi(fields=['Tier'], empty='all')
interval = alt.selection_interval(encodings=['y'])

scatter = alt.Chart(small_df).mark_point().encode(
    x='Base Stat Total',
    y='Num Moves',
    color=alt.condition(multi & interval, 'Tier', alt.value('lightgray'))
).properties(
    selection=interval
)

print(scatter.to_json())

{
  "$schema": "https://vega.github.io/schema/vega-lite/v4.8.1.json",
  "config": {
    "view": {
      "continuousHeight": 300,
      "continuousWidth": 400
    }
  },
  "data": {
    "name": "data-5ff54935db2d127b1e7ec247ebc34012"
  },
  "datasets": {
    "data-5ff54935db2d127b1e7ec247ebc34012": [
      {
        "Base Stat Total": 405,
        "Name": "Plusle",
        "Num Moves": 76,
        "Tier": "PU"
      },
      {
        "Base Stat Total": 630,
        "Name": "Blaziken-Mega",
        "Num Moves": 107,
        "Tier": "Uber"
      }
    ]
  },
  "encoding": {
    "color": {
      "condition": {
        "field": "Tier",
        "selection": {
          "and": [
            "selector010",
            "selector011"
          ]
        },
        "type": "nominal"
      },
      "value": "lightgray"
    },
    "x": {
      "field": "Base Stat Total",
      "type": "quantitative"
    },
    "y": {
      "field": "Num Moves",
      "type": "quantitative"
    }
  },
  "mark": "po

Each plot has a JSON representation containing information about the plot as well as every data point. For this reason, `altair` doesn't work well with huge dataasets--the package recommends using datasets with under 5000 points. There are ways to visualize larger datasets using `altair`, described in the [docs](https://altair-viz.github.io/user_guide/faq.html#why-does-altair-lead-to-such-extremely-large-notebooks).

The Vega Lite specification is becoming a standard on the web (e.g. you can upload a Vega Lite JSON to Wikipedia for interactive plots!). This Vega Lite specification is converted to a more complex Vega representation, which is then converted to D3.js (JavaScript). 

This makes Altair plots easy to embed in web-pages independent of python.

In [34]:
print(plot.to_html())

<!DOCTYPE html>
<html>
<head>
  <style>
    .error {
        color: red;
    }
  </style>
  <script type="text/javascript" src="https://cdn.jsdelivr.net/npm//vega@5"></script>
  <script type="text/javascript" src="https://cdn.jsdelivr.net/npm//vega-lite@4.8.1"></script>
  <script type="text/javascript" src="https://cdn.jsdelivr.net/npm//vega-embed@6"></script>
</head>
<body>
  <div id="vis"></div>
  <script>
    (function(vegaEmbed) {
      var spec = {"config": {"view": {"continuousWidth": 400, "continuousHeight": 300}}, "vconcat": [{"mark": "point", "encoding": {"color": {"condition": {"type": "nominal", "field": "Tier", "selection": {"and": ["selector008", "selector009"]}}, "value": "lightgray"}, "x": {"type": "quantitative", "field": "Base Stat Total"}, "y": {"type": "quantitative", "field": "Num Moves"}}, "selection": {"selector009": {"type": "interval", "encodings": ["y"]}}}, {"mark": "bar", "encoding": {"color": {"condition": {"type": "nominal", "field": "Tier", "selection": "se

Paste that into a text file, save as `file.html`, then open in your browser. Viola!

## Exercises 

We've only just scratched the surface with what we can do with Altair. In this next portion of the workshop, we'll try to create some other types of visualizations.

For data, you can use the pokemon data at this URL:

In [35]:
csv_url

'https://raw.githubusercontent.com/n2cholas/dsc-workshops/master/Intro_to_Interactive_Data_Viz_with_Altair/pokemon-data-cleaned.csv'

Or load data from `vega_datasets`, which is a package with datasets well suited for Altair:

In [36]:
!pip install vega_datasets



In [37]:
from vega_datasets import data

In [38]:
cars_df = data.cars()

cars_df.sample()

Unnamed: 0,Name,Miles_per_Gallon,Cylinders,Displacement,Horsepower,Weight_in_lbs,Acceleration,Year,Origin
362,honda prelude,33.7,4,107.0,75.0,2210,14.4,1982-01-01,Japan


You can find a full list of datasets [here](https://github.com/vega/vega-datasets/blob/master/SOURCES.md). Check out the Altair [example gallery](https://altair-viz.github.io/gallery/index.html) for inspiration, or Altair's [tutorial notebooks](https://github.com/altair-viz/altair_notebooks) for more in depth explanations.

## Conclusion & Next Steps

We covered the basics of building interactive visualizations in Altair. For a more in depth treatment of everything we covered today, check out [Jake VanderPlas's workshop at PyCon 2018](https://www.youtube.com/watch?v=ms29ZPUKxbU). As linked before, the [example gallery](https://altair-viz.github.io/gallery/index.html) is a great place to explore what's possible with Altair. And of course, the best way to learn is by doing, so enjoy!