# Part 1 - Data Types, Graphical Marks, and Visual Encoding Channels

- **Objective**: Learn the basic syntax of Vega-Altair, understand how to encode data into visual properties, and customize chart appearances.
- **Key Topics**:
  - Basic Altair syntax
  - Encoding data dimensions
  - Customizing chart appearance

## Imports

In [1]:
import altair as alt
import pandas as pd
from vega_datasets import data

print("The installed Vega-Altair version is " + alt.__version__)

The installed Vega-Altair version is 5.3.0


## Preview: Encoding the same data in different charts

To produce visualizations using Vega-Altair (or Seaborn, or Plotly Express, or ggplot2 in R...) we provide:

* Data
* A choice of mark type (lines or bars or ...)
* A choice of how to *encode* values from the data into visual properties of the chart.

Consider the following five Altair charts, which all start with the same little 4-row by 2-column pandas DataFrame.  Can you recognize how these charts might be built from this data?

In [2]:
source = pd.DataFrame({"category": [1, 2, 3, 4], "value": [2, 2, 10, 4]})

In [3]:
source

Unnamed: 0,category,value
0,1,2
1,2,2
2,3,10
3,4,4


#### A line chart
<details>
  <summary>Show code</summary>

  ```python
alt.Chart(source).mark_line(stroke='black').encode(
    alt.Y("value").axis(None),
    alt.X("category").axis(None)
)
  ```
</details>

![Line Chart](../resources/images/part1/c_line.png)

#### A bar chart
<details>
  <summary>Show code</summary>

  ```python
alt.Chart(source).mark_bar().encode(
    alt.X("value").axis(None),
    alt.Y("category:O").axis(None)
)
  ```
</details>

![Bar Chart](../resources/images/part1/c_bar.png)

#### Another bar chart
<details>
  <summary>Show code</summary>

  ```python
alt.Chart(source).mark_bar().encode(
    alt.X("value").axis(None),
    alt.Color("category").legend(None),
)
  ```
</details>

![Another Bar Chart](../resources/images/part1/c_bar2.png)

#### A pie/arc chart
<details>
  <summary>Show code</summary>

  ```python
alt.Chart(source).mark_arc().encode(
    alt.Theta("value"),
    alt.Color("category:N").legend(None)
)
  ```
</details>

![A Pie Chart](../resources/images/part1/c_arc.png)

#### A rect chart
<details>
  <summary>Show code</summary>

  ```python
alt.Chart(source).mark_rect().encode(
    alt.X("category:O").axis(None),
    alt.Y("value:O").axis(None)
)
  ```
</details>

![A Rect Chart](../resources/images/part1/c_rect.png)

Keep the above examples in mind as we build up to these sorts of charts more systematically below.

## Creating a Chart with Vega-Altair

Here is a minimal example of an Altair `Chart`.

**Comment**.  As with many examples in this tutorial, the following will not work with certain earlier versions of Vega-Altair (earlier versions required explicit data to be provided, even if it was not being used in the specification).  We are using Altair version 5.3.0.

To create meaningful visualizations, we need to provide data to Altair and to tell Altair how to encode different features from that data into visual properties of the chart.  We will take our data from the Python `vega_datasets` package.  These datasets come in the form expected by Altair: each row in the dataset corresponds to an observation (or data point) and each column corresponds to a feature (variable, dimension, field).

For the rest of this section, we will use data from the Python vega_datasets package. These datasets come in the form expected by Vega-Altair: Each row in the dataset corresponds to an observation (or data point) and each column corresponds to a feature (variable, dimension, field).

We'll start with the cars dataset.

In [4]:
df = data.cars()

First let's take a look at the data types of the columns in the pandas DataFrame.

In [5]:
df.dtypes

Name                        object
Miles_per_Gallon           float64
Cylinders                    int64
Displacement               float64
Horsepower                 float64
Weight_in_lbs                int64
Acceleration               float64
Year                datetime64[ns]
Origin                      object
dtype: object

Let's create a chart using the `circle` mark type, and one encoding channel.  Here we will encode the `"Weight_in_lbs"` feature in the visual channel `X`, which specifies the position along the x-axis direction.

**Comment**: In online examples, you will typically see the syntax `x="Weight_in_lbs"` instead of `alt.X("Weight_in_lbs")` for this kind of basic usage, but the latter is necessary for customizing the encoding, so we start with it directly.

Now, we add a second encoding channel, with the `"Miles_per_Gallon"` feature getting encoded in the y-coordinate.

Vega-Altair cannot make three-dimensional charts, so you might think we're done, but there are many more encoding channels that can be used.  One of the most frequently and effectively used is the color channel.  Here we use the `"Origin"` feature for the color encoding.

The `tooltip` channel is a little different, but it is one of the most frequently used channels. Displayed points typically correspond to a row in the DataFrame. Using the `tooltip` channel, we can access any fields we want from that row.  

For example, by supplying the list of column names `["Name", "Weight_in_lbs", "Miles_per_Gallon"]` to the `tooltip` channel, we can learn the name, weight, and miles-per-gallon of the given car.

Less is more. Even if we encode 'five dimensions' of the data, for example supplying `"Horsepower"` for the size and `"Year"` for the opacity, the resulting chart does not readily convey much additional information.

Let's go back to our tiny chart that did not encode the `x` or `y` channels, but this time we will pass the cars DataFrame.  It looks the same as when we did not pass any data, but this time, let's add a tooltip.

Compare the tooltip value to the final rows of our DataFrame. What do you notice?

As another example of the same phenomenon, of marks being displayed "on top of each other", consider a bar chart (so we switch from circle marks to the bar marks) using `"Miles_per_Gallon"` for the x-axis encoding.  Here Altair tries to resolve the above issue for us, by stacking the bars.

But here's what it looks like if we manually disable stacking.  We also add a tooltip showing "Name" and "Miles_per_Gallon" to help us see what's going on.

Try hovering right around the extreme ends of the bar (all the way to left and to the right). Compare the vehicle with the maximum fuel efficiency.

In [None]:
df.loc[df["Miles_per_Gallon"].idxmax()]

Why are things different on the left end of the bar?

In [None]:
df.loc[df["Miles_per_Gallon"].idxmin()]

## The effect of data types

Let's go back to our earlier chart, which included a color encoding and a tooltip.

In [1]:
alt.Chart(df).mark_circle().encode(
    alt.X("Weight_in_lbs"),
    alt.Y("Miles_per_Gallon"),
    alt.Color("Origin"),
    alt.Tooltip(["Name", "Weight_in_lbs", "Miles_per_Gallon"]),
)

NameError: name 'alt' is not defined

Notice the drastic change by switching to the `"Cylinders"` field  for the color encoding.

Why is there such a significant difference?  By default, Altair tries to predict how you want the visualization to appear.  Let's return to the earlier chart and consider all the small decisions that are being made.

In [None]:
alt.Chart(df).mark_circle().encode(
    alt.X("Weight_in_lbs"), alt.Y("Miles_per_Gallon"), alt.Color("Origin")
)

Usually, we don't have to consciously think about these decisions, but just to get some appreciation for everything that's going on "under the hood", here is the Vega code corresponding to the above chart.  (I got this by clicking on the three dots to the right of the chart, and going to the Vega Editor, and scrolling down to the "Compiled Vega" section.  I've deleted most of the data, which is written out explicitly in the corresponding Vega specification.)

We will never have to explicitly engage with this Vega code, but it gives a sense for the vast amount of customization that is possible (and that must be somehow "defaulted" by Altair when not explicitly specified by us).
```
{
  "$schema": "https://vega.github.io/schema/vega/v5.json",
  "background": "white",
  "padding": 5,
  "width": 300,
  "height": 300,
  "style": "cell",
  "data": [
    {
      "name": "data-583e73726c1545c56c203344161a975c",
      "values": [
        {
          "Name": "chevrolet chevelle malibu",
          "Miles_per_Gallon": 18,
          "Cylinders": 8,
          "Displacement": 307,
          "Horsepower": 130,
          "Weight_in_lbs": 3504,
          "Acceleration": 12,
          "Year": "1970-01-01T00:00:00",
          "Origin": "USA"
        },
        {
          "Name": "buick skylark 320",
          "Miles_per_Gallon": 15,
          "Cylinders": 8,
          "Displacement": 350,
          "Horsepower": 165,
          "Weight_in_lbs": 3693,
          "Acceleration": 11.5,
          "Year": "1970-01-01T00:00:00",
          "Origin": "USA"
        },
        ...
      ]
    },
    {
      "name": "data_0",
      "source": "data-583e73726c1545c56c203344161a975c",
      "transform": [
        {
          "type": "filter",
          "expr": "isValid(datum[\"Weight_in_lbs\"]) && isFinite(+datum[\"Weight_in_lbs\"]) && isValid(datum[\"Miles_per_Gallon\"]) && isFinite(+datum[\"Miles_per_Gallon\"])"
        }
      ]
    }
  ],
  "marks": [
    {
      "name": "marks",
      "type": "symbol",
      "style": ["circle"],
      "from": {"data": "data_0"},
      "encode": {
        "update": {
          "opacity": {"value": 0.7},
          "fill": {"scale": "color", "field": "Origin"},
          "ariaRoleDescription": {"value": "circle"},
          "description": {
            "signal": "\"Weight_in_lbs: \" + (format(datum[\"Weight_in_lbs\"], \"\")) + \"; Miles_per_Gallon: \" + (format(datum[\"Miles_per_Gallon\"], \"\")) + \"; Origin: \" + (isValid(datum[\"Origin\"]) ? datum[\"Origin\"] : \"\"+datum[\"Origin\"])"
          },
          "x": {"scale": "x", "field": "Weight_in_lbs"},
          "y": {"scale": "y", "field": "Miles_per_Gallon"},
          "shape": {"value": "circle"}
        }
      }
    }
  ],
  "scales": [
    {
      "name": "x",
      "type": "linear",
      "domain": {"data": "data_0", "field": "Weight_in_lbs"},
      "range": [0, {"signal": "width"}],
      "nice": true,
      "zero": true
    },
    {
      "name": "y",
      "type": "linear",
      "domain": {"data": "data_0", "field": "Miles_per_Gallon"},
      "range": [{"signal": "height"}, 0],
      "nice": true,
      "zero": true
    },
    {
      "name": "color",
      "type": "ordinal",
      "domain": {"data": "data_0", "field": "Origin", "sort": true},
      "range": "category"
    }
  ],
  "axes": [
    {
      "scale": "x",
      "orient": "bottom",
      "gridScale": "y",
      "grid": true,
      "tickCount": {"signal": "ceil(width/40)"},
      "domain": false,
      "labels": false,
      "aria": false,
      "maxExtent": 0,
      "minExtent": 0,
      "ticks": false,
      "zindex": 0
    },
    {
      "scale": "y",
      "orient": "left",
      "gridScale": "x",
      "grid": true,
      "tickCount": {"signal": "ceil(height/40)"},
      "domain": false,
      "labels": false,
      "aria": false,
      "maxExtent": 0,
      "minExtent": 0,
      "ticks": false,
      "zindex": 0
    },
    {
      "scale": "x",
      "orient": "bottom",
      "grid": false,
      "title": "Weight_in_lbs",
      "labelFlush": true,
      "labelOverlap": true,
      "tickCount": {"signal": "ceil(width/40)"},
      "zindex": 0
    },
    {
      "scale": "y",
      "orient": "left",
      "grid": false,
      "title": "Miles_per_Gallon",
      "labelOverlap": true,
      "tickCount": {"signal": "ceil(height/40)"},
      "zindex": 0
    }
  ],
  "legends": [
    {
      "fill": "color",
      "symbolType": "circle",
      "title": "Origin",
      "encode": {"symbols": {"update": {"opacity": {"value": 0.7}}}}
    }
  ]
}
```

Back to our original question of why the colors looked so different. The reason is that when Altair is presented with strings (as in the "Origin" column), it defaults to a *Nominal* encoding data type. That Nominal encoding data type is what is responsible for the colors we saw above.

In contrast, when Altair is presented with numeric values, it defaults to a *Quantitative* data type.  In our case, there are only five possible values for the number of cylinders (and three of those values are by far the most prevalent), but the quantitative color encoding chosen by Altair adapts well to any number of values.

In [None]:
alt.Chart(df).mark_circle().encode(
    alt.X("Weight_in_lbs"), alt.Y("Miles_per_Gallon"), alt.Color("Cylinders")
)

Altair provides a convenient shorthand `:N` for specifying a Nominal encoding data type.  Let's try manually changing the color encoding so that it specifies a Nominal encoding data type.

**Warning**.  If Altair cannot access the data directly (for example, because it is provided with a URL instead of a DataFrame), then these encoding shorthands must be provided, as Altair cannot infer a data type directly. 

[Reference](https://altair-viz.github.io/user_guide/encodings/index.html#encoding-data-types) for the five available encoding types.

There is a third reasonable choice of encoding type for these "Cylinders" values, which is *Ordinal*. The distinction between Nominal (:N) and Ordinal (:O) is that the order is important in the 'Ordinal' case, and by default, Altair chooses a color scheme that reflects this ordering. The resulting colors look quite similar to the colors used in the default Quantitative (:Q) encoding, but notice the difference in legends.

Here is an example of changing the color scheme to the "plasma" color scheme back with the default Quantitative encoding type.  See the Vega documentation for the [possible choices of color scheme](https://vega.github.io/vega/docs/schemes/).

## Faceting vs Concatenating

Here we will consider two fundamentally different scenarios in which we will display multiple charts using the same data side-by-side. Both scenarios will reinforce the concepts considered above.

### Scenario 1: Using different fields in the encoding
We first define a helper function `make_chart` that takes a string y as input, and as output returns an Altair Chart for the cars data, using "Weight_in_lbs" for the x-axis encoding, "Cylinders" for the color encoding, and `y` for the y-axis encoding.  We also use the "plasma" color scheme.

In [None]:
def make_chart(y):
    return (
        alt.Chart(df)
        .mark_circle()
        .encode(
            alt.X("Weight_in_lbs"),
            alt.Y(y),
            alt.Color("Cylinders").scale(scheme="plasma"),
        )
    )

What if we want to display `make_chart("Miles_per_Gallon")`, `make_chart("Acceleration")`, and `make_chart("Displacement")` side-by-side?  To accomplish this, we use Altair's `hconcat` function.

In [None]:
chart_list = [
    make_chart(y) for y in ["Miles_per_Gallon", "Acceleration", "Displacement"]
]

**Comment**.  Version 5 of Vega-Altair was released in May 2023.  As mentioned above, much of the code used in this tutorial, including the `scale` method just used, does not work with Altair versions 4 and earlier.  In versions 4 and earlier, the so-called *attribute syntax* would have been necessary, which in this case takes the form,

`alt.Color("Cylinders", scale=alt.Scale(scheme="plasma"))`

### Scenario 2: Using different subsets of the data

Notice that each car in the dataset occurs in all three charts above.  In the following scenario we again have three charts displayed side-by-side, but each car appears in exactly one of the three.  Here we *facet* the data into three distinct subsets, and display the corresponding subsets side-by-side.

To accomplish this, we actually use a new encoding channel, the *column* encoding channel.  In the following, the "Origin" value of the DataFrame gets encoded in the column visual property of the chart.  (Which column the point is displayed in depends on the "Origin" value of the corresponding car.)  Here we use the `"Miles_per_Gallon"` field for the y-axis encoding in all three charts.  Unlike the above example, in the following, the same fields are used in all three charts.

## Scale vs Axis

Quoting the [Vega-Lite documentation](https://vega.github.io/vega-lite/docs/scale.html):

> Scales are functions that transform a domain of data values (numbers, dates, strings, etc.) to a range of visual values (pixels, colors, sizes).

Again quoting the [Vega-Lite documentation](https://vega.github.io/vega-lite/docs/axis.html):

> Axes provide axis lines, ticks, and labels to convey how a positional range represents a data range. Simply put, axes visualize scales.

Let's see some examples of this in Altair.  For this, we will work with a small four-row DataFrame containing only those rows from the cars dataset corresponding to "amc hornet".  We extract this 4-row DataFrame using pandas.  (See Part 2 for a method to accomplish something similar within Altair.)

In [None]:
df_toyota = df[df["Name"] == "amc hornet"]
df_toyota

As an example of the types of options that relate to an "axis", in the following we adjust the `labelAngle` and the `title` of the axis.

> **Aside**.  Why the non-Pythonic capitalization of `labelAngle`?  Most of these options are defined in Vega or Vega-Lite, which are written in JavaScript and TypeScript respectively, so many options are named using the JavaScript camel case capitalization style.

In [None]:
alt.Chart(df_toyota).mark_bar().encode(
    alt.X("year(Year):O").axis(labelAngle=-90, title="Year"), alt.Y("Miles_per_Gallon")
)

As an example of what is meant by transforming data values to visual values (as in the above description of `scale` from the Vega-Lite documentation), we explicitly specify the `domain` (four explicit years) and the `range` (three explicit colors, after which it will start cycling).

In [None]:
alt.Chart(df_toyota).mark_bar().encode(
    alt.X("year(Year):O").axis(labelAngle=-90, title="Year"),
    alt.Y("Miles_per_Gallon"),
    alt.Color("year(Year):O")
    .scale(domain=[1970, 1973, 1974, 1976], range=["black", "blue", "orange"])
    .legend(title="Year"),
)

For another example of customizing scale, we will switch to a different dataset from vega_datasets, this time the gapminder dataset, and we will restrict to the year 2005.

In [None]:
# Restrict to year 2005
df = data.gapminder()
df = df[df["year"] == 2005].copy()

Let's first try to plot the data using columns from the DataFrame `df`:
* "country" for the x-encoding
* "pop" for the y-encoding
* "pop" also for the color

What could we do to improve the resulting chart?

In [None]:
alt.Chart(df).mark_bar().encode(alt.X("country"), alt.Y("pop"), alt.Color("pop"))

Here we will make three changes.
* Sort the bars by decreasing population value.
* Use a log scale for the y-encoding.  (Notice how now we can differentiate between the populations of Barbados and Jamaica, for example.)
* In fact, we will also use a log scale for the color-encoding and switch to the 'viridis' theme.

In [None]:
alt.Chart(df).mark_bar().encode(
    alt.X("country").sort("-y"),
    alt.Y("pop").scale(type="log"),
    alt.Color("pop").scale(type="log", scheme="viridis"),
)

## Needing data in long form / tidy form

In the following example, we again use the gapminder data, but this time with all years, not just 2005.

In [None]:
# include years
df = data.gapminder()

Note that the following correlation data is not in *tidy* form ([reference](https://www.jstatsoft.org/article/view/v059i10)).  When data is in tidy form, "each variable is a column, each observation is a row".  Here, on the other hand, every column corresponds to the same variable (correlation in this case).

In [None]:
df_temp = df.corr(numeric_only=True)
df_temp

As an intermediate step, we convert the following into a pandas Series with two levels of index, using the `stack` method (See pandas [stack documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.stack.html)).

In [None]:
df_temp.stack()

Next we convert back into a pandas DataFrame.  (Often when I use `reset_index`, I use it with `reset_index(drop=True)`, but in this instance we absolutely *do not* want to drop the index labels, since without them, we have no way of knowing to what the correlation values correspond.)

You probably agree the following DataFrame is harder to read than the original, but the advantage is that it is now in a *tidy* format and can be easily used by Altair to construct a chart.  This sort of presentation of the data is also said to be in "long form"; note that it is now 25 rows instead of 5 rows.

In [None]:
dfc = df_temp.stack().reset_index()
dfc.columns = ["Predictor 1", "Predictor 2", "Correlation"]

dfc

Here we make a chart using `mark_rect` instead of `mark_circle`.  The resulting rectangle chart is a visualization of the pandas correlation DataFrame we started with.

In [None]:
alt.Chart(dfc).mark_rect().encode(
    alt.X("Predictor 1"), alt.Y("Predictor 2"), alt.Color("Correlation")
).properties(width=200, height=200)

A diverging color scheme is more appropriate here as it makes it easy to identify positive and negative correlations. Adding `domainMid=0` to the scale ensures that the middle of the color scale is located at 0.0.

In [None]:
alt.Chart(dfc).mark_rect().encode(
    alt.X("Predictor 1"),
    alt.Y("Predictor 2"),
    alt.Color("Correlation").scale(scheme="purpleorange", domainMid=0.0),
).properties(width=200, height=200)