**SA433 &#x25aa; Data Wrangling and Visualization &#x25aa; Fall 2024**

# Lesson 3. Introduction to Data Visualization with Altair

## In this lesson...

- What is Altair?


- A small example: data types, graphical marks, encoding channels

- In-depth:
    - Data types
    - Encoding channels
    - Graphical marks

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## What is Altair?

- [__Altair__](https://altair-viz.github.io/) is a statistical visualization library for Python 


- Altair offers a powerful declarative *grammar* for quickly building a wide range of statistical graphics


- You specify *what* you want the visualization to include, instead of *how* to implement the visualization in terms of for-loops, low-level drawing commands, etc.

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## A small example

- Let's start by importing Pandas as `pd` and Altair as `alt`:

In [None]:
import pandas as pd
import altair as alt

- For this example, we'll use a small weather dataset:

In [None]:
weather_df = pd.read_csv('data/weather.csv')

In [None]:
# This dataset is so small, we don't need to use .head()
weather_df

- Each row in this DataFrame represents 1 observation:
    - `precip` contains the average precipitation for a given `city` and `month`

### The Chart object

- The fundamental object in Altair is the `Chart`, which takes a DataFrame as a single argument:

- So far, we have defined the `Chart` object and passed it the `weather_df` DataFrame we created above


- We have not yet told the `Chart` to do anything with the data

### Graphical marks

- With a `Chart` object in hand, we can now specify how we would like the data to be visualized


- First, we indicate what kind of **graphical mark** (e.g., points, lines, bars) we want to use to represent the data


- For example, we can show the data as points using the `.mark_point()` method of the `Chart` object:

- Hm... this isn't very interesting. What's going on? 🤔


- In this chart, we have one point for each row in `weather_df`


- Each of these points are plotted in the same place, on top of each other, since we didn't specify anything about how these points should be positioned

### Encoding channels and data types

- A mark has various attributes: e.g. position, shape, size, color


- We connect the data we want to visualize to these attributes using **encoding channels**


- To visually separate the points in our example, we can **encode** the variable `city` using the `Y` channel, which represents the y-axis position of the points


- To do this, we use the `.encode()` method:

- `:N` tells Altair that the variable `city` contains **nominal** (categorical) data


- We still have multiple points overlapping within each category


- Let's further separate these points by adding an `X` encoding channel, mapped to `precip`:

- `:Q` tells Altair that `precip` contains **quantitative** (continuous real-valued) data

- ❗️ You may find examples in the wild that use keyword arguments in the `encode` method, like this:

    ```python
    alt.Chart(weather_df).mark_point().encode(
        x='precip:Q',  # This is a keyword argument: key=value
        y='city:N'
    )
    ```
    <br>

    - This code is equivalent, but less flexible, once you add parameters to an encoding
    - For this course, we'll stick with specifying encodings with their methods (e.g. `alt.X`, `alt.Y`)

❓ **Exercise 1.** 
Create a scatter plot that shows the relationship between the month and average monthly precipitation.

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Going more in depth &mdash; another data set

- For the remainder of this lesson, we'll use the global and health population data we looked at in Lesson 2, filtered to contain data only for the year 2000:

In [None]:
df = pd.read_csv('data/gapminder_2000.csv')

df.head()

- Recall: For each `country` and `year`, this data contains:
    - region of the world (`cluster`)
    - total population (`pop`)
    - average life expectancy in years (`life_expect`)
    - number of children per woman (`fertility`)

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Data types

- Data usually contains values of different types


- These data types determine the kinds of comparisons we can make, and therefore dictate how they can be effectively visualized


- Altair recognizes 5 main data types:
    1. Nominal (N)
    2. Ordinal (O)
    3. Temporal (T)
    4. Quantitative (Q)
    5. GeoJSON (G)

- We'll talk about the first 4 data types now, and save the 5th for a later lesson...

### Nominal (N)

- **Nominal** data (also called **categorical** data) consist of category names 


- With nominal data, we can compare the equality of values: does $A = B$?


- For example, in the dataset above, we can treat `country` as nominal

- When visualizing nominal data, we should be able to see if values are the same or different
    - This can be achieved using different positions, color hues (e.g. red vs. green vs. blue), or shapes
    - Using different sizes to encode nominal data might be misleading, suggesting an ordering or magnitude differences among values that do not exist!

### Ordinal (O)

- **Ordinal** data consist of values that have a specific ordering


- With ordinal data, we can compare the rank-ordering of values: is $A < B$?


- For example, in the dataset above, we can treat `year` as ordinal

- When visualizing ordinal data, we should be able to see a sense of rank-order
    - Position, size, or color brightness might be appropriate
    - Color hue (which is not perceptually ordered) would be less appropriate

### Quantitative (Q)

- **Quantitative data** consists of continuous, real-valued quantities


- With quantitative data we can measure numerical differences among values, either absolute ($A - B$) or relative ($A / B$)


- For example, in the dataset above, we can treat `fertility` as quantitative


- Quantitative values can be visualized using position, size, or color, for example

### Temporal (T)

- **Temporal** data measure time points or intervals


- This datatype is a special case of quantitative data with rich semantics and conventions (i.e., the [Gregorian calendar](https://en.wikipedia.org/wiki/Gregorian_calendar))


- For example, temporal values include date strings such as `'2019-01-04'` and `'Jan 04 2019'`

- In the dataset above, we can treat `year` as temporal
    - However, but it may be easier to treat `year` as ordinal, since it is simply encoded as an integer

- We'll learn more about using temporal values in Altair later

### Values can have multiple data types

- Note: these data types do _not_ provide a fixed categorization


- For example: just because a data field is represented using a number doesn't mean we have to treat it as a quantitative type! 


- We might interpret a set of ages (10 years old, 20 years old, etc) as nominal (underage or overage), ordinal (grouped by year), or quantitative (calculate average age).

❓ **Exercise 2.** 
What are appropriate data types for `life_expect` in the dataset above?

*Write your answer here. Double-click to edit.*

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Encoding channels

- Now let's examine how to visually encode these data types!


- Altair supports a wide variety of visual encoding channels: [here's the documentation](https://altair-viz.github.io/user_guide/encodings/index.html)

- In this lesson, we'll learn about some of the more commonly used encoding channels:
    - `X`: Horizontal (x-axis) position of the mark
    - `Y`: Vertical (y-axis) position of the mark
    - `Size`: Size of the mark. May correspond to area or length, depending on the mark type
    - `Color`: Mark color, specified as a [legal CSS color](https://developer.mozilla.org/en-US/docs/Web/CSS/color_value)
    - `Shape`: Symbol shape for `point` marks

### X

- The `X` encoding channel sets a mark's horizontal position (x-coordinate)


- Choices of axis and title are made automatically, depending on the data type 


- In the chart below, the choice of a quantitative data type results in a continuous linear axis scale:

### Y

- The `Y` encoding channel sets a mark's vertical position (y-coordinate)


- If we add `cluster` on the y-axis as an ordinal (`O`) variable, the result is a horizontal band of points for each unique value:

❓ **Exercise 3.**
What happens to the chart above if you treat `fertility` as an ordinal (`O`) variable?

*Write your answer here. Double-click to edit.*

- If we instead add `life_expect` on the y-axis as a quantitative (`Q`) variable, the result is a scatter plot with linear scales for both axes:

### Size

- The `Size` encoding channel sets a mark's size


- The meaning of "size" can vary based on the mark type


- For point marks, the `Size` channel roughly maps to the area of the point


- Let's augment our scatter plot by encoding population (`pop`) on the `Size` channel:

- Note that using the `Size` channel results in a legend for interpreting the size values


- Encodings can be customized through the use of method chaining or keyword arguments


- For example, we can adjust the size encoding so that the area of the points range between 0 pixels and 1000 pixels:

In [None]:
alt.Chart(df).mark_point().encode(
    alt.X('fertility:Q'),
    alt.Y('life_expect:Q'),
    alt.Size('pop:Q').scale(range=[0, 1000])
)

- ❗️ You may find examples in the wild that use keyword arguments to specify encoding channel options, like this:

    ```python
    alt.Chart(df).mark_point().encode(
        alt.X('fertility:Q'),
        alt.Y('life_expect:Q'),
        alt.Size('pop:Q', scale=alt.Size(range=[0, 1000]))  # attribute-based syntax
    )    
    ```
    <br>

    - This is called the __attribute-based syntax__, which is equivalent, but more verbose 
    - For this course, we'll stick with the __method-based syntax__, using chained methods to specifying encoding channel options (e.g., `.scale(...)`)

### Color

- The `Color` encoding channel sets a mark's color

- The default style of color encoding depends on the data type: 
    - Nominal data will map to a multi-hued qualitative color scheme
    - Ordinal and quantitative data will map to perceptually ordered color gradients

- If we add `cluster` to our scatter plot as a nominal (`N`) data type by encoding it with the `Color` channel, we get a distinct hue for each cluster value:

- To get filled points, we can can pass the keyword argument `filled=True` to the `.mark_point()` method:

### Shape

- The `Shape` encoding channel sets the geometric shape used for point marks


- Unlike the other channels we have seen so far, the `Shape` channel cannot be used by other mark types

- Let's encode `cluster` using `Shape` as well as `Color`
    - Using multiple channels for the same underlying data field is known as **redundant encoding**

- The resulting chart combines both color and shape information into a single symbol legend:

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Graphical marks

- So far, we've only used point marks to visualize data


- Altair has many built-in mark types: [here's the documentation](https://altair-viz.github.io/user_guide/marks/index.html)

- In this lesson, we'll learn about some of the more commonly used mark methods:
    - __point__: scatter plot points with configurable shapes
    - __bar__: rectangular bars
    - __line__: connected line segments

### Point marks

- As we've seen above, the __point__ mark type conveys specific points, as in *scatter plots* and *dot plots*


- We use the `.mark_point()` method to create a chart with point marks


- In addition to `X` and `Y` encoding channels (to specify 2D point positions), point marks can use `Color`, `Size`, and `Shape` encodings to convey additional variables


- Here is a dot plot of `fertility`, with `cluster` redundantly encoded using both the `Y` and `Shape` channels:

❓ **Exercise 4.**
Look at the Altair documentation on marks (see link above). What do the `.mark_circle()` and `.mark_square()` methods do? Try changing the plot above to use these different marks and see what happens. Do the results make sense? Why?

*Write your answer here. Double-click to edit.*

- Marks can be styled by passing parameters to the `.mark_...()` methods


- For example, we can change the size of a default point and whether a point is filled like this:

### Bar marks

- The __bar__ mark type draws a rectangle with a position, width, and height


- We use the `.mark_bar()` method to create a chart with bar marks


- Here is a simple bar chart of the population (`pop`) of each country:

❓ **Exercise 5.**
Create the same chart as above, except with horizontal bars and the countries listed from top to bottom.

- The bar width is set to a default size


- Bars can also be stacked


- Let's change the `X` encoding to use the `cluster` field, and encode `country` using the `Color` channel:

In [None]:
alt.Chart(df).mark_bar().encode(
    alt.X('cluster:N'),
    alt.Y('pop:Q'),
    alt.Color('country:N')
)

- In the chart above, the use of the `Color` encoding channel causes Altair to automatically stack the bar marks


- Otherwise, bars would be drawn on top of each other! 


- See what happens if we don't apply stacking: try adding `.stack(None)` to the `Y` encoding channel, like this:

    ```python
    alt.Y('pop:Q').stack(None)
    ```

- Note: we can disable the legend (which can't even fit all the countries by default!) by adding `.legend(None)` to the `Color` encoding. Try it out!

### Line Marks

- The __line__ mark type connects plotted points with line segments


- We use the `.mark_line()` method to create a chart with line marks


- This is helpful, for example, when you want to convey information about the rate of change


- Let's plot a line chart of fertility per country over the years, using only data for the South Asia cluster:

In [None]:
southasia_df = pd.read_csv('data/gapminder_southasia.csv')

In [None]:
alt.Chart(southasia_df).mark_line().encode(
    alt.X('year:O'),
    alt.Y('fertility:Q'),
    alt.Color('country:N')
).properties(
    width=400
)

- Note that we set a custom width of 400 pixels. Try changing the width and see what happens!

❓ **Exercise 6.**
Let's change some of the default mark parameters to customize the plot above. Pass the following keyword arguments to `.mark_line()`:

- `strokeWidth=3` to change the thickness of the line to 3 pixels
- `opacity=0.5` to add some transparency

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## What's next?

- You now know the basics of creating data visualizations using Altair!

- Over the next several lessons, we will gradually add more to your Altair toolbelt:
    - Transforming the data to create visualizations that summarize data or visualize newly derived variables
    - Configuring titles, axes, scales, colors
    - Adding user interactivity to visualizations
    - Visualizing cartographic data (i.e., geographic maps)

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Problems

### Problem 1

The file `data/gapminder_2000.csv` contains the Gapminder data filtered for the year 2000 that we used above.

Create a scatter plot of life expectancy vs. population. Use shape to differentiate points representing different clusters. Your chart should look like this:

![](img/scatter.svg)

Do you think this is an effective visualization? Why or why not?

*Write your answer here. Double-click to edit.*

### Problem 2

The file `data/gapminder_americas_2000.csv` contains the Gapminder data filtered for countries in the Americas cluster and the year 2000.

Create a bar chart comparing the fertility of each country. Each country should be represented by a bar of a different color. Do not include a legend for the colors. Your chart should look like this:

![](img/bar.svg)

### Problem 3

The line mark can be used to create *slope graphs*, charts that highlight the change in value between two comparison points using lines.

The file `data/gapminder_eap_1955_2005.csv` contains the Gapminder data, filtered for countries in the East Asia & Pacific cluster and the years 1955 and 2005.

Create a slope graph comparing the populations of each country in 1955 and 2005, like this:

![](img/slope.svg)

Set the opacity of the `line` mark to 0.5, and the width of the chart to 200.

### Problem 4

The code cell below contains code for the line chart we created above, that plots the fertility of the countries in South Asia over time.

Take a look at the [Altair documentation on encoding channels](https://altair-viz.github.io/user_guide/encodings/channels.html). What does the `StrokeDash` encoding channel do? Modify the chart below so that each country is represented with a line of a different color and dash style.

In [None]:
southasia_df = pd.read_csv('data/gapminder_southasia.csv')

alt.Chart(southasia_df).mark_line().encode(
    alt.X('year:O'),
    alt.Y('fertility:Q'),
    alt.Color('country:N'),
).properties(
    width=400
)

_Write your answer here. Double-click to edit._

### Problem 5

Like Problem 4, the code cell below contains code for the line chart we created above, that plots the fertility of the countries in South Asia over time.

Again, take a look at the [Altair documentation on encoding channels](https://altair-viz.github.io/user_guide/encodings/channels.html). What does the `StrokeOpacity` *encoding channel* do? Modify the chart below so that each country is represented with a line of a different color and opacity. How is this different from setting the stroke opacity in Exercise 6?

In [None]:
southasia_df = pd.read_csv('data/gapminder_southasia.csv')

alt.Chart(southasia_df).mark_line().encode(
    alt.X('year:O'),
    alt.Y('fertility:Q'),
    alt.Color('country:N'),
).properties(
    width=400
)

*Write your answer here. Double-click to edit.*

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Notes and sources

- These lesson notes are based on the [Visualization Curriculum](https://uwdata.github.io/visualization-curriculum/) by the University of Washington


- The systematic study of marks, visual encodings, and backing data types was initiated by [Jacques Bertin](https://en.wikipedia.org/wiki/Jacques_Bertin) in his pioneering 1967 work [_Sémiologie Graphique (The Semiology of Graphics)_](https://books.google.com/books/about/Semiology_of_Graphics.html?id=X5caQwAACAAJ)


- The identification of nominal, ordinal, interval, and ratio types dates at least as far back as S. S. Steven's 1947 article [_On the theory of scales of measurement_](https://scholar.google.com/scholar?cluster=14356809180080326415)