[Data Visualization](https://infovis.fh-potsdam.de/tutorials/) · FH Potsdam · Summer 2023

# Tutorial 2: Visual encoding

Welcome back! This tutorial is all about turning data into visual form. This essential mechanism of visualization is called *visual encoding*. For the purpose of this tutorial, we assume that the data is already prepared as a Pandas DataFrame just waiting to be visualized. In reality, getting data into shape often takes a lot of work. 

*As in the first tutorial, you should be able to run the notebook yourself, edit the contents of all cells, and try things out. Here and there particular opportunities for such edits are marked with a pencil. ✏️*

To be able to edit this document on Deepnote, you need to copy the tutorial into your own workspace by clicking on the three dots (⋯) on the very top right and selecting "Duplicate project". For this you need your own free Deepnote account. 

We start by importing the libraries that we plan to use:

In [1]:
import pandas as pd
import altair as alt

# we use sample data from this package
from vega_datasets import data

We start with Gapminder data containing income, health, and population statistics for 187 countries. We use `df` as a variable name for the DataFrame containing gapminder statistics. Many examples in tutorials on Pandas also this variable name and the Altair documentation tends to use `source`.

In [2]:
df = data.gapminder_health_income()
df.head()

Unnamed: 0,country,income,health,population
0,Afghanistan,1925,57.63,32526562
1,Albania,10620,76.0,2896679
2,Algeria,13434,76.5,39666519
3,Andorra,46577,84.1,70473
4,Angola,7615,61.0,25021974


## Visual marks

Let's start with the graphical elements that represent individual data points or groups of data points. Visual marks are the basic building blocks of any visualization and they can take several forms, such as points, lines, and areas.


### Points

Points are maybe the most basic mark type to visualize data elements. Typically these are rendered to create scatterplots. When using Altair, we specify the type of visual mark with the method immediately following the `Chart()`, e.g., `mark_point()`. 

We may want to see the relationship between `income` and `health` statistics:

In [3]:
alt.Chart(df).mark_point().encode(
    x='income',
    y='health'
)

### Lines

The line is the graphical mark that is most often used for visualizing temporal data. For this we are now using another dataset about energy production in Iowa - and for this we create another DataFrame that we call `df2`:

In [4]:
df2 = data.iowa_electricity()

# while head displays the first rows, sample gives as a random selection:
df2.sample(5)

Unnamed: 0,year,source,net_generation
26,2010-01-01,Nuclear Energy,4451
36,2003-01-01,Renewables,1885
27,2011-01-01,Nuclear Energy,5215
6,2007-01-01,Fossil Fuels,41389
45,2012-01-01,Renewables,14949


Below is a simple line chart showing the steady increase of renewable energy production from 2001 and 2017. To focus just on the renewables, we first need to apply a query (we will get to queries in the next tutorial):

In [5]:
renewables = df2[df2["source"] == "Renewables"]

alt.Chart(renewables).mark_line().encode(
    x="year",
    y="net_generation",
)

### Slices

If we wanted to visualize the ratio of energy produced for a given year, we might be tempted to draw a pie chart. For this we can use the `arc` mark:

In [6]:
alt.Chart(df2[df2["year"]=="2017"]).mark_arc().encode(    
    theta="net_generation",
    color="source"
)

### Areas

Area charts let us see how data in multiple categories cumulate over a timespan. We might want to see how different energy sources contribute to an overall temporal trend, i.e., the overall energy production in Iowa:

In [7]:
alt.Chart(df2).mark_area().encode(
    x="year",
    y="net_generation",
    color="source"
)

### Bars

Altair provides for a few more marks, some of which we will get to know in later tutorials, but there is one that we do not want to miss here. Let's also create a bar chart, but this one I will leave for you to draw.

✏️ *Consider plotting the population data in `df` as a bar chart. Hint: The method call is `mark_bar()`*


## Visual variables

Now that we have different marks at our disposal we can vary their placement and appearance according to the dataset. We need to decide how to translate the attributes in the dataset to graphical attributes of the marks. In other words: We want the graphical marks to vary in an equivalent way as the data elements vary in the dataset. In Altair, this is done by explicitly connecting data dimensions and visual variables in the `encode()` method. We will go through the most important visual variables and see how they can be adjusted in Altair visualizations.

Altair offers two ways of defining a visual variable for a given data dimension: a short and a long form. The short form is concise and makes some assumptions on scales and other aspects, while the long form gives us much more control over the encoding. Above examples all use the short form, below we will see a few more long form examples.

The short form is written as a named parameter of the `encode()` method:

` x='income' `

The long form is a chained method call passed as an unnamed parameter to `encode()`:

` alt.X('income').scale(type='log') `

Okay, now let's get to it!


### Position

Perceptually speaking, position is the most powerful variable and in Altair it can be referred simply by `x` and `y`. The short form connects visual variables with data attributes simply by passing respective parameters to the `encode` method. The parameter name is the visual variable (such as x and y) and the value is the column name. This is what we have already done when we introduced the `point` mark above.

However, the default scatterplot we created to map income and health statistics for 187 countries actually resulted in quite a crowded arrangement. To clean this up a bit, we can adjust how the positioning unfolds to better see differences among countries and patterns between the dimensions. To do this we can change the x-axis to a logarithmic scale and have the axis band of the y-axis not start at 0. To specify such adjustments, there is a more elaborate way to define the visual encoding of position:

In [8]:
alt.Chart(df).mark_point().encode(
    alt.X('income').scale(type='log'),
    alt.Y('health').scale(zero=False),
)

### Size

The size of marks can also be varied. How about extending above chart below to vary the size of the points to represent the population of each country.

✏️ *Try adding the visual variable `size` to encode `population` and see what happens. You can use the elaborate way of defining encoding or the short form*

### Color

Color constitutes a particularly precious variable as it lets us distinguish categories, but we can actually not distinguish so many colors and remember their categorical associations. So when you use colors to encode categories or groupings, try to avoid going beyond ten.

As an example, we will revisit above area chart and change the color scale to a custom scale. The parameter `domain` refers to the possible data values to be mapped to a `range` of colors. Note: The long form encoding specifications need to come before the short specifications.

In [9]:
alt.Chart(df2).mark_area().encode(
    alt.Color("source").scale(
        domain=["Renewables", "Nuclear Energy", "Fossil Fuels"],
        range=['green', 'red', 'purple']
    ),
    x="year",
    y="net_generation",
)

By adding color as a visual variable to above line chart and passing the entire energy DataFrame we can actually juxtapose the developments in a way that is not cumulative as above.

✏️ *How would you customize the colors to match above chart?*

In [10]:
alt.Chart(df2).mark_line().encode(
    color="source",
    x="year",
    y="net_generation"
)

### Opacity

Closely connected with color, we can also vary the opacity of marks to encode data:

In [11]:
alt.Chart(df).mark_point().encode(
    alt.X('income').scale(type='log'),
    alt.Y('health').scale(zero=False),
    alt.Opacity('population')
)

### Shape

We can also adjust the shapes used to display data elements. This can be done on the basis of data attributes or by simply changing the default shape, which is what we are doing below. In this case, we actually specify a different shape in the `mark_point()` call:


In [12]:
alt.Chart(df).mark_point(shape="square").encode(
    alt.X('income').scale(type='log'),
    alt.Y('health').scale(zero=False),
    alt.Opacity('population').scale(zero=False),
)

✏️ *What other visual properties would you want to change? Hint: The Altair documentation lists the [general mark properties](https://altair-viz.github.io/user_guide/marks/index.html#general-mark-properties) as well as the properties specific to certain mark types, such as [points](https://altair-viz.github.io/user_guide/marks/point.html#point-mark-properties).*

## Sources

- [Marks — Vega-Altair documentation](https://altair-viz.github.io/user_guide/marks/index.html)
- [Encodings — Vega-Altair documentation](https://altair-viz.github.io/user_guide/encodings/index.html)