This notebook is adapted from UW [data visualization curriculum](https://github.com/uwdata/visualization-curriculum)

# Data Types, Graphical Marks, and Visual Encoding Channels

A visualization represents data using a collection of _graphical marks_ (bars, lines, points, etc.). The attributes of a mark &mdash; such as its position, shape, size, or color &mdash; serve as _channels_ through which we can encode underlying data values.

In [1]:
import pandas as pd
import altair as alt

## Global Development Data


In [2]:
data = pd.read_csv('data/gapminder.tsv', sep='\t')
data.head(5)

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106


## Data Types
![](figures/data_types.jpeg)


In [3]:
#shape = ['Circle', 'Square', 'Triangle']

In [4]:
#sentiment = ['VN', 'N', '0', 'P', 'VP']

The first ingredient in effective visualization is the input data. Data values can represent different forms of measurement. What kinds of comparisons do those measurements support? And what kinds of visual encodings then support those comparisons?

We will start by looking at the basic data types that Altair uses to inform visual encoding choices. These data types determine the kinds of comparisons we can make, and thereby guide our visualization design decisions.

### Nominal (N)

*Nominal* data (also called *categorical* data) consist of category names. 

With nominal data we can compare the equality of values: *is value A the same or different than value B? (A = B)*, supporting statements like “A is equal to B” or “A is not equal to B”.
In the dataset above, the `country` field is nominal.

When visualizing nominal data we should readily be able to see if values are the same or different: position, color hue (blue, red, green, *etc.*), and shape can help. However, using a size channel to encode nominal data might mislead us, suggesting rank-order or magnitude differences among values that do not exist!

### Ordinal (O)

*Ordinal* data consist of values that have a specific ordering.

With ordinal data we can compare the rank-ordering of values: *does value A come before or after value B? (A < B)*, supporting statements like “A is less than B” or “A is greater than B”.
In the dataset above, we can treat the `year` field as ordinal.

When visualizing ordinal data, we should perceive a sense of rank-order. Position, size, or color value (brightness) might be appropriate, where as color hue (which is not perceptually ordered) would be less appropriate.

### Quantitative (Q)

With *quantitative* data we can measure numerical differences among values. There are multiple sub-types of quantitative data:

For *interval* data we can measure the distance (interval) between points: *what is the distance to value A from value B? (A - B)*, supporting statements such as “A is 12 units away from B”.

For *ratio* data the zero-point is meaningful and so we can also measure proportions or scale factors: *value A is what proportion of value B? (A / B)*, supporting statements such as “A is 10% of B” or “B is 7 times larger than A”.

In the dataset above, `year` is a quantitative interval field (the value of year "zero" is subjective), whereas `gdpPercap` and `lifeExp` are quantitative ratio fields (zero is meaningful for calculating proportions).
Vega-Lite represents quantitative data, but does not make a distinction between interval and ratio types.

Quantitative values can be visualized using position, size, or color value, among other channels. An axis with a zero baseline is essential for proportional comparisons of ratio values, but can be safely omitted for interval comparisons.

### Temporal (T)

*Temporal* values measure time points or intervals. This type is a special case of quantitative values (timestamps) with rich semantics and conventions (i.e., the [Gregorian calendar](https://en.wikipedia.org/wiki/Gregorian_calendar)). The temporal type in Vega-Lite supports reasoning about time units (year, month, day, hour, etc.), and provides methods for requesting specific time intervals.

Example temporal values include date strings such as `“2019-01-04”` and `“Jan 04 2019”`, as well as standardized date-times such as the [ISO date-time format](https://en.wikipedia.org/wiki/ISO_8601): `“2019-01-04T17:50:35.643Z”`.

There are no temporal values in our global development dataset above, as the `year` field is simply encoded as an integer. For more details about using temporal data in Altair, see the [Times and Dates documentation](https://altair-viz.github.io/user_guide/times_and_dates.html).

### Summary

These data types are not mutually exclusive, but rather form a hierarchy: ordinal data support nominal (equality) comparisons, while quantitative data support ordinal (rank-order) comparisons.

Moreover, these data types do _not_ provide a fixed categorization. Just because a data field is represented using a number doesn't mean we have to treat it as a quantitative type! For example, we might interpret a set of ages (10 years old, 20 years old, etc) as nominal (underage or overage), ordinal (grouped by year), or quantitative (calculate average age).

Now let's examine how to visually encode these data types!


In [5]:
data.dtypes

country       object
continent     object
year           int64
lifeExp      float64
pop            int64
gdpPercap    float64
dtype: object

In [6]:
alt.Chart(data).mark_point().encode(
    alt.X('lifeExp:Q')
)

In [7]:
alt.Chart(data).mark_point().encode(
    alt.X('lifeExp:N')
)

In [8]:
alt.Chart(data).mark_point().encode(
    alt.X('lifeExp:T')
)

## Encoding Channels

At the heart of Altair is the use of *encodings* that bind data fields (with a given data type) to available encoding *channels* of a chosen *mark* type. In this notebook we'll examine the following encoding channels:

- `x`: Horizontal (x-axis) position of the mark.
- `y`: Vertical (y-axis) position of the mark.
- `size`: Size of the mark. May correspond to area or length, depending on the mark type.
- `color`: Mark color, specified as a [legal CSS color](https://developer.mozilla.org/en-US/docs/Web/CSS/color_value).
- `opacity`: Mark opacity, ranging from 0 (fully transparent) to 1 (fully opaque).
- `shape`: Plotting symbol shape for `point` marks.
- `tooltip`: Tooltip text to display upon mouse hover over the mark.
- `order`: Mark ordering, determines line/area point order and drawing order.
- `column`: Facet the data into horizontally-aligned subplots.
- `row`: Facet the data into vertically-aligned subplots.

For a complete list of available channels, see the [Altair encoding documentation](https://altair-viz.github.io/user_guide/encoding.html).

### X

The `x` encoding channel sets a mark's horizontal position (x-coordinate). In addition, default choices of axis and title are made automatically. In the chart below, the choice of a quantitative data type results in a continuous linear axis scale:

In [9]:
alt.Chart(data).mark_point().encode(
    #alt.X('lifeExp:Q')
    x='lifeExp:Q'
)

### Y

The `y` encoding channel sets a mark's vertical position (y-coordinate). Here we've added the `cluster` field using an ordinal (`O`) data type. The result is a discrete axis that includes a sized band, with a default step size, for each unique value:

In [10]:
alt.Chart(data).mark_point().encode(
    alt.X('lifeExp:Q'),
    alt.Y('continent:O')
)

In [11]:
alt.Chart(data).mark_point().encode(
    alt.X('lifeExp:Q'),
    alt.Y('pop:Q')
)

In [12]:
alt.Chart(data).mark_point().encode(
    alt.X('lifeExp:Q', scale=alt.Scale(zero=False)),
    alt.Y('pop:Q', scale=alt.Scale(zero=False))
)

### Size

In [13]:
alt.Chart(data).mark_point().encode(
    alt.X('gdpPercap:Q'),
    alt.Y('lifeExp:Q'),
    size='pop'#alt.Size('pop:Q')
)

In [14]:
alt.Chart(data).mark_point().encode(
    alt.X('gdpPercap:Q'),
    alt.Y('lifeExp:Q'),
    alt.Size('pop:Q', scale=alt.Scale(range=[0,1000]))
)

### Color and Opacity

In [15]:
alt.Chart(data).mark_point().encode(
    alt.X('lifeExp:Q'),
    alt.Y('pop:Q'),
    alt.Color('continent')
)

If we prefer filled shapes, we can can pass a `filled=True` parameter to the `mark_point` method:

In [16]:
alt.Chart(data).mark_point().encode(
    alt.X('lifeExp:Q'),
    alt.Y('pop:Q'),
    color='continent'
    #alt.Color('continent:N')
)

In [17]:
alt.Chart(data).mark_point(filled=True).encode(
    alt.X('lifeExp:Q'),
    alt.Y('pop:Q'),
    alt.Color('continent:N'),
    #opacity = 'country'
    #alt.OpacityValue(0.5)
)

### Shape

In [18]:
alt.Chart(data).mark_point().encode(
    alt.X('lifeExp:Q'),
    alt.Y('pop:Q'),
    alt.Shape('continent:N'),
    #alt.OpacityValue(0.5)
)

### Tooltips & Ordering

In [19]:
alt.Chart(data).mark_point(filled=True, opacity=0.5).encode(
    alt.X('lifeExp:Q'),
    alt.Y('pop:Q'),
    alt.Color('continent:N'),
    tooltip='country'
)

In [20]:
alt.Chart(data).mark_point(filled=True).encode(
    alt.X('gdpPercap:Q'),
    alt.Y('lifeExp:Q'),
    alt.Size('pop:Q', scale=alt.Scale(range=[0,1000])),
    alt.Color('continent:N'),
    alt.OpacityValue(1),
    alt.Tooltip('country:N'),
    alt.Order('pop:Q', sort='descending')
)

In [21]:
alt.Chart(data).mark_point(filled=True).encode(
    alt.X('gdpPercap:Q'),
    alt.Y('lifeExp:Q'),
    alt.Size('pop:Q', scale=alt.Scale(range=[0,1000])),
    alt.Color('continent:N'),
    alt.OpacityValue(1),
    alt.Order('pop:Q', sort='descending'),
    tooltip = [
        alt.Tooltip('country:N'),
        alt.Tooltip('year:O')
    ], 
).interactive()


### Column and Row Facets

In [22]:
alt.Chart(data).mark_point(filled=True).encode(
    alt.X('gdpPercap:Q'),
    alt.Y('lifeExp:Q'),
    alt.Size('pop:Q', scale=alt.Scale(range=[0,1000])),
    alt.Color('continent:N'),
    alt.OpacityValue(1),
    alt.Order('pop:Q', sort='descending'),
    column = 'year' 
)

In [23]:
alt.Chart(data).mark_point(filled=True).encode(
    alt.X('gdpPercap:Q'),
    alt.Y('lifeExp:Q'),
    alt.Size('pop:Q', scale=alt.Scale(range=[0,1000])),
    alt.Color('continent:N'),
    alt.OpacityValue(1),
    alt.Order('pop:Q', sort='descending'),
    column = 'year' 
).properties(width=135, height=135)

## Graphical Marks

Our exploration of encoding channels above exclusively uses `point` marks to visualize the data. However, the `point` mark type is only one of the many geometric shapes that can be used to visually represent data. Altair includes a number of built-in mark types, including:

- `mark_area()` - Filled areas defined by a top-line and a baseline.
- `mark_bar()` -	Rectangular bars.
- `mark_circle()`	- Scatter plot points as filled circles.
- `mark_line()` - Connected line segments.
- `mark_point()` - Scatter plot points with configurable shapes.
- `mark_rect()` - Filled rectangles, useful for heatmaps.
- `mark_rule()` - Vertical or horizontal lines spanning the axis.
- `mark_square()` - Scatter plot points as filled squares.
- `mark_text()` - Scatter plot points represented by text.
- `mark_tick()` - Vertical or horizontal tick marks.	

For a complete list, and links to examples, see the [Altair marks documentation](https://altair-viz.github.io/user_guide/marks.html). Next, we will step through a number of the most commonly used mark types for statistical graphics.

### Point Marks

The `point` mark type conveys specific points, as in *scatter plots* and *dot plots*. In addition to `x` and `y` encoding channels (to specify 2D point positions), point marks can use `color`, `size`, and `shape` encodings to convey additional data fields.

Below is a dot plot of `lifeExp`, with the `cluster` field redundantly encoded using both the `y` and `shape` channels. 



In [24]:
alt.Chart(data).mark_point().encode(
    alt.X('lifeExp:Q'),
    alt.Y('continent:N'),
    alt.Shape('continent:N')
)

In addition to encoding channels, marks can be stylized by providing values to the `mark_*()` methods.

For example: point marks are drawn with stroked outlines by default, but can be specified to use `filled` shapes instead. Similarly, you can set a default `size` to set the total pixel area of the point mark.


In [25]:
alt.Chart(data).mark_point(filled=True, size=100).encode(
    alt.X('lifeExp:Q'),
    alt.Y('continent:N'),
    alt.Shape('continent:N')
)

### Circle Marks

The `circle` mark type is a convenient shorthand for `point` marks drawn as filled circles.

In [26]:
alt.Chart(data).mark_circle().encode(
    alt.X('lifeExp:Q'),
    alt.Y('continent:N'),
    alt.Shape('continent:N')
)

### Square Marks

The `square` mark type is a convenient shorthand for `point` marks drawn as filled squares.

In [27]:
alt.Chart(data).mark_square().encode(
    alt.X('lifeExp:Q'),
    alt.Y('continent:N'),
    alt.Shape('continent:N')
)

### Tick Marks

The `tick` mark type conveys a data point using a short line segment or "tick". These are particularly useful for comparing values along a single dimension with minimal overlap. A *dot plot* drawn with tick marks is sometimes referred to as a *strip plot*.

In [28]:
alt.Chart(data).mark_tick().encode(
    alt.X('lifeExp:Q'),
    alt.Y('continent:N'),
    alt.Shape('continent:N')
)

### Bar Marks

The \`bar\` mark type draws a rectangle with a position, width, and height.

The plot below is a simple bar chart of the population (\`pop\`) of each country.

In [29]:
alt.Chart(data).mark_bar().encode(
    alt.X('lifeExp:Q'),
    alt.Y('continent:N'),
    #color='country'
)

In [30]:
data

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.853030
2,Afghanistan,Asia,1962,31.997,10267083,853.100710
3,Afghanistan,Asia,1967,34.020,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786
1701,Zimbabwe,Africa,1997,46.809,11404948,792.449960
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623


In [31]:
alt.Chart(data).mark_bar().encode(
    alt.X('median(lifeExp)'),
    alt.Y('continent:N'),
    #color='year'
    
)

In [32]:
alt.Chart(data).mark_bar().encode(
    alt.X('lifeExp:Q'),
    alt.Y('continent:N'),
    color='country'
)

In [33]:
alt.Chart(data).mark_bar().encode(
    alt.Y('lifeExp:Q'),
    alt.X('year:O'),
    color='continent'
)

In [34]:
alt.Chart(data).mark_bar().encode(
    alt.X('min(lifeExp):Q'),
    alt.X2('max(lifeExp):Q'),
    alt.Y('continent:N')
)

### Line Marks

The `line` mark type connects plotted points with line segments, for example so that a line's slope conveys information about the rate of change.

Let's plot a line chart of lifeExp per country over the years, using the full, unfiltered global development data frame. We'll again hide the legend and use tooltips instead.


In [35]:
alt.Chart(data).mark_line().encode(
    alt.X('year:O'),
    alt.Y('average(lifeExp):Q'),
    alt.Color('continent:N'),
).properties(
    width=400
)

We can see interesting variations per country, but overall trends for lower numbers of children per family over time. Also note that we set a custom width of 400 pixels. _Try changing (or removing) the widths and see what happens!_

Let's change some of the default mark parameters to customize the plot. We can set the `strokeWidth` to determine the thickness of the lines and the `opacity` to add some transparency. By default, the `line` mark uses straight line segments to connect data points. In some cases we might want to smooth the lines. We can adjust the interpolation used to connect data points by setting the `interpolate` mark parameter. Let's use `'monotone'` interpolation to provide smooth lines that are also guaranteed not to inadvertently generate "false" minimum or maximum values as a result of the interpolation.

In [36]:
alt.Chart(data).mark_line(
    strokeWidth=3,
    opacity=0.5,
    interpolate='monotone'
).encode(
    alt.X('year:O'),
    alt.Y('lifeExp:Q'),
    alt.Color('country:N', legend=None),
).properties(
    width=400
)


### Area Marks

The `area` mark type combines aspects of `line` and `bar` marks: it visualizes connections (slopes) among data points, but also shows a filled region, with one edge defaulting to a zero-valued baseline.

The chart below is an area chart of population over time for just the United States:

In [37]:
dataUS = data.loc[data['country'] == 'United States']

alt.Chart(dataUS).mark_area().encode(
    alt.X('year:O'),
    alt.Y('gdpPercap:Q')
)

Similar to `line` marks, `area` marks support an `interpolate` parameter.

In [38]:
alt.Chart(dataUS).mark_area(interpolate='monotone').encode(
    alt.X('year:O'),
    alt.Y('gdpPercap:Q')
)

Similar to `bar` marks, `area` marks also support stacking. Here we create a new data frame with data for the three North American countries, then plot them using an `area` mark and a `color` encoding channel to stack by country.

In [39]:
dataNA = data.loc[
    (data['country'] == 'United States') |
    (data['country'] == 'Canada') |
    (data['country'] == 'Mexico')
]

alt.Chart(dataNA).mark_area().encode(
    alt.X('year:O'),
    alt.Y('gdpPercap:Q'),
    alt.Color('country:N')
)

By default, stacking is performed relative to a zero baseline. However, other `stack` options are available:

* `center` - to stack relative to a baseline in the center of the chart, creating a *streamgraph* visualization, and
* `normalize` - to normalize the summed data at each stacking point to 100%, enabling percentage comparisons.

Below we adapt the chart by setting the `y` encoding `stack` attribute to `center`. What happens if you instead set it `normalize`?

In [40]:
alt.Chart(dataNA).mark_area().encode(
    alt.X('year:O'),
    alt.Y('gdpPercap:Q', stack='center'),
    alt.Color('country:N')
)

To disable stacking altogether, set the  `stack` attribute to `None`. We can also add `opacity` as a default mark parameter to ensure we see the overlapping areas!

In [41]:
alt.Chart(dataNA).mark_area(opacity=0.5).encode(
    alt.X('year:O'),
    alt.Y('gdpPercap:Q', stack=None),
    alt.Color('country:N')
)

The `area` mark type also supports data-driven baselines, with both the upper and lower series determined by data fields. As with `bar` marks, we can use the `x` and `x2` (or `y` and `y2`) channels to provide end points for the area mark.

The chart below visualizes the range of minimum and maximum gdpPercap, per year, for North American countries:

In [42]:
alt.Chart(dataNA).mark_area().encode(
    alt.X('year:O'),
    alt.Y('min(gdpPercap):Q'),
    alt.Y2('max(gdpPercap):Q')
)