**SA433 &#x25aa; Data Wrangling and Visualization &#x25aa; Fall 2024**

# Lesson 6. Customizing Visualizations &mdash; Scales, Axes, Legends, Titles

## In this lesson...

- Once you have the basics of a useful visualization, you'll often want to customize it


- For example, you may want to adjust the **scale** used by an encoding channel to map the data values to visual values (e.g. position, color)


- You may also want to tweak the guides that allow your readers to decode your visualization, such as the **axes**, **legends**, and **title**


- In addition, you may want to adjust the **size** of your visualization


- In this lesson, we will explore how to perform these customizations in Altair

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## The antibiotics dataset

* First, let's import Pandas and Altair:

In [None]:
import pandas as pd
import altair as alt

- After World War II, antibiotics were considered "wonder drugs", as they were an easy remedy for what had been intractable ailments


- In the same folder as this notebook, there is a CSV file `data/antibiotics.csv` that contains the performance of the 3 most popular antibiotics on 16 bacteria


- Let's read in the data:

In [None]:
df = pd.read_csv('data/antibiotics.csv')

* This dataset is small, so let's just look at it in its entirety:

In [None]:
df

- Each row corresponds to one `Bacteria` strain:
    - We have the [minimum inhibitory concentration (MIC)](https://en.wikipedia.org/wiki/Minimum_inhibitory_concentration) for `Penicillin`, `Streptomycin`, and `Neomycin`
        - MIC measures the concentration of antibiotic (in micrograms per milliliter) required to prevent growth in vitro
        - So, smaller is better
    - We also have the reaction of the bacteria to `Gram_Staining`, [described here](https://en.wikipedia.org/wiki/Gram_stain)
    - Finally, we have the `Genus` of the bacteria

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Adjusting the scale type of an encoding channel

- Let's start by looking at a simple dot plot of the MIC for Neomycin:

In [None]:
alt.Chart(df).mark_circle().encode(
    alt.X('Neomycin:Q')
)

- We can see that the MIC values cluster on the left, with a few large outliers to the right


- By default Altair uses a linear mapping between the data values (in this case, MIC) and the visual values (pixels)


- To get a better view of the data, we can tell Altair to use a different scale


- For example, we could use a **square root scale**, which compresses larger numbers into a smaller amount of space:

<img src="img/sqrt.svg" width=650/>


- We can tell an encoding channel to use a different scale with the `.scale()` method
    - `.scale()` takes the same arguments as `alt.Scale()`
    - `alt.Scale()` is an object that defines various scale properties
    - [Here is the documentation for `alt.Scale()`](https://altair-viz.github.io/user_guide/generated/core/altair.Scale.html)

- So, to tell the `X` encoding channel to use a square root (`sqrt`) scale `type`, we can do this:

- We see now that the points on the left are now better differentiated, but we still see some heavy skew to the left
 

- We could try using a **logarithmic scale** instead, which has a similar, but more pronounced effect than a square root scale:

<img src="img/log.svg" width=700 />

- To have the `X` encoding channel use the logarithmic (`log`) scale `type`, we can do this:

- Now the data is much more evenly distributed visually


- As a result, we can better see the differences in Neomycin concentrations required for different bacteria

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Changing the sort order of an encoding channel

- Remember that *lower* MIC indicates higher effectiveness


- However, some people may expect "better" values to be "up and to the right"

- We can specify a **sort order** for an encoding channel with `.sort('ascending')` or `.sort('descending')`
    - The `.sort()` method can take other arguments &mdash; we'll see these later

- So, in our dot plot, we can ask the `X` encoding to sort values in descending order, like this:

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Specifying the title of an encoding channel

- One could argue that our chart is starting to get confusing:
    - the axis uses a logarithmic scale
    - the axis is in the reverse direction
    - the axis does not clearly indicate what the units are

- We can specify a title for an encoding channel with the `.title()` method


- The title of an encoding channel is used as the title for any corresponding axes or legends


- For example, let's make the axis title of our dot plot more informative:

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Adjusting axis gridlines, ticks and labels

- By default, Altair places the x-axis along the bottom of the chart

- We can tell an encoding channel to change an axis's gridlines, ticks, or labels with the `.axis()` method 
    - `.axis()` takes the same arguments as `alt.Axis()`
    - `alt.Axis()` is an object that defines various axis properties, such as gridlines, ticks, labels
    - [Here's the documentation for `alt.Axis()`](https://altair-viz.github.io/user_guide/generated/core/altair.Axis.html)

- To change the placement of the x-axis, we can use the `orient='top'` keyword argument of `.axis()` like this:

- Similarly, the y-axis defaults to a `'left'` orientation, but can be set to `'right'`


- Now suppose we want to compare neomycin with another antibiotic, like penicillin


- We can create a scatter plot by adding a `Y` encoding for penicillin that mirrors the design of our x-axis for neomycin, like this:

In [None]:
alt.Chart(df).mark_circle().encode(
    alt.X('Neomycin:Q')
        .sort('descending')
        .scale(type='log')
        .title('Neomycin MIC (micrograms/ml, reverse log scale)'),
    alt.Y('Penicillin:Q')
        .sort('descending')
        .scale(type='log')
        .title('Penicillin MIC (micrograms/ml, reverse log scale)')
)

- We can see a differentiated response: some bacteria respond well to neomycin but not penicillin, and vice versa


- While this plot is useful, we can make it better


- For example, the grid lines are rather dense...


- If we want to remove grid lines altogether, we can specify `grid=False` in `.axis()`

- We can *reduce* the number of grid lines and tick marks, by specifying a target `tickCount` in `.axis()`
    - `tickCount` is a *suggestion* to Altair, considered alongside other aspects that make a nice visualization
    - We may not get *exactly* the number of tick marks we request, but we should get something close

- For example, if we want roughly 5 grid lines and tickmarks, we can do so like this:

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Adjusting the domain

- Note that in our chart of neomycin MIC vs. penicillin MIC above, the x and y axes use the same units, but have different domains
    - This could be misleading...

- We can adjust the scale of the `X` and `Y` encoding channels so that they have matching domains using the `domain=...` keyword argument of `.scale()`, like this:

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Configuring color scales

- In the scatter plots we created above, we saw that neomycin is more effective for some bacteria, while penicillin is more effective for others


- However, we can't tell *which* bacteria respond better to neomycin vs. penicillin!


- Gram staining serves as a diagnostic for discriminating classes of bacteria, so let's take our existing scatter plot and map `Gram_Staining` to the `Color` encoding channel:

- Now we can see clearly that Gram-positive bacteria seem most susceptible to penicillin, whereas Gram-negative bacteria are more susceptible to neomycin!


- Note that we specified `Gram_Staining` as a nominal data type


- The color scheme above was automatically chosen by Altair to provide perceptually-distinguishable colors for nominal comparisons (i.e., equal or not equal)


- However, we might wish to customize the colors used


- This is another *scale* adjustment: how we map data values to visual values


- It turns out that Gram staining results in [distinctive physical colorings: pink for Gram-negative, purple for Gram-positive](https://en.wikipedia.org/wiki/Gram_stain#/media/File:Gram_stain_01.jpg)

- Let's use those colors by specifying an explicit discrete scale mapping with the `domain=...` and `range=...` keyword arguments of `.scale()`
    - `domain` will be a list of data values
    - `range` will be a list of corresponding colors

- Any valid **CSS color string** can be used to specify a color to Altair
    - For example, `'hotpink'` is a CSS color string
    - `'#ff69b4'` and `'rgb(255, 105, 180)'` are also CSS color strings that produce the same `'hotpink'` color
    - [Here's a tutorial on CSS color strings](https://www.w3schools.com/colors/default.asp)
    - [Here's a website that lets you pick a color and get the corresponding CSS color string](https://hashtagcolor.com/)

- By default, legends are placed on the right side of the chart

- We can configure the legend for an encoding channel with the `.legend()` method
    - `.legend()` takes the same arguments as `alt.Legend()`
    - `alt.Legend()` is an object defining properties of the legend
    - [Here's the documentation for `alt.Legend()`](https://altair-viz.github.io/user_guide/generated/core/altair.Legend.html)

- We can change the legend orientation using the `orient=...` keyword argument of `.legend()`:

- We can also remove an encoding channel's legend entirely by passing `None` to the `.legend()` method:

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Using built-in color schemes

- Let's modify our scatter plot to encode `Bacteria` with the `Color` channel, so that we can directly see the effects of neomycin and penicillin on the different bacteria in the dataset:

In [None]:
alt.Chart(df).mark_circle().encode(
    alt.X('Neomycin:Q')
        .sort('descending')
        .scale(type='log', domain=[0.001, 1000])
        .axis(tickCount=5)
        .title('Neomycin MIC (micrograms/ml, reverse log scale)'),
    alt.Y('Penicillin:Q')
        .sort('descending')
        .scale(type='log', domain=[0.001, 1000])
        .axis(tickCount=5)
        .title('Penicillin MIC (micrograms/ml, reverse log scale)'),
    alt.Color('Bacteria:N')
)

- This is not good... if we look carefully at the color legend, we see that the colors start to repeat!


- This is because Altair's default color scheme for nominal data only has 10 colors


- We can fix this by explicitly specifying `domain` and `range` values in the `.scale()` method for the `Color` encoding, like we did above


- An easier option is to use an alternative **color scheme**

- Altair includes a variety of built-in color schemes, derived from Vega
    - For a complete list, see the [Vega color scheme documentation](https://vega.github.io/vega/docs/schemes/#reference)

- Let's try switching to a built-in 20-color scheme, `tableau20`, and set that using the `scheme=...` keyword argument of `.scale()`

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Chart-wide properties

- There are some chart-wide properties that are good to know about


- These are changed with the `.properties()` method of the `Chart` object


- We've seen the `width=...` and `height=...` keyword arguments before


- We can also set a chart-wide title with the `title=...` keyword argument, which can then be further customized using the `.configure_title()` method of the `Chart` object

- Let's put some finishing touches on our scatter plot:
    - Set the width and height of the chart to 300 pixels
    - Add a title, left-justified, with a slightly bigger font, and slightly higher up than the default placement

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## More on the sort order of an encoding channel

- Consider the following bar chart, showing the effects of streptomycin on the different bacteria strains in the dataset:

In [None]:
alt.Chart(df).mark_bar().encode(
    alt.Y('Bacteria:N'),
    alt.X('Streptomycin:Q')
)

- Visually, it would be nice to order the bacteria from most effective at the top (lowest streptomycin MIC) to the least effective at the bottom (highest streptomycin MIC)

- We can tell an encoding channel to sort by passing a _dictionary_ to the `.sort()` method with two key-value pairs:
    1. `'encoding': ...` specifies the encoding to sort by (use *lowercase* strings to specify the encoding)
    2. `'order':...` specifies whether to sort in ascending or descending order

- We can sort the `Y` encoding by ascending values of the `X` encoding, like this:

- We can also tell an encoding channel to sort by a variable (a __field__, in Altair parlance) by passing a dictionary to the `.sort()` method with two key-value pairs: 
    1. `'field': ...` specifies the variable to sort by
    2. `'order': ...` specifies whether to sort in ascending or descending order

- So, we can also sort the `Y` encoding based on descending values of `Streptomycin`, like this:

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Problems

### Problem 1

The file `data/gapminder.csv` contains the Gapminder data we used in Lessons 2 and 3.

In Problem 1 of Lesson 3, you were asked to create a scatter plot of life expectancy vs. population in the year 2000. Recreate that scatter plot, first using a filter transform to only include data from the year 2000. Then, use the techniques in this lesson to:

- Change the scale of the horizontal axis so that the points are easier to discern.
- Provide descriptive titles for the axes and legend.
- Put the legend at the bottom of the chart. *Hint.* Look at the documentation for `alt.Legend()` to see what values are valid for the `orient=...` keyword argument.

### Problem 2

Modify the scatter plot you created in Problem 1 so that the different regions are differentiated by color instead of shape. Use the following colors:

| Region                     | CSS color string |
| :-                         | :-               |
| Americas                   | `#7fc97f`        |
| East Asia & Pacific        | `#beaed4`        |
| Europe & Central Asia      | `#fdc086`        |
| Middle East & North Africa | `#ffff99`        |
| South Asia                 | `#386cb0`        |
| Sub-Saharan Africa         | `#f0027f`        |

*FYI.* This color scheme is generated by [this website](https://colorbrewer2.org/). Poke around and play with different color schemes!

### Problem 3

The file `data/movies.csv` in this folder contains the movies dataset we used in Lesson 5.

In Lesson 5, we created a bar chart showing the top 20 directors by total worldwide gross revenues, with the bar lengths corresponding to these revenue values. Recreate that bar chart. Use the techniques in this lesson to:

- Sort the directors by total worldwide gross revenues in descending order.
- Provide descriptive titles for the axes.
- Provide a descriptive title for the chart overall.

### Problem 4

Look at the Vega color scheme documentation linked above. Note that the categorical color schemes are more appropriate for nominal data types, while the sequential single-hue, sequential multi-hue, diverging, and cyclical color schemes are more appropriate for ordinal data types.

Take the scatter plot of `Neomycin` vs `Penicillin` we created in this lesson (code copied-and-pasted below) and play around with the different color schemes. Try `Bacteria` as both a nominal and ordinal data type.

In [None]:
# Play around with this code!
alt.Chart(df).mark_circle().encode(
    alt.X('Neomycin:Q')
        .sort('descending')
        .scale(type='log', domain=[0.001, 1000])
        .axis(tickCount=5)
        .title('Neomycin MIC (micrograms/ml, reverse log scale)'),
    alt.Y('Penicillin:Q')
        .sort('descending')
        .scale(type='log', domain=[0.001, 1000])
        .axis(tickCount=5)
        .title('Penicillin MIC (micrograms/ml, reverse log scale)'),
    alt.Color('Bacteria:N').scale(scheme='tableau20')
).properties(
    width=300,
    height=300,
    title='MIC values of penicillin and neomycin'
).configure_title(
    fontSize=16,
    anchor='start',
    dy=-10
)

### Problem 5

Let's go back to the movies dataset. In Problem 4 of Lesson 4, you were asked to create a heatmap that shows the relationship between the running times of the IMDB ratings of the movies in the dataset. Recreate this heatmap. Use the techniques in this lesson to:

- Provide descriptive titles for the axes and legend.
- Use an alternative built-in color scheme. Take a look at the Vega color scheme documentation linked above, and try a few out. Which type of color scheme seems most appropriate: sequential single-hue, sequential multi-hue, diverging, or cyclical? 

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Notes and sources

- These lesson notes are based on the [Visualization Curriculum](https://uwdata.github.io/visualization-curriculum/) by the University of Washington


- [Customizing Visualizations](https://altair-viz.github.io/user_guide/customization.html) from the Altair documentation