In [1]:
%matplotlib inline

In [2]:
import matplotlib.pyplot as plt

# Expressing Quantitative Relationships with Charts

We use charts to express quantitative relationships in our data (or find them!) by giving shape to those relationships. Surprisingly, there's only eight (8) of those.


1. Time Series
2. Ranking
3. Part/Whole
4. Deviation
5. Distribution
6. Correlation
7. Geospatial
8. Nominal

Note that these relationships are essentially *messages* or *stories* we want to convey about our data. We might want to tell several stories about a particular variable. This implies that you want *one chart per message*. Do not make charts tell more than one story. This is the problem with the Playfair chart. It is telling a Time Series story about the Exports and Imports of England and a Difference story about the Trade Balance. This should be two charts.

Many of these concepts will be clearer when we start Exploratory Data Analysis. This section is mostly to plant seeds for later. We'll take up these different approaches when we start EDA.

For each of these basic relationships we're going to look at the concepts we use around them and talk about some chart designs for each one. For each, we will consider the following options:

1. (horizontal|vertical) bars
2. (horizontal|vertical) dots
2. points
3. lines
4. points and lines

These are combinations of the basic building blocks. We're going to look at the Nominal relationship first.

## Nominal

The concepts we normally associate with Nominal relationships include:

* categories
    
Generally, the following options are good for Nominal relationships:

1. horizontal bars|dots
2. vertical bars|dots
3. points

Sometimes horizontal bars are preferred to vertical bars if the labels are overly long. When plotting multiple Nominal relationships we will often add color or plot "small multiples" (to be discussed later). Do not add color to a single nominal relationship.

## Time Series

The concepts we normally associated with Time Series include:

* change
* rise
* increase
* fluctuate
* grow
* decline
* decrease
* trend
    
For Time Series data, you can use:

1. vertical bars
2. points and lines

You can also use *boxes* which we have not yet shown. Boxes do not start at the x-axis as bars do. Instead they start at some minimum value and end at some maximum value. Boxes are good for showing low and high values for a particular observation (say, a stock market price per day).

If there is an obvious continuity in the data, then it is best to use points *and* lines. If the continuity is broken,  you may need to use just points or a combination of points and lines. Sometimes it's better to use vertical bars.

*Never use horizontal bars or dots for Time Series*

## Ranking

The concepts we normally associated with Ranking include:

* larger than
* smaller than
* equal to
* greater than
* less than
    
Generally speaking bar charts are the best chart for Rankings. You can either sort the data from highest to lowest (to emphasize the lowest value and ascending values) or lowest to highest (to emphasize the largest value and descending values).

## Part/Whole

The concepts we normally associated with Part/Whole include:

* rate or rate of total
* percent or percentage of total
* share
* accounts for *x* percent
    
We've already ruled out Pie Charts for showing Part/Whole relationships. Noting our problems with comparing non-aligned positions, we're going to suggest bar charts as the best way to represent Part/Whole relationships, either vertical or horizontal.

## Deviation

The concepts we normally associated with Deviation include:

* plus or minus
* variance
* difference
* relative to
    
Deviation stories should always be told *as the deviation* and not simply make the viewer calculate the deviation in their head.

The effective displays for deviations are:

* bars (either horizontal or vertical)
* lines (when the deviations are from a Time Series)

## Distribution

The concepts we normally associated with Distribution include:

* frequency
* distribution
* range
* concentration
* normal curve/distribution (and others)
    
We will have a lot to say about distributions later, in the EDA module. In general, we can use:

* vertical bars (as a *histogram*)
* lines (as a *frequency polygon*)
* dots
* boxes (as a *box and whiskers* plot)

as well as combinations of the above to effectively display distributions.

## Correlation

The concepts we normally associated with Correlation include:

* increases with
* decreases with
* changes with
* varies with
* caused by
* affected by
* follows

The canonical representation of some kind of correlation is to plot the numerical values on a scatter plot (XY-plot) and insert either a linear or LOESS trend line. Another approach would be to plot two side by side horizontal bar charts, called a *table lens*.

## Geospatial

The concepts we normally associated with Geospatial relationships include:

* geography
* location
* where
* region
* territory
* country
* state/province
* count/borough/parish
* city
    
In this particular case, the forms (dots, lines, bars) are generally drawn or plotted on a map. When working with location data, you should consider whether or not the *location* is actually important the story. For example, do people already know where the 50 US states are in relation to each other? Is it really necessary to put this on a map rather than a table? Unless geography is part of the story, consider whether your visualization really requires a map.

If you determine that a map is integral to the story, then possible solutions include:

* points of varying size
* points or areas of varying color *intensity*
* color intensities applied directly to geographical areas (this was the option Cleveland disliked...notice that not everyone is in agreement!)
* lines of varying thickness or color intensity.

One type of chart that we haven't talked about is the *Heat Map*. A Heat Map can be considered a Geospatial chart in an abstract space where color is often used to represent a 3rd dimension.