# Case Study: Anscombe's Quartet

In this case study, we are going to take a look at the famous [Anscombe's quartet example](https://en.wikipedia.org/wiki/Anscombe%27s_quartet). It is constructed by English statistitian Francis Anscombe. The dataset consists of 4 data series that are nearly identical in typical statistics measures such as mean and variance. Yet, when plotted, these 4 data series are visually distinct.

We will start from scratch and generate a faceted plot of Anscombe's quartet using [Altair](https://altair-viz.github.io/index.html). Let's start from loading the dataset!

## Data

In [1]:
from vega_datasets import data
import altair as alt

In [2]:
# Setup raw data for Anscombe's quartet
anscombe = data.anscombe()
anscombe.info()
anscombe.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44 entries, 0 to 43
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Series  44 non-null     object 
 1   X       44 non-null     int64  
 2   Y       44 non-null     float64
dtypes: float64(1), int64(1), object(1)
memory usage: 1.2+ KB


Unnamed: 0,Series,X,Y
0,I,10,8.04
1,I,8,6.95
2,I,13,7.58
3,I,9,8.81
4,I,11,8.33


Note that the dataset consists of 3 columns: "Series", "X" and "Y". Each row represents a data point (aka. an observation). The "Series" column groups the data points into 4 series.  Let's take a closer look at each series.

In [3]:
series_I = anscombe[anscombe.Series == 'I']
series_II = anscombe[anscombe.Series == 'II']
print(series_I.describe())
print(series_II.describe())

               X         Y
count  11.000000  11.00000
mean    9.000000   7.50000
std     3.316625   2.03289
min     4.000000   4.26000
25%     6.500000   6.31500
50%     9.000000   7.58000
75%    11.500000   8.57000
max    14.000000  10.84000
               X          Y
count  11.000000  11.000000
mean    9.000000   7.500909
std     3.316625   2.031657
min     4.000000   3.100000
25%     6.500000   6.695000
50%     9.000000   8.140000
75%    11.500000   8.950000
max    14.000000   9.260000


By looking at the statistics for the first 2 series, the mean and standard diviation of them are indeed the same. Feel free to modify the code and check the other 2 series.

## Visualization

Now, let's plot the data. Since the dataset contains an "X" column and a "Y" column, as a first step, we can plot them as dots (i.e. scatter plot) and see how they look. In the language of grammar of graphics, that means we will select circle as the "mark", and encode the "X" and "Y" columns as the x and y coordinates.

### First try

In [4]:
chart = alt.Chart(anscombe)
chart = chart.mark_circle(size=50)
chart = chart.encode(x='X', y='Y')
chart

Here we have the scatter plot!  Noticed that I passed a `size=50` argument to `mark_circle` method to adjust the size properties of the circles.  You can find out more about other mark-related properties we can set [here](https://altair-viz.github.io/user_guide/marks.html#mark-properties).  Feel free to play around!

While this plot is good, it contains no information about the series! Let's fix that by encoding series information as color:

In [5]:
chart = chart.encode(color='Series:N')
chart

### Faceted chart

So far so good! However, Anscombe's quartet are often visualized as 4 subplots sharing the same scale in x and y. This is an example of a common visualization technique called [small multiple](https://en.wikipedia.org/wiki/Small_multiple) that we will be using very often in this course. Altair can achieve this using [faceted chart](https://altair-viz.github.io/user_guide/compound_charts.html#facet-chart):

In [6]:
# Create facets chart.
# Note: property must be set before faceting.
chart_faceted = chart.properties(width=200, height=200)
chart_faceted = chart_faceted.facet(columns=2, facet=alt.Facet('Series'))
chart_faceted

With faceted chart, each data series has its own quartet.  All quartets shared same scale in x and y, making it easy to compare across subplots.  Altair also provide a label for each quartet based on the value of the "Series" column in our dataset.

### With regression

To really illustrates the statistical properties of Anscombe's quartet dataset, let's fit a line to each of the data series. Altair support this with the [regression transform](https://altair-viz.github.io/user_guide/transform/regression.html#user-guide-regression-transform). Let's start by adding regression to the simple scatter plot we have before.

In [7]:
chart_regression = chart
chart_regression = chart_regression.transform_regression('X', 'Y', 
                                                         method='linear', 
                                                         extent=[0,20], 
                                                         groupby=['Series'])
chart_regression = chart_regression.mark_line()
chart + chart_regression

In the code above, `chart` is still the good old scatter plot.  `chart_regression` is the line plot generated by fitting a line to each data series. It really consists of 4 lines. Because each data series has nearly identical statistical properties, all 4 lines looks the same.  Note that Altair allows us to overlay the line plot (`chart_regression`) on top of the scatter plot (`chart`) by "addition" (i.e. `chart + chart_regression`)!  This is called [layered chart](https://altair-viz.github.io/user_guide/compound_charts.html#layer-chart). Now, let's add facets to this layered chart.

In [8]:
layered_chart = chart + chart_regression
layered_chart = layered_chart.properties(width=200, height=200) # Adjust size before faceting.
layered_chart = layered_chart.facet(columns=2, facet=alt.Facet('Series'))
layered_chart

## Summary

Bravo! We have completed this case study. Here is a quick summary of key points we touched:
* [Anscombe's quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet) is an interesting dataset. The statistical properties of each data series are the same, yet they look very different when plotted.
* [Altair](https://altair-viz.github.io/index.html) (and the grammar of graphics in general) is very flexible and allows one to build up chart incrementally.
* Altair's [faceted chart](https://altair-viz.github.io/user_guide/compound_charts.html#faceted-charts) can be used to create small multiple visualizations.
* Altair's [layered chart](https://altair-viz.github.io/user_guide/compound_charts.html#layered-charts) allow one to layer one chart on top of another.