# Case Study: Anscombe's Quartet

In this case study, we will illustrate some of the common dataframe manipulation techniques using Anscombe's quartet as a toy dataset. Anscombe's quartet is actually a collection of 4 series of 2D points. All series have the exactly the same mean and standard deviation of X and Y coordinates. Yet, they look very different when plotted. You can find more about Anscombe's quartet [here](https://en.wikipedia.org/wiki/Anscombe%27s_quartet).

## First look at the data

Let first take a look at the data of Anscombe's quartet provided by [vega datasets](https://github.com/vega/vega-datasets).

In [113]:
from vega_datasets import data
import pandas as pd

anscombe = data.anscombe()
anscombe

Unnamed: 0,Series,X,Y
0,I,10,8.04
1,I,8,6.95
2,I,13,7.58
3,I,9,8.81
4,I,11,8.33
5,I,14,9.96
6,I,6,7.24
7,I,4,4.26
8,I,12,10.84
9,I,7,4.81


In this case, the Anscombe's quartet is in its long form.  The column "Series" provides distinct labels for each series, and the "X" and "Y" columns provide the x and y coordinates.  Let us verify the claimed property on mean and standard deviation for each series.

In [114]:
anscombe.groupby("Series").mean()

Unnamed: 0_level_0,X,Y
Series,Unnamed: 1_level_1,Unnamed: 2_level_1
I,9.0,7.5
II,9.0,7.500909
III,9.0,7.5
IV,9.0,7.500909


In [115]:
anscombe.groupby("Series").std()

Unnamed: 0_level_0,X,Y
Series,Unnamed: 1_level_1,Unnamed: 2_level_1
I,3.316625,2.03289
II,3.316625,2.031657
III,3.316625,2.030424
IV,3.316625,2.030579


We used the `groupby` methods to aggregate data associated with each series and compute the corresponding mean and standard deviation. Indeed, they are approximately the same for all series.

## Splitting the data frame

It is really easy to extract the data frame associated with each series out by row filtering. For example, the following is an example for extracting the series I.  One can repeat this step to extract all series.

In [116]:
anscombe_I = anscombe[anscombe.Series == 'I']
anscombe_I

Unnamed: 0,Series,X,Y
0,I,10,8.04
1,I,8,6.95
2,I,13,7.58
3,I,9,8.81
4,I,11,8.33
5,I,14,9.96
6,I,6,7.24
7,I,4,4.26
8,I,12,10.84
9,I,7,4.81


## Converting to wide form

While long form is often preferred when it comes to data visualization, it is sometimes necessary to convert the data frame to its wide form.  This is done by the pivot function from pandas. Let's give it a try.

In [117]:
anscombe.pivot(columns="Series")

Unnamed: 0_level_0,X,X,X,X,Y,Y,Y,Y
Series,I,II,III,IV,I,II,III,IV
0,10.0,,,,8.04,,,
1,8.0,,,,6.95,,,
2,13.0,,,,7.58,,,
3,9.0,,,,8.81,,,
4,11.0,,,,8.33,,,
5,14.0,,,,9.96,,,
6,6.0,,,,7.24,,,
7,4.0,,,,4.26,,,
8,12.0,,,,10.84,,,
9,7.0,,,,4.81,,,


Hmm, that is not what we expected. All the `NaN`s are unexpected. On the positive side, the columns look right.  The columns uses Pandas [multi-level index](https://pandas.pydata.org/docs/user_guide/advanced.html#), where the first level determined whether the data is associated with "X" or "Y" coordinates, and second level determines the series the data come from.

Reading the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.pivot.html#pandas.pivot) of pivot in more details, we noticed that the pivot function relies on the `index` argument to determine whether two rows of the original data frame should be mapped to the same row in the output data frame. By default, it is using the row indices of the input data frame (i.e. `anscombe.index`).  Since the row indices are all unique in the `anscombe` dataset, the pivot function mapped each input row into a unique output row.

To fix this, let us generate a better index. Since each series contains 11 data points, we can add an `id` column that provides the per-series data point ids ranging from 0 to 10.

In [118]:
anscombe_wide = anscombe
anscombe_wide["id"] = anscombe_wide.index % 11
anscombe_wide

Unnamed: 0,Series,X,Y,id
0,I,10,8.04,0
1,I,8,6.95,1
2,I,13,7.58,2
3,I,9,8.81,3
4,I,11,8.33,4
5,I,14,9.96,5
6,I,6,7.24,6
7,I,4,4.26,7
8,I,12,10.84,8
9,I,7,4.81,9


Now, we can call pivot again with `id` as the index.

In [119]:
anscombe_wide = anscombe_wide.pivot(columns="Series", index="id")
anscombe_wide

Unnamed: 0_level_0,X,X,X,X,Y,Y,Y,Y
Series,I,II,III,IV,I,II,III,IV
id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
0,10,10,10,8,8.04,9.14,7.46,6.58
1,8,8,8,8,6.95,8.14,6.77,5.76
2,13,13,13,8,7.58,8.74,12.74,7.71
3,9,9,9,8,8.81,8.77,7.11,8.84
4,11,11,11,8,8.33,9.26,7.81,8.47
5,14,14,14,8,9.96,8.1,8.84,7.04
6,6,6,6,8,7.24,6.13,6.08,5.25
7,4,4,4,19,4.26,3.1,5.39,12.5
8,12,12,12,8,10.84,9.13,8.15,5.56
9,7,7,7,8,4.81,7.26,6.42,7.91


This time, we get what we expect! As before, let us validate the mean and standard deviation propoerties for each series.

In [125]:
anscombe_wide.mean()

   Series
X  I         9.000000
   II        9.000000
   III       9.000000
   IV        9.000000
Y  I         7.500000
   II        7.500909
   III       7.500000
   IV        7.500909
dtype: float64

In [124]:
anscombe_wide.std()

   Series
X  I         3.316625
   II        3.316625
   III       3.316625
   IV        3.316625
Y  I         2.032890
   II        2.031657
   III       2.030424
   IV        2.030579
dtype: float64

Note that we no longer need to use the `groupby` method as the data we need to group together are already organized in separate columns.

## Summary

In this case study, we demonstrated how to split a pandas dataframe into multiple sub-dataframes using row filtering and the conversion from long form to wide form using pivot. Here is a list of functions we used:
* `pd.pivot` [doc](https://pandas.pydata.org/docs/reference/api/pandas.pivot.html#pandas.pivot)
* `pd.DataFrame.groupby` [doc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html)
* `pd.DataFrame.mean` [doc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html)
* `pd.DataFrame.std` [doc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.std.html)