# Groups and visualization with Anscombe’s quartet

by Koenraad De Smedt at UiB

---
This notebook shows:

1.   How to read a JSON file into a dataframe
2.   How to make groups by values in a column
3.   How to plot the data in each group.

This is illustrated with the *Anscombe’s quartet*, a well-known example from statistics. The dataset for this example, `anscombe.json`, is formatted in JSON, a structured data type which is similar to a *dict*.

---

Use *pandas* to read the JSON dataset from the Google Colab sample data into a dataframe. Alternatively, if you are not using Colab, you can read the file from another location.

In [None]:
import pandas as pd
df = pd.read_json('/content/sample_data/anscombe.json')
# df = pd.read_json('https://huggingface.co/datasets/merve/test-dataset/raw/4a3883db6cdc57e61f202c981f6924a87bece781/anscombe.json')
# df = pd.read_json('https://raw.githubusercontent.com/vega/vega/main/docs/data/anscombe.json')
df

Group by different values in the Series column. Describe the groups. You can see that the groups have the same number of data points and the same (or very similar) summary statistics, such as the means and standard deviations for X and Y.

In [None]:
quartet = df.groupby('Series')
quartet.describe()

Each group has the same (or very close) correlation between X and Y.

In [None]:
quartet.corr()

The surprising part comes when we plot the data points in each group. This illustrates the importance of data visualization, as pointed out by [Anscombe](https://en.wikipedia.org/wiki/Anscombe%27s_quartet).

In [None]:
quartet.plot.scatter('X','Y')

Here is an alternative way to plot the data using Seaborn.

In [None]:
import seaborn as sns
sns.relplot(data=df, x='X', y='Y', col='Series', col_wrap=2)

###Exercises

1. (optional) Read [the article about the Datasaurus dozen](https://www.autodesk.com/research/publications/same-stats-different-graphs), download the data, move the TSV file to your Google Drive (or upload to Colab), [read the table from file into a dataframe](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) and plot the data.