# Anscombe's Quartet

In 1973, the English statistician Francis John Anscombe published a short paper describing the importance of graphs while conducting statistical analysis. According to Anscombre, plotting data can provide unique insights and can help identifying broad features of the dataset and detecting whether the assumptions of the statistical method are met or violated.

Anscombe created a fictitious dataset of four x-y variable pair that share the same statistics:

- Number of observations: 11
- Mean of the x variable: 9.0
- Mean of the y variable: 7.5
- Equation of linear regression model: $y=3+0.5x$
- Coefficient of determination ($r^2$): 0.667
- Coefficient of correlation ($r$): 0.817

Question: Despite their identical statistical metrics, can these four datasets considered similar?



In [11]:
# Import modules
import pandas as pd
from scipy import stats
from bokeh.plotting import figure, show, output_notebook, gridplot
output_notebook()

In [2]:
# Load Ascombe's dataset
df = pd.read_csv('../datasets/anscombe_quartet.csv')
df.head(11) # The entire dataset


Unnamed: 0,obs,x1,y1,x2,y2,x3,y3,x4,y4
0,1,10,8.04,10,9.14,10,7.46,8,6.58
1,2,8,6.95,8,8.14,8,6.77,8,5.76
2,3,13,7.58,13,8.74,13,12.74,8,7.71
3,4,9,8.81,9,8.77,9,7.11,8,8.84
4,5,11,8.33,11,9.26,11,7.81,8,8.47
5,6,14,9.96,14,8.1,14,8.84,8,7.04
6,7,6,7.24,6,6.13,6,6.08,8,5.25
7,8,4,4.26,4,3.1,4,5.39,19,12.5
8,9,12,10.84,12,9.13,12,8.15,8,5.56
9,10,7,4.82,7,7.26,7,6.42,8,7.91


In [3]:
# Define a linear model
lm = lambda x,intercept,slope: intercept + slope*x


In [4]:
# Fit linear model to each dataset
slope_1, intercept_1, r_val_1, p_val_1, std_err_1 = stats.linregress(df.x1, df.y1)
slope_2, intercept_2, r_val_2, p_val_2, std_err_2 = stats.linregress(df.x2, df.y2)
slope_3, intercept_3, r_val_3, p_val_3, std_err_3 = stats.linregress(df.x3, df.y3)
slope_4, intercept_4, r_val_4, p_val_4, std_err_4 = stats.linregress(df.x4, df.y4)


In [5]:
# Print the coefficient of correlation for each regression to ensure they are equal
print('A:', round(r_val_1**2,3))
print('B:', round(r_val_2**2,3))
print('C:', round(r_val_3**2,3))
print('D:', round(r_val_4**2,3))


A: 0.667
B: 0.666
C: 0.666
D: 0.667


In [32]:
# Plot points and fitted line
p1 = figure(x_range=(0,20), y_range=(0,15))
p1.scatter(x='x1', y='y1', source=df, size=5, color='black')
p1.line(df['x1'], lm(df['x1'],intercept_1,slope_1), color='tomato')
p1.xaxis.axis_label ='x'
p1.yaxis.axis_label ='y'

p2 = figure(x_range=(0,20), y_range=(0,15))
p2.scatter(x='x2', y='y2', source=df, size=5, color='black')
p2.line(df['x2'], lm(df['x2'],intercept_2,slope_2), color='purple')
p2.xaxis.axis_label ='x'
p2.yaxis.axis_label ='y'

p3 = figure(x_range=(0,20), y_range=(0,15))
p3.scatter(x='x3', y='y3', source=df, size=5, color='black')
p3.line(df['x3'], lm(df['x3'],intercept_3,slope_3), color='royalblue')
p3.xaxis.axis_label ='x'
p3.yaxis.axis_label ='y'

p4 = figure(x_range=(0,20), y_range=(0,15))
p4.scatter(x='x4', y='y4', source=df, size=5, color='black')
p4.line(df['x4'], lm(df['x4'],intercept_1,slope_1), color='green')
p4.xaxis.axis_label ='x'
p4.yaxis.axis_label ='y'

grid = gridplot( [[p1,p2],[p3,p4]], plot_width=400, plot_height=400)
show(grid)


## Practice

- Can you think of any pair of independent and dependent variables that could resemble each dataset regardless of the axes magnitudes?

- Use the Numpy module to create a set of eleven evenly-distributed random x,y values. Then, fit a linear model and compute the coefficient of correlation. Finally, create a figure showing the scatter points, the fitted linear model, and include the coefficient of correlation using plot annotations.

- Compute the root mean square error, mean absolute error, and mean bias error for each dataset and the fitted linear model. Can any of these error metrics overcome the limitations of the coefficient of correlation?

- Instead of creating the subplots and regression lines one by one, can you write a for loop that will iterate over each dataset, fit a linear model, and then populate its correpsonding subplot?

- A linear model does not seem to be the best model for Figure B. Can you think of an alternative model that could better approximate the set in Figure B? Define the model as a lmbda function and fit the model to the dataset in Figure B.


## References

Anscombe, F.J., 1973. Graphs in statistical analysis. The American Statistician, 27(1), pp.17-21.