# Data visualization ― week 3

<img src="images/job_titles.png" alt="Drawing" style="width: 800px;"/>

*Source:* NYT

# Exploratory statistical charts ― agenda

- frequencies
- univariate distributions
- bivariate distributions
- multidimensional distributions

# Exploratory statistical charts

* **when** do we use statistical charts?
    - exploration: understanding the dataset at hand
    - presentation: highlighting patterns of variables

* **who** are the 'consumers' of statistical charts?
  - difficult to isolate one single type of consumer
  - the gamut comprises any possible social category

* **what** are the most common forms of statistical charts?

  - you don't fall short of alternatives...

# Design principles for statistical charts

<img src="images/tufte_book.jpg" alt="Drawing" style="width:500px;"/>

_"The conditions under which many data graphics are produces―the lack of substantive and quantitative skills of the illustrators, dislike of quantitative evidence, and the contempt for the intelligence of the audience―guarantee graphic mediocrity."_

_"These conditions engender graphics that (1) lie; (2) employ only the simplest designs; (3) miss the real news actually in the data"_


Tufte (2001, page 87)

# Graphical excellence according to Tufte

_"Excellence in statistical graphics consists of complex ideas communicated with clarity, precision, and efficiency"._

Graphical displays should:

* induce the viewer to think about the substance rather than about the methodology, graphic design [omitted] or something else
* serve a reasonable clear purpose: description, exploration, tabulation or decoration
* show the data
* avoid distorting what the data have to say
* reveal the data at several levels of detail, from a broad overview to the fine structure
* be closely integrated with the statistical and verbal description of the dataset

# Show data variation, not design variation

<img src="images/design_variation.png" alt="Drawing" style="width: 800px;"/>

*Source:* NSF, Science Indicators, 1974 (1976, page 15)

# Avoid distorting what the data have to say: Published chart

<img src="images/nys_exp_0.png" alt="Drawing" style="width: 800px;"/>

*Source:* NYT (February 1, 1976)

# Avoid distorting what the data have to say: Spot not so clever tricks

<img src="images/nys_exp_1.png" alt="Drawing" style="width: 800px;"/>

*Source:* NYT (February 1, 1976)

# Avoid distorting what the data have to say: For real

<img src="images/nys_exp_2.png" alt="Drawing" style="width: 800px;"/>

*Source:* Tufte (2001)

# Avoid distorting what the data have to say: The lie factor

\begin{equation}
\mbox{Lie factor} = \frac{\mbox{size of the effect shown in graphic}}{\mbox{size of the effect in data}}
\end{equation}

<img src="images/oil_price.png" alt="Drawing" style="width: 800px;"/>

# Tufte's conclusions about n-dimensional charts

"The use of two (or three) varying dimensions to show one-dimensional data is a weak and inefficient technique, capable of handling very small datasets, often with error in design and ambiguity in perception."

"The number of information-carrying (variable) dimensions should not exceed the number of dimensions in the data"

# How to use n-dimensional charts consistently

<img src="images/postal_saving_books.png" alt="Drawing" style="width: 800px;"/>

*Source:* Gabaglio (1888)

# Five principles to reveal the data

<center>Above all else show the data.</center>

<center>Maximise the data-ink ratio.</center>

<center>Erase non-data-ink.</center>

<center>Erase redundant-data-ink.</center>

<center>Revise and edit.</center>

# Maximize data-ink ratio: Redesigning the boxplot

![](images/boxplot_0.png)

![](images/boxplot_1.png)

# Maximize data-ink ratio: Redesigning the barchart

![](images/barchart_0.png)

![](images/barchart_1.png)

# Maximize data-ink ratio: Redesigning the scatter

![](images/scatter_0.png)

![](images/scatter_1.png)

# Are Tufte-adjusted charts 'better'?

<img src="images/is_tufte_style_better_0.png" alt="Drawing" style="width: 800px;"/>

<img src="images/is_tufte_style_better_1.png" alt="Drawing" style="width: 800px;"/>

# Looking for some sources of inspiration?

![](images/journal_of_visualization.jpg)

![](images/ieee.jpg)

# Visualizing frequencies

* candidate visual forms:
    - bar chart
    - pie/donut chart

* premises:
    - data should have grouping structure (e.g., reviews nested in products)
    - the grouping structure should be meaningful

* common mistakes:
    - frequencies shouldn't be used when continuous variables (e.g., product price) are available
    - in fact, many bar charts are based on ad-hoc, discretized variables (e.g., bins of price, such as £ 1 - 5)

Group | Frequency
-----:|----------
a     | 2
b     | 4
c     | 1

That is:

```{python}
{'a': 2, 'b': 4, 'c':1}
```

# Visualizing monovariate distributions

* candidate visual forms:
    - histograms
    - boxplots 
    - log-log plots

* premises:
    - data on a continuous variable (e.g., product prices) are available

* common mistakes:
    - histograms based on too many or too few bins 
    - outliers/long tails making the overall distribution gloomy

Case | Variable
-----|---------
1    | 0.59966121
2    | -0.3114587
3    | 1.22467437

That is:

```{python}
[0.59966121, -0.3114587, 1.22467437]
```

# Visualizing bivariate distributions

* candidate visual forms:
    - scatter diagram
    - 2d histogram

* premises:
    - data on two variables (e.g., product prices) are available

* common mistakes:
    - overplotting―there are too many datapoints; the visualization is too dense
    - imbalance between the fine structure of the data and the larger pattern

Case | Variable X  | Variable Y
-----|-------------|-----------
1    | 0.59966121  | 23
2    | -0.3114587  | 56
3    | 1.22467437  | 12

That is:

```{python}
[[0.59966121, -0.3114587, 1.22467437],
 [23, 56, 12]]
```

# Visualizing n-dimensional distributions

* candidate visual forms:
    - contour plots
    - surface plots
    - spider charts

* premises:
    - data on 3 + variables (e.g., product prices) are available

* common mistakes:
    - overplotting―there are too many datapoints; the visualization is too dense
    - imbalance between the fine structure of the data and the larger pattern
    - misfit between the visual form and the audience

Case | Variable X  | Variable Y | Variable Z
-----|-------------|------------|------------
1    | 0.59966121  | 23 | 1000
2    | -0.3114587  | 56 |1200
3    | 1.22467437  | 12 | 1300

That is:

```{python}
[[0.59966121, -0.3114587, 1.22467437],
 [23, 56, 12]
 [1000, 1200, 1300]]
```