# Data Morph
## A Cautionary Tale of Summary Statistics

<br/>

### Stefanie Molin

## Bio

- &#128105;&#127995;&#8205;&#128187; Software engineer at Bloomberg in New York City
- &#10024; Founding member of Bloomberg's Data Science Community
- &#9997;&#127996; Author of <em>[Hands-On Data Analysis with Pandas](https://www.amazon.com/Hands-Data-Analysis-Pandas-visualization-dp-1800563450/dp/1800563450/)</em> (currently in its second edition; translated into Korean and Chinese)
- &#127891; BS in operations research from Columbia University
- &#127891; MS in computer science (ML specialization) from Georgia Tech

## Talk outline

- Why summary statistics aren't enough
- Introduction to Data Morph
- How Data Morph works
- Limitations and areas for future work
- Lessons learned and challenges faced

## Summary statistics aren't enough

These datasets are clearly different:

![example datasets](media/example_datasets.png)

<hr align="left" style="width: 33%;" />
<div style="text-indent: -10px; padding-left: 40px; width: 90%;">
    <small>$^*$ The Python logo is a <a href="https://www.python.org/psf/trademarks/" target="_blank" rel="noopener noreferrer">trademark of the Python Software Foundation (PSF)</a>, used with permission from the Foundation.</small>
</div>

However, we would not know that if we were to only look at the summary statistics:

![summary statistics are the same](media/stats.gif)

<hr align="left" style="width: 33%;" />
<div style="text-indent: -10px; padding-left: 40px; width: 90%;">
    <small>$^*$ The Python logo is a <a href="https://www.python.org/psf/trademarks/" target="_blank" rel="noopener noreferrer">trademark of the Python Software Foundation (PSF)</a>, used with permission from the Foundation.</small>
</div>

What we call *summary statistics* summarize only part of the distribution. We need many **moments**$^*$ to describe the shape of a distribution (and distinguish between these datasets):

![moments](media/moments.gif)

<hr align="left" style="width: 33%;" />
<div style="text-indent: -10px; padding-left: 40px; width: 90%;">
    <small>$^*$ The first moment is the center of mass of the distribution (the mean); here, we have central moments, which are independent of translation, so our first moment is zero (we subtract the mean). The second moment is the variance, but once we get to the third moment (skewness), we can differentiate between these datasets. Further moments, like kurtosis (4th moment), provide even more information.</small>
</div>
<div style="text-indent: -10px; padding-left: 40px; width: 90%;">
    <small>$^\dagger$ The Python logo is a <a href="https://www.python.org/psf/trademarks/" target="_blank" rel="noopener noreferrer">trademark of the Python Software Foundation (PSF)</a>, used with permission from the Foundation.</small>
</div>

Adding in histograms for the marginal distributions, we can see the distributions of both *x* and *y* are indeed quite different across datasets. Some of these differences are captured in the third moment (**skewness**) and the fourth moment (**kurtosis**), which measure the asymmetry and weight in the tails of the distribution, respectively:

![marginal distributions](media/with_marginals.png)

<hr align="left" style="width: 33%;" />
<div style="text-indent: -10px; padding-left: 40px; width: 90%;">
    <small>$^*$ The Python logo is a <a href="https://www.python.org/psf/trademarks/" target="_blank" rel="noopener noreferrer">trademark of the Python Software Foundation (PSF)</a>, used with permission from the Foundation.</small>
</div>

However, the moments aren't capturing the relationship between *x* and *y*. If we suspect a linear relationship, we may use the Pearson correlation coefficient, which is the same for all three datasets below. Here, the visualization tells us a lot more information about the relationships between the variables:

![summary statistics static](media/stats.png)

<hr align="left" style="width: 33%;" />
<div style="text-indent: -10px; padding-left: 40px; width: 90%;">
    <small>$^*$ The Python logo is a <a href="https://www.python.org/psf/trademarks/" target="_blank" rel="noopener noreferrer">trademark of the Python Software Foundation (PSF)</a>, used with permission from the Foundation.</small>
</div>

The Pearson correlation coefficient measures *linear* correlation, so if we don't visualize our data, then we have another problem: a high correlation (close in absolute value to 1) does not mean the relationship is actually linear. Without a visualization to contextualize the summary statistics, we do not have an accurate understanding of the data.

For example, all four datasets in **Anscombe's Quartet** (constructed in 1973) have strong correlations, but only **I** and **III** have linear relationships:

![Anscombe's Quartet](media/anscombe.png)

<div style="text-align: center; margin-top: -20px;">
    <small>This visual was created by Stefanie Molin using the Anscombe's Quartet dataset as provided in
        <a href="https://github.com/mwaskom/seaborn" target="_blank" rel="noopener noreferrer">seaborn.</a></small>
</div>

### Visualization is an essential part of any data analysis.

In their 2020 paper, *[A hypothesis is a liability](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02133-w)*, researchers Yanai and Lercher argue that **simply approaching a dataset with a hypothesis may limit the thoroughness to which the data is explored**.

Let's take a look at their experiment.

#### The experiment

Students in a statistical data analysis course were split into two groups. One group was given the open-ended task of exploring the data, while the other group was instructed to test the following hypotheses:

1. There is a difference in the mean number of steps between women and men.
2. The correlation coefficient between steps and BMI is negative for women.
3. The correlation coefficient between steps and BMI is positive for men.

<div style="text-align: right;">
    <small>
        <a href="https://doi.org/10.1101/2020.07.30.228916" rel="noopener noreferrer" target="_blank">
            (Yanai & Lercher, 2020)
        </a>
    </small>
</div>

Here's what that dataset looked like:

<div style="text-align: center;">
    <img
        width=450
        alt="Figure 1 from 'A hypothesis is a liability' by Itai Yanai & Martin Lercher"
        src="https://media.springernature.com/lw685/springer-static/image/art%3A10.1186%2Fs13059-020-02133-w/MediaObjects/13059_2020_2133_Fig1_HTML.png?as=webp" />
</div>

<div style="text-align: center; margin-top: -10px;">
    <small>Figure 1 from <em><a href="https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02133-w" target="_blank" rel="noopener noreferrer">A hypothesis is a liability</a></em> by Itai Yanai & Martin Lercher (<a href="http://creativecommons.org/licenses/by/4.0/" target="_blank" rel="noopener noreferrer">Creative Commons Attribution 4.0 International License</a>).</small>
</div>

#### How can we encourage students and practitioners alike to be more thorough in their analyses?

### Create more memorable teaching aids

In 2017, Autodesk researchers created the **Datasaurus Dozen**, building upon the idea of Anscombe's Quartet to make a more impactful example:

<div style="text-align: center;">
    <img src="media/datasaurus.png" alt="Datasaurus Dozen" width="500px" style="margin: -10px auto;">
    <br/>
    <div style="margin: auto 22%;">
    <small>This visual was created by Stefanie Molin using the Datasaurus Dozen dataset as provided by
        <a href="https://github.com/jmatejka/same-stats-different-graphs" target="_blank" rel="noopener noreferrer">jmatejka/same-stats-different-graphs.</a></small></div>
</div>

They also employed animation, which is even more impactful. Every shape as we transition between the Datasaurus and the circle shares the same summary statistics:

<div style="text-align: center;">
    <img src="media/dino_to_circle.gif" alt="Datasaurus to circle (Data Morph)">
    <br/>
    <small>This visual was created by Stefanie Molin using Data Morph.</small>
</div>

But, now we have a new problem...

### What's so special about the Datasaurus?

#### NOTHING!

Since there was no easy way to do this for arbitrary datasets, people assumed that this capability is a property of the Datasaurus and were shocked to see this work with other shapes. The more ways people see this and the more memorable they are, the better this concept will stick – repetition is key to learning. This is why I built [Data Morph](https://stefaniemolin.com/data-morph/stable/index.html).

### Data Morph addresses the limitations of previous methods

- installable Python package that can be used without hacking at the codebase
- animated results right out of the box
- additional datasets (built-in and custom): use datasets other than the aforementioned examples
- make it easy for people to experiment with their own datasets and various target shapes
- number of possible examples is no longer frozen

## Data Morph (2023)

<div style="text-align: center;">
    <img src="media/Python_to_heart.gif" alt="morphing the Python logo into a heart">
</div>

<hr align="left" style="width: 33%;" />
<div style="text-indent: -10px; padding-left: 40px; width: 90%;">
    <small>$^*$ The Python logo is a <a href="https://www.python.org/psf/trademarks/" target="_blank" rel="noopener noreferrer">trademark of the Python Software Foundation (PSF)</a>, used with permission from the Foundation.</small>
</div>

Here's the code to create that example:
```console
$ python -m pip install data-morph-ai
$ data-morph --start-shape Python --target-shape heart
```

Here's the code to create that example:

<div style="text-align: left;">
    <img src="media/snippets/data_morph_install.svg" alt="installing and using Data Morph on the command line">
</div>

Here's what's going on behind the scenes:
```python
from data_morph.data.loader import DataLoader
from data_morph.morpher import DataMorpher
from data_morph.shapes.factory import ShapeFactory


dataset = DataLoader.load_dataset('Python')
target_shape = ShapeFactory(dataset).generate_shape('heart')

morpher = DataMorpher(decimals=2, in_notebook=False)
_ = morpher.morph(dataset, target_shape)
```

Here's what's going on behind the scenes:

<div style="text-align: left;">
    <img src="media/snippets/data_morph_code_intro.svg" alt="behind the scenes">
</div>

## How it works
A high-level overview.

### 1. Select a starting dataset
```python
from data_morph.data.loader import DataLoader  # <--
from data_morph.morpher import DataMorpher
from data_morph.shapes.factory import ShapeFactory


dataset = DataLoader.load_dataset('Python')  # <--
target_shape = ShapeFactory(dataset).generate_shape('heart')

morpher = DataMorpher(decimals=2, in_notebook=False)
_ = morpher.morph(dataset, target_shape)
```

### 1. Select a starting dataset
<div style="text-align: left;">
    <img src="media/snippets/data_morph_load_dataset.svg" alt="loading a dataset">
</div>

#### Automatically-calculated bounds

Data Morph provides the `Dataset` class that wraps the data (stored as a `pandas.DataFrame`) with information about bounds for the data, the morphing process, and plotting. This allows for the use of arbitrary datasets by providing a way to calculate target shapes – no more hardcoded values.

<div style="text-align: center;">
    <img src="media/bounds.png" alt="automatically-calculated bounds" width="400px">
</div>

<hr align="left" style="width: 33%;" />
<div style="text-indent: -10px; padding-left: 40px; width: 90%;">
    <small>$^*$ The Python logo is a <a href="https://www.python.org/psf/trademarks/" target="_blank" rel="noopener noreferrer">trademark of the Python Software Foundation (PSF)</a>, used with permission from the Foundation.</small>
</div>

#### Built-in datasets

To spark creativity, there are built-in datasets to inspire you:

<div style="text-align: center;">
    <img src="media/available_datasets.png" alt="built-in datasets" width="450px">
    <br/>
    <small>Note: Currently displaying what's available as of the v0.2.0 release. All logos are used with <a href="https://stefaniemolin.com/data-morph/stable/api/data_morph.data.loader.html#id1" target="_blank" rel="noopener noreferrer">permission</a>.</small>
</div>

### 2. Generate a target shape based on the dataset

```python
from data_morph.data.loader import DataLoader
from data_morph.morpher import DataMorpher
from data_morph.shapes.factory import ShapeFactory  # <--


dataset = DataLoader.load_dataset('Python')
target_shape = ShapeFactory(dataset).generate_shape('heart')  # <--

morpher = DataMorpher(decimals=2, in_notebook=False)
_ = morpher.morph(dataset, target_shape)
```

### 2. Generate a target shape based on the dataset

<div style="text-align: left;">
    <img src="media/snippets/data_morph_generate_shape.svg" alt="generate shape">
</div>

#### Scaling and translating target shapes

Depending on the target shape, bounds and/or statistics from the dataset are used to generate a custom target shape for the dataset to morph into.

![shapes are calculated based on input data](media/fitting_shapes.png)


<hr align="left" style="width: 33%;" />
<div style="text-indent: -10px; padding-left: 40px; width: 90%;">
    <small>$^*$ The Python logo is a <a href="https://www.python.org/psf/trademarks/" target="_blank" rel="noopener noreferrer">trademark of the Python Software Foundation (PSF)</a>, used with permission from the Foundation.</small>
</div>

#### Built-in target shapes

The following target shapes are currently available:

<div style="text-align: center;">
    <img src="media/available_shapes.png" alt="built-in target shapes" width="500px">
    <br/>
    <small>Note: Currently displaying what's available as of the v0.2.0 release.</small>
</div>

#### The `Shape` class hierarchy

In Data Morph, shapes are structured as a hierarchy of classes, which must provide a `distance()` method. This makes them interchangeable in the morphing logic.

<div style="text-align: center;">
    <img src="media/uml/shapes_uml.svg" alt="hierarchy of shapes">
    <br/>
    <small>Note: The ... boxes represent classes omitted for space.</small>
</div>

### 3. Morph the dataset into the target shape
```python
from data_morph.data.loader import DataLoader
from data_morph.morpher import DataMorpher  # <--
from data_morph.shapes.factory import ShapeFactory


dataset = DataLoader.load_dataset('Python')
target_shape = ShapeFactory(dataset).generate_shape('heart')

morpher = DataMorpher(decimals=2, in_notebook=False)  # <--
_ = morpher.morph(dataset, target_shape)  # <--
```

### 3. Morph the dataset into the target shape
<div style="text-align: left;">
    <img src="media/snippets/data_morph_morph_step.svg" alt="morph dataset into target shape">
</div>

#### Simulated annealing
A point is selected at random (blue) and moved a small, random amount to a new location (red), preserving summary statistics. This part of the codebase comes from the Autodesk research and is mostly unchanged:
<div style="text-align: center;">
    <img src="media/simulated_annealing.gif" alt="example point movement">
</div>


<hr align="left" style="width: 33%;" />
<div style="text-indent: -10px; padding-left: 40px; width: 90%;">
    <small>$^*$ The Python logo is a <a href="https://www.python.org/psf/trademarks/" target="_blank" rel="noopener noreferrer">trademark of the Python Software Foundation (PSF)</a>, used with permission from the Foundation.</small>
</div>

#### Avoiding local optima

Sometimes, the algorithm will move a point away from the target shape, while still preserving summary statistics. This helps to avoid getting stuck:

<div style="text-align: center;">
    <img src="media/avoiding_local_optima.gif" alt="example point movement">
    <br/>
    <small>These iterations were isolated from the animation on the previous slide.</small>
</div>


<hr align="left" style="width: 33%;" />
<div style="text-indent: -10px; padding-left: 40px; width: 90%;">
    <small>$^*$ The Python logo is a <a href="https://www.python.org/psf/trademarks/" target="_blank" rel="noopener noreferrer">trademark of the Python Software Foundation (PSF)</a>, used with permission from the Foundation.</small>
</div>

The likelihood of doing this decreases over time and is governed by the **temperature** of the simulated annealing process:

<div style="text-align: center;">
    <img src="media/temperature_over_time.png" alt="temperature over time">
    <br/>
    <small>The temperature falls to zero as we near the final iterations, meaning we become more strict about moving toward the target shape to finalize the output.</small>
</div>

#### Decreasing point movement over time

The maximum amount that a point can move at a given iteration decreases over time for a better visual effect. This makes points move faster when the morphing starts and slow down as we approach the target shape:

<div style="text-align: center;">
    <img src="media/Python_to_heart_forward_only.gif" alt="morphing the Python logo into a heart">
</div>

<hr align="left" style="width: 33%;" />
<div style="text-indent: -10px; padding-left: 40px; width: 90%; margin-bottom: -15px;">
    <small>$^*$ The Python logo is a <a href="https://www.python.org/psf/trademarks/" target="_blank" rel="noopener noreferrer">trademark of the Python Software Foundation (PSF)</a>, used with permission from the Foundation.</small>
</div>
<div style="text-indent: -10px; padding-left: 40px; width: 90%;">
    <small>$^\dagger$ Varying point movement over time is not part of the Autodesk implementation.</small>
</div>

In simulated annealing, we are decreasing temperature over time, so we can think of the earlier iterations as matter in a gaseous state (the points are moving fast). As the temperature decreases, we transition into liquid and eventually solid state, with the point movement decreasing.

Unlike temperature, we don't allow this value to fall to zero, since we don't want to halt movement:

<div style="text-align: center;">
    <img src="media/maximum_movement_over_time.png" alt="easing movement over time">
    <br/>
    <small>Maximum point movement decreases over time just as temperature does.</small>
</div>

## Limitations and areas for future work

### &ldquo;Bald spots&rdquo;

How do we encourage points to fill out the target shape and not just clump together?

<div style="text-align: center;">
    <img src="media/bald_spots.png" alt="bald spots limitation">
</div>

### Morphing direction

Currently, we can only morph from dataset to shape (and shape to dataset by playing the animation in reverse). I would like to support dataset to dataset and shape to shape morphing, but there are challenges to both:

|Goal|Challenges|
|---|---|
|shape &rarr; shape| determining the initial sizing and possibly aligning scale across the shapes, and solving the bald spot problem |
| dataset &rarr; dataset | defining a distance metric, determining scale and position of target, and solving the bald spot problem |

### Speed

The algorithm from the original research is largely untouched and parts of it could potentially be vectorized to speed up the morphing process.

### Data scale affects morphing time

Smaller values (left subplot) morph in fewer iterations than larger values (right subplot) since we only move small amounts at a time:

<div style="text-align: center;">
    <img src="media/scale.png" alt="scale">
    <br/>
    <small>Converting each of these into the circle shape takes ~25K iterations for the half-scale, ~50K iterations for the actual scale, and ~77.5K iterations for the scaled-up version.</small>
</div>

### Convergence

- **Currently**: The user specifies the number of iterations to run. For datasets with small values, convergence might happen earlier; for datasets with larger values, this might happen well after this number of iterations.
- **Goal**: The user would specify the maximum number of iterations and the algorithm would stop early if the dataset had converged to the target shape.

## Lessons learned and challenges faced

### Repeating research is hard

My first step was to use the [Autodesk researchers' code](https://github.com/jmatejka/same-stats-different-graphs) to recreate the conversion of the Datasaurus into a circle and figure out how the code worked.

Challenges at this stage:
- Limited or no code documentation
- Partial codebase with unused variables and functions
- Generic variable names

TIME TAKEN: 4 hours

### Extending research is harder

From there, I tried to get it to work with a panda-shaped dataset, reworked to have similar statistics to the Datasaurus.

Challenges at this stage:
- Limited or no code documentation
- Partial codebase with unused variables and functions
- Hardcoded values (some of which were related to the data)

TIME TAKEN: 1.75 days

### Building and distributing a package is a lot of work

Once I got the transformation working with the panda (my original goal), I realized this would be a helpful teaching tool and decided to make a package.

Challenges at this stage:
- Purging unused variables and functions
- Refactoring a monolithic codebase
- Writing a pre-commit hook to validate numpydoc-style docstrings ([PR 454](https://github.com/numpy/numpydoc/pull/454))
- Building and hosting documentation
- Creating a robust test suite from scratch
- Publishing to PyPI and conda-forge
- Automating workflows with GitHub Actions

TIME TAKEN: 2 months (v0.1.0)

### Side note: Don't completely trust the docs
Here are some cases I bumped into while building Data Morph:

- Error in version switcher config example for pydata-sphinx-theme ([PR 1279](https://github.com/pydata/pydata-sphinx-theme/pull/1279)).
- Unable to report code coverage broken out by package and tests in PR using codecov configuration like Matplotlib's ([PR 25698](https://github.com/matplotlib/matplotlib/pull/25698)).

## Helpful resources

- [Configuring setuptools using pyproject.toml files](https://setuptools.pypa.io/en/latest/userguide/pyproject_config.html) – Python Packaging Authority
- [Packaging Python Packages](https://packaging.python.org/en/latest/tutorials/packaging-projects/) – Python Packaging Authority
- [Building and hosting documentation on GitHub Pages](https://olgarithms.github.io/sphinx-tutorial/docs/7-hosting-on-github-pages.html) – Aya Elsayed and Olga Matoula
- [Python Packaging Tutorial: The Conda Way](https://hackmd.io/ElBrRQ6rT4K_dfzjY6pAFQ#3-Time-to-pack-%F0%9F%93%A6) – Bianca Henderson, Mahe Iram Khan, Valerio Maggio, and Dave Clements
- [Building and testing Python](https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python) – GitHub Actions docs

## Closing remarks
- Summary statistics alone aren't enough to describe your data.
- Visualization is essential, but a single plot won't suffice.
- Try out Data Morph!
  - `python -m pip install data-morph-ai`
  - `conda install -c conda-forge data-morph-ai`
  - docs: [tinyurl.com/data-morph-docs](https://tinyurl.com/data-morph-docs)
  - repo: [github.com/stefmolin/data-morph](https://github.com/stefmolin/data-morph)

## References
- Anscombe, F.J. (1973). Graphs in Statistical Analysis. *The American Statistician 27*, 1, 17–21. https://www.tandfonline.com/doi/abs/10.1080/00031305.1973.10478966
- Matejka, J., Fitzmaurice, G. (2017). Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (CHI '17). Association for Computing Machinery, New York, NY, USA, 1290–1294. https://doi.org/10.1145/3025453.3025912
- Yanai, I., Lercher, M. (2020). A hypothesis is a liability. *Genome Biol 21*, 231. https://doi.org/10.1186/s13059-020-02133-w
- Yanai, I, Lercher, M. (2020). Selective attention in hypothesis-driven data analysis. BioRxiv. https://doi.org/10.1101/2020.07.30.228916

## Thank you!

*I hope you enjoyed the session. You can follow my work on the following platforms:*

  <div
    style="
      display: flex;
      justify-content: space-evenly;
      align-items: center;
    ">
    <div style="text-align: center; width: 30%">
      <img
        class="qr-code"
        src="https://raw.githubusercontent.com/stefmolin/pandas-workshop/main/media/qr-code.png" />
    </div>
    <div style="font-size: 1.5em">
      <div
        style="
          display: flex;
          justify-content: flex-start;
          align-items: center;
        ">
        <i class="fa fa-globe fa-fw" style="padding-right: 4px"></i>
        <a href="https://stefaniemolin.com" rel="noopener noreferrer">
          stefaniemolin.com
        </a>
      </div>
      <div
        style="
          display: flex;
          justify-content: flex-start;
          align-items: center;
        ">
        <i class="fab fa-github fa-fw" style="padding-right: 4px"></i>
        <a
          href="https://github.com/stefmolin"
          rel="noopener noreferrer"
          target="_blank">
          github.com/stefmolin
        </a>
      </div>
      <div
        style="
          display: flex;
          justify-content: flex-start;
          align-items: center;
        ">
        <i class="fab fa-twitter fa-fw" style="padding-right: 4px"></i>
        <a
          href="https://twitter.com/StefanieMolin"
          rel="noopener noreferrer"
          target="_blank">
          twitter.com/StefanieMolin
        </a>
      </div>
      <div
        style="
          display: flex;
          justify-content: flex-start;
          align-items: center;
        ">
        <i class="fab fa-linkedin fa-fw" style="padding-right: 4px"></i>
        <a
          href="https://linkedin.com/in/stefanie-molin"
          rel="noopener noreferrer"
          target="_blank">
          linkedin.com/in/stefanie-molin
        </a>
      </div>
    </div>
  </div>
</div>