# Data Visualization

A good data visualization helps us to see patterns and trends in our data that we might not have noticed otherwise. It also helps us to communicate our findings to others.

You'll sometimes see these referred to as "Exploratory Data Analysis" and "Explanatory Visualizations". The former is about finding patterns in your data, the latter is about communicating those patterns to others.

Many of the tools are the same, but you may find that an interactive tool is better for exploring your data, while a crisp graphic with a clear message is better for communicating your findings. This is not a hard and fast rule of course, plenty of great interactive visualizations exist that serve the purpose of letting your audience explore complex datasets.

## Before You Start (Explanatory Visualizations)

Unless you're using a quick and dirty plot to explore your data, you'll want to have your data as clean as possible, in a format that is easy to work with.

Generally speaking, you'll want to have your data in a tabular format, with one row per observation and one column per variable.

You might use a SQL query or Pandas to get your data into this format, and then have the functions or programs that generate your visualizations read from that file.

When considering the full pipeline (extraction, cleaning, matching, etc.) an visualization should be a terminal step in the pipeline.

This will allow you to work on it independently of the rest of your data processing, and also to easily update your visualizations as your data changes.

If you're building an explanatory visualization, you'll want to have a clear message in mind before you start.
Ask yourself these questions:

* What is the story I want to tell?
* Who is the audience I'm trying to reach?
* What kinds of mistakes might that audience make when interpreting my visualization?

With these in mind, you can start to think about what kind of visualization will best communicate your message.

## Data Viz Best Practices

### Keep your "story" and audience in mind

There's a good chance you have a lot of interesting data, but data visualization is rarely the place to show it off.

By keeping your story or key question in mind, you can usually find 1-3 visualizations that tell the story best.

Consider the famous "hockey stick" graph.
It is a very simple graph, but it was the most effective way to communicate the key message of the report.

![](https://upload.wikimedia.org/wikipedia/commons/thumb/2/2d/T_comp_61-90.pdf/page1-1590px-T_comp_61-90.pdf.jpg)

### Simpler is usually better

It is easy to get lost in making all kinds of wild visualizations modern tools offer: a network analysis graph, a 3D scatter plot, a map with multiple translucent layers showing ten different variables overlaid.

Again, with your audience in mind, consider that most people can interpret a simple bar graph or line graph fairly well, but a radar chart or hierarchical treemap are going to be suited to a more technical audience.

Even with technical audiences, having a simple graph that shows the key message is often better than a complex graph that shows everything.
You can always have a secondary visualization that shows more detail for those that wish to explore further.

![](https://matplotlib.org/stable/_images/sphx_glr_scatter3d_001_2_00x.png)

### Reduce "chart junk"

Chart junk is any visual element that is not directly related to the data. This includes:

* 3D effects & shadows
* Gradients
* All extraneous lines including gridlines, tick marks, and axes
* Unnecessary labels

Once you begin this process, you realize just how much stuff on a chart doesn't add any value.

![](https://matplotlib.org/stable/_images/sphx_glr_gradient_bar_001_2_00x.png)

Edward Tufte has written extensively on this topic, and his book "The Visual Display of Quantitative Information" is a great resource.

* [The Visual Display of Quantitative Information](https://www.edwardtufte.com/tufte/books_vdqi)

![](https://www.datapine.com/blog/wp-content/uploads/2014/06/Unclear-Data-Visualization-.gif)

![](https://www.datapine.com/blog/wp-content/uploads/2014/06/Improved-Data-Visualization.gif)

### Use the right chart type

There are many different chart types, and each one is suited to a different purpose.

![](https://scc.ms.unimelb.edu.au/__data/assets/image/0007/3217291/badpie.png)

vs.

![](https://scc.ms.unimelb.edu.au/__data/assets/image/0003/3217287/bargraph.png)

If in doubt, start with a simple bar chart or line graph.
These are the most common chart types for good reason, and are usually the best place to start.

Avoid using pie charts, they are almost never the best way to visualize data.

![](https://scc.ms.unimelb.edu.au/__data/assets/image/0010/3217285/poster.png)

Credit: Why You Shouldn't Use Pie Charts (https://scc.ms.unimelb.edu.au/resources/data-visualisation-and-exploration/no_pie-charts)

Familiarizing yourself with the different chart types available in your tool(s) of choice is a good idea that may inspire creative ways to visualize your data.
Keep in mind readability, especially if you are using a chart type that is not common. 

Ask a friend or colleague to interpret your data without any explanation, and see if they can make sense of it.

If they take a long time, or come away with the wrong impression, that's a good sign that you should adjust your approach.

There's not a ton of value in a visualization that someone needs to explain.

### Use color effectively

Color can be used to make elements stand out from one another, but doing so can also lead to confusion.

It may be tempting to use a different color for each bar in a bar chart, but consider if there are ways to use the colors instead in a meaningful way.

Colors are a great way to group similar data, or highlight outliers or important values.

Also keep in mind that about 4% of the population is colorblind.
If you are using color as the **only** way to differentiate between elements, you may want to consider using a different method as well.

The most common form of colorblindness is red-green colorblindness, so you may want to avoid using red and green together.

(Consider how many charts use red for negative values and green for positive values anyway!)

At the very least, take a look at your visualization with a colorblind filter to see if your point is still clear.

* [Color Oracle](http://colororacle.org/)
* [Color Blindness Simulator](https://www.color-blindness.com/coblis-color-blindness-simulator/)
* [Colorblind Web Page Filter](https://www.toptal.com/designers/colorfilter/)

## Data Viz in Python

* [Matplotlib](https://matplotlib.org/)
* [Seaborn](https://seaborn.pydata.org/)
* [Pandas](https://pandas.pydata.org/pandas-docs/stable/visualization.html)
* [Altair](https://altair-viz.github.io/)
* [plotnine](https://plotnine.readthedocs.io/en/stable/)

### matplotlib

<https://matplotlib.org/>

Gallery: <https://matplotlib.org/stable/gallery/index.html>

Matplotlib is the oldest and most widely used data visualization library in Python.

It is a low-level library, and exposes a lot of the underlying details. This is both the source of its longevity, and also its greatest weakness.

It also predates `pandas`, so lacks many of the conveniences that come with having a DataFrame as your data source.

#### Examples

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()

# `matplotlib` expects values to be in parallel lists
fruits = ['apple', 'blueberry', 'cherry', 'orange']
counts = [40, 100, 30, 55]
bar_labels = ['red', 'blue', '_red', 'orange']
bar_colors = ['tab:red', 'tab:blue', 'tab:red', 'tab:orange']

ax.bar(fruits, counts, label=bar_labels, color=bar_colors)

ax.set_ylabel('fruit supply')
ax.set_title('Fruit supply by kind and color')
ax.legend(title='Fruit color')

# in a notebook
plt.show()
# or to save to a file
plt.savefig('fruit.png')

Every `matplotlib` plot will start with a call to `plt.subplots()` to create a figure and axes object.

Once those are created, you will call methods on the axes object to add elements to the plot.

Notice that to actually show the plot, you need to call `plt.show()`.
`matplotlib` is relying on global variables to keep track of the current figure and axes.
When you were told to avoid global variables because they make code confusing, this is what they were talking about.

### plotting from pandas

`pandas` has a `plot` method that will create a `matplotlib` plot for you.
This is a useful shortcut, particularly well suited to exploratory data analysis.

In [None]:
import pandas as pd

df = pd.DataFrame({
    'fruit': ['apple', 'blueberry', 'cherry', 'orange'],
    'count': [40, 100, 30, 55],
    'color': ['red', 'blue', 'red', 'orange']
})
df.plot(kind='bar', x='fruit', y='count', color=df['color'])

### seaborn

<https://seaborn.pydata.org/>

Gallery: <https://seaborn.pydata.org/examples/index.html>

Seaborn is a higher-level library built on top of `matplotlib`. 

It is designed to make it easier to create common types of plots, and to make them look good with minimal modification.

It also integrates well with `pandas`, and can be used to create plots directly from a DataFrame instead of needing to extract the data into parallel lists.

#### Examples

In [None]:
import seaborn as sns

# seaborn has built in themes
sns.set_theme(style="whitegrid")

# it also has example datasets that you can use for practice
tips = sns.load_dataset("tips")

sns.barplot(x="day", y="total_bill", data=tips)

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(style="whitegrid")
iris = sns.load_dataset("iris")

# "Melt" the dataset to "long-form" or "tidy" representation
iris = pd.melt(iris, "species", var_name="measurement")

# Initialize the figure: notice that doing this requires some `matplotlib`
f, ax = plt.subplots()
sns.despine(bottom=True, left=True)

# As before, plot methods take dataframes and labels
sns.stripplot(
    data=iris, x="value", y="measurement", hue="species",
    dodge=True, alpha=.25, zorder=1, legend=False
)

# Show the conditional means, aligning each pointplot in the
# center of the strips by adjusting the width allotted to each
# category (.8 by default) by the number of hue levels
sns.pointplot(
    data=iris, x="value", y="measurement", hue="species",
    join=False, dodge=.8 - .8 / 3, palette="dark",
    markers="d", scale=.75, errorbar=None
)

# Improve the legend
sns.move_legend(
    ax, loc="lower right", ncol=3, frameon=True, columnspacing=1, handletextpad=0
)

#### Why is seaborn imported as sns?

Most data science libraries are imported with an alias, to make it easier to type in interactive notebooks.

The most common aliases are:

* `pd` for `pandas`
* `np` for `numpy`
* `sns` for `seaborn`
* `plt` for `matplotlib.pyplot`
* `alt` for `altair`

Why `sns`?

![](https://static.wikia.nocookie.net/westwing/images/b/b3/3sam.png/revision/latest?cb=20191111155050)

**Samuel Norman Seaborn**

(Seriously, the library is named after a West Wing character.)

### altair

<https://altair-viz.github.io/>

Gallery: <https://altair-viz.github.io/gallery/index.html>

Altair is a declarative visualization library, meaning that you describe the data and the visual elements you want to use.

It is based on Vega, a JavaScript library:

"Vega is a visualization grammar, a declarative language for creating, saving, and sharing interactive visualization designs. With Vega, you can describe the visual appearance and interactive behavior of a visualization in a JSON format, and generate web-based views using Canvas or SVG."

In [None]:
import altair as alt

# load a simple example dataset as a pandas DataFrame
from vega_datasets import data
cars = data.cars()

alt.Chart(cars).mark_point().encode(
    x='Horsepower',
    y='Miles_per_Gallon',
    color='Origin',
).interactive()

The concept of a "grammar of graphics" is a popular one, the most famous example being the grammar of graphics in R `ggplot2`.

https://speakerdeck.com/jakevdp/altair-tutorial-intro-pycon-2018

### plotnine

<https://plotnine.readthedocs.io/en/stable/>

Plotnine is a Python implementation of the grammar of graphics in R `ggplot2`.  If you've worked with `ggplot2` before, you'll find it very familiar.

In [None]:
from plotnine import ggplot, geom_point, aes, stat_smooth, facet_wrap
from plotnine.data import mtcars

(ggplot(mtcars, aes('wt', 'mpg', color='factor(gear)'))
 + geom_point()
 + stat_smooth(method='lm')
 + facet_wrap('~gear'))

## Interactive Data Viz

These come in two forms: 

1) Extensions to Jupyter notebooks meant for exploratory data analysis, e.g. being able to zoom/pan around a plot instead of regenerating it with different parameteers.

2) Libraries that leverage the browser to create interactive visualizations.  These will all rely on JavaScript libraries so they can run in the browser.

### Plotly Express and Dash

Plotly is an interactive visualization JavaScript library that has bindings for Python, R, JavaScript, Julia, Matlab, and more.

* [Plotly](https://plot.ly/python/)
* [Dash](https://dash.plot.ly/)

Note: Unlike everything else we've mentioned so far plotly is a company selling a product.
The libraries we're talking about are open source, but it is worth noting that the company does offer paid services.

They are well suited to building interactives and `dash` takes this a step further and allows you to build full web applications (mainly targeted at dashboards).

Take a look at last year's projects for some examples of what you can do with `plotly` and `dash`.

### Bokeh

[Bokeh](https://bokeh.pydata.org/en/latest/)

Gallery: <https://docs.bokeh.org/en/latest/docs/gallery.html>

Similar in many ways to plotly, focused on interactive visualizations on the web.

## Common Data Viz Mistakes

![](https://www.datapine.com/blog/wp-content/uploads/2014/06/Truncated-Y-Axis-Data-Visualizations-Designed-To-Mislead.jpg)

![](https://www.datapine.com/blog/wp-content/uploads/2014/06/Same-Data-Different-Y-Axis-Data-Visualization-Designed-to-Mislead.png)

![](https://assets.website-files.com/60078f9b9c5ea6f60974b74b/61bdf49e384b99811891b691_Blog%2049.8..png)

![](https://assets.website-files.com/60078f9b9c5ea6f60974b74b/61bdf4dfb54a53fa47b4eba6_Blog%2049.11..png)

![](https://assets.website-files.com/60078f9b9c5ea6f60974b74b/61bdf4c6ea186da5af170265_Blog%2049.10..png)

## Specialized Visualizations

* Geographic Data
    * https://geoviews.org/
    * https://geopandas.org/
    * https://python-visualization.github.io/folium/
    * Plotly/Dash have a map component
* Network Graphs
    * NetworkX: https://networkx.org/
    * PyVis: https://pyvis.readthedocs.io/en/latest/index.html
    * Plotly/Dash have a graph component
* Time Series: https://github.com/ozlerhakan/datacamp/blob/master/Visualizing%20Time%20Series%20Data%20in%20Python/Visualizing-Time-Series-Data-in-Python.ipynb

## Conclusion

This is an area where picking a library that you like and then sticking with it is often a good idea, especially if you're working solo and have the choice.

This then becomes part of your initial analysis as you consider your question, audience, and what you want to communicate.

Do you want interactivity?

Is a clean graph with minimal code most appropriate?

Are you working with other data scientists that will be comfortable with Jupyter notebooks?

### Credits

Examples in this notebook are adapted from the official library documentation and tutorials.

https://matplotlib.org/stable/tutorials/index

https://seaborn.pydata.org/tutorial.html

Jake Vanderplaas' PyCon Talk "Exploratory Data Visualization with Altair" https://altair-viz.github.io/altair-tutorial/README.html

Examples from: https://www.datapine.com/blog/misleading-data-visualization-examples/ and https://www.syntaxtechs.com/blog/data-visualization-examples#h6