# Examples

This notebook contains a few examples of how you can use the plotting functions in Aly.
In general, 
the first argument passed to each function is the data,
and the second is the name of a categorical column to color by.
I've only written out the parameter names
when they are specified out of order
from how they are defined in the function signature.

Several of these plots have default interactions that you can try out directly on this page.

In [1]:
import altair_ally as aly
import pandas as pd
from vega_datasets import data


# Either disable the max rows warning or use the data server backend
aly.alt.data_transformers.disable_max_rows()
# aly.alt.data_transformers('data_server')


penguins = (
    pd.read_csv('https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv')
    .assign(year = lambda df: df['year'].astype('object')))
penguins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
 7   year               344 non-null    object 
dtypes: float64(4), object(4)
memory usage: 21.6+ KB


## Missing values

Visualizing missing values can reveal patterns that would influence downstream analysis
and might reflect upstream wrangling or data collection issues.
It is also useful to indicate which variables are codependent in the data collection process,
such as the IMDB Votes and Ratings in the plot below.

Selecting an interval in the heatmap of individual NaNs 
will automatically update the bar plot with the NaN counts.

In [2]:
aly.nan(penguins)

There are not that many missing values in the penguins data set,
so we'll have a look at a dataset of movies as well.
Here we can see patterns in the missing values,
such as that the movies that are missing IMDB votes
also don't have an average IMDB rating,
which makes sense since there are not ratings to average for those movies.

In [3]:
aly.nan(data.movies().sample(400))

## Heatmaps

Heatmaps can be useful to quickly get an overview of the all the observations across all columns
in the dataset.

In [4]:
aly.heatmap(penguins)

By default each column's values are rescaled according to its min and max value
so that all columns have values between between 0 and 1,
which makes the heatmap more effective.
The `rescale` parameter also takes a custom function
or can be turned off to use the raw values directly.
However, 
this is usually not very insightful
since columns with high values will dominate the colorscale
so that it is hard to pick out structure in the data.

In [5]:
aly.heatmap(penguins, rescale=None)

The 0-1 rescaled heatmap shows the structure of the data in each column
much more clarly than the heatmap with the raw values.
To understand the origins of structure in the data,
we can add an additional row below the numerical heatmap,
which is colored according to a categorical column.
In the plot below,
we can see that the penguin species membership
seems to be coincide with the variation in the numerical columns.
You can hover over both the numerical and categorical observations
to view their exact value.

In [6]:
aly.heatmap(penguins, color='species')

The previous plot was sorted by species
because this happened to be the default order of the observations in our dataframe.
We could explicitly specify a column in the dataframe to sort by.

In [7]:
aly.heatmap(penguins, color='species', sort='bill_depth_mm')

Multiple columns can be used both for coloring and sorting,
which further supports exploration of differences between categorical groups.
Since `color` and `sort` are the the first two parameters to `heatmap`,
we could leave them out to save some typing as we do below.

In [8]:
aly.heatmap(penguins, ['species', 'island'], ['species', 'bill_depth_mm'])

If you want to look at all non-nummerical columns,
but don't like the idea of typing them out by hand,
you can use the dataframe method `select_dtypes`.

Viewing all the categorical values at once can be a bit confusing at first glance,
but if we study the plot carefully we can see some patterns,
e.g. it appears that almost all the heaviest penguins
are males of the Gentoo species living on the Biscoe island.

In [9]:
aly.heatmap(penguins, penguins.select_dtypes(exclude='number').columns, 'body_mass_g')

## Distributions

Distributions are visualized as densities by default,
and the subplots are laid out in a square grid if possible.
Since density plots can be misleadingly smooth for small datasets.
they include a rug plot to indicate the number of observations,

In [10]:
aly.dist(penguins)

To compare multiple distributions
a categorical column can be used to group the data
and color the density areas accordingly,
similar to how we used it with the heatmaps above.

In [11]:
aly.dist(penguins, 'species')

It is also possible to use a line mark instead of an area,
which can facilitate comparisons between many distributions.

In [12]:
aly.dist(penguins, 'species', mark='line')

Histograms can be made with the `'bar'` mark.
These could also be grouped by a color variabes,
but it is often more effective to use density plots with grouped data.

In [13]:
aly.dist(penguins, mark='bar')

Setting the `dtype` parameter to a categorical pandas dtype
such as `'object'` or `'categorical'`
allows us to visualize the distribution of non-quantitative variables
by plotting the counts of observations per category in a bar chart.

In [14]:
aly.dist(penguins, 'sex', dtype='object')

## Pairwise variable relationships

Pairplots (also called scatter plot matrices) gives an overview of the pairwise reationships 
of all quantitative columns in the data.
Clicking and dragging with the mouse in one plot
highlights the same points across all subplots.

In [15]:
aly.pair(penguins)

It looks like there are 2-3 different groups
within most of the the individual scatter plots above.
Again we can use color to investigate 
if these groups appear to coincide with a categorical variable,
such as the penguin species.

In [16]:
aly.pair(penguins, 'species')

## Pairwise variable correlations

A pairwise correlation plot can complement a pairplot
and provide a quantitative measurement of correlation between column pairs.
By default the Pearson and Spearman correlations are shown
to reveal both linear and monotonic non-linear (exponential, logarithmic, etc) relationships.
Note that none of these correlatio metrics
would pick up column relationships that aren't monotonic (e.g. quadratic),
so it is a good idea to use correlations plots in tandem with pairplots.

Hovering over a point shows the exact coefficient
and highlights the point across all subplots
(double clicking clears the highlight).

In [17]:
aly.corr(penguins)

Correlation plots are very useful when there are many columns in the dataframe,
which is the case for the full movies data.

In [18]:
aly.corr(data.movies())

## Parallel coordinates

Similar to heatmaps,
parallel coordinate plots gives an overview
of how individual observations are distributed
across all quantitative columns in the data.
Coloring by a categorical variable can help reveal groupings in the data
and is also effective to qualitatively assess clustering results
from using unsupervised learning algorithms.

Click the legend to hide and show groups.

In [19]:
aly.parcoord(penguins, 'species')