# Examples

This notebook contains a few examples of how you can use the plotting functions in Aly.
In general, 
the first argument passed to each function is the data,
and the second is the name of a categorical column to color by.
I've only written out the parameter names
when they are specified out of order
from how they are defined in the function signature.

Several of these plots have default interactions that you can try out directly on this page.

In [1]:
import altair_ally as aly
from vega_datasets import data


aly.alt.data_transformers.disable_max_rows()

movies = (
    data
    .movies()
    .sample(400, random_state=234890)
    .query('`MPAA Rating` in ["G", "PG", "PG-13", "R"]')
    [['IMDB Votes', 'IMDB Rating', 'Rotten Tomatoes Rating',
      'Running Time min', 'MPAA Rating', 'Creative Type']])
movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 319 entries, 1634 to 2098
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   IMDB Votes              302 non-null    float64
 1   IMDB Rating             302 non-null    float64
 2   Rotten Tomatoes Rating  242 non-null    float64
 3   Running Time min        156 non-null    float64
 4   MPAA Rating             319 non-null    object 
 5   Creative Type           303 non-null    object 
dtypes: float64(4), object(2)
memory usage: 17.4+ KB


In [2]:
import pandas as pd

penguins = (
    pd.read_csv('https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv')
    .assign(year = lambda df: df['year'].astype('object')))
penguins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
 7   year               344 non-null    object 
dtypes: float64(4), object(4)
memory usage: 21.6+ KB


## Missing values

Visualizing missing values can reveal patterns that would influence downstream analysis
and might reflect upstream wrangling or data collection issues.
It is also useful to indicate which variables are codependent in the data collection process,
such as the IMDB Votes and Ratings in the plot below.

Selecting an interval in the heatmap of individual NaNs 
will automatically update the bar plot with the NaN counts.

In [3]:
aly.nan(penguins)

There are not that many missing values in the penguins data set,
so we'll have a look at the movies one as well.
Here we can see patterns in the missing values,
such as that the movies that are missing votes
also don't have a rating,
which makes sense.

In [4]:
aly.nan(movies)

## Heatmaps of all observations

Heatmaps can be useful to get an overview of the observed values across all columns.

In [5]:
aly.heatmap(penguins)

By default all column values are rescaled to lie between 0 and 1,
so that they can be visualized together in the same heatmap.
There are different presets for rescaling the data,
a custom function can be passed,
or the raw values can be used directly,
but this is usually not too useful.

In [6]:
aly.heatmap(penguins, rescale=None)

Heatmaps can be colored by a categorical column.
Here we can see that the penguin species
explains much of the variation
in the different measures.
You can hover over both the numerical and categorical observations
to view their exact value.

In [7]:
aly.heatmap(penguins, 'species')

The previous plot was sorted by species
because this is the default order of the observation in our dataframe.
We can also explicitly specify a column in the dataframe to sort by.

In [8]:
aly.heatmap(penguins, 'species', 'bill_depth_mm')

Multiple columns can be used both for coloring and sorting,
which further supports exploration of differences between groups.

In [9]:
aly.heatmap(penguins, ['species', 'island'], ['species', 'bill_depth_mm'])

If you want to look at all non-nummerical columns,
but don't like the idea of typing them out by hand,
you can use the dataframe method `select_dtypes`.

All the categorical values can both be a bit confusing at first glance,
but if we study the graph we can see some patterns,
e.g. it appears that almost all the heaviest penguins
are males of the Gentoo species living on the Biscoe island.

In [10]:
aly.heatmap(penguins, penguins.select_dtypes(exclude='number').columns, ['body_mass_g'])

## Univariate distributions

Distributions are shown as densities by default,
and the subplots are laid out in square grids.
Densities can be made as areas or lines,
and include a rug plot included to indicate the number of observations,
since they can be misleadingly smooth even for small datasets.

Setting the `dtype` allows to visualize the distribution of non-quantitative variables
by plotting the counts of observations per category.

In [11]:
aly.dist(penguins)

A categorical column can be used to group the data
and compare multiple distributions
within the same variable.

In [12]:
aly.dist(penguins, 'species')

Histograms can be made with the `'bar'` mark.
These could also be grouped by a color variabes,
but it is often more effective to use density plots with grouped data.

In [13]:
aly.dist(penguins, mark='bar')

Setting the `dtype` to a categorical value such as `'object'`
allows to visualize the distribution of non-quantitative variables
by plotting the counts of observations per category.

In [14]:
aly.dist(penguins, dtype='object', color_col='sex')

## Pairwise variable relationships

Pairplots (also called scatter plot matrices) gives an overview of the pairwise reationships 
of all quantitative columns in the data.
Selecting in one plot highlights the same points across all subplots.

In [15]:
aly.pair(penguins)

In [16]:
aly.pair(penguins, 'island')

## Pairwise variable correlations

A pairwise correlation plot can complement a pairplot
and provide a quantitative measurement of correlation between column pairs.
By default the Pearson and Spearman correlations are shown
to reveal both linear and monotonic non-linear (exponential, logarithmic, etc) relationships.
Note that non of these correlations would pick up more complex 
column relationships (e.g. quadratic),
so it is a good idea to use these in tandem with the pairplot.

Hovering over a point shows the exact coefficient
and highlights the point across all subplots
and double clicking clears the highlight.

In [17]:
aly.corr(penguins)

Correlation plots are very useful when there are many columns in the dataframe,
which is the case for the full movies data.

In [18]:
aly.corr(data.movies())

## Parallel coordinates

Similar to heatmaps,
parallel coordinate plots gives an overview
of how individual observations are distributed
across all quantitative columns in the data.
Coloring by a categorical variable can help reveal groupings in the data
and is also effective to qualitatively assess clustering results
from using unsupervised learning algorithms.

Click the legend to hide and show groups.

In [19]:
aly.parcoord(penguins, 'species')