# Examples

Several of these plots have default interactions that you can try out directly on this page.

In [1]:
!pip freeze

<class 'pandas.core.frame.DataFrame'>
Int64Index: 319 entries, 1634 to 2098
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   IMDB Votes              302 non-null    float64
 1   IMDB Rating             302 non-null    float64
 2   Rotten Tomatoes Rating  242 non-null    float64
 3   Running Time min        156 non-null    float64
 4   MPAA Rating             319 non-null    object 
 5   Creative Type           303 non-null    object 
dtypes: float64(4), object(2)
memory usage: 17.4+ KB


In [None]:
import altair_ally as aly
from vega_datasets import data


aly.alt.data_transformers.disable_max_rows()

movies = (
    data
    .movies()[data.movies()['MPAA Rating'].isin(["G", "PG", "PG-13", "R"])]
    .sample(400, random_state=234890)
#     .query('`MPAA Rating` in ["G", "PG", "PG-13", "R"]')
    [['IMDB Votes', 'IMDB Rating', 'Rotten Tomatoes Rating',
      'Running Time min', 'MPAA Rating', 'Creative Type']])
movies.info()

## Missing values

A missing value plot can reveal patterns that would influence downstream analysis
and upstream wrangling issues.
It is also useful to indicate which variables are codependent in the data collection process,
such as the IMDB Votes and Ratings in the plot below.

Selecting an interval in the heatmap of individual NaNs 
will automatically update the bar plot with the NaN counts.

In [2]:
aly.nan(movies)

# Heatmaps

Heatmaps can be useful to get an overview of the observed values across all columns.

In [3]:
aly.heatmap(movies)

Heatmaps can be colored by a categorical column
and sorted to explore structure in the data.

In [4]:
aly.heatmap(movies, 'MPAA Rating', 'IMDB Rating')

Sorting by the categorical color column can aid exporations of differences between groups.
There are two built-in rescaling methods to choose from for the colors
to ensure that the colors are comparable across columns
(a custom function can be passed as well).

In [5]:
aly.heatmap(movies, 'MPAA Rating', 'MPAA Rating', 'mean-sd')

## Univariate distributions

Distributions are shown as densities by default,
and the subplots are laid out in square grids.
Densities can be made as areas or lines,
and include a rug plot included to indicate the number of observations,
since they can be misleadingly smooth even for small datasets.

In [6]:
aly.dist(movies)

A categorical column can be used to group the data
and compare multiple distributions
within the same variable.

In [7]:
aly.dist(movies, 'MPAA Rating')

Histograms can be made with the `'bar'` mark.

In [8]:
aly.dist(movies, mark='bar')

Setting the `dtype` allows to visualize the distribution of non-quantitative variables
by plotting the counts of observations per category.

In [9]:
aly.dist(movies, dtype='object')

## Pairwise variable relationships

Pairplots (also called scatter plot matrices) gives an overview of the pairwise reationships 
of all quantitative columns in the data.
Selecting in one plot highlights the same points across all subplots.

In [10]:
aly.pair(movies)

In [11]:
aly.pair(movies, 'MPAA Rating')

## Pairwise variable correlation

A pairwise correlation plot can complement a pairplot
and provide a quantitative measurement of correlation between column pairs.
By default the Pearson and Spearman correlations are shown
to reveal both linear and monotonic non-linear (exponential, logarithmic, etc) relationships.
Note that non of these correlations would pick up more complex 
column relationships (e.g. quadratic),
so it is a good idea to use these in tandem with the pairplot.

Hovering over a point shows the exact coefficient
and highlights the point across all subplots.

In [12]:
aly.corr(movies)

## Parallel coordinates

Similar to heatmaps,
parallel coordinate plots gives an overview
of how individual observations are distributed
across all quantitative columns in the data.
Coloring by a categorical variable can help reveal groupings in the data
and is also effective to qualitatively assess clustering results
from using unsupervised learning algorithms.

Click the legend to hide and show groups.

In [13]:
aly.parcoord(movies, 'MPAA Rating', 'min-max')