`superheat()`

A "super" correlation heatmap you can't find elsewhere

Traditional correlation heatmaps generated from Seaborn or Plotly are good at using color to distinguish positive from negative correlations. However, they aren't very good at drawing your eyes towards the strongest correlations. Especially with large datasets containing many variables, the chart gets cluttered with weak correlations that you don't care about.

Dynamically sized marks make large heatmaps like the one below far easier to read. Your eyes go straight to the most significant correlations.

import chart_tools as ct
df = ct.load_data('ames_mini')
ct.set_style(15)
ct.superheat(df.corr(), half_mask=False, mark_scale=8, grid=False);

Unfortunately, this chart is impossible to build with the "heatmap" or equivalent function provided in most popular graphing libraries, so instead, we have to build a scatterplot that looks and works like a heatmap. This is extremely awkward to do, and requires more code than anyone would normally want to write in their typical visualization workflow. Source code can be found here.

Documentation

Important: Remember to place a semicolon at the end of the function call to avoid the "Figure size ..." annotation printout.

Required Parameters

corr: Correlation dataframe (use df.corr()). Must have equal number of rows and columns.

Optional Parameters

title - str: Chart title. Default: None
thresh_avg - float: Removes any variable whose average correlation to all others is below threshold. Default: None
thresh_mask - float: Masks any individual correlations that are below threshold. Default: None
half_mask - bool: Masks half the chart, hiding duplicate correlations. Default: True
self_mask - bool: Masks correlations between variables and themself. Default: True
cbar - bool: Include colorbar. Default: True
mark_scale - int: Change the scale of all marks. Default: 5
grid - bool: Show grid. Default: True
palette - sns.diverging_palette: Color palette to use on marks. Default: (20, 220, n_colors)
size - int: Set chart height and width. Default: None
marker - char: Marker shape. Default 's'. Click here for a list of all marker shapes.
bar_ticks - int: Number of tick marks on color bar. Default: 5
n_colors - int: Number of colors to include in color palette. Default: 128
**kwargs: Any additional keyword arguments will go to the matplotlib plt.scatter function

`set_style()`

Wrapper for seaborn.set_theme() that applies defaults to save you time

Required Parameters - None

Optional Parameters

size - int or tuple: Declares chart size. Int will set width and height to the same value. Use tuple, (width, height) to set custom values. Default: 12

Parameters passed to sns.set_theme(), but with defaults

palette: str: Default: "pastel"
style: str: Default: "whitegrid"
font_scale: float: Default: 1.5
**kwargs: Any additional keyword arguments will go into sns.set_theme()

Instructions & Examples

All of the following examples will start with this code:

import chart_tools as ct
df = ct.load_data('ames_mini').drop(columns=['YrSold', 'Id', 'GarageCars', 'Fireplaces', 'ScreenPorch', 'BsmtUnfSF', 'Bathrooms'])

set_style(): Easiest way to set chart size and apply a color preset. Pass an integer (like in the above example) to create a square, or pass a tuple, (width, height) for custom dimensions.

Default

ct.set_style(10) # Sets charts to 10x10 square, with chart-tools defualt styling
ct.superheat(df.corr());

Remove variables who average below threshold

ct.set_style(6) # Decrease chart size to keep proportions
ct.superheat(df.corr(), thresh_avg=0.19);

Mask individual correlations below threshold

ct.superheat(df.corr(), thresh_mask=0.19);

Include all real correlations

ct.superheat(df.corr(), half_mask=False);

Include self on self correlations

ct.superheat(df.corr(), self_mask=False);

Change marker type

ct.superheat(df.corr(), marker='o');

Change marker scale

ct.superheat(df.corr(), mark_scale=8); # Notice the marks are slightly larger. Default was 5

Use fewer colors

ct.superheat(df.corr(), n_colors=12); # Look at colorbar to see what's changed

Hide grid

ct.superheat(df.corr(), grid=False, marker='o');

Why is superheat great for large datasets?

Because with a dataset this large, the less you see, the better

This function is based on Drazen Zaric's "Better Heatmaps" in this article, and his heatmaps package.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

superheat.md

superheat.md

`superheat()`

A "super" correlation heatmap you can't find elsewhere

Documentation

`set_style()`

Instructions & Examples

Default

Remove variables who average below threshold

Mask individual correlations below threshold

Include all real correlations

Include self on self correlations

Change marker type

Change marker scale

Use fewer colors

Hide grid

Why is superheat great for large datasets?

Files

superheat.md

Latest commit

History

superheat.md

File metadata and controls

superheat()

A "super" correlation heatmap you can't find elsewhere

Documentation

set_style()

Instructions & Examples

Default

Remove variables who average below threshold

Mask individual correlations below threshold

Include all real correlations

Include self on self correlations

Change marker type

Change marker scale

Use fewer colors

Hide grid

Why is superheat great for large datasets?

`superheat()`

`set_style()`