Skip to content

Latest commit

 

History

History
170 lines (109 loc) · 7 KB

superheat.md

File metadata and controls

170 lines (109 loc) · 7 KB

superheat()

A "super" correlation heatmap you can't find elsewhere

Documentation

Instructions & Examples

Traditional correlation heatmaps generated from Seaborn or Plotly are good at using color to distinguish positive from negative correlations. However, they aren't very good at drawing your eyes towards the strongest correlations. Especially with large datasets containing many variables, the chart gets cluttered with weak correlations that you don't care about.

Dynamically sized marks make large heatmaps like the one below far easier to read. Your eyes go straight to the most significant correlations.

import chart_tools as ct
df = ct.load_data('ames_mini')
ct.set_style(15)
ct.superheat(df.corr(), half_mask=False, mark_scale=8, grid=False);

Screen Shot 2022-04-24 at 3 09 06 AM

Unfortunately, this chart is impossible to build with the "heatmap" or equivalent function provided in most popular graphing libraries, so instead, we have to build a scatterplot that looks and works like a heatmap. This is extremely awkward to do, and requires more code than anyone would normally want to write in their typical visualization workflow. Source code can be found here.


Documentation

Important: Remember to place a semicolon at the end of the function call to avoid the "Figure size ..." annotation printout.

Required Parameters

  • corr: Correlation dataframe (use df.corr()). Must have equal number of rows and columns.

Optional Parameters

  • title - str: Chart title. Default: None
  • thresh_avg - float: Removes any variable whose average correlation to all others is below threshold. Default: None
  • thresh_mask - float: Masks any individual correlations that are below threshold. Default: None
  • half_mask - bool: Masks half the chart, hiding duplicate correlations. Default: True
  • self_mask - bool: Masks correlations between variables and themself. Default: True
  • cbar - bool: Include colorbar. Default: True
  • mark_scale - int: Change the scale of all marks. Default: 5
  • grid - bool: Show grid. Default: True
  • palette - sns.diverging_palette: Color palette to use on marks. Default: (20, 220, n_colors)
  • size - int: Set chart height and width. Default: None
  • marker - char: Marker shape. Default 's'. Click here for a list of all marker shapes.
  • bar_ticks - int: Number of tick marks on color bar. Default: 5
  • n_colors - int: Number of colors to include in color palette. Default: 128
  • **kwargs: Any additional keyword arguments will go to the matplotlib plt.scatter function

set_style()

Wrapper for seaborn.set_theme() that applies defaults to save you time

Required Parameters - None

Optional Parameters

  • size - int or tuple: Declares chart size. Int will set width and height to the same value. Use tuple, (width, height) to set custom values. Default: 12

Parameters passed to sns.set_theme(), but with defaults

  • palette: str: Default: "pastel"
  • style: str: Default: "whitegrid"
  • font_scale: float: Default: 1.5
  • **kwargs: Any additional keyword arguments will go into sns.set_theme()

Instructions & Examples

All of the following examples will start with this code:

import chart_tools as ct
df = ct.load_data('ames_mini').drop(columns=['YrSold', 'Id', 'GarageCars', 'Fireplaces', 'ScreenPorch', 'BsmtUnfSF', 'Bathrooms'])

set_style(): Easiest way to set chart size and apply a color preset. Pass an integer (like in the above example) to create a square, or pass a tuple, (width, height) for custom dimensions.

Default

ct.set_style(10) # Sets charts to 10x10 square, with chart-tools defualt styling
ct.superheat(df.corr());

Screen Shot 2022-04-24 at 2 56 01 AM

Remove variables who average below threshold

ct.set_style(6) # Decrease chart size to keep proportions
ct.superheat(df.corr(), thresh_avg=0.19);

Screen Shot 2022-04-24 at 2 58 45 AM

Mask individual correlations below threshold

ct.superheat(df.corr(), thresh_mask=0.19);

Screen Shot 2022-04-24 at 2 59 27 AM

Include all real correlations

ct.superheat(df.corr(), half_mask=False);

Screen Shot 2022-04-24 at 3 00 13 AM

Include self on self correlations

ct.superheat(df.corr(), self_mask=False);

Screen Shot 2022-04-24 at 3 00 52 AM

Change marker type

ct.superheat(df.corr(), marker='o');

Screen Shot 2022-04-24 at 3 01 40 AM

Change marker scale

ct.superheat(df.corr(), mark_scale=8); # Notice the marks are slightly larger. Default was 5

Screen Shot 2022-04-24 at 3 02 22 AM

Use fewer colors

ct.superheat(df.corr(), n_colors=12); # Look at colorbar to see what's changed

Screen Shot 2022-04-24 at 3 02 57 AM

Hide grid

ct.superheat(df.corr(), grid=False, marker='o');

Screen Shot 2022-04-24 at 3 03 29 AM

Why is superheat great for large datasets?

Because with a dataset this large, the less you see, the better

Screen Shot 2022-04-24 at 3 09 06 AM



This function is based on Drazen Zaric's "Better Heatmaps" in this article, and his heatmaps package.