# 04 - Heat Maps (2d histograms)

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

Let's look at a type bivariate plot type called a heatmap. It can be thought of as a 2d histogram or even color coded table.

Let's start by loading in our data.

In [None]:
df = pd.read_csv('..//data/fuel-econ.csv')
df.shape

In [None]:
df.head(5)

## Introduction to heatmaps

In a heatmap, the field is divided into a grid of cells like this, much like a 2d histogram. Each cell is assigned a color based on the value count inside. 

In this example, the more data points in a grid cell, the darker the color. You can think of a basic heatmap like a 2D version of a histogram, looking at the data from a top-down perspective.

In [None]:
plt.hist2d(data=df, x='displ', y='comb')
plt.colorbar()
plt.xlabel('Displacement (1)')
plt.ylabel('Combined Fuel Eff. (mpg)');

## Heatmap as a histogram

We CAN assign bin widths to heatmaps. Like histograms, you need to think carefully about the bin sizes you want. 

You can set the `cmin` parameter to determine the minimum value that will display a color.

In [None]:
# Specify bin edges 
bins_x = np.arange(0.6, 7+0.3, 0.3)
bins_y = np.arange(12, 58+3, 3)

plt.hist2d(data=df, x='displ', y='comb', cmin=0.5, cmap='viridis_r', bins=[bins_x, bins_y])
plt.colorbar()
plt.xlabel('Displacement (1)')
plt.ylabel('Combined Fuel Eff. (mpg)');

## Adding cell annotations and summary

You can also annotate each cell with the value count. In this view, the heatmap is also like a table, with additional coloring coding for emphasis.

For annotated 2d histograms, you have to loop through each cell to display the value. This can be tedious, but it's okay to re-use this code when making other plots!

In [None]:
# Specify bin edges 
bins_x = np.arange(0.6, 7+0.7, 0.7)
bins_y = np.arange(12, 58+7, 7)
# Use cmin to set a minimum bound of counts 
# Use cmap to reverse the color map. 
h2d = plt.hist2d(data=df, x ='displ', y='comb', cmin=0.5, cmap='viridis_r', bins=[bins_x, bins_y])

plt.colorbar()
plt.xlabel('Displacement (1)')
plt.ylabel('Combined Fuel Eff. (mpg)');

# Select the bi-dimensional histogram, a 2D array of samples x and y. 
# Values in x are histogrammed along the first dimension and 
# values in y are histogrammed along the second dimension.
counts = h2d[0]

# Add text annotation on each cell
# Loop through the cell counts and add text annotations for each
for i in range(counts.shape[0]):
    for j in range(counts.shape[1]):
        c = counts[i,j]
        if c >= 100: # increase visibility on darker cells
            plt.text(bins_x[i]+0.35, bins_y[j]+3.5, int(c),
                     ha = 'center', va = 'center', color = 'white')
        elif c > 0:
            plt.text(bins_x[i]+0.35, bins_y[j]+3.5, int(c),
                     ha = 'center', va = 'center', color = 'black')

A heatmap is favored over a scatter plot when you have two discrete variables, since the associated jittered scatter plot can already be imprecise.