# Plotting Heatmaps and Histograms in Python

This notebook has examples to replicate some of the plots you have seen in R practice notebooks in Python. Study the notation, syntax, and output of the plots to see the implementation differences between R and Python. 

## Heatmaps

In [None]:
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib
%matplotlib inline
import pandas

matplotlib.style.use('ggplot')       # Use ggplot style plots*

In [None]:
# Generate random 2D for plotting a heatmap. Below command will generate a 2D matrix of 10 rows and 12 columns.
uniform_data = np.random.rand(10, 12)

# Display first two rows of data
uniform_data[0:2,:]

Seaborn's heatmap plots rectangular data as a color-encoded matrix. It is an axes-level function and will draw the heatmap into the currently-active axes if none is provided to the ```ax``` argument. Below, the heatmap is assigned to ```ax``` handle. 

In [None]:
ax = sns.heatmap(uniform_data)

So, the map inferred the values for colormap from the data. But if you want to anchor the values for the colormap, you can do it using vmin and vmax arguments. 

In [None]:
ax = sns.heatmap(uniform_data, vmin=0, vmax=1)

Here's an example on how to plot a heatmap using ```flights``` dataset. Load the data that comes with Seaborn package. Create a pivoted table which is just a matrix representation of the heatmap that the ```heatmap()``` function plots. Finally, use that pivot table to plot. 

In [None]:
flights = sns.load_dataset("flights")
flightsp = flights.pivot("month", "year", "passengers")
ax = sns.heatmap(flightsp, annot=True, fmt="d")

The above plot is an **annotated heatmap** because we can see the numbers in the cells. 

For more examples, check the [Seaborn documentation](https://seaborn.pydata.org/generated/seaborn.heatmap.html) for heatmaps.


**We can create similar plot with plotnine as following:** 

In [None]:
from plotnine import *

# We use geom_tile to create the heatmap, and geom_text to annotate it. 
# We use 'factor(year)' to have discrete values shown along the x-axis instead of 'year'.
# .95 spacing between the tiles.

(ggplot(flights, aes('factor(year)', 'month', fill='passengers'))
 + geom_tile(aes(width=.95, height=.95))
 + geom_text(aes(label='passengers'), size=10)
)


**Take a look at [here](https://plotnine.readthedocs.io/en/stable/generated/plotnine.geoms.geom_tile.html#annotated-heatmap) for other cool things you can do with geom_tile.** 

Let's load the Kings county data for house prices. we will plot a heatmap to show the distribution of houses between floors and the year in which the house is built. The blue tiles represent larger number of houses built and white tiles represent fewer number of houses built.


In [None]:
kc_house_data = pd.read_csv("/dsa/data/all_datasets/house_sales_in_king_county/kc_house_data.csv")
yr_built_floors = kc_house_data[["yr_built", "floors"]]
yr_built_floors.head()
yr_built_floors.groupby(["yr_built", "floors"]).size().reset_index()

# Get the frequency of each combination of yr_built and floors

yr_built_floors=yr_built_floors.groupby(["yr_built", "floors"]).size().reset_index()
yr_built_floors = yr_built_floors.rename({0:"Freq"}, axis=1)

p = ggplot(yr_built_floors, aes(x="floors", y="yr_built")) + geom_tile(aes(fill="Freq")) 
p = p + scale_fill_gradient(name = "Yr built and floors", low = "white", high="blue") 
p = p + xlab("Floors") + ylab("Built Year")+labs(title="Houses built from 1900 to 2015") + coord_fixed(ratio = 0.05) + theme_minimal()
p


## Histograms



In [None]:
usda_data=pd.read_csv("/dsa/data/all_datasets/USDA.csv")
usda_data.head()

In [None]:
usda_data.describe()

### Qplot in ggplot

Qplot is a convenient wrapper for ggplot to create a number of different types of plots using a consistent calling scheme that is similar to the base graphics capability of R. It is also referred to as quick plot. 

In below plot, a histogram is plotted using the string **`histogram`** supplied to **`geom`** parameter. Binwidth tells ggplot to form bins of specified width. With a binwidth of 10, each bin in below plot represents a range of calories like (50-59) on x axis, and the data items falling within these ranges are counted and depicted as the frequencies of corresponding bins.

In [None]:
qplot(data=usda_data, x="Calories",  geom="histogram", binwidth=10)

The **`weight`** aesthetic when used with histograms or bar charts can be used to create weighted histograms and bar charts. Here the height of the bar no longer represents count of observations, but a sum over some other variable.

In [None]:
qplot(data=usda_data, x="Calories",  geom="histogram",weight = "Protein", binwidth=10)

#### Layered Grammar of ggplot

We can use the ggplot syntax instead of qplot to create plots that follow the layered grammar convention of ggplot. The histogram can be also plotted like this:

In [None]:
ggplot(usda_data, aes(x="Calories")) + geom_histogram(binwidth=10, fill="lightblue") + ylab("Freqeuncy")

#### Density Curve on Histogram

A density curve can be plotted on a histogram that represents the probability density function of that variable. Density can be overlayed on histogram with a transparent density plot. The alpha value controls the level of transparency as shown in below example. This shows the layered structure of ggplot where two layers (histogram and density) can be plotted on the same plot. Also, **..density..** is a derived variable computed by the ggplot on the fly.

In [None]:
ggplot(usda_data, aes(x="Calories")) + \
    geom_histogram(aes(y="..density.."), colour = "black", fill="lightblue", binwidth=10) +\
    geom_density(alpha = .2, fill="red")