# Data Visualization in Python

There are many data visualization libraries to make charts and graphs using data in Python.  The oldest and probably the most powerful of these is Matplotlib.  It has many options to control the creation of charts, but it is also one of the most complex libraries to use to make charts.  We will spend much of this session on learning how to use Matplotlib, and then switch to exploring tutorials with a new Python library, Altair.

To do the Altair tutorials, you'll need to install the library first.
At the command prompt, install it with this command:
conda install altair --channel conda-forge

We will come back to this later in the class session.

Objectives for today's session:

* Learn how to use Matplotlib to make basic charts:
    * Line charts
    * Histograms
    * Scatterplots
* Learn how to add titles, and labels for X and Y axes
* Learn how to control the styling of the charts
* Learn how to create a composite figure with subplots
* Learn how to export figures to png to use elsewhere

* Introduce Altair charting library

In addition, we will explore Census data and Craigslist rentals to visualize them.


## Visualizing Census Data

Here we load 2010 Census Block data to set up some initial analysis and data visualization. Below are the tables pulled from Summary File 1 (SF1) for Census Blocks in the San Francisco Bay Area.

In [None]:
import pandas as pd

sf1store = pd.HDFStore('bay_sf1_small.h5')
sf1 = sf1store['sf1_extract']
print(sf1[:5])
print(sf1.shape)

Let's calculate some basic information about each census block in the Bay Area.

In [None]:
sf1['pct_rent'] = sf1['H0040004'] / sf1['H0040001'] * 100
sf1['pct_black'] = sf1['P0030003'] / sf1['P0030001'] * 100
sf1['pct_asian'] = sf1['P0030005'] / sf1['P0030001'] * 100
sf1['pct_white'] = sf1['P0030002'] / sf1['P0030001'] * 100
sf1['pct_hisp'] = sf1['P0040003'] / sf1['P0040001'] * 100
sf1['pct_vacant'] = sf1['H0050001'] / sf1['H00010001'] * 100
sf1['pop_sqmi'] = (sf1['P0010001'] / (sf1['arealand'] / 2589988))
sf1 = sf1[sf1['P0030001']>0]
print(sf1.head())
print(sf1.shape)

In [None]:
sf1.describe()

In [None]:
import matplotlib.pyplot as plt, numpy as np
%matplotlib inline
import pandas as pd
#from altair import Chart, X, Y, Axis, SortField


## Basic Graphs With Matplotlib

Matplotlib is the oldest and most widely used of the charting libraries for Python, and integrates seamlessly into IPython notebooks.  Making simple charts of all kinds is very straightforward.  Matplotlib also has a tremendous number of options that enable a user to carefully control the appearance of charts.  This power is one of Matplotlib's best and worst features, at the same time.  Using those options is complicated.

Here we will stick to basic plots that Matplotlib makes easy, and then switch to a newer charting library that handles both simple and more complex charts well.

The first example is a chart showing that you can use Numpy methods like sort to create a revealing line graph showing in this case how few census blocks have a relatively large numbers of people in them.

In [None]:
plt.plot(np.sort(sf1['P0010001']))
plt.show()

Next we look at a cumulative sum of population across blocks.  The order they are added is based on the order they have in the DataFrame, so this is somewhat arbitrary.

In [None]:
plt.plot(np.cumsum(sf1['P0010001']))
plt.show()

Combining the sort and cumulative sum methods makes this look more interesting.

In [None]:
plt.plot(np.cumsum(np.sort(sf1['P0010001'])))
plt.show()

As we have seen before, simple histograms are also easy to produce.  Here is the same population data displayed as a histogram. You can control things like the number of bins, or the color, easily.

In [None]:
plt.hist(sf1['P0010001'], bins=25, color='red')
plt.show()

The distribution of population seems to be very skewed, with most blocks having relatively small population counts.  To zoom in to those and get a better sense of the distribution we might slice the blocks to isolate those below the long tail of the distribution.  Let's plot the distribution again after truncating the top one percent.  Increasing the number of bins provides greater detail as well.  We can see that 99% of the blocks have less than 700 people, and that there are still a significant number of blocks with only a few people in them.

In [None]:
print(sf1['P0010001'].quantile(.99))
small_pop = sf1[sf1['P0010001'] < sf1['P0010001'].quantile(.99)]
plt.hist(small_pop['P0010001'], bins=100, color='green')
plt.show()

### Exploring Racial and Ethnic Concentration in the Bay Area

Let's use the 2010 population by race and ethnicity to explore the spatial concentration of people of different races and ethnicities in the Bay Area.  First, compute a regional percentage to use as a frame of reference for rhe block-level profiles. 

In [None]:
print('Regional Pct White: '+"{0:.1f}%".format(sf1['P0030002'].sum()/sf1['P0030001'].sum()*100))
print('Regional Pct Black: '+"{0:.1f}%".format(sf1['P0030003'].sum()/sf1['P0030001'].sum()*100))
print('Regional Pct Asian: '+"{0:.1f}%".format(sf1['P0030005'].sum()/sf1['P0030001'].sum()*100))
print('Regional Pct Hispanic: '+"{0:.1f}%".format(sf1['P0040003'].sum()/sf1['P0040001'].sum()*100))
print('Note that these add up to more than 100% since Hispanic is not broken out by race in this calculation')

Now create a statistical profile and generate histograms of the distribution of each of these population groups.  What kinds of descriptive conclusions could you draw about how they differ from each other?

In [None]:
print(sf1['pct_asian'].describe())
plt.hist(sf1['pct_asian'], bins=50)
plt.show()

In [None]:
print(sf1['pct_hisp'].describe())
plt.hist(sf1['pct_hisp'], bins=50)
plt.show()

In [None]:
print(sf1['pct_black'].describe())
plt.hist(sf1['pct_black'], bins=50)
plt.show()

In [None]:
print(sf1['pct_white'].describe())
plt.hist(sf1['pct_white'], bins=50)
plt.show()

To make figures easier to compare side-by-side, Matplotlib enables creating a composite figure using subplots.

In [None]:
plt.figure(1)
plt.subplot(221)
plt.hist(sf1['pct_asian'], bins=50)

plt.subplot(222)
plt.hist(sf1['pct_black'], bins=50)

plt.subplot(223)
plt.hist(sf1['pct_hisp'], bins=50)

plt.subplot(224)
plt.hist(sf1['pct_white'], bins=50)

plt.show()

The figures by default are a bit too small, and now it would really help to add some titles and legends to make this easier to read.

In [None]:
plt.figure(1, figsize=(10,8), )
plt.suptitle("2010 Racial and Ethnic Distribution by Census Block, San Francisco Bay Area", fontsize=16)

ax = plt.subplot(221)
ax.set_title("Asian")
ax.set_xlabel('Percent of Population in Census Block')
ax.set_ylabel('Number of Census Blocks')
ax.hist(sf1['pct_asian'], bins=50)

ax = plt.subplot(222)
ax.set_title("Black")
ax.set_xlabel('Percent of Population in Census Block')
ax.set_ylabel('Number of Census Blocks')
ax.hist(sf1['pct_black'], bins=50)

ax = plt.subplot(223)
ax.set_title("Hispanic")
ax.set_xlabel('Percent of Population in Census Block')
ax.set_ylabel('Number of Census Blocks')
ax.hist(sf1['pct_hisp'], bins=50)

ax = plt.subplot(224)
ax.set_title("White")
ax.set_xlabel('Percent of Population in Census Block')
ax.set_ylabel('Number of Census Blocks')
ax.hist(sf1['pct_white'], bins=50)

plt.subplots_adjust(wspace=.5, hspace=.5)
plt.show()

But notice that the y axis scales are different for each subplot?  That makes direct comparisons potentially distorted by the difference in the scales.  Below we add consistent scales for the y axis on each subplot.

In [None]:
plt.figure(1, figsize=(10,8), )
plt.suptitle("2010 Racial and Ethnic Distribution by Census Block, San Francisco Bay Area", fontsize=16)

ax = plt.subplot(221)
ax.set_title("Asian")
ax.set_xlabel('Percent of Population in Census Block')
ax.set_ylabel('Number of Census Blocks')
ax.set_ylim(0,45000)
ax.hist(sf1['pct_asian'], bins=50)

ax = plt.subplot(222)
ax.set_title("Black")
ax.set_xlabel('Percent of Population in Census Block')
ax.set_ylabel('Number of Census Blocks')
ax.set_ylim(0,45000)
ax.hist(sf1['pct_black'], bins=50)

ax = plt.subplot(223)
ax.set_title("Hispanic")
ax.set_xlabel('Percent of Population in Census Block')
ax.set_ylabel('Number of Census Blocks')
ax.set_ylim(0,45000)
ax.hist(sf1['pct_hisp'], bins=50)

ax = plt.subplot(224)
ax.set_title("White")
ax.set_xlabel('Percent of Population in Census Block')
ax.set_ylabel('Number of Census Blocks')
ax.set_ylim(0,45000)
ax.hist(sf1['pct_white'], bins=50)

plt.subplots_adjust(wspace=.5, hspace=.5)
plt.show()

Finally, a bit more tweaking of the plots, changing the color and edgecolor of the bars, and decreasing the alpha (opacity) of the bars.  And here we save the figure to a png file at a specified dpi.

In [None]:
plt.figure(1, figsize=(10,8), )
plt.suptitle("2010 Racial and Ethnic Distribution by Census Block, San Francisco Bay Area", fontsize=16)

ax = plt.subplot(221)
ax.set_title("Asian")
ax.set_xlabel('Percent of Population in Census Block')
ax.set_ylabel('Number of Census Blocks')
ax.set_ylim(0,45000)
ax.hist(sf1['pct_asian'], bins=50, alpha=.6, color='r', edgecolor='r')

ax = plt.subplot(222)
ax.set_title("Black")
ax.set_xlabel('Percent of Population in Census Block')
ax.set_ylabel('Number of Census Blocks')
ax.set_ylim(0,45000)
ax.hist(sf1['pct_black'], bins=50, alpha=.6, color='b', edgecolor='b')

ax = plt.subplot(223)
ax.set_title("Hispanic")
ax.set_xlabel('Percent of Population in Census Block')
ax.set_ylabel('Number of Census Blocks')
ax.set_ylim(0,45000)
ax.hist(sf1['pct_hisp'], bins=50, alpha=.6, color='g', edgecolor='g')

ax = plt.subplot(224)
ax.set_title("White")
ax.set_xlabel('Percent of Population in Census Block')
ax.set_ylabel('Number of Census Blocks')
ax.set_ylim(0,45000)
ax.hist(sf1['pct_white'], bins=50, alpha=.6, color='k', edgecolor='k')

plt.subplots_adjust(wspace=.5, hspace=.5)
plt.savefig("2010_racial_distributuon_bay_area.png",dpi=150)
plt.show()


## Visualizing Craigslist Data

Next we read Craigslist rental listings for the San Francisco Bay Area - over 70,000 listings scraped during 2014,  cleaned, and reverse geocoded to attach a Census Block ID that we can use to merge with census data.

In [None]:
rentals = pd.read_csv('sfbay_geocoded.csv', dtype={'fips_block': str})
rentals = rentals.iloc[:,2:]
print(rentals.head())
print(rentals.shape)
print(rentals['rent'].describe())
print(rentals['sqft'].describe())

Start by creating a default histogram of rents.

In [None]:
plt.hist(rentals['rent'])
plt.show()

Now make it look nicer, adding titles, axis labels, and setting the color and alpha.

In [None]:
plt.figure(1, figsize=(8,6), )
plt.suptitle('2014 Bay Area Rents from Craigslist Listings', fontsize=14)
plt.xlabel('Monthly Rent')
plt.ylabel('Number of Listings')
ax = plt.hist(rentals['rent'], bins=25, alpha=.6, color='g')
plt.show()

Now lets look at generating a scatter plot between two variables -- sqft and rent.

In [None]:
plt.scatter(rentals['sqft'], rentals['rent'])

And a somewhat nicer version:

In [None]:
plt.figure(1, figsize=(8,6), )
plt.suptitle('2014 Bay Area Rents from Craigslist Listings', fontsize=14)
plt.xlabel('Square Feet')
plt.ylabel('Rent')
plt.xlim(0,5000)
plt.ylim(0,10000)
ax = plt.scatter(rentals['sqft'], rentals['rent'], color='g', alpha=.5, edgecolor='g', s=.2)
plt.show()

And here we look at a scatter plot of rents against number of bedrooms.

In [None]:
plt.figure(1, figsize=(8,6), )
plt.suptitle('2014 Bay Area Rents from Craigslist Listings', fontsize=14)
plt.xlabel('Bedrooms')
plt.ylabel('Rent')
ax = plt.scatter(rentals['bedrooms'], rentals['rent'], color='w', edgecolor='r', alpha=.6, s=50)
plt.show()

## Exploring the Altair Python Visualization Library

For the remainder of today's session, experiment with visualizing the two datasets above (or load your own data) to become more comfortable making data charts in Python.

I also suggest that you explore the tutorials for the new Altair library. If you didn't install it earlier, do it before running the cell below, which loads a series of tutorials into this notebook.

In [None]:
from altair import *
tutorial(overwrite=True)