# Maps part II

files needed = ('cb_2017_55_tract_500k.shp', 'ACS_16_5YR_B27001_with_ann.csv')

We continue working with maps. Today we learn about Census tracts and think about how to call out missing data on a map. Along the way, we will look into the American Communities Survey data. 

In [None]:
import pandas as pd                         # pandas for data management
import geopandas                            # geopandas for maps work
import matplotlib.pyplot as plt             # matplotlib for plotting details                     

### Census tracts
Let's take our level of analysis down to the [Census tract](https://factfinder.census.gov/help/en/census_tract.htm). A tract is about the equivalent of a neighborhood. Its shape is 'relatively permanent' but will change occasionally. A tract holds about 4,000 inhabitants on average.   

We can find the shapefiles for Census tracts at [https://www.census.gov/geo/maps-data/data/cbf/cbf_tracts.html](https://www.census.gov/geo/maps-data/data/cbf/cbf_tracts.html).

Download the Wisconsin country shapefiles and extract the zipped file to your cwd. 

In [None]:
# Read the file into a GeoDataFrame. Note that Wisconsin's FIPS code is 55.
tracts = geopandas.read_file('cb_2017_55_tract_500k/cb_2017_55_tract_500k.shp')
print(tracts.head(3))
print(tracts.geometry.name)
print(tracts.shape)


We have several codes (state, county, tract), square meters of land (ALAND) and water (AWATER), and the geometry data for each tract. There are 1,396 tracts in the data. Let's take a look.

In [None]:
fig, gax = plt.subplots(figsize = (20,10))

tracts.plot(ax=gax, edgecolor='black', color='white')

gax.set_title('Census tracts in Wisconsin')

gax.axis('off')

plt.show()

Just looking at the map with the tracts give you a pretty good idea where most of the population lives. 

## Practice

Take a few minutes and try the following. Feel free to chat with those around if you get stuck. The TA and I are here, too.

1. Zoom in on Dane county --- only plot the tracts in Dane county. Dane's county code is 025. 

2. Madison's defining geographic feature is its lakes. Color the lakes in blue. \[Hint: Look for tracts with no land area.\]

### American Community Survey
The ACS is a survey that collects data that used to be collected on the 'long-form' U.S. census. This includes data about a household's demographics, income, ancestry, education,...

Data are collected on individual households, but the data are confidential. The individual data are aggregated to provide confidentiality when reported. For large areas (states, large counties) data are available at one-year intervals. For smaller areas, such as census tracts or block groups, data are available at five-year intervals.

An easy place to find the ACS data is the Census 'American Fact Finder.' Go to [https://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml](https://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml) and choose 'advanced search.'

I have already downloaded files on insurance coverage by age and gender. The file 'ACS_16_5YR_B27001_with_ann.csv' has the data.

In [None]:
ins = pd.read_csv('ACS_16_5YR_B27001/ACS_16_5YR_B27001_with_ann.csv')
print(ins.head())
print(ins.shape)

That is a lot of data. We can find the variable names in 'ACS_16_5YR_B27001_metadata.csv'.


Let's just keep the data for 18-24 year olds. Keep the total count and those with insurance. We can work out the count without insurance. 

In [None]:
# Keep just these variables
vars_to_keep = ['GEO.id2', 'HD01_VD09', 'HD01_VD10', 'HD01_VD37', 'HD01_VD38']
ins = ins[vars_to_keep]

In [None]:
# Rename them to something sensible. I can imagine setting up this data frame with a separate column for gender 
# and simplifying the variable names. 
ins = ins.rename(columns={'HD01_VD09':'m1824_total', 'HD01_VD10':'m1824_wins',
                          'HD01_VD37':'f1824_total', 'HD01_VD38':'f1824_wins'})
ins.head(3)

### Merge the geography data with the ACS data
We can merge this data with the shapefiles from earlier. The GEOid fields make this very easy. First, let's make sure the data are ready to be merged.

In [None]:
ins.info()

In [None]:
tracts.info()

In [None]:
# The GEOID data were stored as strings. Fix it.
tracts['GEOID'] = tracts['GEOID'].astype(float)

# Now merge. I am using a right merge, since there seem to be more insurance data than tracts?
ins_tracts = pd.merge(left=ins, right=tracts, left_on='GEO.id2', right_on='GEOID', how='right')

Create a variable that holds the share of people aged 18-24 without insurance. 

In [None]:
ins_tracts['unins'] =  1-(ins_tracts['m1824_wins'] + ins_tracts['f1824_wins']) / (ins_tracts['m1824_total'] + ins_tracts['f1824_total'])

Before we can plot, we need to turn the DataFrame into a GeoDataFrame.

In [None]:
# What do we have?
print(type(ins_tracts))        

# Create a GeoDataFrame by passing a DataFrame and calling out a geometry
ins_tracts = geopandas.GeoDataFrame(ins_tracts, geometry = 'geometry')

# What do we have now?
print(type(ins_tracts))

In [None]:
ins_tracts['unins'].describe()

## Practice

1. Plot the uninsured rate for each census tract in Wisconsin. 


Does your map look correct? Did you get an error message? The problem is that a few of the shape files do not have data associated with them. \[How could you figure that out?\] These missing values are causing problems. 

Try plotting only the census tracts that have `ins_tracts['unins']>=0`.

Go back and fix up your code.

We want to mark the census tracts with missing data with hashed lines. If we do not mark them, they will have the background color, and could be mistaken for tracts with unins = 0.

2. Start by plotting all the tracts with `ins_tracts['unins']>=0` for Dane county. 

3. Now, go back and add to your plot, a 'hatched' grey for the areas without data. Something like
````python
.plot(ax=gax, hatch='///', color='grey')
```
called on the set of tracts in Dane county that do not have data \[More subsetting!\]. Why might you want to make the hashed areas blue?