Project 1 - California Water Usage
=================

Welcome to the first project in DSC 10! We will be exploring possible connections between water usage, geography, and income in California. The water data for this project was procured from the [California State Water Resources Control Board](http://www2.pacinst.org/gpcd/table.html) and curated by the [Pacific Institute](http://pacinst.org/). The map data includes [US topography](https://github.com/jgoodall/us-maps), [California counties](https://github.com/johan/world.geo.json/tree/master/countries/USA/CA), and [ZIP codes](http://bl.ocks.org/jefffriesen/6892860).

The dataset on income comes from the IRS ([documentation](http://www.irs.gov/pub/irs-soi/13zpdoc.doc)).  We have identified some interesting columns in the dataset, but a full description of all the columns (and a definition of the population in the dataset and some interesting anonymization procedures they used) is available in this [description](irs_info.pdf).

As usual, **run the cell below** to prepare the automatic tests. **Passing the automatic tests does not guarantee full credit on any question.** The tests are provided to help catch some common errors, but it is *your* responsibility to answer the questions correctly.

In [2]:
# Run this cell, but please don't change it.

import numpy as np
import math
from datascience import *

# These lines set up the plotting functionality and formatting.
import matplotlib
matplotlib.use('Agg', warn=False)
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

# These lines load the tests.
from client.api.assignment import load_assignment 
from client.api.notebook import Notebook
project1 = Notebook('project1.ok')

Assignment: Project 1
OK, version v1.13.5



First, load the data. Loading may take some time.

In [4]:
# Run this cell, but please don't change it.

districts = Map.read_geojson('water_districts.geojson')
zips = Map.read_geojson('ca_zips.geojson.gz')
usage_raw = Table.read_table('water_usage.csv', dtype={'pwsid': str})
income_raw = Table.read_table('ca_income_by_zip.csv', dtype={'ZIP': str}).drop(['STATEFIPS', 'STATE', 'agi_stub'])
wd_vs_zip = Table.read_table('wd_vs_zip.csv', dtype={'PWSID': str, 'ZIP': str}).set_format([2, 3], PercentFormatter)

Part 0: Maps
======

The `districts` and `zips` data sets are `Map` objects. Documentation on mapping in the `datascience` package can be found at [here](http://data8.org/datascience/maps.html).  To view a map of California's water districts, run the cell below. Click on a district to see its description.

In [5]:
districts.format(width=400, height=200)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.


A `Map` is a collection of regions and other features such as points and markers, each of which has a **string** `id` and various properties. You can view the features of the `districts` map as a table using `Table.from_records`.

In [8]:
district_table = Table.from_records(districts.features)
district_table.show(3)

PWSID,feature,id,popupContent
110001,,0,Alameda County Water District
110003,,1,California Water Service Company Livermore
110005,,2,East Bay Municipal Utilities District


Part 2: Water Usage
======
We will now investigate water usage in California. The `usage` table contains three columns:

- `PWSID`: The Public Water Supply Identifier of the district
- `Population`: Estimate of average population served in 2015
- `Water`: Average residential water use (gallons per person per day) in 2014-2015

In [9]:
# Run this cell to create the usage table

usage_raw.set_format(4, NumberFormatter)
max_pop = usage_raw.select([0, 'population']).group(0, max).relabeled(1, 'Population')
avg_water = usage_raw.select([0, 'res_gpcd']).group(0, np.mean).relabeled(1, 'Water')
usage = max_pop.join('pwsid', avg_water).relabeled(0, 'PWSID')
usage

PWSID,Population,Water
110001,340000,70.7
110003,57450,90.2727
110005,1390000,76.0
110006,151037,57.1818
110008,73067,96.6364
110009,79547,68.6364
110011,31994,85.8182
310003,23347,82.8182
410002,101447,142.0
410005,11208,88.8182


**Question 2.1.** Draw a map of the water districts, colored by the per capita water usage in each district.

Use the `districts.color(...)` method to generate the map. It takes as its first argument a two-column table with one row per district that has the district `PWSID` as its first column. The label of the second column is used in the legend of the map, and the values are used to color each region.

In [1]:
per_capita_usage = usage.column('Water') / usage.column('Population')
districts.color(per_capita_usage, key_on='feature.properties.PWSID') 

NameError: name 'usage' is not defined