# Solar installations


To get things started, let's load the data for this project.

We've provided a single CSV file which combines census-tract-level data from Google's [Project Sunroof](https://sunroof.withgoogle.com/data-explorer/) and the US Census Bureau's 2013 [American Community Survey (ACS)](https://www.census.gov/data/developers/data-sets/acs-5year.2013.html).  Downloading the raw ACS data turns out to be a bit tricky because there is so much information; at the end of this project is a challenge problem to use web requests to download other bits of data that you might want.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

data = pd.read_csv('CensusSunroofMerged.csv')

data.columns

## Exploring the data

What is in the data?

There are several columns identifying the census tract:
* `line`  Line number from the CSV file
* `region_id_str`  Identifier for the geographic region, unique within the entire dataset
* `tract_number`  Tract ID number, only unique within a county
* `state_name`  Full name of the state that this tract is part of
* `county_name`  Full name of the county that this tract is part of.  Includes the word "county"

Then we have the Project Sunroof data (you can read the full description of the data columns in the Sunroof [metadata.csv](https://storage.googleapis.com/project-sunroof/csv/latest/metadata.csv) file.
* `percent_covered`  Percentage of buildings in Google Maps covered by Project Sunroof
* `existing_installs_count`  Number of buildings estimated to have a solar installation at time of data collection
* `count_qualified`  Number of buildings in Google Maps that are suitable for solar

And finally we have demographic data from the US Census Bureau:
* `Med_HHD_Inc_ACS_09_13`  Median household income
* `Renter_Occp_HU_ACS_09_13`  *Number* of renter-occupied housing units
* `Owner_Occp_HU_ACS_09_13` *Number* of owner-occupied housing units
* `pct_NH_Asian_alone_ACS_09_13` *Percentage* of population that identifies as Asian (and non-Hispanic)
* `pct_NH_Blk_alone_ACS_09_13` ... Black (and non-Hispanic)
* `pct_Hispanic_ACS_09_13` ... Hispanic
* `pct_NH_White_alone_ACS_09_13` ... White (and non-Hispanic)

## Preliminary analysis
1. There are approximately 73000 census tracts in the US (50 states plus Washington DC).  How many census tracts does Project Sunroof have data for?

In [None]:
# Your code here...

2. Calculate the solar installation rate (i.e., the number of existing solar installations divided by the number of rooftops which are qualified).  Which census tract has highest installation rate?  

*Hint 1: you will need to filter the data in order to get a realistic answer.  There isn't a unique right answer, so you should include a brief explanation of why your answer is correct.*

*Hint 2: The [pandas `idxmax()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.idxmax.html#pandas.Series.idxmax) may be useful.  You can use [pandas `loc[]`](https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-integer) if you need to grab a particular row based on the label returned by `idxmax()`*

*Challenge*: See if you can find this place on Google maps and look at the aerial imagery.  Do you see all the solar panels?

In [None]:
# Your code here...

3. Make a scatter plot showing the solar installation rate (as calculated above) as a function of the median household income for each census tract.

In [None]:
# Your code here...

## Evaluating equity

In this part of the project, you'll examine the data to see if are are racial/ethnic disparities in who is installing solar.  Obviously solar adoption is dependant on income, and there are major income disparities between racial groups, so our goal here will be to see if there is still a disparity *after controlling for income*.  We'll start with the state of California, but we encourage you to try other states as well.

### Cleaning the data
To do this, start by "cleaning" the data.  You've already seen some of the wacky stuff in the data; we'll need to remove any obviously erroneous data points before we run the analysis. 

* Remove any tracts (data rows) where the Project Sunroof coverage is less than 95% or over 100%.
* Remove any tracts where the median income is less than $23834 (the 2013 poverty threshold) or is NaN.  (The `~` operator may be useful here; it inverts all the `True`/`False` values in a NumPy/pandas array.)
* Remove any tracts where there are no qualified buildings for solar (`count_qualified` is 0).
* Remove any tracts where there are more installations than qualified buildings.



In [None]:
data = pd.read_csv('CensusSunroofMerged.csv')
data = data[data["state_name"] == "California"] # Feel free to select another state here

# Your code here...

### Solar adoption by race
1. Calculate the solar installation fraction for your cleaned data set.
2. For each of the four racial groups (white/black/hispanic/asian):
    * Select the tracts where the racial group comprises a majority of the population (> 50%)
    * Find a polynomial fit that estimates the solar installation fraction as a function of household income for the selected tracts.
    
        * You can choose the polynomial degree.  Once you get your code working, experiment with different degree polynomials and see if it affects your answer.  Try small numbers like 1 and big numbers like 20.
        * You may want to look back at lecture 16 for the curve-fitting example code.
        * *Hint: Jupyter may hang if you try to curve-fit with data containing `Inf` or `NaN` values.  Make sure you've removed all of these in your data-cleaning step!*

3. Plot all four polynomials on the same graph.


In [None]:
# Your code here...

### Analysis

Based on your results, do you think there is sufficient evidence that there are racial/ethnic disparities in rooftop solar adoption in California when correcting for median household income? Why or why not?

To read more, take a look at the article [Disparities in rooftop photovoltaics deployment in the United States by race and ethnicity](https://escholarship.org/content/qt5sz1j52z/qt5sz1j52z.pdf) by Deborah Sunter et. al.  They do a much more careful analysis of the data than we've done here, while the statistical analysis techniques are beyond what we've talked about in class, you should at least look at the pretty graphs in the paper and read their conclusions.  (As a side note, Deborah Sunter is a [professor in the CEE department here at Tufts](https://engineering.tufts.edu/cee/people/faculty/deborah-sunter)).


*Replace this markdown block with your answer.  Double-click to edit.*

# Challenge: retreiving US census data from the web

The Census Bureau collects a lot more data than we've given you here, and there are some tracts that don't appear in the dataset.

There are several questions you could try to answer with additional data:

* What fraction of the US population is covered by Project Sunroof?
* Are there demographic differences between the areas covered by Project Sunroof and not?  I.e., is Project Sunroof coverage representative of the US population as a whole?

The US Census Bureau has an extensive API but it is not particularly well documented.  Here are a couple API URLs to get you started:
`https://api.census.gov/data/2013/acs/acs5?get=NAME,B01001_001E&for=tract:*&in=state:25`

You can get a list of state IDs here: `https://api.census.gov/data/2013/acs/acs5?get=NAME,B01001_001E&for=state:*` 


Note: please be careful about how many requests you make the the US Census API.  If your code makes more than a dozen requests at a time, please put in a `sleep()` statement to pause for a couple seconds between requests.
