## ETL of Census Data

The "Escape the Bay" project was a success,  but generated 11 separate CSV Files!

Since each analysis was performed separately,  it was difficult to be able to draw correlations between the datasets.

The purpose of this ETL Homework is create a database were all the data can be stored, and queries can be written from dataset to dataset.

Thus the Tasks will be:

1) Extract data from 8 CSVs (there is some duplication of information) and import into Pandas

2) Transform
   A. Eliminate un-needed data and missing data
   B. Harmonize the naming of the key cities in the analysis so the tables can be joined more easily
   B. Based on the dataset size, determine whether to join the data in Pandas,  or in SQL
   
3) Create the Schema for the Escape-The-Bay Database

4) Create the tables in SQL

5) Upload csv into a POSTGRESQL database

5) Check the database and write a few sample queries using SQLALchemy;  

6) Using SQLAlchemy Plot with MathPlotLib to see the new visualizations that were previously not possilbe

7) Document!
   
References:  The original data comes from Vanessa Oakes, Emily Todd, Stefan Zobrist and Rebecca Mih

## Transform

In [1]:
import csv
from matplotlib.ticker import FuncFormatter
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import requests
import json

In [2]:
# Census Quick Facts CSV
out_df = pd.read_csv('Quick Facts- Outside CA.csv')
california_df = pd.read_csv('Quick Facts-California.csv')
marital_ca_df = pd.read_csv('California - Marital 3.csv')
marital_out_df = pd.read_csv('Out of State - Marital.csv')

# Census API call and CSV creation
# This API call creates the dataset for Median home values, Median Rental costs,  
base_url = "https://api.census.gov/data/2017/acs/acs1/profile"
csv_path = "SF_All_OUT.csv"
sf_out_df = pd.read_csv(csv_path)

# Census Advanced Fact Finder CSVs.  
# The "counties" CSV contain information regarding:
# Population demographics, Number of owner occupied housing, Median Value of owner occupied housing, Median Gross rent
# Median Income, Total # of Employer Establishments, Total annual payroll, FIPs code

#The "income and mortgage" CSV is a unique dataset that contains the distributions (bins) of the Total Household Income
# the mortgage values, and the debt to income rati0,  by county

CA_counties = pd.read_csv("CA_counties.csv")
non_CA_counties = pd.read_csv("non_CA_counties.csv")
ACS_data = pd.read_csv("2017_income_mortgage.csv")


In [3]:
#  First determine where people migrate to,  from San Francisco.  Look at those moving within CA

#Top 5 in-CA counties
ranked_sf_out_df = sf_out_df.sort_values(by='Total', ascending=False)
sf_to_ca = ranked_sf_out_df.loc[ranked_sf_out_df["State Name"] == "California",:]
sf_to_ca.head(5)

#Transorm the data -- remove the State FIPs code and re-order the columns

sf_to_ca = sf_to_ca.rename(columns={"Total": "# Migrated from SF County (2017)"})

sftoca_summary = sf_to_ca.iloc[:5,3:7]

# Clean up Employer category. Replace 'Self Employed' and 'Self' with 'Self-Employed'
sftoca_summary['County Name'] = sftoca_summary['County Name'].replace(
    {'Alameda County': 'Alameda', 'San Mateo County': 'San Mateo', "Contra Costa County":'Contra Costa',
     "Los Angeles County":'Los Angeles', 'Santa Clara County':'Santa Clara'})

sftoca_summary.to_csv('ca_destinations.csv')
sftoca_summary = sftoca_summary.set_index(["County Name"])
sftoca_summary.head()

Unnamed: 0_level_0,State Name,# Migrated from SF County (2017),Margin of Error (+/-)
County Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Alameda,California,10791,1127
San Mateo,California,8995,1054
Contra Costa,California,4085,631
Los Angeles,California,3726,547
Santa Clara,California,3383,447


In [4]:
#  Top destinations moving outside of CA

#Top 5 non-CA counties
sf_not_ca = ranked_sf_out_df.loc[ranked_sf_out_df["State Name"] != "California",:]
sf_not_ca.head(5)

#Transorm the data -- remove the State FIPs code and re-order the columns

sf_not_ca = sf_not_ca.rename(columns={"Total": "# Migrated from SF County (2017)"})

sfnotca_summary = sf_not_ca.iloc[:5,3:7]
sfnotca_summary['County Name'] = sfnotca_summary['County Name'].replace(
    {'New York County': 'NY (Manhattan)', 'King County': 'King', "Multnomah County":'Multnomah',
     "Kings County":'Kings (Brooklyn)', 'Cook County':'Cook'})

sfnotca_summary = sfnotca_summary.set_index(["County Name"])

sfnotca_summary.to_csv('non_ca_destinations.csv')
sfnotca_summary.head()

Unnamed: 0_level_0,State Name,# Migrated from SF County (2017),Margin of Error (+/-)
County Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
NY (Manhattan),New York,1419,657
King,Washington,1293,336
Multnomah,Oregon,1094,282
Kings (Brooklyn),New York,887,300
Cook,Illinois,635,223


In [5]:
#  Extract the QuickFacts data in CA and outside CA
california_df = pd.read_csv('Quick Facts-California.csv')

#column_titles = ['Fact',
#                'San Francisco County, California',
#                'Alameda County, California', 
#                 'San Mateo County, California',
#                'Contra Costa County, California', 
#                 'Los Angeles County, California',
#                'Santa Clara County, California']

#california_df = california_df.reindex(columns = column_titles)
california_df



Unnamed: 0,Fact,Fact Note,"San Francisco County, California","Alameda County, California","San Mateo County, California","Contra Costa County, California","Los Angeles County, California","Santa Clara County, California"
0,"Population estimates, July 1, 2017, (V2017)",,884363,1663190,771410,1147439,10163507,1938153
1,"Population estimates base, April 1, 2010, (V2...",,805193,1510261,718500,1049200,9818696,1781671
2,"Population, percent change - April 1, 2010 (es...",,9.80%,10.10%,7.40%,9.40%,3.50%,8.80%
3,"Population, Census, April 1, 2010",,805235,1510271,718451,1049025,9818605,1781642
4,"Persons under 5 years, percent",,4.50%,5.90%,5.70%,5.70%,6.10%,6.10%
5,"Persons under 18 years, percent",,13.40%,20.70%,20.80%,22.80%,21.90%,22.20%
6,"Persons 65 years and over, percent",,15.40%,13.50%,15.80%,15.30%,13.20%,13.10%
7,"Female persons, percent",,49.00%,50.80%,50.70%,51.10%,50.70%,49.50%
8,"White alone, percent",,53.10%,50.20%,60.60%,65.90%,70.90%,53.80%
9,"Black or African American alone, percent",(a),5.50%,11.30%,2.80%,9.50%,9.00%,2.80%


In [None]:
out_df = pd.read_csv('Quick Facts- Outside CA.csv')
out_df.drop(columns = ["Fact Note"], inplace=True) 

column_titles = ['Fact',
                'San Francisco County, California',
                'New York County (Manhattan Borough), New York', 
                 'King County, Washington',
                'Multnomah County, Oregon', 
                 'Kings County (Brooklyn Borough), New York',
                'Cook County, Illinois']

out_df.reindex(columns = column_titles)

In [None]:
#dictionary for in-CA counties

base_url = "https://api.census.gov/data/2017/acs/acs1/profile"

ca_cty_name = ["San Francisco","Alameda","San Mateo","Contra Costa","Los Angeles","Santa Clara"]
ca_st_fips = ["06","06","06","06","06","06"]
ca_cty_fips = ["075","001","081","013","037","085"]

in_ca_dict = {
    "County Name": ca_cty_name,
    "State_FIPS": ca_st_fips,
    "County_FIPS": ca_cty_fips
}

in_ca_df = pd.DataFrame(in_ca_dict)
print(in_ca_df)

#dictionary for non-CA counties

nonca_cty_name = ["San Francisco","NY (Manhattan), NY","King, WA","Multnomah, OR","Kings (Brooklyn), NY","Cook County, IL"]
nonca_st_fips = ["06","36","53","41","36","17"]
nonca_cty_fips = ["075","061","033","051","047","031"]

non_ca_dict = {
    "County Name": nonca_cty_name,
    "State_FIPS": nonca_st_fips,
    "County_FIPS": nonca_cty_fips
}

non_ca_df = pd.DataFrame(non_ca_dict)
print(non_ca_df)

In [None]:
#collect median home values by county and add to data frames

ca_med_home_val = []
med_home_var = "DP04_0089E"
    
for county_id, state_id in zip(ca_cty_fips, ca_st_fips):
    med_home_val = requests.get(f"{base_url}?get={med_home_var}&for=county:{county_id}&in=state:{state_id}").json()
    ca_med_home_val.append(int(med_home_val[1][0]))
    
in_ca_df["Med_Home_Value"] = ca_med_home_val
in_ca_df.to_csv('ca_home_value.csv')
print(in_ca_df)

non_ca_med_home_val = []
med_home_var = "DP04_0089E"

for county_id, state_id in zip(nonca_cty_fips, nonca_st_fips):
    med_home_val = requests.get(f"{base_url}?get={med_home_var}&for=county:{county_id}&in=state:{state_id}").json()
    non_ca_med_home_val.append(int(med_home_val[1][0]))
    
non_ca_df["Med_Home_Value"] = non_ca_med_home_val
non_ca_df.to_csv('nonca_home_value.csv')
print(non_ca_df)

In [None]:
#follow the same process for median gross rents

ca_med_rent = []
med_rent_var = "DP04_0134E"
    
for county_id, state_id in zip(ca_cty_fips, ca_st_fips):
    med_rent = requests.get(f"{base_url}?get={med_rent_var}&for=county:{county_id}&in=state:{state_id}").json()
    ca_med_rent.append(int(med_rent[1][0]))
    
in_ca_df["Med_Rent"] = ca_med_rent

in_ca_df.to_csv('ca_rents.csv')
print(in_ca_df)

non_ca_med_rent = []
med_rent_var = "DP04_0134E"

for county_id, state_id in zip(nonca_cty_fips, nonca_st_fips):
    med_rent= requests.get(f"{base_url}?get={med_rent_var}&for=county:{county_id}&in=state:{state_id}").json()
    non_ca_med_rent.append(int(med_rent[1][0]))
    
non_ca_df["Med_Rent"] = non_ca_med_rent
non_ca_df.to_csv('nonca_rents.csv')
print(non_ca_df)

In [None]:
#follow the same process for home owner rate

ca_own_rate = []
home_own_var = "DP04_0046PE"
    
for county_id, state_id in zip(ca_cty_fips, ca_st_fips):
    own_rate = requests.get(f"{base_url}?get={home_own_var}&for=county:{county_id}&in=state:{state_id}").json()
    ca_own_rate.append(float(own_rate[1][0]))
    
in_ca_df["Home Own Rate"] = ca_own_rate
in_ca_df.to_csv('ca_homeowner_rates.csv')
print(in_ca_df)

nonca_own_rate = []
home_own_var = "DP04_0046PE"
    
for county_id, state_id in zip(nonca_cty_fips, nonca_st_fips):
    own_rate = requests.get(f"{base_url}?get={home_own_var}&for=county:{county_id}&in=state:{state_id}").json()
    nonca_own_rate.append(float(own_rate[1][0]))
    
non_ca_df["Home Own Rate"] = nonca_own_rate
non_ca_df.to_csv('nonca_homeowner_rates.csv')
print(non_ca_df)

##  Data CleanUp Steps
Reduce the data size and clean up the naming (for easier reference later on)

df.drop()- Drop columns all the columns which have no important data

df.dropna() - Drop rows with NaN
df.reset_index() - Reset the index because we had dropped out a few rows
df[:x] - Drop rows, only keep x rows
df.rename() - Rename the colums with shorter names so the plots look ok

In [None]:
# Rename the columns to shorten the names for plotting, look at the DataFrame
CA_counties = CA_counties.rename(columns={"San Francisco County, California": "San Francisco",
                                 "Alameda County, California":"Alameda",
                                 "San Mateo County, California":"San Mateo",
                                 "Contra Costa County, California":"Contra Costa",
                                "Los Angeles County, California":"Los Angeles",
                                "Santa Clara County, California":"Santa Clara"
                                            })

non_CA_counties = non_CA_counties.rename(columns={"San Francisco County, California":"San Francisco",
                                 "New York County (Manhattan Borough), New York":"NY (Manhattan), NY",
                                 "King County, Washington":"King, WA",
                                "Multnomah County, Oregon":"Multnomah, OR",
                                "Kings County (Brooklyn Borough), New York":"Kings (Brooklyn), NY",
                                "Cook County, Illinois":"Cook, IL",
                                })


# Move the Facts into the index to get it out of the way since we don't need to clean the numbers in that column
# Making a new DF ca_data,  so  you can always refer to ca_df to see the line number of the row
ca_data = CA_counties.set_index("Fact")
non_ca_data = non_CA_counties.set_index("Fact")
non_ca_data.head()

## 4.0 Transform to numerical values
Because the raw data in the csv is formatted with $, %, or ',' Pandas will read all data as objects into the DataFrame

Clean the entire table, the user can specify the specific data fact (row) they wish to use
df.replace - Replace the %, $, , in the data to blank
df.apply(pd.to_numeric()) -- now change the objects in each column into numerics, "apply" will apply to all cols
Use errors = 'coerce' to force to an number. If there are alphanumerics, they will become 'NaN's and you will lose the text. In that case use errors = 'ignore'


In [None]:
# First remove all the $, % or ,  characters from the dataframe

cols = ca_data.columns
cols_nonca = non_ca_data.columns

ca_data[cols] = ca_data[cols].replace({'\$': '', ',': '', '\%':'', '\"': ''}, regex=True)
non_ca_data[cols_nonca]=non_ca_data[cols_nonca].replace({'\$': '', ',': '', '\%':'', '\"': ''}, regex=True)


# convert all objects to numerics
# reference:  https://stackoverflow.com/questions/36814100/pandas-to-numeric-for-multiple-columns

clean_ca = ca_data.apply(pd.to_numeric, errors='coerce')
clean_ca.head()


non_ca_data = non_ca_data[cols_nonca].apply(pd.to_numeric, errors='coerce')
non_ca_data.head()

## Income and Mortgage data from census FactFinder Advanced Search utility
Reference:
Use the US Census FactFinder - Advanced Search functionality to get detailed in the area of Employment (including income), Housing (including Mortgage information), and Population demographics.

The utility is fairly easy to use -- but Warning - there is a LOT of data, and often times the data is repeated

Recommendations: Use the filtering and editing functions on the Advanced Search, BEFORE creating your CSV file.

Select all counties of interest first as a filter.
Use the graphical interface to input as many locations as you want (by city, county, state, etc)
Once you have selected all the key locations, you can save the query, which saves time if you are going to do other analyses later on

Edit out to the minimal data you need.

The census provides the data with calculated error, or percent error. Those can be filtered
Remove any columns of data you don't need. It's difficult to change column names with very large datasets, so better to minimize the number of columns if possible

The Income and Mortgage CSV in this file was filtered and edited on the Census website, with some additional description cleaning in Excel.

In [None]:
# Files to load
ACS_data = pd.read_csv("2017_income_mortgage.csv")

ACS_data.head()