## ETL of Census Data

The "Escape the Bay" project was a success,  but generated 11 separate CSV Files!

Since each analysis was performed separately,  it was difficult to be able to draw correlations between the datasets.

The purpose of this ETL Homework is create a database were all the data can be stored, and queries can be written from dataset to dataset.

Thus the Tasks will be:

### 1) Extract data from 6 CSVs (there is some duplication of information) and import into Pandas

### 2) Transform
  #### A. Eliminate un-needed data and missing data
  #### B. Harmonize the naming of the key cities in the analysis so the tables can be joined more easily
  #### C. Based on the dataset size, determine whether to join the data in Pandas,  or in SQL
  #### D. Output csv files into the SQL_data folder
  
### 3) Load
   
 #### A. Create the Schema for the Escape-The-Bay Database using quickDBD

 #### B.  Create the tables in SQL

 #### C. Upload transformed data csv (from the SQL_Data folder) into a POSTGRESQL database

 #### D. Check the database and write a few sample queries using SQLALchemy;  

### Document!
   
References:  The original data sources comes from Vanessa Oakes, Emily Todd, Stefan Zobrist and Rebecca Mih

## Extract

In [1]:
import csv
from matplotlib.ticker import FuncFormatter
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import requests
import json

In [2]:
# Census Flows Mapper outputs
csv_path = "./Resources/SF_All_OUT.csv"
sf_out_df = pd.read_csv(csv_path)

# Census Quick Facts CSV
# The "counties" CSV contain information regarding:
# Population demographics, Number of owner occupied housing, Median Value of owner occupied housing, Median Gross rent
# Median Income, Total # of Employer Establishments, Total annual payroll, FIPs code

CA_counties = pd.read_csv("./Resources/CA_counties.csv")
non_CA_counties = pd.read_csv("./Resources/non_CA_counties.csv")
marital_ca_df = pd.read_csv('./Resources/California - Marital 3.csv')
marital_out_df = pd.read_csv('./Resources/Out of State - Marital.csv')

# Census Advanced Fact Finder CSV from the American Community Survey (ACS)  
#The "income and mortgage" CSV is a unique dataset that contains the distributions (bins) of the Total Household Income
# the mortgage values, and the debt to income rati0,  by county
ACS_data = pd.read_csv("./Resources/2017_income_mortgage.csv")

# Census API call and CSV creation
# This API call creates the dataset for Median home values, Median Rental costs
base_url = "https://api.census.gov/data/2017/acs/acs1/profile"



# Transform 

## Census Flows Mapper Data

The U.S. Census has a very handy tool called Census Flows Mapper which automatically determines the outbound and inbound migrants from any given county

https://flowsmapper.geo.census.gov/

The data in the CSVs came from the output of that website

Activities
- Keep in mind that this csv will have the primary key through the FIPs code for the Sequel Database



In [3]:
#  First determine where people migrate to,  from San Francisco.  Look at those moving within CA

#Top 5 in-CA counties
ranked_sf_out_df = sf_out_df.sort_values(by='Total', ascending=False)
sf_to_ca = ranked_sf_out_df.loc[ranked_sf_out_df["State Name"] == "California",:]


#Transorm the data -- Only take the top 5 destinations

sftoca = sf_to_ca.iloc[:5,]

sftoca = sftoca.rename(columns={"Total": "# Migrated from SF County (2017)"})

sftoca['County Name'] = sftoca['County Name'].replace(
    {'Alameda County': 'Alameda', 'San Mateo County': 'San Mateo', "Contra Costa County":'Contra Costa',
     "Los Angeles County":'Los Angeles', 'Santa Clara County':'Santa Clara'})


#sftoca = sftoca.set_index(["County Name"])

sftoca.head()


Unnamed: 0,State/County FIPS,State FIPS,County FIPS,County Name,State Name,# Migrated from SF County (2017),Margin of Error (+/-)
19,'06001','06','001',Alameda,California,10791,1127
52,'06081','06','081',San Mateo,California,8995,1054
25,'06013','06','013',Contra Costa,California,4085,631
34,'06037','06','037',Los Angeles,California,3726,547
54,'06085','06','085',Santa Clara,California,3383,447


In [4]:
#  Top destinations moving outside of CA

#Top 5 non-CA counties
sf_not_ca = ranked_sf_out_df.loc[ranked_sf_out_df["State Name"] != "California",:]
sf_not_ca.head(5)

#Transorm the data -- only take top 5 destinations
sfnotca = sf_not_ca.iloc[:5,]
sfnotca = sfnotca.rename(columns={"Total": "# Migrated from SF County (2017)"})

# No filtering needed, keep the FIPs code for the SQL database sfnotca_summary = sf_not_ca.iloc[:5,3:7]
sfnotca['County Name'] = sfnotca['County Name'].replace(
    {'New York County': 'NY (Manhattan)', 'King County': 'King', "Multnomah County":'Multnomah',
     "Kings County":'Kings (Brooklyn)', 'Cook County':'Cook'})

#sfnotca = sfnotca.set_index(["County Name"])

sfnotca.head()

Unnamed: 0,State/County FIPS,State FIPS,County FIPS,County Name,State Name,# Migrated from SF County (2017),Margin of Error (+/-)
259,'36061','36','061',NY (Manhattan),New York,1419,657
413,'53033','53','033',King,Washington,1293,336
326,'41051','41','051',Multnomah,Oregon,1094,282
255,'36047','36','047',Kings (Brooklyn),New York,887,300
131,'17031','17','031',Cook,Illinois,635,223


In [5]:
# Join the two tables together for a single destinations file

Destinations = pd.merge(sftoca,sfnotca,how='outer')

#Destinations = Destinations.set_index(["State/County FIPs"])

# Keep the FIPs ids as strings, so that they don't lose the 0 at the beginning

# Add in San Francisco County to the table ??

Destinations.to_csv('./SQL_data/Destinations.csv')

Destinations.head(10)

Unnamed: 0,State/County FIPS,State FIPS,County FIPS,County Name,State Name,# Migrated from SF County (2017),Margin of Error (+/-)
0,'06001','06','001',Alameda,California,10791,1127
1,'06081','06','081',San Mateo,California,8995,1054
2,'06013','06','013',Contra Costa,California,4085,631
3,'06037','06','037',Los Angeles,California,3726,547
4,'06085','06','085',Santa Clara,California,3383,447
5,'36061','36','061',NY (Manhattan),New York,1419,657
6,'53033','53','033',King,Washington,1293,336
7,'41051','41','051',Multnomah,Oregon,1094,282
8,'36047','36','047',Kings (Brooklyn),New York,887,300
9,'17031','17','031',Cook,Illinois,635,223


# Transform From US Census "QuickFacts Utility"

## Demographics, Age, Income, Median Housing, Median Rents, Commute, Employers


Documentation of the Journey

1. Use the graphical interface to input up to 6 locations (by city, county, state, etc) 
Reference:  https://www.census.gov/quickfacts/fact/table/US/PST045218

** The only data cleaning done in Excel was to add (for out-of-state),  Travis County TX, (Austin is located there),  which was added manually to the "non_CA_counties.csv" files

### Transform Data CleanUp Steps

* Reduce the data size and clean up the naming (for easier reference later on)

* df.drop(columns = ['column name'], inplace = True)-    Drop columns which have no important data
* df.dropna() - Drop rows with NaN
* df.reset_index() - Reset the index because we had dropped out a few rows
* df[:x] - Drop rows, only keep x rows
* df.rename()  - Rename the colums with shorter names so the plots look ok
* df.join all the destinations where people move to from San Francisco county, both within CA and also destinations out of CA

Output to SQL_data

In [6]:
CAcols = list(CA_counties.columns.values)
CAcols 

['Fact',
 'Fact Note',
 'San Francisco County, California',
 'Value Note for San Francisco County, California',
 'Alameda County, California',
 'Value Note for Alameda County, California',
 'San Mateo County, California',
 'Value Note for San Mateo County, California',
 'Contra Costa County, California',
 'Value Note for Contra Costa County, California',
 'Los Angeles County, California',
 'Value Note for Los Angeles County, California',
 'Santa Clara County, California',
 'Value Note for Santa Clara County, California']

In [7]:
# Clean up the raw data 
# Select the columns wanted

#CA_df = CA_counties[['Fact',  'San Francisco County, California','Alameda County, California',
#                    'San Mateo County, California', 'Contra Costa County, California',
#                    'Los Angeles County, California','Santa Clara County, California',]]

CA_counties.drop(columns = ['Fact Note'], inplace=True)
CA_counties.drop(columns = ['Value Note for San Francisco County, California'], inplace=True)
CA_counties.drop(columns = ['Value Note for Alameda County, California'], inplace=True)
CA_counties.drop(columns = ['Value Note for San Mateo County, California'], inplace=True)
CA_counties.drop(columns = ['Value Note for Contra Costa County, California'], inplace=True)
CA_counties.drop(columns = ['Value Note for Los Angeles County, California'], inplace=True)
CA_counties.drop(columns = ['Value Note for Santa Clara County, California'], inplace=True)

non_CA_counties.drop(columns = ['Fact Note'], inplace=True)
non_CA_counties.drop(columns = ['San Francisco County, California'], inplace=True)
non_CA_counties.drop(columns = ['Value Note for San Francisco County, California'], inplace=True)
non_CA_counties.drop(columns = ['Value Note for King County, Washington'], inplace=True)
non_CA_counties.drop(columns = ['Value Note for New York County (Manhattan Borough), New York'], inplace=True)
non_CA_counties.drop(columns = ['Value Note for Multnomah County, Oregon'], inplace=True)
non_CA_counties.drop(columns = ['Value Note for Kings County (Brooklyn Borough), New York'], inplace=True)
non_CA_counties.drop(columns = ['Value Note for Cook County, Illinois'], inplace=True)
non_CA_counties.drop(columns = ['Value Note for Travis County, Texas'], inplace=True)
non_CA_counties.drop(columns = ['Travis County, Texas'], inplace=True)


# Remove the rows which have NaNs,  doing inplace needed
CA_counties.dropna(inplace=True)
non_CA_counties.dropna(inplace=True)

# Reset the index to keep everything in order, drop = True means that the original index will be discarded
# Do this because we need to have one DF that shows the row number as a reference (later code)
# Reference:  https://stackoverflow.com/questions/33165734/update-index-after-sorting-data-frame

CA_counties.reset_index(drop=True, inplace=True)
non_CA_counties.reset_index(drop=True, inplace=True)


# Only keep the top 62 rows of data

CA_counties = CA_counties[:62]
non_CA_counties = non_CA_counties[:62]

non_CA_counties

Unnamed: 0,Fact,"New York County (Manhattan Borough), New York","King County, Washington","Multnomah County, Oregon","Kings County (Brooklyn Borough), New York","Cook County, Illinois"
0,"Population estimates, July 1, 2017, (V2017)",1664727,2188649,807555,2648771,5211263
1,"Population estimates base, April 1, 2010, (V2...",1586184,1931281,735169,2504706,5195075
2,"Population, percent change - April 1, 2010 (es...",5.00%,13.30%,9.80%,5.80%,0.30%
3,"Population, Census, April 1, 2010",1585873,1931249,735334,2504700,5194675
4,"Persons under 5 years, percent",4.80%,5.90%,5.60%,7.30%,6.20%
5,"Persons under 18 years, percent",14.40%,20.40%,19.10%,22.90%,22.00%
6,"Persons 65 years and over, percent",16.00%,13.00%,13.00%,13.50%,14.30%
7,"Female persons, percent",52.60%,49.90%,50.50%,52.60%,51.40%
8,"White alone, percent",64.40%,68.00%,79.50%,49.20%,65.60%
9,"Black or African American alone, percent",17.90%,6.80%,6.00%,34.30%,24.00%


In [8]:
CA_counties.head()

Unnamed: 0,Fact,"San Francisco County, California","Alameda County, California","San Mateo County, California","Contra Costa County, California","Los Angeles County, California","Santa Clara County, California"
0,"Population estimates, July 1, 2017, (V2017)",884363,1663190,771410,1147439,10163507,1938153
1,"Population estimates base, April 1, 2010, (V2...",805193,1510261,718500,1049200,9818696,1781671
2,"Population, percent change - April 1, 2010 (es...",9.80%,10.10%,7.40%,9.40%,3.50%,8.80%
3,"Population, Census, April 1, 2010",805235,1510271,718451,1049025,9818605,1781642
4,"Persons under 5 years, percent",4.50%,5.90%,5.70%,5.70%,6.10%,6.10%


In [9]:
cols = list(non_CA_counties.columns.values)
cols

['Fact',
 'New York County (Manhattan Borough), New York',
 'King County, Washington',
 'Multnomah County, Oregon',
 'Kings County (Brooklyn Borough), New York',
 'Cook County, Illinois']

In [10]:
CA_df = CA_counties.rename(columns = { 'San Francisco County, California': 'San Francisco',
                                      'Alameda County, California': 'Alameda',
                                      'San Mateo County, California': 'San Mateo', 
                                      'Contra Costa County, California': 'Contra Costa',
                                      'Los Angeles County, California':'Los Angeles',
                                      'Santa Clara County, California':'Santa Clara',})


non_CA_df = non_CA_counties.rename(columns = {'New York County (Manhattan Borough), New York': 'NY (Manhattan)',
                                             'King County, Washington': 'King', 'Multnomah County, Oregon': 'Multnomah',
                                             'Kings County (Brooklyn Borough), New York': 'Kings (Brooklyn)',
                                             'Cook County, Illinois': 'Cook', })


CA_df

Unnamed: 0,Fact,San Francisco,Alameda,San Mateo,Contra Costa,Los Angeles,Santa Clara
0,"Population estimates, July 1, 2017, (V2017)",884363,1663190,771410,1147439,10163507,1938153
1,"Population estimates base, April 1, 2010, (V2...",805193,1510261,718500,1049200,9818696,1781671
2,"Population, percent change - April 1, 2010 (es...",9.80%,10.10%,7.40%,9.40%,3.50%,8.80%
3,"Population, Census, April 1, 2010",805235,1510271,718451,1049025,9818605,1781642
4,"Persons under 5 years, percent",4.50%,5.90%,5.70%,5.70%,6.10%,6.10%
5,"Persons under 18 years, percent",13.40%,20.70%,20.80%,22.80%,21.90%,22.20%
6,"Persons 65 years and over, percent",15.40%,13.50%,15.80%,15.30%,13.20%,13.10%
7,"Female persons, percent",49.00%,50.80%,50.70%,51.10%,50.70%,49.50%
8,"White alone, percent",53.10%,50.20%,60.60%,65.90%,70.90%,53.80%
9,"Black or African American alone, percent",5.50%,11.30%,2.80%,9.50%,9.00%,2.80%


In [11]:
non_CA_df.head()

Unnamed: 0,Fact,NY (Manhattan),King,Multnomah,Kings (Brooklyn),Cook
0,"Population estimates, July 1, 2017, (V2017)",1664727,2188649,807555,2648771,5211263
1,"Population estimates base, April 1, 2010, (V2...",1586184,1931281,735169,2504706,5195075
2,"Population, percent change - April 1, 2010 (es...",5.00%,13.30%,9.80%,5.80%,0.30%
3,"Population, Census, April 1, 2010",1585873,1931249,735334,2504700,5194675
4,"Persons under 5 years, percent",4.80%,5.90%,5.60%,7.30%,6.20%


## Inspect the DataFrame to see what data to remove

In [12]:
Demo_data = pd.merge(CA_df,non_CA_df, how = "outer" )
#Demo_data2 = Demo_data.set_index(['Fact'])
Demo_data.head


<bound method NDFrame.head of                                                  Fact San Francisco  \
0        Population estimates, July 1, 2017,  (V2017)       884,363   
1   Population estimates base, April 1, 2010,  (V2...       805,193   
2   Population, percent change - April 1, 2010 (es...         9.80%   
3                   Population, Census, April 1, 2010       805,235   
4                      Persons under 5 years, percent         4.50%   
5                     Persons under 18 years, percent        13.40%   
6                  Persons 65 years and over, percent        15.40%   
7                             Female persons, percent        49.00%   
8                                White alone, percent        53.10%   
9            Black or African American alone, percent         5.50%   
10   American Indian and Alaska Native alone, percent         0.70%   
11                               Asian alone, percent        35.90%   
12  Native Hawaiian and Other Pacific Islander 

In [13]:
#Drop Rows
# Reference https://chrisalbon.com/python/data_wrangling/pandas_dropping_column_and_rows/
# proper usage of reset_index https://stackoverflow.com/questions/40755680/how-to-reset-index-pandas-dataframe-after-dropna-pandas-dataframe
# drop = True re-assigns the same dataframe the values, with a new index

Demo_summary = Demo_data.drop(Demo_data.index[[1,2,7,8,9,10,11,12,13,14,15,16,17,18,19,24,25,26,27,28,
                                               29,30,31,32,33,34,37,38,39,40,41,42,45,46,53,54,55,56,57,58,59,60]])

Demo_summary = Demo_summary.reset_index(drop=True)

Demo_summary.head()




Unnamed: 0,Fact,San Francisco,Alameda,San Mateo,Contra Costa,Los Angeles,Santa Clara,NY (Manhattan),King,Multnomah,Kings (Brooklyn),Cook
0,"Population estimates, July 1, 2017, (V2017)",884363,1663190,771410,1147439,10163507,1938153,1664727,2188649,807555,2648771,5211263
1,"Population, Census, April 1, 2010",805235,1510271,718451,1049025,9818605,1781642,1585873,1931249,735334,2504700,5194675
2,"Persons under 5 years, percent",4.50%,5.90%,5.70%,5.70%,6.10%,6.10%,4.80%,5.90%,5.60%,7.30%,6.20%
3,"Persons under 18 years, percent",13.40%,20.70%,20.80%,22.80%,21.90%,22.20%,14.40%,20.40%,19.10%,22.90%,22.00%
4,"Persons 65 years and over, percent",15.40%,13.50%,15.80%,15.30%,13.20%,13.10%,16.00%,13.00%,13.00%,13.50%,14.30%


In [14]:
#Rename the columns
Demographics = Demo_summary.set_index('Fact')

Demographics.head()



Unnamed: 0_level_0,San Francisco,Alameda,San Mateo,Contra Costa,Los Angeles,Santa Clara,NY (Manhattan),King,Multnomah,Kings (Brooklyn),Cook
Fact,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
"Population estimates, July 1, 2017, (V2017)",884363,1663190,771410,1147439,10163507,1938153,1664727,2188649,807555,2648771,5211263
"Population, Census, April 1, 2010",805235,1510271,718451,1049025,9818605,1781642,1585873,1931249,735334,2504700,5194675
"Persons under 5 years, percent",4.50%,5.90%,5.70%,5.70%,6.10%,6.10%,4.80%,5.90%,5.60%,7.30%,6.20%
"Persons under 18 years, percent",13.40%,20.70%,20.80%,22.80%,21.90%,22.20%,14.40%,20.40%,19.10%,22.90%,22.00%
"Persons 65 years and over, percent",15.40%,13.50%,15.80%,15.30%,13.20%,13.10%,16.00%,13.00%,13.00%,13.50%,14.30%


In [15]:
demoT = Demographics.T
demoT
demoT_cols=list(demoT.columns.values)
demoT_cols


['Population estimates, July 1, 2017,  (V2017)',
 'Population, Census, April 1, 2010',
 'Persons under 5 years, percent',
 'Persons under 18 years, percent',
 'Persons 65 years and over, percent',
 'Median value of owner-occupied housing units, 2013-2017',
 'Median selected monthly owner costs -with a mortgage, 2013-2017',
 'Median selected monthly owner costs -without a mortgage, 2013-2017',
 'Median gross rent, 2013-2017',
 'In civilian labor force, total, percent of population age 16 years+, 2013-2017',
 'In civilian labor force, female, percent of population age 16 years+, 2013-2017',
 'Mean travel time to work (minutes), workers age 16 years+, 2013-2017',
 'Median household income (in 2017 dollars), 2013-2017',
 'Total employer establishments, 2016',
 'Total employment, 2016',
 'Total annual payroll, 2016 ($1,000)',
 'Total employment, percent change, 2015-2016',
 'Total nonemployer establishments, 2016',
 'All firms, 2012',
 'FIPS Code']

In [16]:
# Update the column titles to be easier to 

demoT_summary = demoT.rename(columns={'Population estimates, July 1, 2017,  (V2017)': 'Population estimate, 2017', 
                                        'Population, Census, April 1, 2010': 'Population, Census, 2010',
                                        'Persons under 5 years, percent': 'Age <5 yrs, %',
                                        'Persons under 18 years, percent': 'Age <18 yrs, %',
                                        'Persons 65 years and over, percent': 'Age 65 yrs+, %',
                                        'Median value of owner-occupied housing units, 2013-2017': 'Home Median Value, $',
                                        'Median selected monthly owner costs -with a mortgage, 2013-2017': 'Med. Monthly Costs with mortgage, $',
                                        'Median selected monthly owner costs -without a mortgage, 2013-2017': 'Monthly Costs, no mortgage, $',
                                        'Median gross rent, 2013-2017': 'Median gross rent, $',
                                        'In civilian labor force, total, percent of population age 16 years+, 2013-2017': 'Employment, %',
                                        'In civilian labor force, female, percent of population age 16 years+, 2013-2017': 'Employment, Females, %',
                                        'Mean travel time to work (minutes), workers age 16 years+, 2013-2017': "Travel time to work, mean",
                                        'Median household income (in 2017 dollars), 2013-2017': 'Median Household income $',
                                        'Total employer establishments, 2016': 'Total # of employers, 2016',
                                        'Total employment, 2016':'Total Employed, 2016',
                                        'Total annual payroll, 2016 ($1,000)':'Annual payroll, 2016 ($1K)',
                                        'Total nonemployer establishments, 2016': 'Total nonemployers',
                                        'All firms, 2012':'Total # of employers, 2012',
                                        'FIPS Code': 'State/County FIPS'
                          
                                            })
demoT_summary

Fact,"Population estimate, 2017","Population, Census, 2010","Age <5 yrs, %","Age <18 yrs, %","Age 65 yrs+, %","Home Median Value, $","Med. Monthly Costs with mortgage, $","Monthly Costs, no mortgage, $","Median gross rent, $","Employment, %","Employment, Females, %","Travel time to work, mean",Median Household income $,"Total # of employers, 2016","Total Employed, 2016","Annual payroll, 2016 ($1K)","Total employment, percent change, 2015-2016",Total nonemployers,"Total # of employers, 2012",State/County FIPS
San Francisco,884363,805235,4.50%,13.40%,15.40%,"$927,400","$3,332",$615,"$1,709",70.30%,66.20%,32.8,"$96,265",34314,627915,60475855,2.70%,99307,116803,"""06075"""
Alameda,1663190,1510271,5.90%,20.70%,13.50%,"$649,100","$2,675",$608,"$1,547",66.60%,60.60%,32.5,"$85,743",39242,662511,45907712,3.20%,143612,150564,"""06001"""
San Mateo,771410,718451,5.70%,20.80%,15.80%,"$917,700","$3,227",$689,"$1,973",68.70%,63.10%,28.2,"$105,667",21199,374251,41060198,-0.70%,71514,75507,"""06081"""
Contra Costa,1147439,1049025,5.70%,22.80%,15.30%,"$522,300","$2,527",$634,"$1,600",64.40%,58.30%,37.1,"$88,456",23591,325864,21468960,3.30%,94711,93083,"""06013"""
Los Angeles,10163507,9818605,6.10%,21.90%,13.20%,"$495,800","$2,336",$556,"$1,322",64.30%,57.80%,30.9,"$61,015",269489,3871716,212488786,-3.40%,1046426,1146701,"""06037"""
Santa Clara,1938153,1781642,6.10%,22.20%,13.10%,"$829,600","$3,081",$709,"$1,955",67.30%,59.90%,28.0,"$106,761",48278,1021748,114930448,2.20%,143480,163130,"""06085"""
NY (Manhattan),1664727,1585873,4.80%,14.40%,16.00%,"$915,300","$3,112",$946,"$1,615",67.20%,62.40%,31.8,"$79,781",104691,2245903,241159256,1.80%,226631,315399,"""36061"""
King,2188649,1931249,5.90%,20.40%,13.00%,"$446,600","$2,271",$732,"$1,379",69.50%,63.40%,29.1,"$83,571",68079,1167201,87675700,3.00%,172297,201404,"""53033"""
Multnomah,807555,735334,5.60%,19.10%,13.00%,"$330,900","$1,799",$616,"$1,094",68.80%,65.00%,26.6,"$60,369",27246,434205,22672822,3.80%,71087,85366,"""41051"""
Kings (Brooklyn),2648771,2504700,7.30%,22.90%,13.50%,"$623,900","$2,723",$843,"$1,314",63.40%,59.00%,42.4,"$52,782",57621,606738,24611877,4.90%,269471,296858,"""36047"""


## Transform to numerical values
Because the raw data in the csv is formatted with $, %, or ',' Pandas will read all data as objects into the DataFrame

Clean the entire table, the user can specify the specific data fact (row) they wish to use
df.replace - Replace the %, $, , in the data to blank
df.apply(pd.to_numeric()) -- now change the objects in each column into numerics, "apply" will apply to all cols
Use errors = 'coerce' to force to an number. If there are alphanumerics, they will become 'NaN's and you will lose the text. In that case use errors = 'ignore'


In [17]:
# Remove non-numerics in the dataframe
cols = demoT_summary.columns

demoT_summary[cols] = demoT_summary[cols].replace({'\$': '', ',': '', '\%':'', '\"': ''}, regex=True)


In [18]:
Demographics = demoT_summary.apply(pd.to_numeric, errors='coerce')

Demographics.index.name ='County'

Demographics

# Reference for how to set the index name https://stackoverflow.com/questions/18022845/pandas-index-column-title-or-name

Fact,"Population estimate, 2017","Population, Census, 2010","Age <5 yrs, %","Age <18 yrs, %","Age 65 yrs+, %","Home Median Value, $","Med. Monthly Costs with mortgage, $","Monthly Costs, no mortgage, $","Median gross rent, $","Employment, %","Employment, Females, %","Travel time to work, mean",Median Household income $,"Total # of employers, 2016","Total Employed, 2016","Annual payroll, 2016 ($1K)","Total employment, percent change, 2015-2016",Total nonemployers,"Total # of employers, 2012",State/County FIPS
County,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
San Francisco,884363,805235,4.5,13.4,15.4,927400,3332,615,1709,70.3,66.2,32.8,96265,34314,627915,60475855,2.7,99307,116803,6075
Alameda,1663190,1510271,5.9,20.7,13.5,649100,2675,608,1547,66.6,60.6,32.5,85743,39242,662511,45907712,3.2,143612,150564,6001
San Mateo,771410,718451,5.7,20.8,15.8,917700,3227,689,1973,68.7,63.1,28.2,105667,21199,374251,41060198,-0.7,71514,75507,6081
Contra Costa,1147439,1049025,5.7,22.8,15.3,522300,2527,634,1600,64.4,58.3,37.1,88456,23591,325864,21468960,3.3,94711,93083,6013
Los Angeles,10163507,9818605,6.1,21.9,13.2,495800,2336,556,1322,64.3,57.8,30.9,61015,269489,3871716,212488786,-3.4,1046426,1146701,6037
Santa Clara,1938153,1781642,6.1,22.2,13.1,829600,3081,709,1955,67.3,59.9,28.0,106761,48278,1021748,114930448,2.2,143480,163130,6085
NY (Manhattan),1664727,1585873,4.8,14.4,16.0,915300,3112,946,1615,67.2,62.4,31.8,79781,104691,2245903,241159256,1.8,226631,315399,36061
King,2188649,1931249,5.9,20.4,13.0,446600,2271,732,1379,69.5,63.4,29.1,83571,68079,1167201,87675700,3.0,172297,201404,53033
Multnomah,807555,735334,5.6,19.1,13.0,330900,1799,616,1094,68.8,65.0,26.6,60369,27246,434205,22672822,3.8,71087,85366,41051
Kings (Brooklyn),2648771,2504700,7.3,22.9,13.5,623900,2723,843,1314,63.4,59.0,42.4,52782,57621,606738,24611877,4.9,269471,296858,36047


In [19]:
Demographics.to_csv('./SQL_data/Demographics.csv')

## Income and Mortgage data from census FactFinder Advanced Search utility
Reference:
Use the US Census FactFinder - Advanced Search functionality to get detailed in the area of Employment (including income), Housing (including Mortgage information), and Population demographics.

The utility is fairly easy to use -- but Warning - there is a LOT of data, and often times the data is repeated

Recommendations: Use the filtering and editing functions on the Advanced Search, BEFORE creating your CSV file.

Select all counties of interest first as a filter.
Use the graphical interface to input as many locations as you want (by city, county, state, etc)
Once you have selected all the key locations, you can save the query, which saves time if you are going to do other analyses later on

Edit out to the minimal data you need.

The census provides the data with calculated error, or percent error. Those can be filtered
Remove any columns of data you don't need. It's difficult to change column names with very large datasets, so better to minimize the number of columns if possible

The Income and Mortgage CSV in this file was filtered and edited on the Census website, with some additional description cleaning in Excel.

In [65]:
# Files to load
ACS_data = pd.read_csv("./Resources/2017_income_mortgage.csv")

ACS_data.head()



Unnamed: 0,GEO.id,GEO.display-label,HC02_EST_VC03,HC02_EST_VC04,HC02_EST_VC05,HC02_EST_VC06,HC02_EST_VC07,HC02_EST_VC08,HC02_EST_VC09,HC01_EST_VC10,...,HC02_EST_VC24,HC02_EST_VC25,HC02_EST_VC26,HC02_EST_VC27,HC01_EST_VC28,HC02_EST_VC31,HC02_EST_VC32,HC02_EST_VC33,HC02_EST_VC34,GEO.id2
0,Id,Geography,"Mortgate Value - Less than $50,000","Mortgage value - $50,000 to $99,999","Mortgage Value - $100,000 to $299,999","Mortgage Value- $300,000 to $499,999","Mortgage Value $500,000 to $749,999","Mortgage Value - $750,000 to $999,999","Mortgage Value $1,000,000 or more",mortgage; Estimate; VALUE - Median (dollars),...,"% Household Income - $50,000 to $74,999","% Household Income - $75,000 to $99,999","% Household Income - $100,000 to $149,999","% Household Income - $150,000 or more",Median household income (dollars),% RATIO OF VALUE TO HOUSEHOLD INCOME Less th...,% RATIO OF VALUE TO HOUSEHOLD INCOME 2.0 to 2.9,%RATIO OF VALUE TO HOUSEHOLD INCOME - 3.0 to 3.9,%RATIO OF VALUE TO HOUSEHOLD INCOME - 4.0 or ...,Id2
1,0500000US06001,"Alameda County, California",1.2,0.6,7,20.8,31.3,21.8,17.2,662100,...,10,11.2,23.1,43.7,134673,7.7,15,17.9,59,6001
2,0500000US06013,"Contra Costa County, California",1.3,0.8,14.2,29.8,22.5,14.6,16.8,543400,...,11.9,12.2,22.9,39.2,124265,10.8,17.8,18.6,52.4,6013
3,0500000US06037,"Los Angeles County, California",1.6,0.9,13.1,34.7,25.9,10.7,13.1,498400,...,15.1,14.2,22,29,101782,8.6,14.6,16.4,59.8,6037
4,0500000US06075,"San Francisco County, California",1,0.4,2,5.8,21.8,24.4,44.5,943700,...,8.3,9.5,19.3,51.8,155398,5.8,9.8,13.1,70.9,6075


In [66]:
ACS_data.columns = ['ID', 
                    'County',
                    '% of Mortgages Valued at <$50K',
                         '% of Mortgages Valued at $50-$99K',
                         '% of Mortgages Valued at $100K-$299K',
                         '% of Mortgages Valued at $300K-$499K',
                         '% of Mortgages Valued at $500K-$749K',
                         '% of Mortgages Valued at $750K-$999K',
                         '% of Mortgages Valued at >$1M',
                        'Median Value of Mortgages ($)',
                        '% Household income <$10K',
                        '% Household income $10K-$24K',
                        '% Household income $25K-34K',
                        '% Household income $35K-$49K',
                        '% Household income $50K-$74K',
                        '% Household income $75K-$99K',
                        '% Household income $100K-$150K',
                       '% Household income >$150K',
                       '2017 Household Median Income ($)',
                       'Ratio of Mortgage Value to Income, % <2',
                       'Ratio of Mortgage Value to Income, % 2-2.9',
                       'Ratio of Mortgage Value to Income %, 3-3.9',
                        'Ratio of Mortgage Value to Income, % > 4.0',
                        'ID2']
ACS_data.drop(columns = ['ID'], inplace=True)  
ACS_data.set_index('County', inplace=True)                 

ACS_data.head()

Unnamed: 0_level_0,% of Mortgages Valued at <$50K,% of Mortgages Valued at $50-$99K,% of Mortgages Valued at $100K-$299K,% of Mortgages Valued at $300K-$499K,% of Mortgages Valued at $500K-$749K,% of Mortgages Valued at $750K-$999K,% of Mortgages Valued at >$1M,Median Value of Mortgages ($),% Household income <$10K,% Household income $10K-$24K,...,% Household income $50K-$74K,% Household income $75K-$99K,% Household income $100K-$150K,% Household income >$150K,2017 Household Median Income ($),"Ratio of Mortgage Value to Income, % <2","Ratio of Mortgage Value to Income, % 2-2.9","Ratio of Mortgage Value to Income %, 3-3.9","Ratio of Mortgage Value to Income, % > 4.0",ID2
County,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Geography,"Mortgate Value - Less than $50,000","Mortgage value - $50,000 to $99,999","Mortgage Value - $100,000 to $299,999","Mortgage Value- $300,000 to $499,999","Mortgage Value $500,000 to $749,999","Mortgage Value - $750,000 to $999,999","Mortgage Value $1,000,000 or more",mortgage; Estimate; VALUE - Median (dollars),"Percent HOUSEHOLD INCOME Less than $10,000","%Household Income $10,000 to $24,999",...,"% Household Income - $50,000 to $74,999","% Household Income - $75,000 to $99,999","% Household Income - $100,000 to $149,999","% Household Income - $150,000 or more",Median household income (dollars),% RATIO OF VALUE TO HOUSEHOLD INCOME Less th...,% RATIO OF VALUE TO HOUSEHOLD INCOME 2.0 to 2.9,%RATIO OF VALUE TO HOUSEHOLD INCOME - 3.0 to 3.9,%RATIO OF VALUE TO HOUSEHOLD INCOME - 4.0 or ...,Id2
"Alameda County, California",1.2,0.6,7,20.8,31.3,21.8,17.2,662100,1.3,3.1,...,10,11.2,23.1,43.7,134673,7.7,15,17.9,59,6001
"Contra Costa County, California",1.3,0.8,14.2,29.8,22.5,14.6,16.8,543400,1.4,3.4,...,11.9,12.2,22.9,39.2,124265,10.8,17.8,18.6,52.4,6013
"Los Angeles County, California",1.6,0.9,13.1,34.7,25.9,10.7,13.1,498400,2,5.1,...,15.1,14.2,22,29,101782,8.6,14.6,16.4,59.8,6037
"San Francisco County, California",1,0.4,2,5.8,21.8,24.4,44.5,943700,1.4,3.2,...,8.3,9.5,19.3,51.8,155398,5.8,9.8,13.1,70.9,6075


In [67]:
ACS_dataT = ACS_data.T

# Reset the index to keep everything in order, drop = True means that the original index will be discarded
# Do this because we need to have one DF that shows the row number as a reference (later code)
# Reference:  https://stackoverflow.com/questions/33165734/update-index-after-sorting-data-frame

#ACS_dataT.reset_index(inplace=True)

# Remove the rows which have NaNs,  doing inplace needed
ACS_dataT.dropna(inplace=True)

# Only keep the top 25 rows of data
ACS_dataT = ACS_dataT[:25]

ACS_dataT.drop(columns = ['Geography'], inplace=True)  

ACS_dataT.head()

County,"Alameda County, California","Contra Costa County, California","Los Angeles County, California","San Francisco County, California","San Mateo County, California","Santa Clara County, California","Cook County, Illinois","Kings County, New York","New York County, New York","Multnomah County, Oregon","Travis County, Texas","King County, Washington"
% of Mortgages Valued at <$50K,1.2,1.3,1.6,1.0,1.0,1.1,2.9,1.7,1.0,1.9,2.2,1.5
% of Mortgages Valued at $50-$99K,0.6,0.8,0.9,0.4,0.5,0.6,8.6,1.0,0.8,0.8,3.2,0.7
% of Mortgages Valued at $100K-$299K,7.0,14.2,13.1,2.0,2.0,3.1,53.1,8.5,4.2,39.2,48.9,21.5
% of Mortgages Valued at $300K-$499K,20.8,29.8,34.7,5.8,6.7,10.9,21.8,23.5,11.7,36.2,26.2,33.5
% of Mortgages Valued at $500K-$749K,31.3,22.5,25.9,21.8,22.8,26.6,7.8,29.2,20.5,14.9,11.9,23.7


In [68]:
# Rename the columns, look at the DataFrame

ACS_cleanup = ACS_dataT.drop(columns = ['Travis County, Texas'])
ACS_cleanup = ACS_cleanup.rename(columns={
                                      "San Francisco County, California": "San Francisco",
                                 "Alameda County, California":"Alameda",
                                 "San Mateo County, California":"San Mateo",
                                 "Contra Costa County, California":"Contra Costa",
                                "Los Angeles County, California":"Los Angeles",
                                "Santa Clara County, California":"Santa Clara",
                                      "New York County, New York": "NY (Manhattan)",
                                 "King County, Washington":"King",
                                "Multnomah County, Oregon":"Multnomah",
                                "Kings County, New York":"Kings (Brooklyn)",
                                "Cook County, Illinois":"Cook",
                                            })


ACS_cleanup.head()

County,Alameda,Contra Costa,Los Angeles,San Francisco,San Mateo,Santa Clara,Cook,Kings (Brooklyn),NY (Manhattan),Multnomah,King
% of Mortgages Valued at <$50K,1.2,1.3,1.6,1.0,1.0,1.1,2.9,1.7,1.0,1.9,1.5
% of Mortgages Valued at $50-$99K,0.6,0.8,0.9,0.4,0.5,0.6,8.6,1.0,0.8,0.8,0.7
% of Mortgages Valued at $100K-$299K,7.0,14.2,13.1,2.0,2.0,3.1,53.1,8.5,4.2,39.2,21.5
% of Mortgages Valued at $300K-$499K,20.8,29.8,34.7,5.8,6.7,10.9,21.8,23.5,11.7,36.2,33.5
% of Mortgages Valued at $500K-$749K,31.3,22.5,25.9,21.8,22.8,26.6,7.8,29.2,20.5,14.9,23.7


In [69]:
Financial = ACS_cleanup.T
Financial.head()

Unnamed: 0_level_0,% of Mortgages Valued at <$50K,% of Mortgages Valued at $50-$99K,% of Mortgages Valued at $100K-$299K,% of Mortgages Valued at $300K-$499K,% of Mortgages Valued at $500K-$749K,% of Mortgages Valued at $750K-$999K,% of Mortgages Valued at >$1M,Median Value of Mortgages ($),% Household income <$10K,% Household income $10K-$24K,...,% Household income $50K-$74K,% Household income $75K-$99K,% Household income $100K-$150K,% Household income >$150K,2017 Household Median Income ($),"Ratio of Mortgage Value to Income, % <2","Ratio of Mortgage Value to Income, % 2-2.9","Ratio of Mortgage Value to Income %, 3-3.9","Ratio of Mortgage Value to Income, % > 4.0",ID2
County,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alameda,1.2,0.6,7.0,20.8,31.3,21.8,17.2,662100,1.3,3.1,...,10.0,11.2,23.1,43.7,134673,7.7,15.0,17.9,59.0,6001
Contra Costa,1.3,0.8,14.2,29.8,22.5,14.6,16.8,543400,1.4,3.4,...,11.9,12.2,22.9,39.2,124265,10.8,17.8,18.6,52.4,6013
Los Angeles,1.6,0.9,13.1,34.7,25.9,10.7,13.1,498400,2.0,5.1,...,15.1,14.2,22.0,29.0,101782,8.6,14.6,16.4,59.8,6037
San Francisco,1.0,0.4,2.0,5.8,21.8,24.4,44.5,943700,1.4,3.2,...,8.3,9.5,19.3,51.8,155398,5.8,9.8,13.1,70.9,6075
San Mateo,1.0,0.5,2.0,6.7,22.8,23.5,43.4,930200,1.2,2.9,...,8.5,9.8,20.4,51.1,153124,4.9,8.4,13.3,73.0,6081


In [70]:
Financial.to_csv('./SQL_data/Financial.csv')

## Using US Census API to extract Home Ownership and Home Rental data


In [71]:
#dictionary for in-CA counties

base_url = "https://api.census.gov/data/2017/acs/acs1/profile"

ca_cty_name = ["San Francisco","Alameda","San Mateo","Contra Costa","Los Angeles","Santa Clara"]
ca_st_fips = ["06","06","06","06","06","06"]
ca_cty_fips = ["075","001","081","013","037","085"]


in_ca_dict = {
    "County Name": ca_cty_name,
    "State_FIPS": ca_st_fips,
    "County_FIPS": ca_cty_fips
}

in_ca_df = pd.DataFrame(in_ca_dict)

#dictionary for non-CA counties

nonca_cty_name = ['NY (Manhattan)',"King","Multnomah","Kings (Brooklyn)","Cook"]
nonca_st_fips = ["36","53","41","36","17"]
nonca_cty_fips = ["061","033","051","047","031"]

non_ca_dict = {
    "County Name": nonca_cty_name,
    "State_FIPS": nonca_st_fips,
    "County_FIPS": nonca_cty_fips
    
}



non_ca_df = pd.DataFrame(non_ca_dict)



In [None]:
#collect median home values by county and add to data frames

ca_med_home_val = []
med_home_var = "DP04_0089E"
    
for county_id, state_id in zip(ca_cty_fips, ca_st_fips):
    med_home_val = requests.get(f"{base_url}?get={med_home_var}&for=county:{county_id}&in=state:{state_id}").json()
    ca_med_home_val.append(int(med_home_val[1][0]))
    
in_ca_df["Med_Home_Value"] = ca_med_home_val
in_ca_df.to_csv('ca_home_value.csv')
print(in_ca_df)

non_ca_med_home_val = []
med_home_var = "DP04_0089E"

for county_id, state_id in zip(nonca_cty_fips, nonca_st_fips):
    med_home_val = requests.get(f"{base_url}?get={med_home_var}&for=county:{county_id}&in=state:{state_id}").json()
    non_ca_med_home_val.append(int(med_home_val[1][0]))
    
non_ca_df["Med_Home_Value"] = non_ca_med_home_val
non_ca_df.to_csv('nonca_home_value.csv')
print(non_ca_df)



In [None]:
#follow the same process for median gross rents

ca_med_rent = []
med_rent_var = "DP04_0134E"
    
for county_id, state_id in zip(ca_cty_fips, ca_st_fips):
    med_rent = requests.get(f"{base_url}?get={med_rent_var}&for=county:{county_id}&in=state:{state_id}").json()
    ca_med_rent.append(int(med_rent[1][0]))
    
in_ca_df["Med_Rent"] = ca_med_rent

in_ca_df.to_csv('ca_rents.csv')
print(in_ca_df)

non_ca_med_rent = []
med_rent_var = "DP04_0134E"

for county_id, state_id in zip(nonca_cty_fips, nonca_st_fips):
    med_rent= requests.get(f"{base_url}?get={med_rent_var}&for=county:{county_id}&in=state:{state_id}").json()
    non_ca_med_rent.append(int(med_rent[1][0]))
    
non_ca_df["Med_Rent"] = non_ca_med_rent
non_ca_df.to_csv('nonca_rents.csv')
print(non_ca_df)

In [None]:
#follow the same process for home owner rate

ca_own_rate = []
home_own_var = "DP04_0046PE"
    
for county_id, state_id in zip(ca_cty_fips, ca_st_fips):
    own_rate = requests.get(f"{base_url}?get={home_own_var}&for=county:{county_id}&in=state:{state_id}").json()
    ca_own_rate.append(float(own_rate[1][0]))
    
in_ca_df["Home Own Rate"] = ca_own_rate
in_ca_df.to_csv('ca_homeowner_rates.csv')
print(in_ca_df)

nonca_own_rate = []
home_own_var = "DP04_0046PE"
    
for county_id, state_id in zip(nonca_cty_fips, nonca_st_fips):
    own_rate = requests.get(f"{base_url}?get={home_own_var}&for=county:{county_id}&in=state:{state_id}").json()
    nonca_own_rate.append(float(own_rate[1][0]))
    
non_ca_df["Home Own Rate"] = nonca_own_rate
non_ca_df.to_csv('nonca_homeowner_rates.csv')
print(non_ca_df)