# Master Table Constructor

The code in this section creates the data for the San Francisco Planning Department's Neighborhood Profiles Interactive Tool. The Neighborhood Profiles Interactive Tool (SFNP) provides data and information about communities in the city, including socio-economic profiles derived from American Community Survey to list of community organizations and planning projects in different areas in San Francisco. This notebook creates a master data table that contains every data point required by SFNP. The code is based off methods created by Michael Webster, Jason Sherba, and others. Run the notebook to:

- Download ACS data using the Census API
- Calculate socio-economic summary data by geographies (Analysis Neighborood, Census Tract), by race/ethnicity groups from the lastest ACS 5 years and some past surveys. 
- Integrate non-ACS data regarding community organizations and planning efforts to the ACS summary data. 
- Export the final data as a csv file. 

## Import packages

In [None]:
import numpy as np
np.__version__
np.__path__
import sys
sys.version_info

In [None]:
import requests, json, os
import pandas as pd
import numpy as np
import geopandas as gpd
import sodapy
from sodapy import Socrata
from collections import defaultdict
from collections import OrderedDict
from arcgis.gis import GIS

# Part 1. ACS 5-year data

The code in this section creates the socio-economic profile data for Analysis Neighborhoods and census tracts in SF. The data is derived from the [American Community Survey](https://www.census.gov/programs-surveys/acs) 5-year data and consists of four groups of data which includes: 

- Data of Total Population (current year)
    - Language Spoken Data (Detailed)
- Data of Race/Ethnicity Groups (current year) 

- Data of Total Population (10 years prior to current year)
- Data of Race/Ethnicity Groups (10 years prior to current year)

## Set analysis year

Set the years below following the instruction

In [None]:
year = 2020 # the current year 
year_past = 2010 # 10 years past to the current year 
year_language= 2015 # the latest year in which the detailed language spoken data was available. 

## Retrieve data from Census API
All socio-economic data comes from the Census ACS 5-year estimates and is available at the tract level through the census API. API documentation and data for the 2020 ACS data and previous years is available [here](https://www.census.gov/data/developers/data-sets/acs-5year.html)

### Census Attribute IDs
The census API returns ACS attribute vales for provided attribute IDs. A list of relevant attribute ID's needed for calculating the socio-economic profile data is compiled below from IDs stored in a series of csv files named as attribute lookup tables. Below is the pairs of data and lookup tables needed to compile the data:

- Data of Total Population (current year): [attribute_lookup.csv](https://github.com/jsherba/socio-economic-profiles/blob/main/lookup_tables/attribute_lookup.csv)
    - Language Spoken Data (Detailed): [language_attribute_lookup.csv](https://github.com/jsherba/socio-economic-profiles/blob/main/lookup_tables/language_attribute_lookup.csv) 
- Data of Race/Ethnicity Groups (current year): [race_attribute_lookup.csv](https://github.com/jsherba/socio-economic-profiles/blob/main/lookup_tables/language_attribute_lookup.csv) 
- Data of Total Population (10 years prior to current year): [attribute_lookup_past.csv](https://github.com/jsherba/socio-economic-profiles/blob/main/lookup_tables/language_attribute_lookup.csv) 
- Data of Race/Ethnicity Groups (10 years prior to current year): [race_attribute_lookup_past.csv](https://github.com/jsherba/socio-economic-profiles/blob/main/lookup_tables/language_attribute_lookup.csv) 

For a full list of ACS attribute IDs and their meanings visit the API docs [here](https://api.census.gov/data/2019/acs/acs5/variables.html).

In [None]:
# Total Population
# Create list of attribute IDs from attribute_lookup.csv

attribute_lookup_df = pd.read_csv (r'./lookup_tables/attribute_lookup_sh.csv', dtype=str)

attribute_ids_extracted = attribute_lookup_df['attribute_id'].tolist()
attribute_ids = []
for attribute_id in attribute_ids_extracted:
    attribute_ids.extend(attribute_id.split(", "))
attribute_ids = list(set([x+"E" for x in attribute_ids]))
print(len(attribute_ids))
attribute_ids[:10]

In [None]:
# Language Spoken (Detailed)
# Create list of attribute IDs from language_attribute_lookup.csv

language_attribute_lookup_df = pd.read_csv (r'./lookup_tables/language_attribute_lookup_sh.csv', dtype=str)

language_attribute_ids_extracted = language_attribute_lookup_df['attribute_id'].tolist()
language_attribute_ids = []
for language_attribute_id in language_attribute_ids_extracted:
    language_attribute_ids.extend(language_attribute_id.split(", "))
language_attribute_ids = list(set([x+"E" for x in language_attribute_ids]))
print(len(language_attribute_ids))
language_attribute_ids[:10]

In [None]:
# Race/Ethnicity Groups
# Create list of attribute IDs from language_attribute_lookup.csv

race_attribute_lookup_df = pd.read_csv(r'./lookup_tables/race_attribute_lookup_sh.csv', dtype=str)

race_attribute_ids_extracted = race_attribute_lookup_df['attribute_id'].tolist()
race_attribute_ids = []
for race_attribute_id in race_attribute_ids_extracted:
    race_attribute_ids.extend(race_attribute_id.split(", "))
race_attribute_ids = list(OrderedDict.fromkeys(race_attribute_ids))
print(len(race_attribute_ids))
race_attribute_ids[:10]

### Build Census API URL and Make Query
The code below builds the URL for the census API call to get relevant ACS attribute data at the tract level for San Francisco County. The Census API accepts up to 50 attributes at a time. Therefore the attribute list is first grouped into sublists of 45 attribute IDs. An API call is. Below define:
- Tract code is '*' to collect all tracts
- State code is '06' for CA
- County code is '075' for San Francisco County
- Attributes are defined by the attribute id list and includes all relevant attributes for the socio-economic data calcs

#### Set up functions for API call

In [None]:
# function builds the api URL from tract_code, state_code, county_code, and attribute ids. 
def build_census_url(tract_code, state_code, county_code, attribute_ids, year):
    attributes = ','.join(attribute_ids)
    census_url = r'https://api.census.gov/data/{}/acs/acs5?get={}&for=tract:{}&in=state:{}&in=county:{}'\
                .format(year, attributes, tract_code, state_code, county_code)
    return census_url
    

In [None]:
# function makes a single api call and collects results in a pandas dataframe
def make_census_api_call(census_url):
    # make API call to Census
    resp = requests.get(census_url)
    if resp.status_code != 200:
        # this means something went wrong
        resp.raise_for_status()
       
    # retrieve data as json and convert to Pandas Dataframe
    data = resp.json()
    headers = data.pop(0)
    df = pd.DataFrame(data, columns=headers)

    # convert values that are not state, county, or tract to numeric type
    cols=[i for i in df.columns if i not in ["state","county","tract"]]
    for col in cols:
        #print('col is:', df[col])
        #print(type(df[col]))
        df[col]=pd.to_numeric(df[col])
        
    return df

#### Set geo variables for api call

In [None]:
tract_code = "*"
state_code = "06"
county_code = "075"

#### Compile data: Total Population

In [None]:
# split attributes into groups of 45, run a census query for each, merge outputs into a single df
split_attribute_ids = [attribute_ids[i:i+45] for i in range(0, len(attribute_ids), 45)]
df=None
first = True
for ids in split_attribute_ids:
    census_url = build_census_url(tract_code, state_code, county_code, ids, year)
    returned_df = make_census_api_call(census_url)
    if first:
        df = returned_df
        first = False
    else:
        returned_df = returned_df.drop(columns=['state', 'county'])
        df = pd.merge(df, returned_df, on='tract', how='left')

df.head()

#### Compile data: Language Spoken

In [None]:
# language: run a census query for each, merge outputs into a single df
split_language_attribute_ids = [language_attribute_ids[i:i+45] for i in range(0, len(language_attribute_ids), 45)]

first = False
for ids in split_language_attribute_ids:
    census_url = build_census_url(tract_code, state_code, county_code, ids, year_language)
    #print(census_url)
    returned_df = make_census_api_call(census_url)
    if first:
        df = returned_df
        first = False
    else:
        returned_df = returned_df.drop(columns=['state', 'county'])
        df = pd.merge(df, returned_df, on='tract', how='left')

df.head()

#### Compile data: Race/Ethnicity Groups

In [None]:
# race/ethnicity: run a census query for each, merge outputs into a single df
split_race_attribute_ids = [race_attribute_ids[i:i+45] for i in range(0, len(race_attribute_ids), 45)]

first = False
for ids in split_race_attribute_ids:
    census_url = build_census_url(tract_code, state_code, county_code, ids, year)
    #print(census_url)
    returned_df = make_census_api_call(census_url)
    if first:
        df = returned_df
        first = False
    else:
        returned_df = returned_df.drop(columns=['state', 'county'])
        df = pd.merge(df, returned_df, on='tract', how='left')

df.head()

In [None]:
df['B15002_010E']

## Prepare Lookup Dictionaries and Helper Functions

### Tract/Neighborhood Lookup
A lookup dictionaries are created below that relates neghborhoods to tracts. The dictionary is used to subset the census dataframe for each neighborhood so that calcs can be run on each set of tracts. The lookup dictionary is created from [geo_lookup.csv]() in the repository.

In [None]:
# import geo_lookup csv
geo_lookup_df = pd.read_csv (r'./lookup_tables/geo_lookup.csv', dtype=str)

tract_tr_lookup = defaultdict(list)
tract_nb_lookup = defaultdict(list)
tract_sd_lookup = defaultdict(list)
all_tracts = list(set(df['tract'].tolist()))
# create tract lookup dictionary for tracts 
for i in all_tracts:
    tract_tr_lookup[i].append(i)
    tract_tr_lookup["sf"]=all_tracts 
# create tract lookup dictionary for neighborhoods
for i, j in zip(geo_lookup_df['neighborhood'], geo_lookup_df['tractid']):
    tract_nb_lookup[i].append(j)
tract_nb_lookup["sf"]= all_tracts
# create tract lookup dictionary for supervisor districts
for i, j in zip(geo_lookup_df['supervisor_district'], geo_lookup_df['tractid']):
    tract_sd_lookup[i].append(j)
tract_sd_lookup["sf"]= all_tracts
    
first_4 = list(tract_nb_lookup.items())


### Calculating Medians
To calculate median values of aggregated geographies you cannot use the mean of component geographies. Instead a statistical approximation of the median must be calculated from range tables. 

Range variables in the ACS have a unique ID like any other Census variable. They represent the amount of a variable within a select range. e.g. number of households with household incomes between $45000-50000. 

Range variable ID's and range information is stored in the [median_ranges.csv]() file and [median_ranges_race.csv]() in the repository. These range variables and ranges are needed for calculating the median at the neighborhood level. 

The below function calculates a median based on range data. This method follows the offical ACS documentation for [calculating a median](https://www.dof.ca.gov/Forecasting/Demographics/Census_Data_Center_Network/documents/How_to_Recalculate_a_Median.pdf)


#### Medians for Total Population

In [None]:
#all population: import median tables from median_ranges csv and add empty columns for rows 'households and 'cumulative_totals'
range_df = pd.read_csv (r'./lookup_tables/median_ranges.csv')
range_df['households']=0
range_df['cumulative_total']=0
range_df.head()

#### Medians for Race/Ethnicity groups

In [None]:
#race/ethnicity: import median tables from median_ranges_race csv and add empty columns for rows 'households and 'cumulative_totals'
range_race_df = pd.read_csv (r'./lookup_tables/median_ranges_race.csv')
range_race_df['households']=0
range_race_df['cumulative_total']=0
range_race_df.head()

#### Define Median Helper Function

In [None]:
# define median helper function
def calc_median(tract_df, range_df, median_to_calc):
    
    # subset range df for current median variable to calc
    range_df = range_df[range_df['name']==median_to_calc]
    
    # sort dataframe low to high by range start column
    range_df = range_df.sort_values(by=['range_start'])
    
    # calculate households as sum of tract level households for each row based on range id
    range_df['households'] = range_df.apply(lambda row : tract_df[row['id']].sum(), axis = 1)
    
    # calculate the cumulative total of households
    range_df['cumulative_total'] = range_df['households'].cumsum()
    
    # calculate total households and return 0 if total households is 0
    total_households = range_df['households'].sum()
    
    # if total households is 0 set median to 0
    if total_households == 0:
        return 0
    
    # calculate midpoint
    midpoint = total_households/2

    # if midpoint is below first range return median as end of first range value
    if midpoint < range_df['cumulative_total'].min():
        new_median = range_df['range_end'].min()
        return new_median
    
    # if midpoint is above last range set median to end of last range value
    if midpoint > range_df['cumulative_total'].max():
        new_median = range_df['range_end'].max()
        return new_median
    
    less_midpoint_df = range_df[range_df['cumulative_total']<midpoint]
    
    # get the single row containing the range just below the mid range by getting the row with the max range start from the subsetted median df
    range_below_mid_range_df = less_midpoint_df[less_midpoint_df['range_start'] == less_midpoint_df['range_start'].max()]
    
    # get the cumulative total value for the first row of the range below mid range dictionary
    total_hh_previous_range = range_below_mid_range_df['cumulative_total'].iloc[0]
    hh_to_mid_range = midpoint - total_hh_previous_range
    
    # extract rows above midrange by subsetting median df for rows with cumulative total grearter than midpoint.
    greater_midpoint_df = range_df[range_df['cumulative_total']>midpoint]
    
    # get the single row containing the mid range by getting the row with the min range start from the subsetted median df
    mid_range_df = greater_midpoint_df[greater_midpoint_df['range_start'] == greater_midpoint_df['range_start'].min()]
    
    # get the households value for the first row of the mid range dictionary
    hh_in_mid_range = mid_range_df['households'].iloc[0]
    
    # calculate proportion of number of households in the mid range that would be needed to get to the mid-point
    prop_of_hh = hh_to_mid_range/hh_in_mid_range
    
    # calculate width of the mid range
    width = (mid_range_df['range_end'].iloc[0]-mid_range_df['range_start'].iloc[0])+1
    
    # apply proportion to width of mid range
    prop_to_width = prop_of_hh*width
    beginning_of_mid_range = mid_range_df['range_start'].iloc[0]
    
    # calculate new median
    new_median = beginning_of_mid_range + prop_to_width
    
    return new_median

## Define functions for calculating socio-economic data


The `calc_socio_economic_data` function family takes tract level data from the API call and the tract/neighborhood lookup dictionary. These functions create all of the socio-economic data calcs and returns a dictionary. The calcs in this function are derived from the [Data_Items_and_Sources.xlsx](https://github.com/jsherba/socio-economic-profiles/raw/main/Data_Items_And_Sources_2019.xlsx). 

### Calculation helper functions
- `calc_sum(df, attribute_id)`: sum values of all given attributes
- `calc_normalized(df, attribute_id, attribute_id2)`: normalized the 1st attribute value with the 2nd attribute value 
- `calc_sum_normalized(df, attribute_list, attribute_id2)`: normalized the sum of the attribute values (attribute list) by the 2nd attribute value  
- `calc_sum_sum_normalized(df, attribute_list1, attribute_list2)`: normalized the sum of the 1st set of attribute values by the sum of the 2nd set of attribute values 

In [None]:
# define calculation helper functions
def calc_sum(df, attribute_id):
    return df[attribute_id].sum()

def calc_normalized(df, attribute_id, attribute_id2):
    if df[attribute_id2].sum() == 0:
        return 0
    else:
        return (df[attribute_id].sum()/df[attribute_id2].sum())

def calc_sum_normalized(df, attribute_list, attribute_id2):
    if df[attribute_id2].sum()==0:
        return 0
    else:
        sum_of_attributes = 0
        for attribute_id in attribute_list:
            sum_of_attributes+=df[attribute_id].sum()
        return sum_of_attributes/df[attribute_id2].sum()
    
def calc_sum_sum_normalized(df, attribute_list1, attribute_list2):
    sum_of_attributes2 = 0
    for attribute_id in attribute_list2:
            sum_of_attributes2+=df[attribute_id].sum()
    if sum_of_attributes2==0:
        return 0
    else:
        sum_of_attributes1 = 0
        for attribute_id in attribute_list1:
            sum_of_attributes1+=df[attribute_id].sum()
        return sum_of_attributes1/sum_of_attributes2


### Attribute value calculation function

In [None]:
# function runs all calcs for each neighborhood or census tracts
def calc_socio_economic_data_simple(df, attribute_look_df, tract_lookup):

    all_calc_data = defaultdict(dict) 
    attribute_ids_extracted = attribute_lookup_df['attribute_id'].tolist()
    attribute_names = attribute_lookup_df['attribute_name'].tolist()
    calc_types = attribute_lookup_df['calc_type'].tolist()
    median_calc_types = attribute_lookup_df['median_type'].tolist()


    for nb_name, tracts in tract_lookup.items():
        # extract attribute information for tracks associated with a neighborhood
        tract_df = df[df['tract'].isin(tracts)]
        
        # build dictionary with all stats for a neighborhood
        all_calc_data_nb = all_calc_data[nb_name]

        for i in range(0, len(attribute_lookup_df)):
            name = attribute_names[i]
            calc = calc_types[i]
            print(attribute_ids_extracted[i])
            attribute_ids = attribute_ids_extracted[i].split(", ")
            attribute_ids = list(set([x+"E" for x in attribute_ids]))

            if calc == 'sum':
                new_dict = {name:calc_sum(tract_df, attribute_ids[0])}
                all_calc_data_nb.update(new_dict) 
            elif calc == 'sum_normalized':
                new_dict = {name:calc_sum_normalized(tract_df, attribute_ids[:-1], attribute_ids[-1])}
                all_calc_data_nb.update(new_dict) 
            elif calc == 'normalized':
                new_dict = {name:calc_normalized(tract_df, attribute_ids[0], attribute_ids[1])}
                all_calc_data_nb.update(new_dict) 
            elif calc == 'sum_sum_normalized':
                new_dict = {name:calc_sum_sum_normalized(tract_df, attribute_ids[:-2], attirubte_ids[-2, -1])}
                all_calc_data_nb.update(new_dict) 
            elif calc == 'median':
                median_calc = median_calc_types[i]
                new_dict = {name:calc_median(tract_df, range_df, 'median_rent_percent_of_income')}
                all_calc_data_nb.update(new_dict) 
            elif calc == 'none':
                new_dict = {name:np.nan}
                all_calc_data_nb.update(new_dict) 

    return all_calc_data

### Attribute value calculation function: ACS 5 Years for All Population


In [None]:
# function runs all calcs for each neighborhood or supervisor district
def calc_socio_economic_data(df, tract_lookup):
    # create empty dictionary to add calculated attribute information to
    all_calc_data = defaultdict(dict) 
    # calculate all stats for each neighborhood
    for nb_name, tracts in tract_lookup.items():
        # extract attribute information for tracks associated with a neighborhood
        tract_df = df[df['tract'].isin(tracts)]
        # build dictionary with all stats for a neighborhood
        all_calc_data_nb = all_calc_data[nb_name]
        
        # population
        all_calc_data_nb["Total Population"] = calc_sum(tract_df, 'B01001_001E')
        all_calc_data_nb["Group Quarter Population"] = calc_sum(tract_df, 'B26001_001E')
        all_calc_data_nb["Percent Female"] = calc_normalized(tract_df, 'B01001_026E', 'B01001_001E')
        all_calc_data_nb["Percent with Any Disabilities"] = calc_normalized(tract_df, 'B10052_002E', 'B10052_001E')
        
        # household stats
        all_calc_data_nb["Housholds"] = calc_sum(tract_df, 'B11001_001E')
        all_calc_data_nb["Family Households"] = calc_normalized(tract_df, 'B11001_002E', 'B11001_001E')
        all_calc_data_nb["Non-Family Households"] = calc_normalized(tract_df, 'B11001_007E', 'B11001_001E')
        all_calc_data_nb["Single Person Households"] = calc_normalized(tract_df, 'B11001_008E', 'B11001_001E')
        all_calc_data_nb["Households with Children"] = calc_normalized(tract_df, 'B11005_002E', 'B11001_001E')
        all_calc_data_nb["Households with 60 years and older"] = calc_normalized(tract_df, 'B11006_002E', 'B11001_001E')
        all_calc_data_nb["Senior (65+) living alone"] = calc_sum_normalized(tract_df, ['B09020_015E', 'B09020_018E'], 'B09020_001E')
        all_calc_data_nb["Average Household Size"] = calc_normalized(tract_df, 'B11002_001E', 'B11001_001E')
        all_calc_data_nb["Average Family Household Size"] = calc_normalized(tract_df, 'B11002_002E', 'B11001_002E')
        
        # race and ethnicity stats
        all_calc_data_nb["Asian"] = calc_normalized(tract_df, 'B02001_005E', 'B02001_001E')
        all_calc_data_nb["Black/African American"] = calc_normalized(tract_df, 'B02001_003E', 'B02001_001E')
        all_calc_data_nb["White"] = calc_normalized(tract_df, 'B02001_002E', 'B02001_001E')
        all_calc_data_nb["American Indian/Alaska Native"] = calc_normalized(tract_df, 'B02010_001E', 'B02001_001E')
        all_calc_data_nb["Native Hawaiian/Pacific Islander"] = calc_normalized(tract_df, 'B02001_006E', 'B02001_001E')
        all_calc_data_nb["Not-listed/Multi-racial"] = calc_sum_normalized(tract_df, ['B02001_008E', 'B02001_007E'], 'B02001_001E')
        all_calc_data_nb["Hispanic/Latinx of Any Race"] = calc_normalized(tract_df, 'B03001_003E', 'B03001_001E')
       
        # age
        all_calc_data_nb["0-4 Years"] = calc_sum_normalized(tract_df, ['B01001_003E', 'B01001_027E'], 'B01001_001E')
        all_calc_data_nb["5-17 Years"] = calc_sum_normalized(tract_df, ['B01001_004E', 'B01001_005E', 'B01001_006E', 'B01001_028E', 'B01001_029E', 'B01001_030E'],'B01001_001E')
        all_calc_data_nb["18-34 Years"] = calc_sum_normalized(tract_df, ['B01001_007E','B01001_008E','B01001_009E', 'B01001_010E', 'B01001_011E', 'B01001_012E','B01001_031E','B01001_032E','B01001_033E','B01001_034E','B01001_035E','B01001_036E'], 'B01001_001E')
        all_calc_data_nb["35-59 Years"] = calc_sum_normalized(tract_df, ['B01001_013E', 'B01001_014E', 'B01001_015E', 'B01001_016E', 'B01001_017E', 'B01001_037E', 'B01001_038E', 'B01001_039E', 'B01001_040E', 'B01001_041E'], 'B01001_001E')
        all_calc_data_nb["60-64 Years"] = calc_sum_normalized(tract_df, ['B01001_018E', 'B01001_019E', 'B01001_042E', 'B01001_043E'], 'B01001_001E')
        all_calc_data_nb["65 Years or older"] = calc_sum_normalized(tract_df, ['B01001_020E', 'B01001_021E', 'B01001_022E', 'B01001_023E', 'B01001_024E', 'B01001_025E', 'B01001_044E', 'B01001_045E', 'B01001_046E', 'B01001_047E', 'B01001_048E', 'B01001_049E'], 'B01001_001E') 
        
        # educational attainment
        all_calc_data_nb["Less than high school degree"] = calc_sum_normalized(tract_df, ['B15003_002E', 'B15003_003E', 'B15003_004E', 'B15003_005E', 'B15003_006E', 'B15003_007E', 'B15003_008E', 'B15003_009E', 'B15003_010E', 'B15003_011E', 'B15003_012E', 'B15003_013E', 'B15003_014E', 'B15003_015E', 'B15003_016E'], 'B15003_001E')
        all_calc_data_nb["High school degree or equivalent"] = calc_sum_normalized(tract_df, ['B15003_017E', 'B15003_018E'], 'B15003_001E')
        all_calc_data_nb["Some college or Associates degree"] = calc_sum_normalized(tract_df, ['B15003_019E', 'B15003_020E', 'B15003_021E'], 'B15003_001E')
        all_calc_data_nb["Bachelors degree or higher"] = calc_sum_normalized(tract_df, ['B15003_022E','B15003_023E', 'B15003_024E', 'B15003_025E'], 'B15003_001E')
        
        # nativity
        all_calc_data_nb["Foreign Born"] = calc_normalized(tract_df, 'B05002_013E', 'B05002_001E')
        all_calc_data_nb["Naturalized"] = calc_normalized(tract_df, 'B05002_014E', 'B05002_001E')
        
        # language spoken at home
        all_calc_data_nb["English Only"] = calc_sum_normalized(tract_df, ['B16007_003E', 'B16007_009E', 'B16007_015E'], 'B16007_001E')
        all_calc_data_nb["Spanish Only"] = calc_sum_normalized(tract_df, ['B16007_004E', 'B16007_010E', 'B16007_016E'], 'B16007_001E')
        all_calc_data_nb["Asian/Pacific Islander"] = calc_sum_normalized(tract_df, ['B16007_006E', 'B16007_012E', 'B16007_018E'], 'B16007_001E')
        all_calc_data_nb["Other European Languages"] = calc_sum_normalized(tract_df, ['B16007_005E', 'B16007_011E', 'B16007_017E'], 'B16007_001E')
        all_calc_data_nb["Other Languages"] = calc_sum_normalized(tract_df, ['B16007_007E', 'B16007_013E', 'B16007_019E'], 'B16007_001E')
        
        # Language spoken at home: fine grained 
        all_calc_data_nb["Spanish"] = calc_normalized(tract_df, 'B16001_003E', 'B16001_001E')
        all_calc_data_nb["Spanish with Limited English Proficiency"] = calc_normalized(tract_df, 'B16001_005E', 'B16001_001E')
        all_calc_data_nb["French"] = calc_normalized(tract_df, 'B16001_006E', 'B16001_001E')
        all_calc_data_nb["French with Limited English Proficiency"] = calc_normalized(tract_df, 'B16001_008E', 'B16001_001E')
        all_calc_data_nb["Haitian"] = calc_normalized(tract_df, 'B16001_009E', 'B16001_001E')
        all_calc_data_nb["Haitian with Limited English Proficiency"] = calc_normalized(tract_df, 'B16001_011E', 'B16001_001E')
        all_calc_data_nb["Italian"] = calc_normalized(tract_df, 'B16001_012E', 'B16001_001E')
        all_calc_data_nb["Italian with Limited English Proficiency"] = calc_normalized(tract_df, 'B16001_014E', 'B16001_001E')
        all_calc_data_nb["German"] = calc_normalized(tract_df, 'B16001_018E', 'B16001_001E')
        all_calc_data_nb["German with Limited English Proficiency"] = calc_normalized(tract_df, 'B16001_020E', 'B16001_001E')
        all_calc_data_nb["Russian"] = calc_normalized(tract_df, 'B16001_027E', 'B16001_001E')
        all_calc_data_nb["Russian with Limited English Proficiency"] = calc_normalized(tract_df, 'B16001_008E', 'B16001_001E')
        all_calc_data_nb["Hindi"] = calc_normalized(tract_df, 'B16001_048E', 'B16001_001E')
        all_calc_data_nb["Hindi with Limited English Proficiency"] = calc_normalized(tract_df, 'B16001_050E', 'B16001_001E')
        all_calc_data_nb["Punjabi"] = calc_normalized(tract_df, 'B16001_054E', 'B16001_001E')
        all_calc_data_nb["Punjabi with Limited English Proficiency"] = calc_normalized(tract_df, 'B16001_008E', 'B16001_001E')
        all_calc_data_nb["Bengali"] = calc_normalized(tract_df, 'B16001_057E', 'B16001_001E')
        all_calc_data_nb["Bengali with Limited English Proficiency"] = calc_normalized(tract_df, 'B16001_059E', 'B16001_001E')
        all_calc_data_nb["Chinese"] = calc_normalized(tract_df, 'B16001_075E', 'B16001_001E')
        all_calc_data_nb["Chinese with Limited English Proficiency"] = calc_normalized(tract_df, 'B16001_077E', 'B16001_001E')
        all_calc_data_nb["Japanese"] = calc_normalized(tract_df, 'B16001_078E', 'B16001_001E')
        all_calc_data_nb["Japanese with Limited English Proficiency"] = calc_normalized(tract_df, 'B16001_080E', 'B16001_001E')
        all_calc_data_nb["Korean"] = calc_normalized(tract_df, 'B16001_081E', 'B16001_001E')
        all_calc_data_nb["Korean with Limited English Proficiency"] = calc_normalized(tract_df, 'B16001_083E', 'B16001_001E')
        all_calc_data_nb["Hmong"] = calc_normalized(tract_df, 'B16001_084E', 'B16001_001E')
        all_calc_data_nb["Hmong with Limited English Proficiency"] = calc_normalized(tract_df, 'B16001_086E', 'B16001_001E')
        all_calc_data_nb["Vietnamese"] = calc_normalized(tract_df, 'B16001_087E', 'B16001_001E')
        all_calc_data_nb["Vietnamese with Limited English Proficiency"] = calc_normalized(tract_df, 'B16001_089E', 'B16001_001E')
        all_calc_data_nb["Tagalog"] = calc_normalized(tract_df, 'B16001_099E', 'B16001_001E')
        all_calc_data_nb["Tagalog with Limited English Proficiency"] = calc_normalized(tract_df, 'B16001_101E', 'B16001_001E')
        all_calc_data_nb["Arabic"] = calc_normalized(tract_df, 'B16001_105E', 'B16001_001E')
        all_calc_data_nb["Arabic with Limited English Proficiency"] = calc_normalized(tract_df, 'B16001_107E', 'B16001_001E')
        
        # linguistic isolation
        all_calc_data_nb["% of All Households"] = calc_sum_normalized(tract_df, ['B16003_002E', 'B16003_008E'], 'B16004_001E')
        all_calc_data_nb["% of Spanish-Speaking Households"] = calc_sum_normalized(tract_df, ['B16003_004E', 'B16003_009E'], 'B16004_001E')
        all_calc_data_nb["% of Asian-Speaking Households"] = calc_sum_normalized(tract_df, ['B16003_006E', 'B16003_011E'], 'B16004_001E')
        all_calc_data_nb["% of Other European-Speaking Households"] = calc_sum_normalized(tract_df, ['B16003_005E', 'B16003_010E'], 'B16004_001E')
        all_calc_data_nb["% of Households Speaking Other Languages"] = calc_sum_normalized(tract_df, ['B16003_007E', 'B16003_012E'], 'B16004_001E')
        
        # enlish proficiency 
        all_calc_data_nb["% Speaking English Very Well of Total"] = calc_sum_normalized(tract_df, ['B06007_002E', 'B06007_004E', 'B06007_007E'], 'B06007_001E')
        all_calc_data_nb["% Speaking English Very Well of Foreign Born"] = calc_sum_normalized(tract_df, ['B06007_034E', 'B06007_036E', 'B06007_039E'], 'B06007_033E')
        
        # housing
        all_calc_data_nb["Total Number of Units"] = calc_sum(tract_df, 'B25001_001E')
        all_calc_data_nb["Units built in 2010 or later"] = calc_sum_normalized(tract_df, ['B25034_002E', 'B25034_003E'], 'B25034_001E')
        all_calc_data_nb["Occupied Units"] = calc_normalized(tract_df, 'B25003_001E', 'B25007_001E')
        all_calc_data_nb["Occupied Units (Number)"] = calc_sum(tract_df, 'B25007_002E')
        all_calc_data_nb["Owner Occupied"] = calc_normalized(tract_df, 'B25007_002E', 'B25007_001E')
        all_calc_data_nb["Renter Occupied"] = calc_normalized(tract_df, 'B25007_012E', 'B25007_001E')
        all_calc_data_nb["Vacant Units"] = calc_normalized(tract_df, 'B25004_001E', 'B25001_001E')
        all_calc_data_nb["Percent in Same House Last Year"] = calc_normalized(tract_df, 'B07001_017E', 'B07001_001E')
        all_calc_data_nb["Percent Housing Overcrowding (> 1 person per room)"] = calc_sum_normalized(tract_df, ['B25014_005E', 'B25014_006E', 'B25014_007E', 'B25014_011E', 'B25014_012E', 'B25014_013E'], 'B25014_001E')
        # structure type
        all_calc_data_nb["Single Family Housing"] = calc_sum_normalized(tract_df, ['B25024_002E', 'B25024_003E'], 'B25024_001E')
        all_calc_data_nb["2-4 Units"] = calc_sum_normalized(tract_df, ['B25024_004E', 'B25024_005E'], 'B25024_001E')
        all_calc_data_nb["5-9 Units"] = calc_normalized(tract_df, 'B25024_006E', 'B25024_001E')
        all_calc_data_nb["10-19 Units"] = calc_normalized(tract_df, 'B25024_007E', 'B25024_001E')
        all_calc_data_nb["20 Units or More"] = calc_sum_normalized(tract_df, ['B25024_008E', 'B25024_009E'], 'B25024_001E')
        all_calc_data_nb["Other Type"] = calc_sum_normalized(tract_df, ['B25024_010E', 'B25024_011E'], 'B25024_001E')
        # unit size - number 
        all_calc_data_nb["No Bedroom (Number)"] = calc_sum(tract_df,'B25041_002E')
        all_calc_data_nb["1 Bedroom (Number)"] = calc_sum(tract_df, 'B25041_003E')
        all_calc_data_nb["2 Bedrooms (Number)"] = calc_sum(tract_df, 'B25041_004E')
        all_calc_data_nb["3 Bedrooms (Number)"] = calc_sum(tract_df, 'B25041_005E')
        all_calc_data_nb["4 Bedrooms (Number)"] = calc_sum(tract_df, 'B25041_006E')
        all_calc_data_nb["5 or More Bedrooms (Number)"] = calc_sum(tract_df, 'B25041_007E')
        # unit size
        all_calc_data_nb["No Bedroom"] = calc_normalized(tract_df,'B25041_002E', 'B25041_001E')
        all_calc_data_nb["1 Bedroom"] = calc_normalized(tract_df, 'B25041_003E', 'B25041_001E')
        all_calc_data_nb["2 Bedrooms"] = calc_normalized(tract_df, 'B25041_004E', 'B25041_001E')
        all_calc_data_nb["3-4 Bedrooms"] = calc_sum_normalized(tract_df, ['B25041_005E', 'B25041_006E'], 'B25041_001E')
        all_calc_data_nb["5 or More Bedrooms"] = calc_normalized(tract_df, 'B25041_007E', 'B25041_001E')
        # housing prices
        all_calc_data_nb["Median Rent"] = calc_median(tract_df, range_df, 'median_rent')
        #all_calc_data_nb["Median Contract Rent"] = calc_median(tract_df, range_df, 'median_rent_contract')
        all_calc_data_nb["Median Rent as % of Household Income"] = calc_median(tract_df, range_df, 'median_rent_percent_of_income')
        all_calc_data_nb["Median Home Value"] = calc_median(tract_df, range_df, 'median_home_value')
        all_calc_data_nb["Rent-Burdened Population"] = calc_sum_normalized(tract_df, ['B25074_006E', 'B25074_007E', 'B25074_008E', 'B25074_009E', 'B25074_015E', 'B25074_016E', 'B25074_017E', 'B25074_018E'], 'B25074_001E' )
        # vehicles available
        all_calc_data_nb["Vehicles Available"] = calc_sum(tract_df, 'B25046_001E')
        all_calc_data_nb["Vehicles Homeowners"] = calc_normalized(tract_df, 'B25046_002E', 'B25046_001E')
        all_calc_data_nb["Vehicles Renters"] = calc_normalized(tract_df, 'B25046_003E', 'B25046_001E')
        all_calc_data_nb["Vehicles Per Capita"] = calc_normalized(tract_df, 'B25046_001E', 'B01001_001E')
        all_calc_data_nb["Households with no Vehicle"] = calc_sum_normalized(tract_df, ['B25044_003E', 'B25044_010E'], 'B25044_001E')
        all_calc_data_nb["Percent of Homeowning Households"] = calc_normalized(tract_df, 'B25044_003E', 'B25044_002E')
        all_calc_data_nb["Percent of Renting Households"] = calc_normalized(tract_df, 'B25044_010E', 'B25044_009E')
        # income
        all_calc_data_nb["Median Household Income (B19013_001)"] = calc_median(tract_df, range_df, 'median_household_income')
        all_calc_data_nb["Median Family Income (B19113_001)"] = calc_median(tract_df, range_df, 'median_family_income')
        all_calc_data_nb["Per Capita Income"] = calc_normalized(tract_df, 'B19025_001E', 'B01001_001E')
        all_calc_data_nb["Percent in Poverty"] = calc_normalized(tract_df, 'B17001_002E', 'B17001_001E')
        all_calc_data_nb["Household Income (less than 25K)"] = calc_sum_normalized(tract_df, ['B19001_002E', 'B19001_003E', 'B19001_004E', 'B19001_005E'], 'B19001_001E')
        all_calc_data_nb["Household Income (25K-50K)"] = calc_sum_normalized(tract_df, ['B19001_006E', 'B19001_007E', 'B19001_008E', 'B19001_009E', 'B19001_010E'], 'B19001_001E')
        all_calc_data_nb["Household Income (5OK-75K)"] = calc_sum_normalized(tract_df, ['B19001_011E', 'B19001_012E'], 'B19001_001E')
        all_calc_data_nb["Household Income (75K-100K)"] = calc_normalized(tract_df, 'B19001_013E', 'B19001_001E')
        all_calc_data_nb["Household Income (100K-125K)"] = calc_normalized(tract_df, 'B19001_014E', 'B19001_001E')
        all_calc_data_nb["Household Income (more than 125K)"] = calc_sum_normalized(tract_df, ['B19001_015E', 'B19001_016E', 'B19001_017E'], 'B19001_001E')

        # employment
        all_calc_data_nb["Unemployment Rate"] = calc_normalized(tract_df, 'B23025_005E', 'B23025_002E')
        all_calc_data_nb["Percent Unemployment Female"] = calc_sum_normalized(tract_df, ['B23001_094E', 'B23001_101E', 'B23001_108E', 'B23001_115E', 'B23001_122E', 'B23001_129E', 'B23001_136E', 'B23001_143E', 'B23001_150E', 'B23001_157E', 'B23001_162E', 'B23001_167E', 'B23001_172E', 'B23001_090E', 'B23001_097E', 'B23001_104E', 'B23001_111E', 'B23001_118E', 'B23001_125E', 'B23001_132E', 'B23001_139E', 'B23001_146E', 'B23001_153E', 'B23001_160E', 'B23001_165E', 'B23001_170E'], 'B23001_088E')
        all_calc_data_nb["Percent Unemployment Male"] = calc_sum_normalized(tract_df, ['B23001_008E', 'B23001_015E', 'B23001_022E', 'B23001_029E', 'B23001_036E', 'B23001_043E', 'B23001_050E', 'B23001_057E', 'B23001_064E', 'B23001_071E', 'B23001_076E', 'B23001_081E', 'B23001_086E', 'B23001_004E', 'B23001_011E', 'B23001_018E', 'B23001_025E', 'B23001_032E', 'B23001_039E', 'B23001_046E', 'B23001_053E', 'B23001_060E', 'B23001_067E', 'B23001_074E', 'B23001_079E', 'B23001_084E'], 'B23001_002E')
        
        # journey to work
        all_calc_data_nb["Workers 16 Years and Older"] = calc_sum(tract_df, 'B08006_001E')
        all_calc_data_nb["Car"] = calc_normalized(tract_df, 'B08006_002E', 'B08006_001E')
        all_calc_data_nb["Drove Alone"] = calc_normalized(tract_df, 'B08006_003E', 'B08006_001E')
        all_calc_data_nb["Carpooled"] = calc_normalized(tract_df, 'B08006_004E', 'B08006_001E')
        all_calc_data_nb["Transit"] = calc_normalized(tract_df, 'B08006_008E', 'B08006_001E')
        all_calc_data_nb["Bike"] = calc_normalized(tract_df, 'B08006_014E', 'B08006_001E')
        all_calc_data_nb["Walk"] = calc_normalized(tract_df, 'B08006_015E', 'B08006_001E')
        all_calc_data_nb["Other Journey Type"] = calc_normalized(tract_df, 'B08006_016E', 'B08006_001E')
        all_calc_data_nb["Worked at Home"] = calc_normalized(tract_df, 'B08006_017E', 'B08006_001E')
        
        # population density
        all_calc_data_nb["Population Density per Acre"] = calc_sum(tract_df, 'B01001_001E')
       
    #return calc dictionary
    return all_calc_data

### Attribute value calculation function: ACS 5 Years for Race/Ethnicity groups 

In [None]:
# function runs all cals for race/Hispanic breakdowns for each neighborhood or supervisor district
def calc_socio_economic_data_race(df, tract_lookup):
    
    # race/ethnicity breakdown 
    racecode = {
        "White Alone": "A",
        "Black or African American Alone": "B",
        "American Indian And Alaska Native Alone": "C",
        "Asian Alone": "D",
        #"Native Hawaiian and Other Pacific Islander Alone", "E"
        "White Non-Hispanic": "H",
        "Hispanic or Latino": "I"
    }

    all_calc_data = defaultdict(dict) 
    # calculate all stats for each neighborhood
    for nb_name, tracts in tract_lookup.items():
        # extract attribute information for tracks associated with a neighborhood
        tract_df = df[df['tract'].isin(tracts)]
        # build dictionary with all stats for a neighborhood
        all_calc_data_nb = all_calc_data[nb_name]
        # population
        
        for race, code in racecode.items():
            # population
            all_calc_data_nb[code+"_Total Population"] = calc_sum(tract_df, 'B01001'+code+'_001E')
            all_calc_data_nb[code+"_Group Quarter Population"] = calc_sum(tract_df, 'B26103'+code+'_002E')
            all_calc_data_nb[code+"_Percent Female"] = calc_normalized(tract_df, 'B01001'+code+'_017E', 'B01001'+code+'_001E')

            # nativity
            all_calc_data_nb[code+"_Foreign Born"] = calc_normalized(tract_df, 'B06004'+code+'_005E', 'B06004'+code+'_001E')
            all_calc_data_nb[code+"_Naturalized"] = calc_sum_normalized(tract_df, ['B05003'+code+'_006E', 'B05003'+code+'_011E', 'B05003'+code+'_017E', 'B05003'+code+'_022E'], 'B05003'+code+'_001E')
            
            # disability population
            all_calc_data_nb[code+"_Population with Disability"] = calc_sum_normalized(tract_df, ['B18101'+code+'_003E', 'B18101'+code+'_006E', 'B18101'+code+'_009E'], 'B18101A_001E')
            
            # household stats
            all_calc_data_nb[code+"_Housholds"] = calc_sum(tract_df, 'B11001'+code+'_001E')
            all_calc_data_nb[code+"_Family Households"] = calc_normalized(tract_df, 'B11001'+code+'_002E', 'B11001'+code+'_001E')
            all_calc_data_nb[code+"_Non-Family Households"] = calc_normalized(tract_df, 'B11001'+code+'_007E', 'B11001'+code+'_001E')
            all_calc_data_nb[code+"_Single Person Households, % of Total"] = calc_normalized(tract_df, 'B11001'+code+'_008E', 'B11001'+code+'_001E')
            #all_calc_data_nb["Households with Children, % of Total"] = calc_normalized(tract_df, 'B11005_002E', 'B11001A_001E')
            #all_calc_data_nb["Households with 60 years and older, % of Total"] = calc_normalized(tract_df, 'B11006_002E', 'B11001_001E')
            all_calc_data_nb[code+"_Average Household Size"] = calc_normalized(tract_df, 'B11002'+code+'_001E', 'B11001'+code+'_001E')
            all_calc_data_nb[code+"_Average Family Household Size"] = calc_normalized(tract_df, 'B11002'+code+'_002E', 'B11001'+code+'_002E')
            
            # age 
            all_calc_data_nb[code+"_0-4 Years"] = calc_sum_normalized(tract_df, ['B01001'+code+'_003E', 'B01001'+code+'_018E'], 'B01001'+code+'_001E')
            all_calc_data_nb[code+"_5-17 Years"] = calc_sum_normalized(tract_df, ['B01001'+code+'_004E', 'B01001'+code+'_005E', 'B01001'+code+'_006E', 'B01001'+code+'_019E', 'B01001'+code+'_020E', 'B01001'+code+'_021E'],'B01001'+code+'_001E')
            all_calc_data_nb[code+"_18-34 Years"] = calc_sum_normalized(tract_df, ['B01001'+code+'_007E','B01001'+code+'_008E','B01001'+code+'_009E', 'B01001'+code+'_010E', 'B01001'+code+'_022E','B01001'+code+'_023E','B01001'+code+'_024E','B01001'+code+'_025E'], 'B01001'+code+'_001E')
            all_calc_data_nb[code+"_35-64 Years"] = calc_sum_normalized(tract_df, ['B01001'+code+'_011E', 'B01001'+code+'_012E', 'B01001'+code+'_013E', 'B01001'+code+'_026E', 'B01001'+code+'_027E', 'B01001'+code+'_028E'], 'B01001'+code+'_001E')
            all_calc_data_nb[code+"_65 Years or older"] = calc_sum_normalized(tract_df, ['B01001'+code+'_014E', 'B01001'+code+'_015E', 'B01001'+code+'_016E', 'B01001'+code+'_029E', 'B01001'+code+'_030E', 'B01001'+code+'_031E'], 'B01001'+code+'_001E') 

            # educational attainment
            all_calc_data_nb[code+"_Less than high school degree"] = calc_sum_normalized(tract_df, ['C15002'+code+'_003E', 'C15002'+code+'_008E'], 'C15002'+code+'_001E')
            all_calc_data_nb[code+"_High school degree or equivalent"] = calc_sum_normalized(tract_df, ['C15002'+code+'_004E', 'C15002'+code+'_009E'], 'C15002'+code+'_001E')
            all_calc_data_nb[code+"_Some college or Associate's degree"] = calc_sum_normalized(tract_df, ['C15002'+code+'_005E', 'C15002'+code+'_010E'], 'C15002'+code+'_001E')
            all_calc_data_nb[code+"_Bachelor's degree or higher"] = calc_sum_normalized(tract_df, ['C15002'+code+'_006E', 'C15002'+code+'_011E'], 'C15002'+code+'_001E')

            # income
            all_calc_data_nb[code+"_Median Household Income"] = calc_median(tract_df, range_race_df, code+'_median_household_income')
            all_calc_data_nb[code+"_Median Family Income"] = calc_median(tract_df, range_race_df, code+'_median_family_income')
            all_calc_data_nb[code+"_Per Capita Income"] = calc_normalized(tract_df, 'B19025'+code+'_001E', 'B01001'+code+'_001E')
            all_calc_data_nb[code+"_Percent in Poverty"] = calc_normalized(tract_df, 'B17001'+code+'_002E', 'B17001'+code+'_001E')

            # employment
            all_calc_data_nb[code+"_Unemployment Rate"] = calc_sum_sum_normalized(tract_df, ['C23002A_008E', 'C23002A_013E','C23002A_021E','C23002A_026E'], ['C23002A_004E', 'C23002A_011E','C23002A_017E','C23002A_024E'])
            all_calc_data_nb[code+"_Owner Occupied"] = calc_normalized(tract_df, 'B25003'+code+'_002E', 'B25003'+code+'_001E')
            all_calc_data_nb[code+"_Renter Occupied"] = calc_normalized(tract_df, 'B25003'+code+'_003E', 'B25003'+code+'_001E')
            all_calc_data_nb[code+"_Percent in Same House Last Year"] = calc_normalized(tract_df, 'B07004'+code+'_002E', 'B07004'+code+'_001E')

    return all_calc_data

### Attribute value calculation: PUMS for Race/Ethnicity groups

In [None]:
pums_url = 'https://api.census.gov/data/2020/acs/acs5/pums?get=WGTP,GRPIP,RAC1P,HISP&ucgid=7950000US0607501,7950000US0607502,7950000US0607503,7950000US0607504,7950000US0607505,7950000US0607506,7950000US0607507'
resp = requests.get(pums_url)
if resp.status_code != 200:
    # this means something went wrong
    resp.raise_for_status()

# retrieve data as json and convert to Pandas Dataframe
data = resp.json()
headers = data.pop(0)
pums_df = pd.DataFrame(data, columns=headers)
for col in headers:
    pums_df[col] = pd.to_numeric(pums_df[col])

pums_df

In [None]:
puma_to_tract = pd.read_csv('https://www2.census.gov/geo/docs/maps-data/data/rel2020/2020_Census_Tract_to_2020_PUMA.txt')
print(len(puma_to_tract))
puma_to_tract = puma_to_tract.apply(pd.to_numeric)
puma_to_tract[puma_to_tract['PUMA5CE']==7504].head()


In [None]:
pums_df.loc[pums_df.RAC1P == 4, 'RAC1P'] = 3
pums_df.loc[pums_df.RAC1P == 5, 'RAC1P'] = 3

puma_list = [7501, 7502, 7503, 7504, 7505, 7506, 7507]

racecode_pums = {
        1: "A", #White Alone
        2: "B", #Black or African American Alone
        3: "C", #American Indian And Alaska Native Alone
        6: "D", #Asian Alone
        #"Native Hawaiian and Other Pacific Islander Alone", "E"
        #"White Non-Hispanic": "H",
        #Hispanic or Latino
    }


pums_data = defaultdict(dict) 

for puma in puma_list:
    # build dictionary with all stats for a neighborhood
    pums_data_puma = pums_data[puma]
    
    # calculate stats for races
    for race, code in racecode_pums.items():
        print(len(pums_df[(pums_df['PUMA']==puma)]))
        print(len(pums_df[(pums_df['RAC1P']==race)]))
        print(len(pums_df[(pums_df['HISP']==1)]))
        
        sub_df_hisp_n = pums_df[(pums_df['PUMA']==puma) & (pums_df['RAC1P']==race) & (pums_df['HISP']==1)].reset_index()
        print(len(sub_df_hisp_n))
        pums_data_puma[code+'_GRPIP'] = sub_df_hisp_n['GRPIP'].tolist()[0]

        
        
    # calculate stats for Latino
    sub_df_hisp_y = pums_df[(pums_df['PUMA']==puma) & (pums_df['HISP']!=1)].reset_index()
    sub_df_hisp_y['GRPIP_wg'] = sub_df_hisp_y['GRPIP'] * sub_df_hisp_y['WGTP']
    value = sum(sub_df_hisp_y['GRPIP_wg'])/sum(sub_df_hisp_y['WGTP'])
    pums_data_puma['I_GRPIP'] = value
    
pums_data_df = pd.DataFrame.from_dict(pums_data).reset_index()
pums_data_df.rename(columns = {'index':'Attribute'}, inplace = True)
#df_race_calcs.head()
pums_data_df.head()
            




In [None]:
pums_data_df.rename(columns = {'index':'Attribute'}, inplace = True)
pums_data_df_tp = pums_data_df.T.reset_index()
pums_data_df_tp.columns = pums_data_df_tp.iloc[0]
pums_data_df_tp = pums_data_df_tp[1:].rename(columns={'Attribute': 'PUMA'})
pums_data_df_tp = pums_data_df_tp.sort_values(by=['PUMA'])
pums_data_df_tp.head()

In [None]:
pums_data_df_tp = pums_data_df_tp.apply(pd.to_numeric)
pums_data_df_tp = pd.merge(pums_data_df_tp, puma_to_tract, left_on='PUMA', right_on='PUMA5CE', how='left').drop(['STATEFP', 'COUNTYFP'], axis=1)



In [None]:

simple_geo_lookup = geo_lookup_df[['tractid', 'neighborhood']]
simple_geo_lookup['tractid'] = pd.to_numeric(simple_geo_lookup['tractid'])

pums_data_df_tp = pd.merge(pums_data_df_tp, simple_geo_lookup, left_on = 'TRACTCE', right_on = 'tractid', how='left')
pums_data_df_tp.head()

In [None]:
pums_data_df_tp.head()

## Caculate Socioeconomic Profiles

### Set Summary Variables (Neighborhood or Census Tract) and Output Paths

Now you are ready to calculate attribute values summarized at a geographic level of your selection. Set `geo_summary_variable` below as `Neighborhood` first and run the following codes. After exporting the final data table as a csv file, come back here, set the `geo_summary_variable` as `Tract` and repeat running the code til the exporting stage.  

In [None]:
# set geography to summarize by. If supervisor districts set geo_summary_variable to "Superisor District"
geo_summary_variable = 'Neighborhood'#'Tract'# 

# set path to download csvs
download_path = r"./output"

# sets geo variables based on above choice
if geo_summary_variable == 'Tract':
    tract_lookup = tract_tr_lookup
    geo_path = r'./shps/tracts/tracts_sf.shp'
    geo_merge_variable = 'tractce'
elif geo_summary_variable == 'Neighborhood':
    tract_lookup = tract_nb_lookup
    geo_path = r'./shps/neighborhoods/neighborhoods5/neighborhoods5.shp'
    geo_merge_variable = 'nhood'

### Run Socioeconomic Profiles Calcs

#### Total Population

In [None]:
# run functions to calculate all stats and convert calc dictionary to pandas dataframe
all_calc_data = calc_socio_economic_data(df, tract_lookup)
print(len(all_calc_data))
df_all_calcs = pd.DataFrame.from_dict(all_calc_data).reset_index()
df_all_calcs.rename(columns = {'index':'Attribute'}, inplace = True) 
df_all_calcs.head()


#### Race/Ethnicity Groups

In [None]:
# run functions to calculate all race/ethnicity stats, convert dictionary to dataframe, append the datafrmae to df_all_calcs
race_calc_data = calc_socio_economic_data_race(df, tract_lookup)
df_race_calcs = pd.DataFrame.from_dict(race_calc_data).reset_index()
df_race_calcs.rename(columns = {'index':'Attribute'}, inplace = True)
df_race_calcs.head()



#### Combine the two data groups & Arrange data by geographies

In [None]:
df_all_calcs_fin = pd.concat([df_all_calcs, df_race_calcs]).reset_index(drop = True)
df_all_calcs_fin.head()

In [None]:
# transpose dataset so that each row represents one geographic area
df_all_calcs_fin_tp = df_all_calcs_fin.T.reset_index()
df_all_calcs_fin_tp.columns = df_all_calcs_fin_tp.iloc[0]
df_all_calcs_fin_tp = df_all_calcs_fin_tp[1:].rename(columns={'Attribute': geo_summary_variable})
df_all_calcs_fin_tp.head()

In [None]:
df_all_calcs_fin_tp = df_all_calcs_fin_tp.sort_values(by=[geo_summary_variable]).reset_index(drop = True)
df_all_calcs_fin_tp.head()

#### Midpoint export

In [None]:
# export dataset to csv
df_all_calcs_fin_tp.to_csv(os.path.join(download_path,geo_summary_variable+"_"+'profiles_by_geo_{}.csv'.format(year)), index = False)

## Repeat: 10 years ago

The code below basically repeats the same data compiling process as above, yet the data is from `year_past` set at the beginning of this notebook. The code needs new lookup tables that include attribute IDs corresponding to the past ACS data: 

- [attribute_lookup_past.csv]() 
- [race_attribute_loookup_past.csv]()
- [median_ranges_past.csv]()
- [race_median_ranges_past.csv]()

### Census Attribute IDs

In [None]:
# Create list of attribute IDs from attribute_lookup.csv
attribute_lookup_df = pd.read_csv (r'./lookup_tables/attribute_lookup_past_sh.csv', dtype=str)
attribute_ids_extracted = attribute_lookup_df['attribute_id'].tolist()
attribute_ids = []
for attribute_id in attribute_ids_extracted:
    attribute_ids.extend(attribute_id.split(", "))
attribute_ids = list(set([x+"E" for x in attribute_ids]))
attribute_ids[:10]

In [None]:
# Create list of attribute IDs from attribute_lookup.csv
attribute_lookup_df = pd.read_csv (r'./lookup_tables/race_attribute_lookup_past_sh.csv', dtype=str)
attribute_ids_extracted = attribute_lookup_df['attribute_id'].tolist()
attribute_ids = []
for attribute_id in attribute_ids_extracted:
    attribute_ids.extend(attribute_id.split(", "))
attribute_ids = list(set([x+"E" for x in attribute_ids]))
attribute_ids[:10]

In [None]:
# import median tables from median_ranges csv and add empty columns for rows 'households and 'cumulative_totals'
range_df = pd.read_csv (r'./lookup_tables/median_ranges_past.csv')
range_df['households']=0
range_df['cumulative_total']=0
range_df.head()

In [None]:
# test simple
test = calc_socio_economic_data_simple(df, tract_lookup)

### Attribute Calculation Function: Total Population - Past year

In [None]:
def calc_socio_economic_data_past(df, tract_lookup):
    # create empty dictionary to add calculated attribute information to
    all_calc_data = defaultdict(dict) 
    # calculate all stats for each neighborhood
    for nb_name, tracts in tract_lookup.items():
        # extract attribute information for tracks associated with a neighborhood
        tract_df = df[df['tract'].isin(tracts)]
        # build dictionary with all stats for a neighborhood
        all_calc_data_nb = all_calc_data[nb_name]
        # population
        all_calc_data_nb["Total Population"] = calc_sum(tract_df, 'B01001_001E')
        all_calc_data_nb["Group Quarter Population"] = calc_sum(tract_df, 'B26001_001E')
        all_calc_data_nb["Percent Female"] = calc_normalized(tract_df, 'B01001_026E', 'B01001_001E')
        # household stats
        all_calc_data_nb["Housholds"] = calc_sum(tract_df, 'B11001_001E')
        all_calc_data_nb["Family Households"] = calc_normalized(tract_df, 'B11001_002E', 'B11001_001E')
        all_calc_data_nb["Non-Family Households"] = calc_normalized(tract_df, 'B11001_007E', 'B11001_001E')
        all_calc_data_nb["Single Person Households"] = calc_normalized(tract_df, 'B11001_008E', 'B11001_001E')
        all_calc_data_nb["Households with Children"] = calc_normalized(tract_df, 'B11005_002E', 'B11001_001E')
        all_calc_data_nb["Households with 60 years and older"] = calc_normalized(tract_df, 'B11006_002E', 'B11001_001E')
        all_calc_data_nb["Senior (65+) living alone"] = np.nan
        all_calc_data_nb["Median Household Size"] = calc_normalized(tract_df, 'B11002_001E', 'B11001_001E')
        all_calc_data_nb["Median Family Household Size"] = calc_normalized(tract_df, 'B11002_002E', 'B11001_002E')
        
        # age
        ## from Jason's original 
        all_calc_data_nb["0-4 Years"] = calc_sum_normalized(tract_df, ['B01001_003E', 'B01001_027E'], 'B01001_001E')
        all_calc_data_nb["5-17 Years"] = calc_sum_normalized(tract_df, ['B01001_004E', 'B01001_005E', 'B01001_006E', 'B01001_028E', 'B01001_029E', 'B01001_030E'],'B01001_001E')
        all_calc_data_nb["18-34 Years"] = calc_sum_normalized(tract_df, ['B01001_007E','B01001_008E','B01001_009E', 'B01001_010E', 'B01001_011E', 'B01001_012E','B01001_031E','B01001_032E','B01001_033E','B01001_034E','B01001_035E','B01001_036E'], 'B01001_001E')
        all_calc_data_nb["35-59 Years"] = calc_sum_normalized(tract_df, ['B01001_013E', 'B01001_014E', 'B01001_015E', 'B01001_016E', 'B01001_017E', 'B01001_037E', 'B01001_038E', 'B01001_039E', 'B01001_040E', 'B01001_041E'], 'B01001_001E')
        all_calc_data_nb["60-64 Years"] = calc_sum_normalized(tract_df, ['B01001_018E', 'B01001_019E', 'B01001_042E', 'B01001_043E'], 'B01001_001E')
        all_calc_data_nb["65 Years or older"] = calc_sum_normalized(tract_df, ['B01001_020E', 'B01001_021E', 'B01001_022E', 'B01001_023E', 'B01001_024E', 'B01001_025E', 'B01001_044E', 'B01001_045E', 'B01001_046E', 'B01001_047E', 'B01001_048E', 'B01001_049E'], 'B01001_001E') 
        # educational attainment
        all_calc_data_nb["Less than high school degree"] = calc_sum_normalized(tract_df, ['B15002_003E', 'B15002_004E', 'B15002_005E', 'B15002_006E', 'B15002_007E', 'B15002_008E', 'B15002_009E', 'B15002_010E', 'B15002_020E', 'B15002_021E', 'B15002_022E', 'B15002_023E', 'B15002_024E', 'B15002_025E', 'B15002_026E', 'B15002_027E'], 'B15002_001E')
        all_calc_data_nb["High school degree or equivalent"] = calc_sum_normalized(tract_df, ['B15002_011E', 'B15002_027E'], 'B15002_001E')
        all_calc_data_nb["Some college or Associates degree"] = calc_sum_normalized(tract_df, ['B15002_012E', 'B15002_013E', 'B15002_014E', 'B15002_029E', 'B15002_030E', 'B15002_031E'], 'B15002_001E')
        all_calc_data_nb["Bachelors degree or higher"] = calc_sum_normalized(tract_df, ['B15002_015E', 'B15002_016E', 'B15002_017E', 'B15002_018E', 'B15002_032E', 'B15002_033E', 'B15002_034E', 'B15002_035E'], 'B15002_001E')
        # nativity
        all_calc_data_nb["Foreign Born"] = calc_normalized(tract_df, 'B05002_013E', 'B05002_001E')
        all_calc_data_nb["Naturalized"] = calc_normalized(tract_df, 'B05002_014E', 'B05002_001E')
        
        # language spoken at home
        all_calc_data_nb["English Only"] = calc_sum_normalized(tract_df, ['B16007_003E', 'B16007_009E', 'B16007_015E'], 'B16007_001E')
        all_calc_data_nb["Spanish Only"] = calc_sum_normalized(tract_df, ['B16007_004E', 'B16007_010E', 'B16007_016E'], 'B16007_001E')
        all_calc_data_nb["Asian/Pacific Islander"] = calc_sum_normalized(tract_df, ['B16007_006E', 'B16007_012E', 'B16007_018E'], 'B16007_001E')
        all_calc_data_nb["Other European Languages"] = calc_sum_normalized(tract_df, ['B16007_005E', 'B16007_011E', 'B16007_017E'], 'B16007_001E')
        all_calc_data_nb["Other Languages"] = calc_sum_normalized(tract_df, ['B16007_007E', 'B16007_013E', 'B16007_019E'], 'B16007_001E')
       
        # linguistic isolation
        all_calc_data_nb["% of All Households"] = calc_sum_normalized(tract_df, ['B16003_002E', 'B16003_008E'], 'B16004_001E')
        all_calc_data_nb["% of Spanish-Speaking Households"] = calc_sum_normalized(tract_df, ['B16003_004E', 'B16003_009E'], 'B16004_001E')
        all_calc_data_nb["% of Asian-Speaking Households"] = calc_sum_normalized(tract_df, ['B16003_006E', 'B16003_011E'], 'B16004_001E')
        all_calc_data_nb["% of Other European-Speaking Households"] = calc_sum_normalized(tract_df, ['B16003_005E', 'B16003_010E'], 'B16004_001E')
        all_calc_data_nb["% of Households Speaking Other Languages"] = calc_sum_normalized(tract_df, ['B16003_007E', 'B16003_012E'], 'B16004_001E')
        # enlish proficiency 
        all_calc_data_nb["% Speaking English Very Well of Total"] = calc_sum_normalized(tract_df, ['B06007_002E', 'B06007_004E', 'B06007_007E'], 'B06007_001E')
        all_calc_data_nb["% Speaking English Very Well of Foreign Born"] = calc_sum_normalized(tract_df, ['B06007_034E', 'B06007_036E', 'B06007_039E'], 'B06007_033E')
        housing
        all_calc_data_nb["Total Number of Units"] = calc_sum(tract_df, 'B25001_001E')
        all_calc_data_nb["Median Year Structure Built"] = calc_median(tract_df, range_df, 'median_year_structure_built')
        all_calc_data_nb["Units built in 2010 or later"] = calc_sum_normalized(tract_df, ['B25034_002E', 'B25034_003E'], 'B25034_001E')
        all_calc_data_nb["Occupied Units"] = calc_normalized(tract_df, 'B25003_001E', 'B25007_001E')
        all_calc_data_nb["Occupied Units (Number)"] = calc_sum(tract_df, 'B25007_002E')
        all_calc_data_nb["Owner Occupied"] = calc_normalized(tract_df, 'B25007_002E', 'B25007_001E')
        all_calc_data_nb["Renter Occupied"] = calc_normalized(tract_df, 'B25007_012E', 'B25007_001E')
        #all_calc_data_nb["Vacant Units"] = calc_normalized(tract_df, 'B25004_001E', 'B25001_001E')
        all_calc_data_nb["Percent in Same House Last Year"] = calc_normalized(tract_df, 'B07001_017E', 'B07001_001E')
        #all_calc_data_nb["Percent Abroad Last Year"] = calc_normalized(tract_df, 'B07003_016E', 'B07003_001E')
        all_calc_data_nb["Percent Housing Overcrowding (> 1 person per room)"] = calc_sum_normalized(tract_df, ['B25014_005E', 'B25014_006E', 'B25014_007E', 'B25014_011E', 'B25014_012E', 'B25014_013E'], 'B25014_001E')
        # unit size - number 
        all_calc_data_nb["No Bedroom (Number)"] = calc_sum(tract_df,'B25041_002E')
        all_calc_data_nb["1 Bedroom (Number)"] = calc_sum(tract_df, 'B25041_003E')
        all_calc_data_nb["2 Bedrooms (Number)"] = calc_sum(tract_df, 'B25041_004E')
        all_calc_data_nb["3 Bedrooms (Number)"] = calc_sum(tract_df, 'B25041_005E')
        all_calc_data_nb["4 Bedrooms (Number)"] = calc_sum(tract_df, 'B25041_006E')
        all_calc_data_nb["5 or More Bedrooms (Number)"] = calc_sum(tract_df, 'B25041_007E')
        # unit size
        all_calc_data_nb["No Bedroom"] = calc_normalized(tract_df,'B25041_002E', 'B25041_001E')
        all_calc_data_nb["1 Bedroom"] = calc_normalized(tract_df, 'B25041_003E', 'B25041_001E')
        all_calc_data_nb["2 Bedrooms"] = calc_normalized(tract_df, 'B25041_004E', 'B25041_001E')
        all_calc_data_nb["3-4 Bedrooms"] = calc_sum_normalized(tract_df, ['B25041_005E', 'B25041_006E'], 'B25041_001E')
        all_calc_data_nb["5 or More Bedrooms"] = calc_normalized(tract_df, 'B25041_007E', 'B25041_001E')
        # housing prices
        all_calc_data_nb["Median Rent"] = calc_median(tract_df, range_df, 'median_rent')
        all_calc_data_nb["Median Contract Rent"] = calc_median(tract_df, range_df, 'median_rent_contract')
        all_calc_data_nb["Median Rent as % of Household Income"] = calc_median(tract_df, range_df, 'median_rent_percent_of_income')
        all_calc_data_nb["Median Home Value"] = calc_median(tract_df, range_df, 'median_home_value')
        # vehicles available
        all_calc_data_nb["Vehicles Available"] = calc_sum(tract_df, 'B25046_001E')
        all_calc_data_nb["Vehicles Homeowners"] = calc_normalized(tract_df, 'B25046_002E', 'B25046_001E')
        all_calc_data_nb["Vehicles Renters"] = calc_normalized(tract_df, 'B25046_003E', 'B25046_001E')
        all_calc_data_nb["Vehicles Per Capita"] = calc_normalized(tract_df, 'B25046_001E', 'B01001_001E')
        all_calc_data_nb["Households with no Vehicle"] = calc_sum_normalized(tract_df, ['B25044_003E', 'B25044_010E'], 'B25044_001E')
        all_calc_data_nb["Percent of Homeowning Households"] = calc_normalized(tract_df, 'B25044_003E', 'B25044_002E')
        all_calc_data_nb["Percent of Renting Households"] = calc_normalized(tract_df, 'B25044_010E', 'B25044_009E')
        # income
        all_calc_data_nb["Median Household Income"] = calc_median(tract_df, range_df, 'median_household_income')
        all_calc_data_nb["Median Family Income"] = calc_median(tract_df, range_df, 'median_family_income')
        all_calc_data_nb["Per Capita Income"] = calc_normalized(tract_df, 'B19025_001E', 'B01001_001E')
        all_calc_data_nb["Percent in Poverty"] = calc_normalized(tract_df, 'B17001_002E', 'B17001_001E')
        all_calc_data_nb["Household Income (less than 25K)"] = calc_sum_normalized(tract_df, ['B19001_002E', 'B19001_003E', 'B19001_004E', 'B19001_005E'], 'B19001_001E')
        all_calc_data_nb["Household Income (25K-50K)"] = calc_sum_normalized(tract_df, ['B19001_006E', 'B19001_007E', 'B19001_008E', 'B19001_009E', 'B19001_010E'], 'B19001_001E')
        all_calc_data_nb["Household Income (5OK-75K)"] = calc_sum_normalized(tract_df, ['B19001_011E', 'B19001_012E'], 'B19001_001E')
        all_calc_data_nb["Household Income (75K-100K)"] = calc_normalized(tract_df, 'B19001_013E', 'B19001_001E')
        all_calc_data_nb["Household Income (100K-125K)"] = calc_normalized(tract_df, 'B19001_014E', 'B19001_001E')
        all_calc_data_nb["Household Income (more than 125K)"] = calc_sum_normalized(tract_df, ['B19001_015E', 'B19001_016E', 'B19001_017E'], 'B19001_001E')

        # employment
        all_calc_data_nb["Unemployment Rate"] = calc_sum_sum_normalized(tract_df, ['B23001_094E', 'B23001_101E', 'B23001_108E', 'B23001_115E', 'B23001_122E', 'B23001_129E', 'B23001_136E', 'B23001_143E', 'B23001_150E', 'B23001_157E', 'B23001_162E', 'B23001_167E', 'B23001_172E', 'B23001_090E', 'B23001_097E', 'B23001_104E', 'B23001_111E', 'B23001_118E', 'B23001_125E', 'B23001_132E', 'B23001_139E', 'B23001_146E', 'B23001_153E', 'B23001_160E', 'B23001_165E', 'B23001_170E', 'B23001_008E', 'B23001_015E', 'B23001_022E', 'B23001_029E', 'B23001_036E', 'B23001_043E', 'B23001_050E', 'B23001_057E', 'B23001_064E', 'B23001_071E', 'B23001_076E', 'B23001_081E', 'B23001_086E', 'B23001_004E', 'B23001_011E', 'B23001_018E', 'B23001_025E', 'B23001_032E', 'B23001_039E', 'B23001_046E', 'B23001_053E', 'B23001_060E', 'B23001_067E', 'B23001_074E', 'B23001_079E', 'B23001_084E'], ['B23001_088E', 'B23001_002E']) 
        all_calc_data_nb["Percent Unemployment Female"] = calc_sum_normalized(tract_df, ['B23001_094E', 'B23001_101E', 'B23001_108E', 'B23001_115E', 'B23001_122E', 'B23001_129E', 'B23001_136E', 'B23001_143E', 'B23001_150E', 'B23001_157E', 'B23001_162E', 'B23001_167E', 'B23001_172E', 'B23001_090E', 'B23001_097E', 'B23001_104E', 'B23001_111E', 'B23001_118E', 'B23001_125E', 'B23001_132E', 'B23001_139E', 'B23001_146E', 'B23001_153E', 'B23001_160E', 'B23001_165E', 'B23001_170E'], 'B23001_088E')
        all_calc_data_nb["Percent Unemployment Male"] = calc_sum_normalized(tract_df, ['B23001_008E', 'B23001_015E', 'B23001_022E', 'B23001_029E', 'B23001_036E', 'B23001_043E', 'B23001_050E', 'B23001_057E', 'B23001_064E', 'B23001_071E', 'B23001_076E', 'B23001_081E', 'B23001_086E', 'B23001_004E', 'B23001_011E', 'B23001_018E', 'B23001_025E', 'B23001_032E', 'B23001_039E', 'B23001_046E', 'B23001_053E', 'B23001_060E', 'B23001_067E', 'B23001_074E', 'B23001_079E', 'B23001_084E'], 'B23001_002E')
        #all_calc_data_nb["Employed Residents"] = calc_sum(tract_df, 'C24050_001E')
        #journey to work
        all_calc_data_nb["Workers 16 Years and Older"] = calc_sum(tract_df, 'B08006_001E')
        all_calc_data_nb["Car"] = calc_normalized(tract_df, 'B08006_002E', 'B08006_001E')
        all_calc_data_nb["Drove Alone"] = calc_normalized(tract_df, 'B08006_003E', 'B08006_001E')
        all_calc_data_nb["Carpooled"] = calc_normalized(tract_df, 'B08006_004E', 'B08006_001E')
        all_calc_data_nb["Transit"] = calc_normalized(tract_df, 'B08006_008E', 'B08006_001E')
        all_calc_data_nb["Bike"] = calc_normalized(tract_df, 'B08006_014E', 'B08006_001E')
        all_calc_data_nb["Walk"] = calc_normalized(tract_df, 'B08006_015E', 'B08006_001E')
        all_calc_data_nb["Other Journey Type"] = calc_normalized(tract_df, 'B08006_016E', 'B08006_001E')
        all_calc_data_nb["Worked at Home"] = calc_normalized(tract_df, 'B08006_017E', 'B08006_001E')
        # population density
        all_calc_data_nb["Population Density per Acre"] = calc_sum(tract_df, 'B01001_001E')
       
    #return calc dictionary
    return all_calc_data

### Attribute value calculation function: Race/Ethnicity Group - Past year

In [None]:
def calc_socio_economic_data_race_past(df, tract_lookup):
    racecode = {
        "White Alone": "A",
        "Black or African American Alone": "B",
        "American Indian And Alaska Native Alone": "C",
        "Asian Alone": "D",
        #"Native Hawaiian and Other Pacific Islander Alone", "E"
        "White Non-Hispanic": "H",
        "Hispanic or Latino": "I"
    }

    all_calc_data = defaultdict(dict) 
    
    # calculate all stats for each neighborhood
    for nb_name, tracts in tract_lookup.items():
        # extract attribute information for tracks associated with a neighborhood
        tract_df = df[df['tract'].isin(tracts)]
        # build dictionary with all stats for a neighborhood
        all_calc_data_nb = all_calc_data[nb_name]
        
        for race, code in racecode.items():
            
            # population
            all_calc_data_nb["Total Population"] = calc_sum(tract_df, 'B01001_001E')
            all_calc_data_nb["Group Quarter Population"] = calc_sum(tract_df, 'B26001_001E')
            all_calc_data_nb["Percent Female"] = calc_normalized(tract_df, 'B01001_026E', 'B01001_001E')
            # household stats
            all_calc_data_nb["Housholds"] = calc_sum(tract_df, 'B11001_001E')
            all_calc_data_nb["Family Households"] = calc_normalized(tract_df, 'B11001_002E', 'B11001_001E')
            all_calc_data_nb["Non-Family Households"] = calc_normalized(tract_df, 'B11001_007E', 'B11001_001E')
            all_calc_data_nb["Single Person Households"] = calc_normalized(tract_df, 'B11001_008E', 'B11001_001E')
            #all_calc_data_nb["Households with Children"] = calc_normalized(tract_df, 'B11005_002E', 'B11001_001E')
            #all_calc_data_nb["Households with 60 years and older"] = calc_normalized(tract_df, 'B11006_002E', 'B11001_001E')
            all_calc_data_nb["Senior (65+) living alone"] = np.nan
            all_calc_data_nb["Average Household Size"] = calc_normalized(tract_df, 'B11002_001E', 'B11001_001E')
            all_calc_data_nb["Average Family Household Size"] = calc_normalized(tract_df, 'B11002_002E', 'B11001_002E')
          
            # age
            ## from Jason's original 
            all_calc_data_nb["0-4 Years"] = calc_sum_normalized(tract_df, ['B01001_003E', 'B01001_027E'], 'B01001_001E')
            all_calc_data_nb["5-17 Years"] = calc_sum_normalized(tract_df, ['B01001_004E', 'B01001_005E', 'B01001_006E', 'B01001_028E', 'B01001_029E', 'B01001_030E'],'B01001_001E')
            all_calc_data_nb["18-34 Years"] = calc_sum_normalized(tract_df, ['B01001_007E','B01001_008E','B01001_009E', 'B01001_010E', 'B01001_011E', 'B01001_012E','B01001_031E','B01001_032E','B01001_033E','B01001_034E','B01001_035E','B01001_036E'], 'B01001_001E')
            all_calc_data_nb["35-59 Years"] = calc_sum_normalized(tract_df, ['B01001_013E', 'B01001_014E', 'B01001_015E', 'B01001_016E', 'B01001_017E', 'B01001_037E', 'B01001_038E', 'B01001_039E', 'B01001_040E', 'B01001_041E'], 'B01001_001E')
            all_calc_data_nb["60-64 Years"] = calc_sum_normalized(tract_df, ['B01001_018E', 'B01001_019E', 'B01001_042E', 'B01001_043E'], 'B01001_001E')
            all_calc_data_nb["65 Years or older"] = calc_sum_normalized(tract_df, ['B01001_020E', 'B01001_021E', 'B01001_022E', 'B01001_023E', 'B01001_024E', 'B01001_025E', 'B01001_044E', 'B01001_045E', 'B01001_046E', 'B01001_047E', 'B01001_048E', 'B01001_049E'], 'B01001_001E') 
            # educational attainment
            all_calc_data_nb["Less than high school degree"] = calc_sum_normalized(tract_df, ['B15002_003E', 'B15002_004E', 'B15002_005E', 'B15002_006E', 'B15002_007E', 'B15002_008E', 'B15002_009E', 'B15002_010E', 'B15002_020E', 'B15002_021E', 'B15002_022E', 'B15002_023E', 'B15002_024E', 'B15002_025E', 'B15002_026E', 'B15002_027E'], 'B15002_001E')
            all_calc_data_nb["High school degree or equivalent"] = calc_sum_normalized(tract_df, ['B15002_011E', 'B15002_027E'], 'B15002_001E')
            all_calc_data_nb["Some college or Associates degree"] = calc_sum_normalized(tract_df, ['B15002_012E', 'B15002_013E', 'B15002_014E', 'B15002_029E', 'B15002_030E', 'B15002_031E'], 'B15002_001E')
            all_calc_data_nb["Bachelors degree or higher"] = calc_sum_normalized(tract_df, ['B15002_015E', 'B15002_016E', 'B15002_017E', 'B15002_018E', 'B15002_032E', 'B15002_033E', 'B15002_034E', 'B15002_035E'], 'B15002_001E')
            # nativity
            all_calc_data_nb["Foreign Born"] = calc_normalized(tract_df, 'B05002_013E', 'B05002_001E')
            all_calc_data_nb["Naturalized"] = calc_normalized(tract_df, 'B05002_014E', 'B05002_001E')
            # language spoken at home
            all_calc_data_nb["English Only"] = calc_sum_normalized(tract_df, ['B16007_003E', 'B16007_009E', 'B16007_015E'], 'B16007_001E')
            all_calc_data_nb["Spanish Only"] = calc_sum_normalized(tract_df, ['B16007_004E', 'B16007_010E', 'B16007_016E'], 'B16007_001E')
            all_calc_data_nb["Asian/Pacific Islander"] = calc_sum_normalized(tract_df, ['B16007_006E', 'B16007_012E', 'B16007_018E'], 'B16007_001E')
            all_calc_data_nb["Other European Languages"] = calc_sum_normalized(tract_df, ['B16007_005E', 'B16007_011E', 'B16007_017E'], 'B16007_001E')
            all_calc_data_nb["Other Languages"] = calc_sum_normalized(tract_df, ['B16007_007E', 'B16007_013E', 'B16007_019E'], 'B16007_001E')
            # linguistic isolation
            all_calc_data_nb["% of All Households"] = calc_sum_normalized(tract_df, ['B16003_002E', 'B16003_008E'], 'B16004_001E')
            all_calc_data_nb["% of Spanish-Speaking Households"] = calc_sum_normalized(tract_df, ['B16003_004E', 'B16003_009E'], 'B16004_001E')
            all_calc_data_nb["% of Asian-Speaking Households"] = calc_sum_normalized(tract_df, ['B16003_006E', 'B16003_011E'], 'B16004_001E')
            all_calc_data_nb["% of Other European-Speaking Households"] = calc_sum_normalized(tract_df, ['B16003_005E', 'B16003_010E'], 'B16004_001E')
            all_calc_data_nb["% of Households Speaking Other Languages"] = calc_sum_normalized(tract_df, ['B16003_007E', 'B16003_012E'], 'B16004_001E')
            # enlish proficiency 
            all_calc_data_nb["% Speaking English Very Well of Total"] = calc_sum_normalized(tract_df, ['B06007_002E', 'B06007_004E', 'B06007_007E'], 'B06007_001E')
            all_calc_data_nb["% Speaking English Very Well of Foreign Born"] = calc_sum_normalized(tract_df, ['B06007_034E', 'B06007_036E', 'B06007_039E'], 'B06007_033E')
            # housing
            all_calc_data_nb["Total Number of Units"] = calc_sum(tract_df, 'B25001_001E')
            #all_calc_data_nb["Median Year Structure Built"] = calc_median(tract_df, range_df, 'median_year_structure_built')
            all_calc_data_nb["Units built in 2010 or later"] = calc_sum_normalized(tract_df, ['B25034_002E', 'B25034_003E'], 'B25034_001E')
            all_calc_data_nb["Occupied Units"] = calc_normalized(tract_df, 'B25003_001E', 'B25007_001E')
            all_calc_data_nb["Occupied Units (Number)"] = calc_sum(tract_df, 'B25007_002E')
            all_calc_data_nb["Owner Occupied"] = calc_normalized(tract_df, 'B25007_002E', 'B25007_001E')
            all_calc_data_nb["Renter Occupied"] = calc_normalized(tract_df, 'B25007_012E', 'B25007_001E')
            all_calc_data_nb["Vacant Units"] = calc_normalized(tract_df, 'B25004_001E', 'B25001_001E')
            all_calc_data_nb["Percent in Same House Last Year"] = calc_normalized(tract_df, 'B07001_017E', 'B07001_001E')
            #all_calc_data_nb["Percent Abroad Last Year"] = calc_normalized(tract_df, 'B07003_016E', 'B07003_001E')
            all_calc_data_nb["Percent Housing Overcrowding (> 1 person per room)"] = calc_sum_normalized(tract_df, ['B25014_005E', 'B25014_006E', 'B25014_007E', 'B25014_011E', 'B25014_012E', 'B25014_013E'], 'B25014_001E')
            # unit size - number 
            all_calc_data_nb["No Bedroom (Number)"] = calc_sum(tract_df,'B25041_002E')
            all_calc_data_nb["1 Bedroom (Number)"] = calc_sum(tract_df, 'B25041_003E')
            all_calc_data_nb["2 Bedrooms (Number)"] = calc_sum(tract_df, 'B25041_004E')
            all_calc_data_nb["3 Bedrooms (Number)"] = calc_sum(tract_df, 'B25041_005E')
            all_calc_data_nb["4 Bedrooms (Number)"] = calc_sum(tract_df, 'B25041_006E')
            all_calc_data_nb["5 or More Bedrooms (Number)"] = calc_sum(tract_df, 'B25041_007E')
            # unit size
            all_calc_data_nb["No Bedroom"] = calc_normalized(tract_df,'B25041_002E', 'B25041_001E')
            all_calc_data_nb["1 Bedroom"] = calc_normalized(tract_df, 'B25041_003E', 'B25041_001E')
            all_calc_data_nb["2 Bedrooms"] = calc_normalized(tract_df, 'B25041_004E', 'B25041_001E')
            all_calc_data_nb["3-4 Bedrooms"] = calc_sum_normalized(tract_df, ['B25041_005E', 'B25041_006E'], 'B25041_001E')
            all_calc_data_nb["5 or More Bedrooms"] = calc_normalized(tract_df, 'B25041_007E', 'B25041_001E')
            # housing prices
            all_calc_data_nb["Median Rent"] = calc_median(tract_df, range_df, 'median_rent')
            #all_calc_data_nb["Median Contract Rent"] = calc_median(tract_df, range_df, 'median_rent_contract')
            all_calc_data_nb["Median Rent as % of Household Income"] = calc_median(tract_df, range_df, 'median_rent_percent_of_income')
            all_calc_data_nb["Median Home Value"] = calc_median(tract_df, range_df, 'median_home_value')
            # vehicles available
            all_calc_data_nb["Vehicles Available"] = calc_sum(tract_df, 'B25046_001E')
            all_calc_data_nb["Vehicles Homeowners"] = calc_normalized(tract_df, 'B25046_002E', 'B25046_001E')
            all_calc_data_nb["Vehicles Renters"] = calc_normalized(tract_df, 'B25046_003E', 'B25046_001E')
            all_calc_data_nb["Vehicles Per Capita"] = calc_normalized(tract_df, 'B25046_001E', 'B01001_001E')
            all_calc_data_nb["Households with no Vehicle"] = calc_sum_normalized(tract_df, ['B25044_003E', 'B25044_010E'], 'B25044_001E')
            all_calc_data_nb["Percent of Homeowning Households"] = calc_normalized(tract_df, 'B25044_003E', 'B25044_002E')
            all_calc_data_nb["Percent of Renting Households"] = calc_normalized(tract_df, 'B25044_010E', 'B25044_009E')
            # income
            all_calc_data_nb["Median Household Income (B19013_001)"] = calc_median(tract_df, range_df, 'median_household_income')
            all_calc_data_nb["Median Family Income (B19113_001)"] = calc_median(tract_df, range_df, 'median_family_income')
            all_calc_data_nb["Per Capita Income"] = calc_normalized(tract_df, 'B19025_001E', 'B01001_001E')
            all_calc_data_nb["Percent in Poverty"] = calc_normalized(tract_df, 'B17001_002E', 'B17001_001E')
            all_calc_data_nb["Household Income (less than 25K)"] = calc_sum_normalized(tract_df, ['B19001_002E', 'B19001_003E', 'B19001_004E', 'B19001_005E'], 'B19001_001E')
            all_calc_data_nb["Household Income (25K-50K)"] = calc_sum_normalized(tract_df, ['B19001_006E', 'B19001_007E', 'B19001_008E', 'B19001_009E', 'B19001_010E'], 'B19001_001E')
            all_calc_data_nb["Household Income (5OK-75K)"] = calc_sum_normalized(tract_df, ['B19001_011E', 'B19001_012E'], 'B19001_001E')
            all_calc_data_nb["Household Income (75K-100K)"] = calc_normalized(tract_df, 'B19001_013E', 'B19001_001E')
            all_calc_data_nb["Household Income (100K-125K)"] = calc_normalized(tract_df, 'B19001_014E', 'B19001_001E')
            all_calc_data_nb["Household Income (more than 125K)"] = calc_sum_normalized(tract_df, ['B19001_015E', 'B19001_016E', 'B19001_017E'], 'B19001_001E')

            # employment
            all_calc_data_nb["Unemployment Rate"] = calc_sum_sum_normalized(tract_df, ['B23001_094E', 'B23001_101E', 'B23001_108E', 'B23001_115E', 'B23001_122E', 'B23001_129E', 'B23001_136E', 'B23001_143E', 'B23001_150E', 'B23001_157E', 'B23001_162E', 'B23001_167E', 'B23001_172E', 'B23001_090E', 'B23001_097E', 'B23001_104E', 'B23001_111E', 'B23001_118E', 'B23001_125E', 'B23001_132E', 'B23001_139E', 'B23001_146E', 'B23001_153E', 'B23001_160E', 'B23001_165E', 'B23001_170E', 'B23001_008E', 'B23001_015E', 'B23001_022E', 'B23001_029E', 'B23001_036E', 'B23001_043E', 'B23001_050E', 'B23001_057E', 'B23001_064E', 'B23001_071E', 'B23001_076E', 'B23001_081E', 'B23001_086E', 'B23001_004E', 'B23001_011E', 'B23001_018E', 'B23001_025E', 'B23001_032E', 'B23001_039E', 'B23001_046E', 'B23001_053E', 'B23001_060E', 'B23001_067E', 'B23001_074E', 'B23001_079E', 'B23001_084E'], ['B23001_088E', 'B23001_002E']) 
            all_calc_data_nb["Percent Unemployment Female"] = calc_sum_normalized(tract_df, ['B23001_094E', 'B23001_101E', 'B23001_108E', 'B23001_115E', 'B23001_122E', 'B23001_129E', 'B23001_136E', 'B23001_143E', 'B23001_150E', 'B23001_157E', 'B23001_162E', 'B23001_167E', 'B23001_172E', 'B23001_090E', 'B23001_097E', 'B23001_104E', 'B23001_111E', 'B23001_118E', 'B23001_125E', 'B23001_132E', 'B23001_139E', 'B23001_146E', 'B23001_153E', 'B23001_160E', 'B23001_165E', 'B23001_170E'], 'B23001_088E')
            all_calc_data_nb["Percent Unemployment Male"] = calc_sum_normalized(tract_df, ['B23001_008E', 'B23001_015E', 'B23001_022E', 'B23001_029E', 'B23001_036E', 'B23001_043E', 'B23001_050E', 'B23001_057E', 'B23001_064E', 'B23001_071E', 'B23001_076E', 'B23001_081E', 'B23001_086E', 'B23001_004E', 'B23001_011E', 'B23001_018E', 'B23001_025E', 'B23001_032E', 'B23001_039E', 'B23001_046E', 'B23001_053E', 'B23001_060E', 'B23001_067E', 'B23001_074E', 'B23001_079E', 'B23001_084E'], 'B23001_002E')
            #all_calc_data_nb["Employed Residents"] = calc_sum(tract_df, 'C24050_001E')
            # journey to work
            all_calc_data_nb["Workers 16 Years and Older"] = calc_sum(tract_df, 'B08006_001E')
            all_calc_data_nb["Car"] = calc_normalized(tract_df, 'B08006_002E', 'B08006_001E')
            all_calc_data_nb["Drove Alone"] = calc_normalized(tract_df, 'B08006_003E', 'B08006_001E')
            all_calc_data_nb["Carpooled"] = calc_normalized(tract_df, 'B08006_004E', 'B08006_001E')
            all_calc_data_nb["Transit"] = calc_normalized(tract_df, 'B08006_008E', 'B08006_001E')
            all_calc_data_nb["Bike"] = calc_normalized(tract_df, 'B08006_014E', 'B08006_001E')
            all_calc_data_nb["Walk"] = calc_normalized(tract_df, 'B08006_015E', 'B08006_001E')
            all_calc_data_nb["Other Journey Type"] = calc_normalized(tract_df, 'B08006_016E', 'B08006_001E')
            all_calc_data_nb["Worked at Home"] = calc_normalized(tract_df, 'B08006_017E', 'B08006_001E')
            # population density
            all_calc_data_nb["Population Density per Acre"] = calc_sum(tract_df, 'B01001_001E')

    #return calc dictionary
    return all_calc_data

In [None]:
# set geo variables for api call
tract_code = "*"
state_code = "06"
county_code = "075"

# split attributes into groups of 45, run a census query for each, merge outputs into a single df
split_attribute_ids = [attribute_ids[i:i+45] for i in range(0, len(attribute_ids), 45)]
split_attribute_ids[:] = (value for value in split_attribute_ids if value != ' ')

df=None
first = True
for ids in split_attribute_ids:
    census_url = build_census_url(tract_code, state_code, county_code, ids, year_past)
    returned_df = make_census_api_call(census_url)
    if first:
        df = returned_df
        first = False
    else:
        returned_df = returned_df.drop(columns=['state', 'county'])
        df = pd.merge(df, returned_df, on='tract', how='left')

df.head()

In [None]:
# import geo_lookup csv
geo_lookup_df = pd.read_csv (r'./lookup_tables/geo_lookup_{}.csv'.format(year), dtype=str)

tract_tr_lookup = defaultdict(list)
tract_nb_lookup = defaultdict(list)
tract_sd_lookup = defaultdict(list)
all_tracts = list(set(df['tract'].tolist()))
# create tract lookup dictionary for tracts 
for i in all_tracts:
    tract_tr_lookup[i].append(i)
    tract_tr_lookup["sf"]=all_tracts 
# create tract lookup dictionary for neighborhoods
for i, j in zip(geo_lookup_df['neighborhood'], geo_lookup_df['tractid']):
    tract_nb_lookup[i].append(j)
tract_nb_lookup["sf"]= all_tracts
# create tract lookup dictionary for supervisor districts
for i, j in zip(geo_lookup_df['supervisor_district'], geo_lookup_df['tractid']):
    tract_sd_lookup[i].append(j)
tract_sd_lookup["sf"]= all_tracts
    
first_4 = list(tract_nb_lookup.items())


In [None]:
# set geography to summarize by. If supervisor districts set geo_summary_variable to "Superisor District"
geo_summary_variable = 'Neighborhood'#'Supervisor District' #'Neighborhood'# 

# set path to download csvs
download_path = r"./output"

# sets geo variables based on above choice
if geo_summary_variable == 'Tract':
    tract_lookup = tract_tr_lookup
    geo_path = r'./shps/tracts/tracts_sf.shp'
    geo_merge_variable = 'tractce'
elif geo_summary_variable == 'Neighborhood':
    tract_lookup = tract_nb_lookup
    geo_path = r'./shps/neighborhoods/neighborhoods5/neighborhoods5.shp'
    geo_merge_variable = 'nhood'
elif geo_summary_variable == 'Supervisor District':
    tract_lookup = tract_sd_lookup
    geo_path = r'./shps/supervisor_districts/supervisor_districts2/supervisor_districtss.shp'
    geo_merge_variable = 'sup_dist'

In [None]:
# run functions to calculate all stats and convert calc dictionary to pandas dataframe
all_calc_data = calc_socio_economic_data_past(df, tract_lookup)
df_all_calcs = pd.DataFrame.from_dict(all_calc_data).reset_index()

df_all_calcs.rename(columns = {'index':'Attribute'}, inplace = True) 
df_all_calcs.head()


In [None]:
print(df_all_calcs.columns)

In [None]:
# transpose dataset for second geo view of dataset
df_all_calcs_tp = df_all_calcs.T.reset_index()
df_all_calcs_tp.columns = df_all_calcs_tp.iloc[0]
df_all_calcs_tp = df_all_calcs_tp[1:].rename(columns={'Attribute': geo_summary_variable})
df_all_calcs_tp = df_all_calcs_tp.sort_values(by=[geo_summary_variable])
df_all_calcs_tp.head()

In [None]:
# export both dataset views to csv
#df_all_calcs.to_csv(os.path.join(download_path,geo_summary_variable+"_"+'profiles_by_attribute_{}.csv'.format(year)), index = False)
df_all_calcs_tp.to_csv(os.path.join(download_path,geo_summary_variable+"_"+'profiles_by_geo_{}.csv'.format(year_past)), index = False)

In [None]:
# get both dataset views for the current year and 10 years ago 
data_year_df = pd.read_csv (r'./output/'+geo_summary_variable+"_"+'profiles_by_geo_{}.csv'.format(year), dtype=str)
data_year2_df = pd.read_csv (r'./output/'+geo_summary_variable+"_"+'profiles_by_geo_{}.csv'.format(year_past), dtype=str)

data_joined_years_df= data_year_df.merge(data_year2_df, on='Neighborhood', suffixes=("", "_10"))


In [None]:
# export the final data view 
data_joined_years_df.to_csv(os.path.join(download_path,geo_summary_variable+"_"+'profiles_by_geo_{}_{}.csv'.format(year, year_past)), index = False)


# Part 2. Non-Census data

In SFNP, there are four groups of data that are derived from sources other than ACS: 
- Affordable Housing
- Eviction 
- Equity Geographies/Project Boundaries 
- Built Environments 

Unlike the ACS data, these data are either 1) manually compiled by staff in SF Planning and located under the resources folder in this repository; or 2) derived from [DataSF](https://datasf.org/opendata/). 

The code below does:
- load the csv files
- aggregate the data by geographic areas
- add the result as new attributes to the socio-economic profile data created and saved by the code in the ACS 5 years section above.  



### Load the socio-economic profiles data created by the Part1 code

In [None]:
# open the census data 
df_all_calcs_tp = pd.read_csv (r'./output/'+geo_summary_variable+"_"+'profiles_by_geo_{}_{}.csv'.format(year, year_past), dtype=str)

## Affordable Housing (Affordable units + SROs + Rent-controlled)

This data was compiled by James Papas and Michael Webster in the SF Planning Department. While the dataset provides data for 2020, there is no maintenance plan for this data yet. Three csv files were derived from the original dataset: 
    - [affordable_housing_2021_for_NP.csv]()
    - [SRO_Points_for_NP.csv]()
    - [rent_controlled_2019_for_NP.csv]()

### Affordable Housing

In [None]:
# load the affordable housing data 
aff_df = pd.read_csv (r'./resources/affordable_housing_2021_for_NP.csv')
aff_df['tot_units'] = aff_df.tot_units + (aff_df.tot_units == 0) * (aff_df.aff_unit)
 
aff_df['aff_unit_ratio'] = aff_df['aff_unit']/aff_df['tot_units']
aff_df[['aff_unit', 'tot_units', 'aff_unit_ratio']] = aff_df[['aff_unit', 'tot_units', 'aff_unit_ratio']].astype(float)
print(aff_df['neighborhood'])
aff_df.head()

In [None]:
# aggregate the number of affordable units into geographic areas (neighborhood, census tract)
aff_df_by_neighborhood = aff_df.groupby("neighborhood").agg({'aff_unit':['count','sum'],'tot_units':'sum','aff_unit_ratio':'mean'})
aff_df_by_neighborhood.columns = ['aff_count', 'aff_unit_sum', 'aff_tot_units_sum', 'aff_mean_aff_ratio']
aff_df_by_neighborhood = aff_df_by_neighborhood.reset_index()

aff_df_by_tract = aff_df.groupby("tractce").agg({'aff_unit':['count','sum'],'tot_units':'sum','aff_unit_ratio':'mean'})
aff_df_by_tract.columns = ['aff_count', 'aff_unit_sum', 'aff_tot_units_sum', 'aff_mean_aff_ratio']
aff_df_by_tract = aff_df_by_neighborhood.reset_index()

aff_df_by_neighborhood.head()

In [None]:
# combine the data to the socioeconomic profiles data 
if geo_summary_variable == 'Neighborhood':
    df_all_calcs_tp = df_all_calcs_tp.merge(aff_df_by_neighborhood, how= 'left', left_on = 'Neighborhood', right_on = 'neighborhood')
    df_all_calcs_tp= df_all_calcs_tp.drop(['neighborhood'], axis=1)
elif geo_summary_variable == 'Tract':
    df_all_calcs_tp = df_all_calcs_tp.merge(aff_df_by_tract, how='left', left_on = 'Tract', right_on = 'tractce')
    df_all_calcs_tp= df_all_calcs_tp.drop(['tractce'], axis=1)
df_all_calcs_tp.head()

### SROs

In [None]:
# load the SRO data 
sro_df = pd.read_csv (r'./resources/SRO_Points_for_NP.csv')
print(sro_df.columns)
sro_df['residential_unit_ratio'] = sro_df['CERT_RESID']/(sro_df['CERT_RESID']+sro_df['CERT_TOURI'])

In [None]:
# aggregate the raw data into geographic units (neighborhood, census tract)
sro_df_by_neighborhood = sro_df.groupby("NHOOD").agg({'CERT_RESID':['count','sum'],'CERT_TOURI':'sum','residential_unit_ratio':'mean'})
sro_df_by_neighborhood.columns = ['sro_count', 'sro_residential_unit', 'sro_tourist_unit', 'sro_mean_residential_ratio']
sro_df_by_neighborhood = sro_df_by_neighborhood.reset_index()

sro_df_by_tract = sro_df.groupby("tractce").agg({'CERT_RESID':['count','sum'],'CERT_TOURI':'sum','residential_unit_ratio':'mean'})
sro_df_by_tract.columns = ['sro_count', 'sro_residential_unit', 'sro_tourist_unit', 'sro_mean_residential_ratio']
sro_df_by_tract = sro_df_by_tract.reset_index()

In [None]:
# combine the data to the socioeconomic profiles data 
if geo_summary_variable == 'Neighborhood':
    df_all_calcs_tp = df_all_calcs_tp.merge(sro_df_by_neighborhood, how= 'left', left_on = 'Neighborhood', right_on = 'NHOOD')
    df_all_calcs_tp= df_all_calcs_tp.drop(['NHOOD'], axis=1)
elif geo_summary_variable == 'Tract':
    df_all_calcs_tp = df_all_calcs_tp.merge(sro_df_by_tract, how='left', left_on = 'Tract', right_on = 'tractce')
    df_all_calcs_tp= df_all_calcs_tp.drop(['tractce'], axis=1)
df_all_calcs_tp.head()

### Rent-controlled

In [None]:
# load the rent-controlled data
rc_df = pd.read_csv (r'./resources/rent_controlled_2019_for_NP.csv')
print(rc_df.columns)

In [None]:
# aggregate the raw data into geographic units (neighborhood, census tract)
rc_df_by_neighborhood = rc_df.groupby("NHOOD").agg({'RESUNITS':['count','sum']})
rc_df_by_neighborhood.columns = ['rc_count', 'rc_residential_unit']
rc_df_by_neighborhood = rc_df_by_neighborhood.reset_index()

rc_df_by_tract = rc_df.groupby("tractce").agg({'RESUNITS':['count','sum']})
rc_df_by_tract.columns = ['rc_count', 'rc_residential_unit']
rc_df_by_tract = rc_df_by_tract.reset_index()

In [None]:
# combine the data to the socioeconomic profiles data 
if geo_summary_variable == 'Neighborhood':
    df_all_calcs_tp = df_all_calcs_tp.merge(rc_df_by_neighborhood, how= 'left', left_on = 'Neighborhood', right_on = 'NHOOD')
    df_all_calcs_tp= df_all_calcs_tp.drop(['NHOOD'], axis=1)
elif geo_summary_variable == 'Tract':
    df_all_calcs_tp = df_all_calcs_tp.merge(rc_df_by_tract, how='left', left_on = 'Tract', right_on = 'tractce')
    df_all_calcs_tp= df_all_calcs_tp.drop(['tractce'], axis=1)
df_all_calcs_tp.head()

## Eviction

Eviction data is derived from [xxxx dataset]() on DataSF(SF's open data portal). The code below adds number of eviction by categories as new attributes to the master data table. 

In [None]:
#load the eviction data from DataSF: url = 'https://data.sfgov.org/resource/5cei-gny5.csv'
client = Socrata("data.sfgov.org", None)
results = client.get("5cei-gny5", limit=45000)
eviction_df_raw = pd.DataFrame.from_records(results)
print(eviction_df_raw.columns)
print(len(eviction_df_raw.index))

In [None]:
# aggregate the data by neighborhoods 
eviction_df = eviction_df_raw.iloc[:,6:28]

# create a dictionary for aggregation 
eviction_keys = eviction_df.columns[0:19]
eviction_values = ['sum']*19
res = dict(zip(eviction_keys, eviction_values))


eviction_df_by_neighborhood = eviction_df.groupby("neighborhood").agg(res).astype(float)
eviction_df_by_neighborhood.columns = eviction_keys
eviction_df_by_neighborhood = eviction_df_by_neighborhood.reset_index()
eviction_df_by_neighborhood.head()

In [None]:
# add a row for SF 
eviction_df_sf = eviction_df.agg(res).astype(float)
pd.Series(['sf'], index = ['neighborhood'])
eviction_df_sf = pd.Series(['sf'], index = ['neighborhood']).append(eviction_df_sf)
eviction_df_by_neighborhood = eviction_df_by_neighborhood.append(eviction_df_sf,ignore_index=True)
eviction_df_by_neighborhood.tail()

In [None]:
# combine the data to the census calc data 
if geo_summary_variable == 'Neighborhood':
    df_all_calcs_tp = df_all_calcs_tp.merge(eviction_df_by_neighborhood, how= 'left', left_on = 'Neighborhood', right_on = 'neighborhood')
    df_all_calcs_tp= df_all_calcs_tp.drop(['neighborhood'], axis=1)
elif geo_summary_variable == 'Tract':
    df_all_calcs_tp = df_all_calcs_tp.merge(eviction_df_by_tract, how='left', left_on = 'Tract', right_on = 'tractce')
    df_all_calcs_tp= df_all_calcs_tp.drop(['tractce'], axis=1)
df_all_calcs_tp.head()

## Equity Geographies/Project Boundaries

`np_boundaries.csv` file in thie repository contains data that shows whether various city-led projects and geographies for equitable development apply to the SF neighborhoods. The data is complied by city staff using GIS software in a way that each column in the file represent one equity geographies or project boundaries. The code below load and join the data to the master table. 


In [None]:
# load the np_boundaries table 
np_boundaries = pd.read_csv (r'./resources/np_boundaries.csv')

In [None]:
# join np_boundaries to the master table 
df_all_calcs_tp = df_all_calcs_tp.merge(np_boundaries, how= 'left', left_on = 'Neighborhood', right_on = 'nhood')

## Built Environment

SFNP also provides indicators that summarises the qulity of built environment and the number of community amenities in the neighborhooods. The code below calcualtes three indicators, using datasets found on [DataSF] and SF Planning ArcGIS Online and the neighborhood boundaries shapefile in this repository. 

In [None]:
# import neighborhood shapefiles and reproject to web mercator
neigh_df = gpd.read_file(geo_path)
neighborhood_list = neigh_df['nhood'].tolist()

### Pavement Condition Index

In [None]:
# load the PCI data from DataSF

url = "https://data.sfgov.org/resource/5aye-4rtt.geojson?$limit=45000"
pci = gpd.read_file(url)
pci['length'] = pci['geometry'].length

# filter items that has length > 0 (street segments)
pci = pci[pci['length'].notna()]
pci.dtypes

In [None]:
# run spatial join between neighborhood boundaries and pci street segment 

neigh_pci = neigh_df.sjoin(pci, how="right", predicate='intersects')
neigh_pci = pd.DataFrame(neigh_pci.drop(columns='geometry'))

#geo_pci['pci_weight'] = geo_pci['pci_score']*geo_pci['length']

In [None]:
# calculate high_pci_ratio: high pci - PCI > 85 

high_pci_ratio = list()
for neighborhood in neighborhood_list: 
    total_len = neigh_pci['length'].sum()
    high_pci = neigh_pci[(neigh_pci["nhood"]==neighborhood) & (neigh_pci["pci_score"].astype(float)>85)]
    high_len = high_pci['length'].sum()
    high_pci_ratio.append(high_len/total_len)
        
        
high_pci_df = pd.DataFrame({'nhood': neighborhood_list, 'high_pci_ratio':high_pci_ratio})

In [None]:
# add the 'high_pci_ratio' as a column to the master table
df_all_calcs_tp = df_all_calcs_tp.merge(high_pci_df, how= 'left', left_on = 'Neighborhood', right_on = 'nhood')

### Hign Injury Network

In [None]:
# use your own ArcGIS Online credential for SF ArcGIS Online
gis = GIS("https://sfgov.maps.arcgis.com/", "seolha.lee_cpc", "25Minhaa!?!")
print(f"Connected to {gis.properties.portalHostname} as {gis.users.me.username}")


In [None]:
# load the Vision Zero 2017 high injury network data from SF ArcGIS Online 
# service directory: https://services.arcgis.com/Zs2aNLFN00jrS4gG/arcgis/rest/services/vz_hin_2017_single_line/FeatureServer
vz_2017_id = '25d06501f18e458491ca7c6d4e3813b4'
vz_2017 = gis.content.get(vz_2017_id)
vz_2017

In [None]:
# load the first layer of vz_2017 as 'layer'
layer = vz_2017.layers[0]
for f in layer.properties.fields:
    print(f['name'])

In [None]:
# export the features in the layer as a shapefile
features = layer.query(where = 'length >0')
features.sdf.spatial.to_featureclass('./resources/vz_2017.shp')

In [None]:
# load the shapefile as a geopandas dataframe
vz_2017_shp = gpd.read_file('./resources/vz_2017.shp')
vz_2017_shp = vz_2017_shp.to_crs("EPSG:4326")

neigh_vz_2017 = neigh_df.sjoin(vz_2017_shp, how="right", predicate='intersects')
neigh_vz_2017 = pd.DataFrame(neigh_vz_2017.drop(columns='geometry'))

# calculate the total length of Vision Zero 
vz_length = list()
for neighborhood in neighborhood_list: 
    sub = neigh_vz_2017[(neigh_vz_2017["nhood"]==neighborhood)]
    total_len = sub['length'].sum()
    vz_length.append(total_len)
    
        
neigh_vz_df = pd.DataFrame({'nhood': neighborhood_list, 'vz_length':vz_length})

In [None]:
# add the 'high_pci_ratio' as a column to the master table
df_all_calcs_tp = df_all_calcs_tp.merge(high_pci_df, how= 'left', left_on = 'Neighborhood', right_on = 'nhood')

### Community Facilities

In [None]:
# load the recreation and park facilities data from DataSF 
# https://data.sfgov.org/resource/gtr9-ntp6.geojson 
url = "https://data.sfgov.org/resource/gtr9-ntp6.geojson?$limit=45000"
rec = gpd.read_file(url)
rec = rec.to_crs("EPSG:4326")
rec.dtypes

In [None]:
# run notebooks on ArcGIS Online 
neigh_rec = neigh_df.sjoin(rec[['objectid','propertytype', 'geometry']], how="right", predicate='intersects')
neigh_rec = pd.DataFrame(neigh_rec.drop(columns='geometry'))

propertytype_list = neigh_rec['propertytype'].unique()

# calculate the total length of Vision Zero 
rec_count = pd.DataFrame({'propertytype':propertytype_list})
for neighborhood in neighborhood_list: 
    sub = neigh_rec[(neigh_rec["nhood"]==neighborhood)].groupby('propertytype').agg({'nhood':'count'})
    sub.columns = [neighborhood]
    sub = sub.reset_index()
    rec_count = rec_count.merge(sub, on = 'propertytype', how='left') 

rec_count_df_tp = rec_count.T.reset_index()
rec_count_df_tp.columns = rec_count_df_tp.iloc[0]
rec_count_df_tp = rec_count_df_tp[1:].rename(columns={'propertytype': 'Neighborhood'}).replace(np.nan, 0)
rec_count_df_tp.head()

In [None]:
# add the 'high_pci_ratio' as a column to the master table
df_all_calcs_tp = df_all_calcs_tp.merge(rec_count_df_tp, how= 'left', on = 'Neighborhood')

In [None]:
# load the school data from DataSF 
# https://data.sfgov.org/resource/gtr9-ntp6.geojson 
url = "https://data.sfgov.org/resource/rxa4-qmcf.json"

school = pd.read_file(url)
school_df = gpd.GeoDataFrame(pd.read_file(url), geometry=gpd.points_from_xy(school.longitude, school.Latitude))
school_df = school_df[school_df['common_name'].contains('High|Elementary|Middle')]
#rec = rec.to_crs("EPSG:4326")
rec.dtypes

In [None]:
neigh_school = neigh_df.sjoin(school_df[['facility_id', 'common_name', 'geometry']], how="right", predicate='intersects')
neigh_school = pd.DataFrame(neigh_school.drop(columns='geometry'))

school_list = ['High', 'Middle', 'Elementary']

# calculate the total length of Vision Zero 
school_count = pd.DataFrame({'schooltype':school_list})
for neighborhood in neighborhood_list: 
    sub = neigh_school[(neigh_school["nhood"]==neighborhood)].groupby('schooltype').agg({'nhood':'count'})
    sub.columns = [neighborhood]
    sub = sub.reset_index()
    school_count = school_count.merge(sub, on = 'schooltype', how='left') 

school_count_df_tp = school_count.T.reset_index()
school_count_df_tp.columns = school_count_df_tp.iloc[0]
school_count_df_tp = school_count_df_tp[1:].rename(columns={'schooltype': 'Neighborhood'}).replace(np.nan, 0)
school_count_df_tp.head()

In [None]:
# add the 'high_pci_ratio' as a column to the master table
df_all_calcs_tp = df_all_calcs_tp.merge(school_count_df_tp, how= 'left', on = 'Neighborhood')

## Final: Export the Master Table

In [None]:
# export the master table as csv 
df_all_calcs_tp.to_csv(os.path.join(download_path,geo_summary_variable+"_"+'master_table_by_geo_{}_{}.csv'.format(year, year_past)), index = False)
# export only for the 'sf' row 
df_all_calc_tp_sf = df_all_calcs_tp[df_all_calcs_tp['Neighborhood']=='sf']
df_all_calc_tp_sf.to_csv(os.path.join(download_path,geo_summary_variable+"_"+'master_table_by_geo_SF_{}_{}.csv'.format(year, year_past)), index = False)

## Publish Data to ArcGIS Online 

In [None]:
import arcgis 
from arcgis.features import FeatureLayer
from arcgis.gis import GIS

In [None]:
conda install -c esri arcpy

In [None]:
# set up ArcGIS Online connection
print("ArcGIS Online Org account")    
gis = GIS('home')
print("Logged in as " + str(gis.properties.user.username))

In [None]:
# set geography to summarize by. If supervisor districts set geo_summary_variable to "Superisor District"
geo_summary_variable = 'Neighborhood'#'Supervisor District' #'Neighborhood'# 

# set path to download csvs
download_path = r"./output"

# sets geo variables based on above choice
if geo_summary_variable == 'Tract':
    geo_path = r'./shps/tracts/tracts_sf.shp'
    geo_merge_variable = 'tractce'
elif geo_summary_variable == 'Neighborhood':
    geo_path = r'./shps/neighborhoods/neighborhoods5/neighborhoods5.shp'
    geo_merge_variable = 'nhood'
elif geo_summary_variable == 'Supervisor District':
    geo_path = r'./shps/supervisor_districts/supervisor_districts2/supervisor_districtss.shp'
    geo_merge_variable = 'sup_dist'

In [None]:
#load the master table
master_df = pd.read_csv(os.path.join(download_path,geo_summary_variable+"_"+'master_table_by_geo_{}_{}.csv'.format(year, year_past)))

# import neighborhood shapefiles and reproject to web mercator
geo_df = gpd.read_file(geo_path)
#geo_df = geo_df.to_crs("EPSG:4326")

# join dataframe to neighborhoods geodataframe by neighborhood name
neighborhoods_var_df= geo_df.merge(master_df,left_on=geo_merge_variable, right_on=geo_summary_variable)