<a href="https://colab.research.google.com/github/vgaurav-umich/siads592/blob/master/get_demographic_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Download ACS Race/Ethnicity Data from US Census

This notbook will download 2018 ACS Race/Ethnicity Data from US Census data portal. Link to source [Data table](https://data.census.gov/cedsci/table?d=ACS%205-Year%20Estimates%20Data%20Profiles&table=DP05&tid=ACSDP5Y2018.DP05).

## Purpose

The purpose of this Notebook is to download data from US Census that provides us baseline info about demographic makeup of US population at County level.
  

In [0]:
# Download required libraries 
import pandas as pd
import numpy as np

## Download Data

Unfortunately US Census' data portal does not provide an easy way to download data from their site. They provide an interactive tool to select custom data topics that later can be downloaded as zip file. 

There is a bit of manual work involved here as we have to download this zip file, extract it, and then upload it to Google Drive folder so, it can be picked up by this Notebook.

The good news is, ACS estimates data is generated once every year, so once files are downloaded we do not have to go through manul download process every time.  

In [11]:
# We are going to make use of parameterized nbotebook feature in Google Colab
# acs_data_file_path = "drive/My Drive/Colab Notebooks/assets/data/ACSDP1Y2018.DP05_2020-06-06T132235/ACSDP5Y2018.DP05_data_with_overlays_2020-06-09T020759.csv" #@param {type:"raw"}
acs_data_file_path = 'https://raw.githubusercontent.com/vgaurav-umich/siads592/master/data/ACSDP5Y2018.DP05_data_with_overlays_2020-06-09T020759.csv'
# acs_data_file_path

'https://raw.githubusercontent.com/vgaurav-umich/siads592/master/data/ACSDP5Y2018.DP05_data_with_overlays_2020-06-09T020759.csv'

## Receipie for Data Transformation
1. Read Data from download path
2. Filter only interesting columns
3. Rename columns to user freindly names
4. Remove first rwo as it include crude column names

In [0]:
def read_file(data_file_path):
  # read the file downloaded from US Census Buerau website
  # ACS 2018 1 year data profile estimates from https://data.census.gov/cedsci/table?d=ACS%205-Year%20Estimates%20Data%20Profiles&table=DP05&tid=ACSDP5Y2018.DP05
  return pd.read_csv(data_file_path)

def filter_columns(df):
  # there are many columns, let's only pick ones that are of our interest for this analysis, i.e. race/ethnicity info 
  # col_types = ['PE', 'PM'] # Only interested in PE = percentage estimate and PM = percentage margin of error
  col_types = ['PE']# Only interested in PE = percentage estimate 
  subset = ['GEO_ID','NAME'] + [f'DP05_00{i}{col_type}' for i in [71,76,77,78,79,80,81,82,83] for col_type in col_types]
  # Filter only subset of columns along with Geography name
  return df[subset]

def rename_columns(df):
  # generate user friendly names for columns
  # col_types = ['PE', 'PM'] # Only interested in PE = percentage estimate and PM = percentage margin of error
  col_types = ['PE'] # Only interested in PE = percentage estimate
  dic = {71: 'hispanic_latino_any_race', 76:'non_hispanic_latino_any_race',  77: 'white_alone', 78: 'black_african_american_alone', 79: 'american_indian_alaska_native_alone', 80: 'asian_alone', 81: 'native_hawaiian_pacific_islander_alone', 82: 'some_other_race_alone', 83: 'two_or_more_races'}
  numeric_cols = [f'{dic[item]}_{col_type}' for item in dic for col_type in col_types]
  subset_names = ['fips','geo_name'] + numeric_cols
  df.columns = subset_names # rename columns to a user friendly name
  df = df.replace('*****', np.nan)
  df = df.replace('N', np.nan)
  df[numeric_cols] = df[numeric_cols].astype(float)
  return df

def clean_and_enrich(df):
  df = df.iloc[1:]
  df = rename_columns(df)
  df['fips'] = df['fips'].str.extract(".+US[0]?(.+)").astype(float)
  df['state'] = df['geo_name'].apply(lambda x: x.split(",")[1].strip() if "," in x else x )
  return df

In [13]:
race_ethnicity_county_df = clean_and_enrich(filter_columns(read_file(acs_data_file_path)))

  if self.run_code(code, result):


In [0]:
# Test if everything is all right. 
# acs_race_ethnicity_df.tail()
print("\n============================== race_ethnicity_county_df: Cleaned Race Ethnicity dataset ===========================================")
race_ethnicity_county_df.info()
print("\n ============= First 52 rows contains data for US States, and last row contains data about whole US ============")


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3273 entries, 1 to 3273
Data columns (total 12 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   fips                                       3272 non-null   float64
 1   geo_name                                   3273 non-null   object 
 2   hispanic_latino_any_race_PE                3273 non-null   float64
 3   non_hispanic_latino_any_race_PE            3273 non-null   float64
 4   white_alone_PE                             3273 non-null   float64
 5   black_african_american_alone_PE            3273 non-null   float64
 6   american_indian_alaska_native_alone_PE     3273 non-null   float64
 7   asian_alone_PE                             3273 non-null   float64
 8   native_hawaiian_pacific_islander_alone_PE  3273 non-null   float64
 9   some_other_race_alone_PE                   3273 non-null   float64
 10  two_or_more_races_PE   

#### Notes
Population and demographic data on are based on analysis of the Census Bureau’s American Community Survey (ACS) and may differ from other population estimates published yearly by the Census Bureau. 

Persons of Hispanic origin may be of any race; all other racial/ethnic groups are non-Hispanic.

## Enrich with Geographic Data

The data we just downloaded provides us detailed demographics about each US County. 

Since we have Geographic data downloaded we can further enrich demographic data with Geo features like CBSA.

The good thing is ACS data contains FIPS code for each geography, so it makes an easy key to join on Geo data.

In [0]:
%run 'drive/My Drive/Colab Notebooks/get_geographic_data.ipynb'


<class 'pandas.core.frame.DataFrame'>
Int64Index: 3234 entries, 0 to 3233
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   county      3234 non-null   object
 1   state       3234 non-null   object
 2   population  3234 non-null   int64 
 3   cbsa        1874 non-null   object
 4   cbsa_type   1874 non-null   object
 5   fips        3234 non-null   int64 
dtypes: int64(2), object(4)
memory usage: 176.9+ KB

<class 'pandas.core.frame.DataFrame'>
Int64Index: 57 entries, 0 to 56
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   state       57 non-null     object
 1   population  57 non-null     int64 
 2   state_code  57 non-null     object
 3   fips        57 non-null     int64 
dtypes: int64(2), object(2)
memory usage: 2.2+ KB

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 854 entries, 0 to 853
Data columns (total 2 columns):
 #   Column      Non-Nu

In [0]:
county_cbsa_demographic_df = race_ethnicity_county_df.merge(county_cbsa_full_df, on = 'fips').rename({'state_x': 'state'},axis = 1).drop('state_y',axis = 1)
print("\n============================== county_cbsa_demographic_df: CBSA County Level Demographic Dataset ===========================================")
county_cbsa_demographic_df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 3220 entries, 0 to 3219
Data columns (total 16 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   fips                                       3220 non-null   float64
 1   geo_name                                   3220 non-null   object 
 2   hispanic_latino_any_race_PE                3220 non-null   float64
 3   non_hispanic_latino_any_race_PE            3220 non-null   float64
 4   white_alone_PE                             3220 non-null   float64
 5   black_african_american_alone_PE            3220 non-null   float64
 6   american_indian_alaska_native_alone_PE     3220 non-null   float64
 7   asian_alone_PE                             3220 non-null   float64
 8   native_hawaiian_pacific_islander_alone_PE  3220 non-null   float64
 9   some_other_race_alone_PE                   3220 non-null   float64
 10  two_or_more_races_PE   

In [0]:
state_demographic_df = state_full_df.merge(race_ethnicity_county_df, on =  'fips').drop('state_y', axis = 1).rename({'state_x': 'state'}, axis = 1)
print("\n============================== state_demographic_df: State Level Demographic Dataset ===========================================")
state_demographic_df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 52 entries, 0 to 51
Data columns (total 14 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   state                                      52 non-null     object 
 1   population                                 52 non-null     int64  
 2   state_code                                 52 non-null     object 
 3   fips                                       52 non-null     int64  
 4   geo_name                                   52 non-null     object 
 5   hispanic_latino_any_race_PE                52 non-null     float64
 6   non_hispanic_latino_any_race_PE            52 non-null     float64
 7   white_alone_PE                             52 non-null     float64
 8   black_african_american_alone_PE            52 non-null     float64
 9   american_indian_alaska_native_alone_PE     52 non-null     float64
 10  asian_alone_PE             

In [0]:
cbsa_demographic_df = county_cbsa_demographic_df.groupby(['cbsa', 'cbsa_type']).agg({'population': 'sum', 
                                                'hispanic_latino_any_race_PE': 'mean', 
                                                'white_alone_PE' : 'mean', 
                                                'black_african_american_alone_PE' : 'mean', 
                                                'american_indian_alaska_native_alone_PE': 'mean', 
                                                'asian_alone_PE': 'mean', 
                                                'native_hawaiian_pacific_islander_alone_PE': 'mean', 
                                                'some_other_race_alone_PE': 'mean', 'two_or_more_races_PE': 'mean'}).reset_index()


In [0]:
cbsa_demographic_df['total'] = round(cbsa_demographic_df.loc[:, cbsa_demographic_df.columns != 'population'].sum(axis = 1))
print("\n============================== cbsa_demographic_df: SBSA Level Demographic Dataset ===========================================")
cbsa_demographic_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 875 entries, 0 to 874
Data columns (total 12 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   cbsa                                       875 non-null    object 
 1   cbsa_type                                  875 non-null    object 
 2   population                                 875 non-null    int64  
 3   hispanic_latino_any_race_PE                875 non-null    float64
 4   white_alone_PE                             875 non-null    float64
 5   black_african_american_alone_PE            875 non-null    float64
 6   american_indian_alaska_native_alone_PE     875 non-null    float64
 7   asian_alone_PE                             875 non-null    float64
 8   native_hawaiian_pacific_islander_alone_PE  875 non-null    float64
 9   some_other_race_alone_PE                   875 non-null    float64
 10  two_or_more_races_PE     

## Save Data

In [0]:
county_cbsa_demographic_df.to_csv('drive/My Drive/Colab Notebooks/assets/county_cbsa_demographic.csv')
cbsa_demographic_df.to_csv('drive/My Drive/Colab Notebooks/assets/cbsa_demographic.csv')
state_demographic_df.to_csv('drive/My Drive/Colab Notebooks/assets/state_demographic.csv')