The data used for these visualizations is 2018 ACS 5-Year Estimate data downloaded from the U.S. Census Bureau. The data set contains census tract-level data for Los Angeles County, and includes variables relevant for our analysis including:

Employment rate for persons over 16;
Unemployment rate; 
Median household income;
Population;
Percent of white residents;
Percent of Asian residents;
Percent of residents with race (other);
Total housing units;
Median home value;
Percent of residents with health insurance coverage;
Percent of residents under the poverty level


In [1]:
# import pandas as pd

import pandas as pd
import numpy as np
import seaborn as sns

In [2]:
# load ACS data
df = pd.read_csv('tractdata.csv', dtype={'GEOID11':str})
df.shape

FileNotFoundError: [Errno 2] File b'tractdata.csv' does not exist: b'tractdata.csv'

In [None]:
df.columns

In [None]:
# view the dataframe's "head"
df.head()

In [None]:
# Extract the census tracts through which the Metro Gold Line passes
ex_tracts = ['06037183520', '06037183610', '06037183620', '06037183701', '06037183702', '06037183810', '06037185202', '06037185310', 
             '06037199000', '06037199400', '06037206010', '06037206020', '06037207102', '06037207103', '06037400602', '06037400800', 
             '06037404600', '06037430002', '06037430101', '06037430721', '06037430801', '06037430901', '06037431100', '06037461901', 
             '06037461902', '06037462201', '06037462302', '06037462700', '06037462800', '06037462900', '06037463000', '06037463602', 
             '06037464000', '06037480600', '06037480703']
mask = df['GEOID11'].isin(ex_tracts)
df_extracts = df[mask]
df_extracts.shape

In [None]:
#clean data, extract masks for each station
df['GEOID11'] = df['GEOID11'].str.slice(-6) #pull tracts out of GEOIDs
df = df.set_index('GEOID11') #set tracts as index

In [None]:
#clean med_hhinc column
df['med_hhinc'] = df['med_hhinc'].replace({'250,000+':'250000','-':'0'})
df['med_hhinc'] = df['med_hhinc'].astype(float)

In [None]:
#convert med_value to thousands
df['med_value'] = df['med_value'] * 1000

In [None]:
#convert med_value to float
df['med_value'] = df['med_value'].astype(float)

In [None]:
#create new dataframe for all station tracts
df_station = df.loc[['207103','206010','207102','206020',
                    '183610','183620','183701','183810','183520',
                    '199000','185202','185310','199400',
                    '480600','480703']]

In [None]:
#create new dataframe that excludes station tracts
df_others = df.drop(['207103','206010','207102','206020',
                    '183610','183620','183701','183810','183520',
                    '199000','185202','185310','199400',
                    '480600','480703'], axis = 'index')

Descriptive Stats

In [None]:
# What is the median household income in the station tracts?
df_station['med_hhinc'].median()

In [None]:
# What is the mean median household income in the station tracts?
df_station['med_hhinc'].mean()

In [None]:
# What is the median household income in all other tracts?
df_others['med_hhinc'].median()

In [None]:
# What is the median home value in the station tracts?
df_station['med_value'].median()

In [None]:
# What is the mean median home value in the station tracts?
df_station['med_value'].mean()

In [None]:
# What is the median home value in all other tracts?
df_others['med_value'].median()

Visualizations

In [None]:
#Home value histogram
sns.set()
sns.set_style("whitegrid")
sns.despine()

valhist = sns.kdeplot(df_station['med_value'].dropna(), label ='Near Gold Line stations',
                    shade=True)
valhist = sns.kdeplot(df_others['med_value'].dropna(), label = 'All other tracts',
                    shade=True)
valhist.set_xlim(left=0,right=1000000)
valhist.set_title('Smoothed histogram of tract-level median home values')
valhist.set_xlabel('')
valhist.set_ylabel('')
valhist.get_yaxis().set_visible(False)

new_labels = ['${:,.0f}'.format(x) for x in valhist.get_xticks()]
valhist.set_xticklabels(new_labels)

Median home values in census tracts near Metro Gold Line stations are higher than Los Angeles County census tracts overall. While this is only 2018 data, it will be interesting to compare this to 2000 and 2010 data to see how median home values changed over time.

In [None]:
#income histogram
sns.set()
sns.set_style("whitegrid")
sns.despine()

inchist = sns.kdeplot(df_station['med_hhinc'], label ='Near Gold Line stations',
                    shade=True)
inchist = sns.kdeplot(df_others['med_hhinc'], label = 'All other tracts',
                    shade=True)
inchist.set_xlim(left=0,right=150000)
inchist.set_title('Smoothed histogram of tract-level median income')
inchist.set_xlabel('')
inchist.set_ylabel('')
inchist.get_yaxis().set_visible(False)
inc_labels = ['${:,.0f}'.format(x) for x in inchist.get_xticks()]
inchist.set_xticklabels(inc_labels, rotation=45, horizontalalignment='right')


Median incomes in the census tracts near Metro Gold Line stations are similar to the median income across all Los Angeles County census tracts. However, median incomes in the Metro Gold Line-adjacent census tracts are more concentrated than county census tracts over all. It will be interesting to compare this to data from previous years to see how median household incomes changed over time.  

In [None]:
# Identify which city each Gold Line-adjacent census tract is in
df_extracts['city'] = df_extracts['GEOID'].str.slice(14,20)
fips = {'183520' : 'Los Angeles', '183610' : 'Los Angeles', '183620' : 'Los Angeles', '183701' : 'Los Angeles', '183702' : 'Los Angeles', 
        '183810' : 'Los Angeles', 
        '185202' : 'Los Angeles', '185310' : 'Los Angeles', '199000' : 'Los Angeles', '199400' : 'Los Angeles', '206010' : 'Los Angeles', 
        '206020' : 'Los Angeles', '207102' : 'Los Angeles', '207103' : 'Los Angeles', '400602' : 'Azusa', '400800' : 'Azusa', 
        '404600' : 'Irwindale', '430002' : 'Duarte', '430101' : 'Duarte', '430721' : 'Arcadia', '430801': 'Arcadia', '430901' : 'Monrovia', 
        '431100' : 'Monrovia', '461901' : 'Pasadena', '461902' : 'Pasadena', '462201' : 'Pasadena', '462301' : 'Pasadena', 
        '462302' : 'Pasadena', '462700' : 'Pasadena', '462800' : 'Pasadena', '462900' : 'Pasadena', '462900' : 'Pasadena', 
        '463000' : 'Pasadena', '463602' : 'Pasadena', '464000' : 'Pasadena', '480600' : 'South Pasadena', '480703' : 'South Pasadena'}
df_extracts['city'] = df_extracts['city'].replace(fips)
df_extracts.set_index('GEOID11')

In [None]:
# Create bar chart that shows how many selected census tracts are in each city through which the Metro Gold Line passes to understand distribution of data
sns.set()
order = df_extracts['city'].value_counts().index
kx = sns.countplot(df_extracts['city'], order=order, alpha=0.7)
# rotate the tick labels, set x and y axis labels, then save
kx.set_xticklabels(kx.get_xticklabels(), rotation=45, horizontalalignment='right')
kx.set_xlabel('Cities where the Metro Gold Line passes through')
kx.set_ylabel('Number of census tracts')
kx.get_figure().savefig('city-tracts-countplot.png', dpi=600, bbox_inches='tight')

This graph shows how the census tracts used for this analysis are distributed across the cities the Metro Gold Line passes through. While it is only a subsample of all census tracts near the Metro Gold Line, it provides context for the descriptive statistics calculated earlier.

In [None]:
# Identify the stations within the area
df_extracts['station'] = df_extracts['GEOID'].str.slice(14,20)
fips = {'183520' : 'Highland Park','183610':'Highland Park','183620':'Highland Park','183701':'Highland Park/Southwest Museum','183702':'Southwest Museum',
        '183810':'Highland Park','185202':'Heritage Square','185310':'Heritage Square','199000':'Heritage Square',
        '199400':'Heritage Square','206010':'Chinatown','206020':'Chinatown','207102':'Chinatown','207103':'Chinatown',
        '400602':'Azusa Downtown','400800':'APU/Citrus','404600':'Irwindale','430002':'Duarte/City of Hope',
        '430101':'Duarte/City of Hope','430721':'Arcadia','430801':'Arcadia','430901':'Monrovia','431100':'Monrovia',
        '461901':'Memorial Park','461902':'Monrovia','462201':'Lake Avenue','462302':'Allen','462700':'Allen','462800':'Allen',
        '462900':'Sierra Madre Villa','463000':'Sierra Madre Villa','463602':'Del Mar','464000':'Fillmore','480600':'South Pasadena','480703':'South Pasadena'}
df_extracts['station'] = df_extracts['station'].replace(fips)
df_extracts.set_index('GEOID11')

In [None]:
#clean med_hhinc column in df_extracts
df_extracts['med_hhinc'] = df_extracts['med_hhinc'].replace({'250,000+':'250000','-':'0'})
df_extracts['med_hhinc'] = df_extracts['med_hhinc'].astype(float)

In [None]:
# Create a box plot of median household income in the census tracts by the Metro Gold Line station they are closest to

sns.set_style('whitegrid')
sns.set_context('paper')

sns.set()
ax = sns.boxplot(x=df_extracts['med_hhinc'], y=df_extracts['station'], fliersize=1, boxprops={'alpha':0.7})
ax.set_xlim(left=0)
ax.set_title('Box plot of median household income around Metro Gold Line stations')
ax.set_xlabel('Median household income')
ax.set_ylabel('Stations')

This box plot measures median household income by closest Metro Gold Line station. While this is only a subsample of all current Metro Gold Line stations, it begins to show the differences in median household income between these census tracts. For example, the median household income in the census tracts near the Sierra Madre Villa station in Pasadena is about twice that of the median household income in the census tracts near the Highland Park station. It will be interesting to compare this data to previous years' data to see how this variable has changed over time.  

Ryan - Created 5 visualizations (income and home value histograms in final deliverable), created partial list of station-adjacent tracts; 
Cason - Created two race/ethnicity histograms, created pairplot scatter graph to detect variable relationships, coordinated questions/meeting with Geoff/Kurt;
Serena - Created a health insurance coverage histogram; Discussed with teammates;
Vesna - Created two histograms and a pair plot; compiled final code from team members, wrote descriptions, and submitted final deliverable; 
Minghang: Provided basic tract-level data; Made the bar chart; Discussed with teammates.;
Fiona - Created 4 visualizations independently including scatter plot, regplot, boxplot, among which the boxplot that shows the median household income around different stations; throughout the Gold Line is included in the final deliverable, discussed with teammates during the group meeting
