## Methods


## Data Collection

In order to investigate whether walkability has an impact on other aspects of peoples’ lives, we wanted to compare different socioeconomic outcomes of different neighborhoods depending on each one's walkability. In order to accomplish this, we looked for walkability data, and socioeconomic outcomes that had granularity on the neighborhood level. We were able to accomplish this analysis with the following datasets:

- [**EPA Walkability Index**](https://catalog.data.gov/dataset/walkability-index): This dataset compiles different features about a specific census tract<sup>1</sup>, and computes a walkability score for each census tract. Certain D.C. neighborhoods are comprised of several census tracts, depending on the population density of that neighborhood.<sup>2</sup>
- [**U.S. Census Bureau Community Resilience Estimates**](https://www.census.gov/programs-surveys/community-resilience-estimates/data/supplement.html): This dataset compiles different socioeconomic factors about each census block. The ones we chose to analyze were percentage of the households that fall under the poverty threshold, high school completion rates, income inequality, and vehicle ownership. 
- [**U.S. Census Tract D.C. GeoJSON**](https://raw.githubusercontent.com/arcee123/GIS_GEOJSON_CENSUS_TRACTS/master/11.geojson): This file is the GeoJSON map file created by the U.S. Census Bureau for Washington, D.C.. It includes a geographical information regarding the location of each neighborhood in Washington, D.C.

<sup>1</sup> A census tract is a geographic region defined for the purpose of taking a census. There are 179 census tracts in Washington, D.C.
<sup>2</sup> Census tracts generally have a population size between 1,200 and 8,000 people, with an optimum size of 4,000 people. A census tract usually covers a contiguous area; however, the spatial size of census tracts varies widely depending on the density of settlement.

## Data Cleaning

In order to create a clean dataset in the format we wanted, we created a script that subsetted the raw data for D.C. in each data source and only include the columns of interest. We then joined the files by using the Census Block ID column. Most of the columns were percentages, but some of them - like the walkabilty column - were in a score that was comprised of a different range of numbers. In order to ensure that the visualizations would enable easy comparison between each of the factors, each of the outcomes columns was rescaled to have a 0-100 range. Because the datasets used certain negative values to indicate missing or dirty data, all negative values were forced to nulls. The JSON data with the geographic boundaries for each DC census tract did not require any cleaning.


In [None]:
import altair as alt
import pandas as pd
import geopandas as gpd
from pathlib import Path
import requests
import numpy as np

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

"""
IMPORT DATA
"""

# define data directory
data_dir = Path().absolute().parent.absolute().parent/"data"
write_data_dir = Path().absolute().parent.absolute().parent/"cleaned_data"
img_dir = Path().absolute().parent.absolute().parent/"data"/"img"

# import data
walkability = pd.read_csv(data_dir/"joined_depression_cre_walkability.csv")
walkability.loc[:, 'geoid_tract_20'] = walkability.geoid_tract_20.astype(str)
nation = pd.read_csv(data_dir/"cleaned_data"/"nation-joined_depression_cre_walkability.csv")

# Ingest GEOJSON file of census tracts in DC and grab json
req_dc = requests.get('https://raw.githubusercontent.com/arcee123/GIS_GEOJSON_CENSUS_TRACTS/master/11.geojson')
json_dc = req_dc.json()

# create geopandas dataframe and add in the walkability / outcomes data
geo_df = gpd.GeoDataFrame.from_features((json_dc))
merged_df = geo_df.merge(walkability,
                         how = 'left',
                         left_on = 'GEOID',
                         right_on='geoid_tract_20')



"""
NORMALIZE SCORES ACROSS ALL METRICS
"""

# convert the walkability score into a scale from 0 to 100 to make it more easier to interpret
# original range 1-20
# new desired range: 0-100
original_range_min = 1
original_range_max = 20
new_range_max = 100
new_range_min = 0 

merged_df.loc[:, 'walkability_score_scaled'] = merged_df.loc[:, 'walkability_score'].apply(lambda x: ((x - original_range_min) / (original_range_max - original_range_min)) * (new_range_max - new_range_min) + new_range_min)
nation.loc[:, 'walkability_score_scaled'] = nation.loc[:, 'walkability_score'].apply(lambda x: ((x - original_range_min) / (original_range_max - original_range_min)) * (new_range_max - new_range_min) + new_range_min)

# convert the income inequality index score into a scale from 0 to 100 to make it easier to interpret
# original range 0-1
# new desired range: 0-100
original_range_min = 0
original_range_max = 1
new_range_max = 100
new_range_min = 0 

merged_df.loc[:, 'income_inequality_gini_index'] = merged_df.loc[:, 'income_inequality_gini_index'].apply(lambda x: x if x >= 0 else np.nan)
merged_df.loc[:, 'income_inequality_gini_index_scaled'] = merged_df.loc[:, 'income_inequality_gini_index'].apply(lambda x: ((x - original_range_min) / (original_range_max - original_range_min)) * (new_range_max - new_range_min) + new_range_min)
nation.loc[:, 'income_inequality_gini_index'] = nation.loc[:, 'income_inequality_gini_index'].apply(lambda x: x if x >= 0 else np.nan)
nation.loc[:, 'income_inequality_gini_index_scaled'] = nation.loc[:, 'income_inequality_gini_index'].apply(lambda x: ((x - original_range_min) / (original_range_max - original_range_min)) * (new_range_max - new_range_min) + new_range_min)


# define columns to report
outcomes_cols = ['walkability_score_scaled',
                 'below_poverty_level_perc',
                 'income_inequality_gini_index_scaled',
                 'hs_grad_perc',
                 'households_no_vehicle_perc']

for i in outcomes_cols:
    merged_df[i] = merged_df[i].apply(lambda x: x if x >= 0 else np.nan)
    nation[i] = nation[i].apply(lambda x: x if x >= 0 else np.nan)
    
# flip metric to be percent of households with a car
merged_df.loc[:, 'households_w_vehicle'] = 100 - merged_df['households_no_vehicle_perc']
nation.loc[:, 'households_w_vehicle'] = 100 - nation['households_no_vehicle_perc']



"""
CLEAN COLUMN NAMES
"""
col_mapping = {'below_poverty_level_perc': '% Below Poverty Level',
               'income_inequality_gini_index_scaled': 'Income Inequality Gini Score',
               'hs_grad_perc': '% HS or Higher Degree',
               'households_w_vehicle': '% with a Vehicle',
               'walkability_score_scaled': 'Walkability Score',
               'neighborhood_name': 'Neighborhood Name'}

merged_df = merged_df.rename(col_mapping, axis='columns')


"""
RE-FORMAT DATA
"""
# turn the dataframe into long data so that the bar chart can be created with each outcome as a bar
neighborhood_df = pd.melt(merged_df,
                          id_vars = 'Neighborhood Name',
                          value_vars = col_mapping.values())

neighborhood_df = neighborhood_df.groupby(['Neighborhood Name', 'variable'])['value'].mean().reset_index()
walk_scores = dict(zip(list(neighborhood_df[neighborhood_df.variable=='Walkability Score']['Neighborhood Name']),
                       list(neighborhood_df[neighborhood_df.variable=='Walkability Score']['value'])
                      ))
neighborhood_df.loc[:, 'Walkability Score'] = neighborhood_df['Neighborhood Name'].map(walk_scores)

# reformat to get the averages
nation = nation[outcomes_cols+['households_w_vehicle']]
nation.drop('households_no_vehicle_perc', axis='columns', inplace=True)
nation_avg = pd.melt(nation,
                     value_vars = [i for i in col_mapping.keys() if 'neighborhood_name' not in i])
nation_avg = nation_avg.groupby('variable')['value'].mean().reset_index()

# create cleaned column for plotting the national averages
nation_avg['National Average'] = nation_avg['variable'].map(col_mapping)

# create DC average walkability score
neighborhood_df['dc_avg_walk'] = merged_df['Walkability Score'].mean()

# add URL to the american flag icon
nation_avg['flag_url'] = 'https://upload.wikimedia.org/wikipedia/commons/d/de/Flag_of_the_United_States.png'

# write data to CSV
neighborhood_df.to_csv(write_data_dir/"neighborhood_walkability.csv", index = False)
nation_avg.to_csv(write_data_dir/"national_walkability.csv", index = False)
merged_df.to_csv(write_data_dir/"cleaned_walkability.csv", index = False)

## Data Visualization

The figure visualization below is an `Altair` choropleth of walkability scores across all 179 D.C. Census Tracts, as well as  several social outcomes averaged across the entire district in a bar graph.

Hovering over each census tract will display the name of the neighborhood that it is in, as well as the walkability score for that particular census tract. Clicking on a neighborhood updates the bar graph on the right with the socioeconomic outcomes for that particular neighborhood, with the color of the bars being the average walkability score of that neighborhood (in the case that the neighborhood is comprised of more than one census tract).  If you click on a certain neighborhood (which may be comprised of more than one census tract) on the map, it will then highlight that neighborhood in the map, and then update the bar graph with the corresponding social outcomes averaged across just that neighborhood. On each bar, the national averages are also displayed, marked by the image of an American flag with a horizontal line indicator as well. Hovering over each bar gives you the value of that social outcome averaged across all the census tracts in that neighborhood, and hovering over each American flag gives you the national average of that social outcome.

Overall, we can see that DC is a highly walkable city, especially in comparison to the rest of the United States. In fact, it has almost double the walkability score as the national average. Accompanying that fact, we see that far fewer households in DC have vehicles in comparison to the national average. Interestingly, we see that DC fares about average for the social outcomes reported on. We that the most walkable parts of the city are concentrated in the city center around downtown, and as one ventures out from the city center the walkability decreases. An interesting finding is that although all edges of the city decrease in walkability, we see that the topmost edges of the city (wards 3 and 4) increase in car ownership, have very low rates of poverty, and higher high school education attainment. The lower edges of the city (wards 7 and 8) have lower walkability scores but still have lower rates of car ownership, higher poverty, and lower high school degree attainment (in comparison with wards 3 and 4). This logically suggests that car ownership is a key factor in economic success in less walkabile areas. In contrast, we see that in highly walkable neighborhoods such as Logan Circle / Shaw, it has significantly lower car ownership even in comparison to the DC average, yet has lower rates of poverty, and higher rates of high school degree attainment . 


In [None]:
# define data directories
data_dir = Path().absolute().parent.absolute().parent/"data"
write_data_dir = Path().absolute().parent.absolute().parent/"cleaned_data"
img_dir = Path().absolute().parent.absolute().parent/"data"/"img"

# read in cleaned data
neighborhood_df = pd.read_csv(write_data_dir/"neighborhood_walkability.csv")
nation_avg = pd.read_csv(write_data_dir/"national_walkability.csv")
merged_df = pd.read_csv(write_data_dir/"cleaned_walkability.csv")

"""
CREATE VISUALIZATION
"""

# define a click on the chloropleth map so that it can filter the bar chart
click = alt.selection_multi(fields=['Neighborhood Name'])

# create the chloropleth map
choropleth = (alt.Chart(merged_df,
                        title = "Walkability of DC Census Tracts"
                       )
              .mark_geoshape(stroke='white')
              .transform_lookup(
                                lookup='geoid_tract_20',
                                from_=alt.LookupData(merged_df,
                                                     'geoid_tract_20',
                                                     ['Walkability Score', 'Neighborhood Name'])
              ).encode(
                    alt.Color('Walkability Score:Q',
                              scale=alt.Scale(scheme='redyellowblue',
                                              reverse=True
                                             ),
                              title = "DC Walkability"
                             ),
                    opacity=alt.condition(click,
                                          alt.value(1),
                                          alt.value(0.2)),
                    tooltip=['Neighborhood Name:N', 'Walkability Score:Q'])
              .add_selection(click)
             )

bars = (
    alt.Chart(neighborhood_df,
              title='Outcomes of DC Neighborhoods')
    .mark_bar()
    .encode(
        x = alt.X('variable:N',
                  axis=alt.Axis(labelAngle=-45)),
        color = 'mean(Walkability Score):Q',
        y = alt.Y('mean(value):Q',
                  sort='x',
                  scale = alt.Scale(domain = [0, 100])
                 ),
        tooltip = [
                 'variable:N',
                 'mean(value):Q'
                ]
    ).properties(
        width = 200,
        height = 300
    ).transform_filter(click))

# modify the axes and title labels
bars.encoding.y.title = 'Avg. Value Across All Census Tracts'
bars.encoding.x.title = 'Outcome'

nation_avg_lines = (alt.Chart(nation_avg)
                    .mark_tick(
                        color="black",
                        thickness=3,
                        size=39,  # controls width of tick
                        strokeDash=[1,2]
                    )
                    .encode(
                        x = 'National Average:N',
                        y='value:Q'
                    ))

nation_avg_img = (alt.Chart(nation_avg)
                    .mark_image(
                        width=15,
                        height=15)
                    .encode(
                        x='National Average:N',
                        y='value:Q',
                        url='flag_url',
                        tooltip = ['National Average', 'value:Q']
                    ))

# plot the two graphs together
alt.hconcat(choropleth, (bars+nation_avg_lines+nation_avg_img))