Urban Data Science & Smart Cities <br>
URSP688Y Spring 2025<br>
Instructor: Chester Harvey <br>
Urban Studies & Planning <br>
National Center for Smart Growth <br>
University of Maryland

# Exercise04

This last exercise is an opportunity for you to get started on your final project. Please identify a portion of your project to get started on and submit a notebook (and any other related files) where you:

1. State the question you are aiming to address with this portion of your analysis
2. Outline the approach you will use to answer that question (pseudocode or you can start to more formally outline the approach section for your final narrative)
3. Operationalize your approach with data and code that you can later slot into your final analysis

## Submitting

Please make a pull request with all of your code and reasonably-sized data in a folder with your first name. See the example with my name in the `exercise04` directory.

If you have datasets that are too large for GitHub or should not be made public, please upload them to a cloud location (e.g., Google Drive) to which I (and ideally your classmates) have access. Please also provide instructions for how someone running your code should properly locate or connect to these files so the analysis will run properly. For example, should they copy and paste the files into the same directory as your notebook, or a provided `data` directory? Best practice is to include these instructions in a separate ReadMe.md or ReadMe.txt file, or at the top of your notebook.

In [None]:
#Analyzing Public Transportation Access and Housing Affordability for Hispanic
#Communities in Washington, D.C.

#This study seeks to address the following questions:
#1. How does the spatial distribution of affordable housing units in Washington, D.C.
#relate to access to public transit services for Hispanic populations?
#2. To what extent are Hispanic-majority census tracts located near high-frequency or
#multimodal transit services (e.g., Metrorail, Metrobus)?
#3. How has gentrification affected the proximity of Hispanic residents to public
#transit infrastructure over time?

In [None]:
# 1. First Install necessary packages

!pip install pandas
!pip install geopandas
!pip install matplotlib

In [None]:
# 2. Import libraries
import pandas as pd
import os

# Import geopandas
import geopandas as gpd
import matplotlib.pyplot as plt

In [None]:
# 3. Check the directory Folder and files, to verify the files are there to be used for future codes.
os.getcwd()
os.listdir()

In [None]:
# 4. Load the CSV file without header
hisp_pop_cbg_2023_df = pd.read_csv('hisp_pop_cbg_2023.csv', header=None)

# 5. Get row 1 to use as new column names
new_column_names = hisp_pop_cbg_2023_df.iloc[1].copy()

# 6. Set the header as row 1
hisp_pop_cbg_2023_df.columns = hisp_pop_cbg_2023_df.iloc[1]
hisp_pop_cbg_2023_df = hisp_pop_cbg_2023_df.drop(index=1).reset_index(drop=True)

# 7. Assign new headers
hisp_pop_cbg_2023_df.columns = new_column_names

# 8. Drop row 0 and 1
hisp_pop_cbg_2023_df = hisp_pop_cbg_2023_df.drop(index=[0, 1]).reset_index(drop=True)

# 9. Rename "GIS Join Match Code" to "GISJOIN"
hisp_pop_cbg_2023_df = hisp_pop_cbg_2023_df.rename(columns={"GIS Join Match Code": "GISJOIN"})

# 10. Reset index
hisp_pop_cbg_2023_df = hisp_pop_cbg_2023_df.reset_index()

In [None]:
# Preview the DataFrame
hisp_pop_cbg_2023_df.head()

In [None]:
# 10. Load the shapefile
wdc_cbg_2023_gdf = gpd.read_file('US_blck_grp_2023.shp')

# Filter only for Washington, D.C. (STATEFP == '11')
wdc_cbg_2023_gdf = wdc_cbg_2023_gdf[wdc_cbg_2023_gdf['STATEFP'] == "11"].reset_index(drop=True)

In [None]:
# View the first few rows
wdc_cbg_2023_gdf.head()

In [None]:
# 11. Call the function join_df_to_gdf(df, gdf, label) 
# from the module data_prep.py 
import data_prep

# call the function
wdc_cbg_2023_gdf_merged = data_prep.join_df_to_gdf(hisp_pop_cbg_2023_df, wdc_cbg_2023_gdf, "GISJOIN")

In [None]:
# View the first few rows
wdc_cbg_2023_gdf_merged

In [None]:
# 12. Load transit stops shapefile *USED TRANSIT FREQUENCY DATA JUST IN CASE OF FURTHER ANALYSIS IN THE FUTURE
transit_stops_frequency = gpd.read_file('aggregated_stop_loading.shp')

# Check results
print(transit_stops_frequency.shape)
transit_stops_frequency.head()

In [None]:
# 13. Load Washington DC boundary shapefile
dc_boundary = gpd.read_file('Washington_DC_Boundary.shp')

#Check resultS
print(dc_boundary.shape)
dc_boundary.head()

In [None]:
# 14. Load Affordable_Housing shapefile
affordable_housing = gpd.read_file('Affordable_Housing.shp')

In [None]:
# 15. Make sure the coordinate system is same accross the gdf s

# Used the UTM system in case of distance analysis 

wdc_cbg_2023_gdf_merged = wdc_cbg_2023_gdf_merged.to_crs(epsg=3857)
transit_stops_frequency = transit_stops_frequency.to_crs(epsg=3857)
affordable_housing = affordable_housing.to_crs(epsg=3857)
dc_boundary = dc_boundary.to_crs(epsg=3857)

In [None]:
# 16. Clip the transit_stops_frequency to dc_boundary
transit_stops_frequency_clipped = gpd.clip(transit_stops_frequency, dc_boundary)

#Check resultS
transit_stops_frequency_clipped.head()

In [None]:
#CHECK THE MAP 
# Create a plot with a nice figure size
fig, ax = plt.subplots(figsize=(12, 12))

# Plot DC boundary first (background layer)
dc_boundary.plot(ax=ax, color='white', edgecolor='black')

# Plot CBGs (maybe with transparent fill so we can still see stops)
wdc_cbg_2023_gdf_merged.plot(ax=ax, facecolor="none", edgecolor="gray", linewidth=0.5)

# Plot Transit Stops
transit_stops_frequency_clipped.plot(ax=ax, markersize=5, color='red', alpha=0.7)

# Plot Affordable Housing projects
affordable_housing.plot(ax=ax, markersize=10, color='blue', alpha=1)

# Add title
ax.set_title('Washington DC: CBGs, Transit Stops, and Boundary', fontsize=16)

# Remove axis
ax.set_axis_off()

# Show the plot
plt.show()

In [None]:
# 17. List the column names of affordable_housing
affordable_housing.columns.tolist()

In [None]:
# 18. Rename specific columns
affordable_housing = affordable_housing.rename(columns={
    'MAR_WARD': "Ward Number",
    'PROJECT_NA': "Project Name",
    'STATUS_PUB': "Project Status",
    'AGENCY_CAL': "Agency Name",
    'TOTAL_AFFO': "Total Number of Affordable Housing Units",
    'AFFORDABLE': "Number of Affordable Housing Units with 30% AMI or Lower",
    'AFFORDAB_1': "Number of Affordable Housing Units with 31-50% AMI",
    'AFFORDAB_2': "Number of Affordable Housing Units with 51-60% AMI",
    'AFFORDAB_3': "Number of Affordable Housing Units with 61-80% AMI",
    'AFFORDAB_4': "Number of Affordable Housing Units with 81% AMI or Higher"
})

# Check results
affordable_housing.head()

In [None]:
# 19. Spatially join the wdc_cbg_2023_gdf_merged with affordable_housing projects
wdc_cbg_2023_aff_hous_sjoin = gpd.sjoin(
    wdc_cbg_2023_gdf_merged,  # target GeoDataFrame (CBGs)
    affordable_housing,           # join GeoDataFrame (affordable_housing)
    how="left",                   # keep all block groups, even if no affordable_housing
    predicate="intersects"        # spatial predicate (you could also use 'within' if needed)
)

# Optional: Check result
wdc_cbg_2023_aff_hous_sjoin.head()

In [None]:
# 20. Drop the old 'index_right' if it exists
if 'index_right' in wdc_cbg_2023_aff_hous_sjoin.columns:
    wdc_cbg_2023_aff_hous_sjoin = wdc_cbg_2023_aff_hous_sjoin.drop(columns=['index_right'])

# Spatially join wdc_cbg_2023_aff_hous_sjoin with transit_stops_frequency_clipped projects
wdc_cbg_2023_aff_hous_transit_sjoin = gpd.sjoin(
    wdc_cbg_2023_aff_hous_sjoin,  # target GeoDataFrame (CBGs)
    transit_stops_frequency_clipped,           # join GeoDataFrame (transit_stops_frequency_clipped)
    how="left",                   # keep all block groups, even if no transit_stops_frequency_clipped
    predicate="intersects"        # spatial predicate (you could also use 'within' if needed)
)

# Check results
wdc_cbg_2023_aff_hous_transit_sjoin.head()

In [None]:
wdc_cbg_2023_aff_hous_transit_sjoin.columns.tolist()

In [None]:
# Force unique columns by keeping only one 'Estimates: Total'
wdc_cbg_2023_aff_hous_transit_sjoin = wdc_cbg_2023_aff_hous_transit_sjoin.loc[:, ~wdc_cbg_2023_aff_hous_transit_sjoin.columns.duplicated()]

# Calculate Hispanic Population Percentage
wdc_cbg_2023_aff_hous_transit_sjoin['Hispanic Population Percentage'] = (
    wdc_cbg_2023_aff_hous_transit_sjoin['Estimates: Hispanic or Latino'].astype(float) /
    wdc_cbg_2023_aff_hous_transit_sjoin['Estimates: Total'].astype(float)
) * 100

wdc_cbg_2023_aff_hous_transit_sjoin.head()

In [None]:
#CREATE A BIVARIATE CHOROPLETH MAP

import matplotlib.pyplot as plt
import numpy as np

# 1. Classify each variable into quantiles
wdc_cbg_2023_aff_hous_transit_sjoin['hispanic_quantile'] = pd.qcut(
    wdc_cbg_2023_aff_hous_transit_sjoin['Hispanic Population Percentage'],
    q=3,  # 3 classes: low, medium, high
    labels=[1, 2, 3]
)

wdc_cbg_2023_aff_hous_transit_sjoin['housing_quantile'] = pd.qcut(
    wdc_cbg_2023_aff_hous_transit_sjoin['Total Number of Affordable Housing Units'].fillna(0),
    q=3,  # 3 classes
    labels=[1, 2, 3]
)

# 2. Combine the two quantiles to create a bivariate class
wdc_cbg_2023_aff_hous_transit_sjoin['bivariate_class'] = (
    wdc_cbg_2023_aff_hous_transit_sjoin['hispanic_quantile'].astype(str) +
    "-" +
    wdc_cbg_2023_aff_hous_transit_sjoin['housing_quantile'].astype(str)
)

In [None]:
# 3. Get the breakpoints for Hispanic Population Percentage
hispanic_breaks = wdc_cbg_2023_aff_hous_transit_sjoin['Hispanic Population Percentage'].quantile([0, 1/3, 2/3, 1])
print("Hispanic Population % quantiles:")
print(hispanic_breaks)

# 4. Get the breakpoints for Total Number of Affordable Housing Units
housing_breaks = wdc_cbg_2023_aff_hous_transit_sjoin['Total Number of Affordable Housing Units'].fillna(0).quantile([0, 1/3, 2/3, 1])
print("\nAffordable Housing Units quantiles:")
print(housing_breaks)

In [None]:
# 5. Define a manual color palette for 3x3 combinations
bivariate_color_dict = {
    '1-1': '#e8e8e8',  # low hispanic, low housing
    '1-2': '#ace4e4',
    '1-3': '#5ac8c8',
    '2-1': '#dfb0d6',
    '2-2': '#a5add3',
    '2-3': '#5698b9',
    '3-1': '#be64ac',
    '3-2': '#8c62aa',
    '3-3': '#3b4994',  # high hispanic, high housing
}

In [None]:
# Set colors
wdc_cbg_2023_aff_hous_transit_sjoin['color'] = wdc_cbg_2023_aff_hous_transit_sjoin['bivariate_class'].map(bivariate_color_dict)

In [None]:
# 6. Drop rows without a color (where either variable was missing)
plot_data = wdc_cbg_2023_aff_hous_transit_sjoin.dropna(subset=['color'])

In [None]:
import matplotlib.patches as mpatches

# 1. Create a small figure for the legend
fig, ax = plt.subplots(1, 1, figsize=(17, 17))

# Plot your data
plot_data.plot(
    color=plot_data['color'],
    linewidth=0.1,
    edgecolor='white',
    ax=ax
)

# Set title and remove axes
ax.set_title('Bivariate Choropleth:\nHispanic Population % vs Affordable Housing Units', fontsize=16)
ax.set_axis_off()

# 2. Build the legend manually

# Define the position and size of the legend
legend_labels = {
    '3-3': "Hispanic HIGH 10.6%-67.6% / Housing HIGH 26-741 units",
    '3-2': "Hispanic HIGH 10.6%-67.6% / Housing MEDIUM 2-25 units",
    '3-1': "Hispanic HIGH 10.6%-67.6% / Housing LOW 0-1 units",
    '2-3': "Hispanic MEDIUM 3.3%-10.6% / Housing HIGH 26-741 units",
    '2-2': "Hispanic MEDIUM 3.3%-10.6% / Housing MEDIUM 2-25 units",
    '2-1': "Hispanic MEDIUM 3.3%-10.6% / Housing 0-3 units",
    '1-3': "Hispanic LOW 0%-3.3% / Housing HIGH 26-741 units",
    '1-2': "Hispanic LOW 0%-3.3% / Housing MEDIUM 2-25 units",
    '1-1': "Hispanic LOW 0%-3.3% / Housing LOW 0-1 units"
}

# Create patch rectangles
patches = [mpatches.Patch(color=bivariate_color_dict[key], label=label)
           for key, label in legend_labels.items()]

# Add legend to the plot
plt.legend(
    handles=patches,
    loc='lower left',
    title='Legend',
    fontsize='small',
    title_fontsize='medium',
    frameon=True,
    framealpha=1
)

# Show the map
plt.show()

In [None]:
#NExt steps, create an automatic way to make the legend and also explore the interactive map options like folium