# Access KPIs to Evaluate City-level Environmental and Social Equity

![Population density near Omaha, US](https://i.imgur.com/HkmpEJT.png)

In this notebook, I present two KPIs for measuring equitable access to essential services with each city of the CDP report.

The level and distribution of access a population has to the services in a city has significant impact on social factors like [mental health](https://pubmed.ncbi.nlm.nih.gov/23994648/) and [commute time](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1253792/), as well as environmental factors like [natural hazard resilience ](https://onlinelibrary.wiley.com/doi/full/10.1111/risa.13492) and [air quality](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3108136/). However, because of the difficulty gathering and analysing geospatial data, cities can be slow to take a data-driven approach to urban planning.


The KPIs developed in this notebook represent the ease and equity of access that people have to amenities in their city. When considering access, it is imporant to consider equity as well as average levels of access, as the disadvantaged people at the tails of access distribution are those most likely to be impacted by climate change and suffer adverse social outcomes.

Just as CDP's mission is to produce a global dataset of environmental disclosure, I use datasets that have global coverage. There is a disproportionate amount of research focused on North America and Europe due to the availability of rich data there, which can result in important research overlooking the regions that need it most.

To generate these global KPIs I use geospatial methods and datasets. In particular I rely heavily on OpenStreetMap (OSM), a global, open, crowdsourced dataset of geospatial road and amenity data.

In [None]:
!pip install --upgrade seaborn

In [None]:
# Libraries.
import pandas as pd
import geopandas as gpd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats

In [None]:
# Import city table.
df_disclosing = pd.read_csv('../input/cdp-unlocking-climate-solutions/Cities/Cities Disclosing/2020_Cities_Disclosing_to_CDP.csv')
df_disclosing = df_disclosing.drop(['Access', 'Last update', 'Year Reported to CDP'], axis=1)
df_disclosing = df_disclosing.drop(['Reporting Authority', 'First Time Discloser'], axis=1)
df_disclosing['City'] = df_disclosing['City'].fillna('')
df_disclosing['Organization'] = df_disclosing['Organization'].fillna('')

## Executive Summary

* To begin, I build exact boundaries for each city that match the reported population and extent.
* The first KPI measures both access to hospitals. Cities can improve this KPI both by improving average access to hospitals, as well as by reducing the disparity between areas of high and low access. Hospitals are used as an example of an access metric that could be extended to anything in the OSM database, such as food or greenspace.
* The second KPI measures poverty-adjusted access to hospitals. This KPI rewards cities which have similar hospital access levels for areas with low and high child mortailty.
* I finish up with a short discussion of the findings, and suggestions for further work.

Some of the datasets I'm using are fairly large. To keep this notebook manageable, in each section I give an overview of any data processing steps, then I add a link to the data as a Kaggle dataset plus any code used to generate the data.

## City Boundaries

It's important to know the extent of each city. A city has the most power to enact change within its borders, and when cross-referencing other responses within the questionnare it's important to be consistent with the region in question.

I produced a [dataset](https://www.kaggle.com/ajnisbet/cdp-city-boundaries-100m) of boundary polygons within a city.

In [None]:
df_boundaries = gpd.read_file('../input/cdp-city-boundaries-100m/city_boundaries_100m.geojson')

First, polygons were found by querying the OSM database for the administrative boundary of each city. For cases where the city name wasn't provided, or where the city name couldn't be found in OSM, I built a parser to extract the city from the name of the reporting organization.

In [None]:
df_city_name = pd.DataFrame({'org_name': df_disclosing.Organization, 'city_from_org_name': df_boundaries.city_name_from_org})
df_city_name = df_city_name[~df_city_name.org_name.isnull()]
df_city_name = df_city_name[df_city_name.org_name != df_city_name.city_from_org_name].reset_index(drop=True)
df_city_name.head(10)

The disclosure report listed a location point for many cities, but I found it to often be incorrect. When there were multiple results for the same city name I prioritised any result that contained the location point, but manually checking cases where no OSM results were found near the provided point suggested OSM was more likely to be correct.

This discrepancy between OSM and the location point provides a dataset of points that are likely incorrect. This could be used as feedback for both cities and CDP to improve accuracy for future reports.

Here are the largest discrepancies:

In [None]:
# Table of city point error.
df_city_error = df_boundaries[['account_number', 'city_location_error']].copy()
df_city_error['city'] = df_disclosing.City.values
df_city_error = df_city_error[~df_city_error.city_location_error.isnull()]
df_city_error = df_city_error.sort_values('city_location_error', ascending=False)
df_city_error.head(10).reset_index(drop=True)

In all, 515 cities had a matching administrative area in OSM, 45 cities had a matching representative point point in OSM, and for 6 cities the CDP point was used as fallback.

In [None]:
df_boundaries.city_geom_source.value_counts()

To validate the areas, I checked the population reported agains the residential population using the [WorldPop 100m global population raster](https://www.worldpop.org/) dataset.

In [None]:
df_pop = df_boundaries.copy()
df_pop['cdp_population'] = df_disclosing.Population

fig, ax = plt.subplots()
ax.scatter(df_pop.population_corrected, df_pop.raw_geom_population, alpha=0.3)
ax.set_yscale('log')
ax.set_xscale('log')
ax.set_xlim(10**3, 10**8)
ax.set_ylim(10**3, 10**8)
ax.set_xlabel('CDP reported population')
ax.set_ylabel('Population in OSM boundary');

There is mostly good agreement between the two measures, suggesting that OSM is a good approximation of city boundaries. However there are a number of outliers.

In some cases, there can be ambiguity between administrative levels with the same name in the same location: a familiar example of this is New York city vs New York state. To correct for this (and to convert the points into areas), boundaries were rescaled so the residential population contained within the boundary matched the population on the report. Only boundaries with a population discrepancy of more than 25% were rescaled.

When the OSM boundary had a smaller population than the report, the boundary was scaled outwards in all directions.

![](https://i.imgur.com/ZEVj4Qk.png)

When the OSM boundary had a larger population than the report, it was cropped to a circle. If a correct city location was provided, the cropping circle was centered at the city location point.

![](https://i.imgur.com/fUPLFgM.png) 

The resulting [dataset](https://www.kaggle.com/ajnisbet/cdp-city-boundaries-100m) has a population-adjusted polygon for each city. These will be used to calculate the KPIs in this report, but may also be of use to other researchers wishing to perform geospatial analysis on the CDP report.

For code, plots, and more details about the creation of the dataset, see [Appendix A - Border Dataset](https://www.kaggle.com/ajnisbet/cdp-appendix-a-population-dataset).

## KPI 1 - Hospital Access


For this KPI I am evaluating access to hospitals. Hospitals are important to health outcomes and they are also consistently tagged globally in OSM. This KPI could be extended to consider any category of information in the OSM database, such as supermarkets, greenspaces, and public transit stations.

First, I built a [dataset](https://www.kaggle.com/ajnisbet/cdp-osrm-hospital-access) of hospital locations in each city by querying OSM for amenities tagged as `hospital`, and also with the `emergency=yes` tag to exclude small clinics and hospital suppliers. I used [this code](https://www.kaggle.com/ajnisbet/cdp-appendix-hospital-locations).

In [None]:
df_hosp = pd.read_csv('../input/cdp-osrm-hospital-access/osrm_hospital_location.csv')
print(f'Number of cities with at least one hospital: {df_hosp.city_iloc.nunique()}')

438 cities have at least one hospital within the bounds found in the previous step. Some of those cities absent from the hospital dataset don't have hospitals as they are too small. Others may have hospitals that are incorrectly tagged in OSM or simpy not tagged.

For large cities without tagged cities in OSM, ensuring these are tagged correctly will not only help access research, but more importantly will improve actual access to hospitals due to OSM's integration with popular routing apps and software.

Here are the largest cities without a hospital tagged in OSM:

In [None]:
df_tmp = df_disclosing.copy()
df_tmp['area_km2'] = df_boundaries.area_km2
df_with_hosp = df_tmp.iloc[df_hosp.city_iloc.unique()]
df_no_hosp = df_tmp[~df_tmp.index.isin(df_with_hosp.index)]
df_no_hosp.sort_values('area_km2', ascending=False).head()[['Account Number', 'Organization', 'City', 'Country', 'Population', 'area_km2']]

For each city with a hospital, I selected 500 locations within the boundary. The locations were randomized and weighted according to population density, so each location represents around 0.2% of the city's population. [This code] was used.

I then used [OSRM](http://project-osrm.org/) (an open-source routing service that operates on OSM) to calculate the driving time from each point to each of the hospitals. OSRM accounts for driving distance, traffic lights and road speed limits, so gives a representation of access that is more accurate and can be compared between cities better than Euclidean distance.

The drive time to the closest hospital can be selected for each point, and a cumulative distribution can be calculated which will be the basis of this KPI. Shown below is the cumulative access distribution for Nashville, USA:

In [None]:
df_access = pd.read_csv('../input/cdp-osrm-hospital-access/osrm_hospital_duration.csv')
df_access = df_access.groupby(['source_lat', 'source_lon', 'city_iloc']).duration.min().reset_index()
df_access['duration_mins'] = df_access.duration / 60

df_access_0 = df_access[df_access.city_iloc == 0]

fig, ax = plt.subplots()
sns.ecdfplot(df_access_0.duration_mins, ax=ax)
ax.set_xlabel('Drive time to nearest hospital [mins]')
ax.set_ylabel('Proportion of population');
ax.set_title('Hospital access distribution in Nashville, USA')

Because the points were weighted by population, the curve represets how the people in a city access hospitals. For example we can see that about half the people in Nashville live within 10 minutes of a hospital, but there is a tail of people who live over twice as far away.

To quantify these curves it is important to consider both the location (average) and scale (variation, spread) of the data. The proposed KPI takes values in the range (0, 100) and is composed of two components: a location component (median) and a scale component(gini):

```
kpi_location = 8000 / median(drivetime_in_seconds)
kpi_scale = 8 / gini(drivetime_in_seconds)
kpi_access = kpi_location.clip(0, 50) + kpi_scale.clip(0, 50)
```

The constants in the formulas are chosen qualitatively to give an even distribution of values within each range. The scale and location components are capped at 50 to ensure a city score well on both to acheive a good score on the overall KPI.

In [None]:
def gini(array, beta = -0.5, weights = None):
    # based on bottom eq:
    # http://www.statsdirect.com/help/generatedimages/equations/equation154.svg
    # from:
    # http://www.statsdirect.com/help/default.htm#nonparametric_methods/gini.htm
    # All values are treated equally, arrays must be 1d:
    array = np.asarray(array)
    if weights:
        array = np.repeat(array,weights)
    array = array.flatten()
    if np.amin(array) < 0:
        # Values cannot be negative:
        array -= np.amin(array)
    # Values cannot be 0:
    array += 0.0000001
    # Values must be sorted:
    array = np.sort(array)
    # Index per array element:
    index = np.arange(1,array.shape[0]+1)
    # Number of array elements:
    n = array.shape[0]
    # Gini coefficient:
    return ((np.sum((2 * index - n  - 1) * array)) / (n * np.sum(array)))


records = []
for iloc in df_access.city_iloc.unique():
    df_iloc = df_access[df_access.city_iloc == iloc]
    records.append({
        'm_gini': gini(df_iloc.duration),
        'm_median': df_iloc.duration.median(),
        'city_iloc': iloc,
        'account_number': df_disclosing['Account Number'].iloc[iloc],
        'city': df_disclosing.City.iloc[iloc],
        'organization': df_disclosing.Organization.iloc[iloc],
    })
    
df_meta = pd.DataFrame(records)
df_meta['kpi_scale'] = 8 / (df_meta.m_gini + 0.01)
df_meta['kpi_loc'] = 8000 / df_meta.m_median
df_meta['kpi_scale'] = df_meta['kpi_scale'].clip(0, 50)
df_meta['kpi_loc'] = df_meta['kpi_loc'].clip(0, 50)
df_meta['kpi'] = df_meta['kpi_loc'] + df_meta['kpi_scale']

df_meta = df_meta.sort_values('kpi_loc', ascending=False)
df_meta.head()[['city', 'organization', 'account_number', 'kpi_scale', 'kpi_loc', 'kpi']]

The top-ranked city is Adelaide, Australia. Here's what Adelaide's access curve looks like:

In [None]:
df_access_0 = df_access[df_access.city_iloc == 93]

fig, ax = plt.subplots()
sns.ecdfplot(df_access_0.duration_mins, ax=ax)
ax.set_xlabel('Drive time to nearest hospital [mins]')
ax.set_ylabel('Proportion of population');
ax.set_title('Hospital access distribution in Adelaide, Australia');

On average, the population of Adelaide is very close to a hospital, and there is little difference between those who live the closest and the furthest.

## KPI 2 - Equity-Adjusted Hospital Access

Fast access to critical services is important for reducing emissions, improving air quality, and social welfare.

But if cities have limited resources to improve access those efforts should prioritize those most in need.  To capture social equity of access and to motivate helping disadvantaged regions of a city, the second KPI measures how city access  differs between advantaged and disadvantaged population.

For this notebook I'm again using hospitals as the access service. As an example of disadvantaged groups, I am using [infant mortality rates](https://sedac.ciesin.columbia.edu/data/set/povmap-global-subnational-infant-mortality-rates-v2) due to the availability of a high-quality global grid dataset, but this analysis could be extended:

* Adjust access to gas stations with flood risk.
* Adjust access to greenspace with low mental health.
* Adjust access to supermarkets  with obesity.


For each city with a hospital I built two access curves: one curve for the areas of the city with higher that median infant mortality, and one curve for the areas of the city with lower than median infant mortality. Again the curves were weighted by population, so each represent the same number of people.

The data is available as a [Kaggle dataset](https://www.kaggle.com/ajnisbet/cdp-osrm-hospital-access).


In [None]:
df_access_low = pd.read_csv('../input/cdp-osrm-hospital-access/osrm_hospital_duration_low_im.csv')
df_access_low = df_access_low.groupby(['source_lat', 'source_lon', 'city_iloc']).duration.min().reset_index()
df_access_low['duration_mins'] = df_access_low.duration / 60
df_access_high = pd.read_csv('../input/cdp-osrm-hospital-access/osrm_hospital_duration_high_im.csv')
df_access_high = df_access_high.groupby(['source_lat', 'source_lon', 'city_iloc']).duration.min().reset_index()
df_access_high['duration_mins'] = df_access_high.duration / 60

For the KPI, I'm using a one-sided [Kolmogorov–Smirnov test](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test), which measures how different two distributions are.

The KPI is the Kolmogorov–Smirnov statistic scaled to the range (0, 100), where higher is better. Values of 100 mean there is no statistically significant difference between hospital access based on the infant-mortality rate where you live.

In [None]:
records = []
for iloc in set(df_access_low.city_iloc.tolist()) & set(df_access_high.city_iloc.tolist()):
    df_iloc_low = df_access_low[df_access_low.city_iloc == iloc]
    df_iloc_high = df_access_high[df_access_high.city_iloc == iloc]
    
    low = sorted(df_iloc_low.duration)
    high = sorted(df_iloc_high.duration)
    ks, p = scipy.stats.ks_2samp(low, high, alternative='greater')
    records.append({
        'm_ks': ks,
        'm_p': ks,
        'low_med': np.median(low),
        'high_med': np.median(high),
        'city_iloc': iloc,
        'account_number': df_disclosing['Account Number'].iloc[iloc],
        'city': df_disclosing.City.iloc[iloc],
        'organization': df_disclosing.Organization.iloc[iloc],
        'country': df_disclosing.Country.iloc[iloc],
    })
    
df_meta = pd.DataFrame(records)

df_meta['kpi'] = 100 * np.clip(1 - df_meta.m_ks, 0, 1)
df_meta = df_meta.sort_values('kpi', ascending=False)
df_meta.head(10)[['city', 'organization', 'account_number', 'country', 'kpi']]

Many countries achieve a perfect score for this KPI, within the statistical margin of error! For example, the curves for Helsinki, Finland are nearly identical due to the overall low levels of infant mortality. (In more developed cities, demographic measures such as relative income or development-agnostic measures like wildfire risk might be more illuminating than this marker of poverty).

In [None]:
iloc = 138

df_access_low_0 = df_access_low[df_access_low.city_iloc == iloc]
fig, ax = plt.subplots()
sns.ecdfplot(df_access_low_0.duration_mins, ax=ax, label='Low infant mortality')

df_access_high_0 = df_access_high[df_access_high.city_iloc == iloc]
sns.ecdfplot(df_access_high_0.duration_mins, ax=ax, label='High infant mortality')

ax.set_title('Hospital access distribution in Helsink, Finland');
ax.set_xlabel('Drive time to nearest hospital [mins]')
ax.set_ylabel('Proportion of population');
ax.legend();

Low values of the KPI reveal large differences in hospital access. In the example below, those living in high infant morality areas take on average twice as long to get to a hospital, and not even the closest 1% live within a few minutes.

In [None]:
iloc = 371

df_access_low_0 = df_access_low[df_access_low.city_iloc == iloc]
fig, ax = plt.subplots()
sns.ecdfplot(df_access_low_0.duration_mins, ax=ax, label='Low infant mortality')

df_access_high_0 = df_access_high[df_access_high.city_iloc == iloc]
sns.ecdfplot(df_access_high_0.duration_mins, ax=ax, label='High infant mortality')

ax.set_title('Hospital access distribution in city #371');
ax.set_xlabel('Drive time to nearest hospital [mins]')
ax.set_ylabel('Proportion of population');
ax.legend();

## Limitations and Suggestions for Further Research

There are a number of things I didn't get around to that would improve the value of these KPIs:
* A disclosure KPI representing OSM quality: [this paper](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0180698) suggests a viable method for estimating how accurate OSM is in a given region.
* I tried as much as possible to automate the extraction of city boundaries, but it may be more time efficient and accurate to correct them manually.
* OSRM supports non-driving modes such as walking and cycling, similar KPIs could be developed for walkability.
* There is a growing number of climate-risk raster datasets that would further investigate the ties between environmental and social issues.