> Martin Fridrich 08/2019

**The task:** Evil zombies have destroyed all central rescue stations in the city. The state has statistics where emergency problems occurred in last year (data above) and they are not sure whether the emergency stations were placed efficiently. Note that ambulances were sent only to EMS problems. Therefore, they asked us to find places where they should build 5 (of the same size) emergency stations. What will their position be?

Thus, we aim to find suitable locations for five emergency facilities based on historical data with linear programming. The notebook is organized as follows:

* Housekeepin',
* Data Processing,
* Exploratory Data Analysis,
* Capacitated Facility Location Problem,
* What's Next?

# Housekeepin'

In [None]:
!pip install pulp

In [None]:
import os
import urllib.request as ur
import zipfile
import math
import numpy as np
import pandas as pd
import geopandas as gp
import shapely
import matplotlib
import matplotlib.pyplot as plt
import contextily
import pulp

# Data Processing

We identify the most critical columns and ensure features are in proper data types. The resulting dataset will contain only EMS distress calls and the variables: `lat`, `lng`, `timeStamp`, `eventType`, `eventDetail`.

In [None]:
df = pd.read_csv("/kaggle/input/montcoalert/911.csv")
df.info()

In [None]:
df.head(3)

There are ~ 660k rows & 9 columns.

The coordinates of distress events are in `lat` and `lng` cols, loaded as `float64`; required precision might be discussed later. The `desc` col denotes details - responding crew?, township, responding station, and timestamp, the values are loaded as `object` and require parsing. The  `zip` is loaded as `float64`, even though discrete. Approx 10 % of its values are missing. The `title` col encodes the type of response and nature of the call; it is loaded as an `object` and needs parsing. The `timeStamp` col describes the date and time of the call, loaded as an object, and needs to be coerced to a proper data type. The `twp` stands for the township; it is loaded as an `object` and might be transformed into the `cat` type. The `addr` col represents the address, and it is introduced as an `object`. The last feature `e` is loaded as `int64` and is 1 for all observations; hence will be omitted.

In [None]:
df_clean = df.copy() # copy df
df_clean = df_clean.loc[:,["lat","lng", "title", "timeStamp"]]

# convert to datetime
df_clean.timeStamp = pd.to_datetime(df_clean.timeStamp)

# parse the title
df_title = df_clean.title.str.extract(r'(?P<eventType>.+):\s*(?P<eventDetail>.+)')
df_title = df_title.apply(lambda x: x.astype('category'), axis=0)

# append new cols and drop title
df_clean = pd.concat([df_clean.loc[:,["lat","lng", "timeStamp"]], df_title],axis=1)
df_clean.info()

In [None]:
# check number of NAs per column
print("NAs per column")
display(df_clean.apply(lambda x: x.isna().sum(), axis=0))

# check number of distinct values per column
print("nunique per column")
display(df_clean.apply(lambda x: len(x.unique()), axis=0))

In [None]:
# filter out irrelevant eventType
df_ems = df_clean[df_clean.eventType=='EMS']
df_ems.info()

# Exploratory Data Analysis

Over the next code chunks, we focus on a brief examination of EMS calls, specifically on frequent incident types, trends, and geospatial properties.

In [None]:
plt.figure(figsize=(10, 5))
_ = df_ems.eventDetail.value_counts().head(10).plot.barh()
_.set_xlabel("n")
_.invert_yaxis();

The most common incidents are `FALL VICTIM`, `RESPIRATORY EMERGENCY`, and `CARDIAC EMERGENCY`.

In [None]:
plt.figure(figsize=(10, 5))
_ = df_ems.timeStamp.dt.date.value_counts().sort_index().plot()
_.set_xlabel("time")
_.set_ylabel("n");

There is somewhat linear trend, with the significant drop in 2020 caused by the pandemic. With different level of aggregation, we might see some other interesting pattern. However, it is not the goal of the endevour.

In [None]:
# geospatial data
world = gp.read_file(gp.datasets.get_path('naturalearth_lowres'))
gdf = gp.GeoDataFrame(df_ems.loc[:,["timeStamp","eventDetail"]],
        geometry=[shapely.geometry.Point(x, y) for x, y in zip(df_ems.lng, df_ems.lat)])
gdf.crs = {'init': 'epsg:4326'}
#world.crs

# spatial outliers?
ax = world.plot(color='white', edgecolor='black', figsize=(15, 10))
gdf.plot(ax=ax, color='red', markersize=7.5);

There are multiple outlying points. To resolve that, we leverage US shapefile with borders of Montgomery County.

In [None]:
# get county shapes
if not os.path.exists('shape/'):
    sh_file = ur.URLopener()
    u = "https://www2.census.gov/geo/tiger/TIGER2016/COUNTY/tl_2016_us_county.zip"
    sh_file.retrieve(u, "shape.zip")
    shz_file = zipfile.ZipFile("shape.zip")
    shz_file.extractall(path="shape")
    shz_file.close()
    os.remove("shape.zip")

filenames = [y for y in sorted(os.listdir("shape/"))
                 for ending in ['dbf', 'prj', 'shp', 'shx'] if y.endswith(ending)] 
dbf, prj, shp, shx = [filename for filename in filenames]
county = gp.read_file("shape/"+shp)
#county.crs

# build montco polygon
montco = county[county.NAMELSAD=="Montgomery County"].unary_union

# Capacitated Facility Location Problem

## Data Transformation

Firstly, we aggregate observations on rounded coordinates to improve the optimization speed, resulting in a significantly smaller dataset while not losing the precision (~ 5 km). Geometries are then reconstructed to be combined with `montco` polygon.

In [None]:
# aggregate the observations
df = gdf.copy()
df['x'] = np.round(df.geometry.x/0.025)*0.025
df['y'] = np.round(df.geometry.y/0.025)*0.025
df = df.groupby(['x', 'y'], as_index=False).\
        geometry.count().rename({'geometry':'weight'}, axis=1)

df = gp.GeoDataFrame(df,
        geometry=[shapely.geometry.Point(x, y) for x, y in zip(df.x, df.y)])

df.crs = {'init': 'epsg:4326'}
df.info()

In [None]:
# get montco points
within_vec = df.geometry.within(montco)
df = df.loc[np.logical_and(within_vec,df.x>-80),:]
tot_inc_served = df.weight.sum()
df.shape

## Distance Function

Here, we define a distance func based on the Manhattan (L1).

In [None]:
point_1 = gdf.geometry[0]
point_2 = gdf.geometry[1]

def get_manhattan(p1, p2):
    """calculate manhattan from point lat+lng 
    args>
    p1, p2 - shapely.geometry.Point objects    
    returns>
    distance, scalar"""
    
    e_circ = 40075  
    distance = e_circ*(math.fabs(p1.x-p2.x) +
                       math.fabs(p1.y-p2.y))/360 
    return distance

get_manhattan(point_1, point_2)


## Problem Formulation

Now, it is time to formulate the CFLP. Let us have a binary variable $x_{ij}$, for all $i = 1, ..., m$ and for all $j = 1, ..., n$ , where $m$ stands for a number of incident locations and $n$ represents a number of facilities. If $x_{ij} = 1$, than incident $i$ is served from facility $j$, otherwise $x_{ij} = 0 $. Let us have another binary variable $y_{j}$ , for all $j = 1, ..., n$, where $n$ represents the number of facilities. If $y_{j} = 1$, than the facility is open, otherwise $y_{j} = 0$. Now we define cost variables, $d_{ij}$ presents manhattan distance between incident $i$ and facility $j$, $w_{i}$ then shows a number of incidents in location $i$.

We introduce the problem in the following manner.

**Objective function:** 

$min \sum_{i=1}^{m} \sum_{j=1}^{n} d_{ij} w_{i} x_{ij}$

**Constraints:** 

$(1)\space\space\sum_{i=1}^{m} x_{ij} = 1, \forall i = 1, ..., m$  
$(2)\space\space\sum_{i=1}^{m} x_{ij} = 1, \forall i = 1, ..., m$  
$(3)\space\space\space x_{ij} \leq y_{j}, \forall i = 1, ..., m, j = 1, ..., n$  
$(4)\space\space\sum_{j=1}^{n} y_{j} = facility\_limit$  
$(5)\space\space\sum_{i=1}^{m} x_{ij} w_{i} \leq capacity\_limit,\forall j = 1, ..., n$  

The objective function minimizes the sum of the weighted distance between incidents and facilities. Constraint (1) ensure that every incident location is served; constraint (2) ensures that if incident $x_{ij}$ is served from facility $y_{j}$, that facility exists; constraint (3) limits the number of facilities that can be opened; the constraint (4) limits capacities, we arbitrarily chose the limit.

In [None]:
# remap vars
incidents = df.geometry
weights = df.weight
facilities = df.geometry
inc_i = incidents.index
fac_i = facilities.index

# initialize object
os.chdir("/")
model = pulp.LpProblem("CFLModel", pulp.LpMinimize)

# initialize facility vector
f_vec = pulp.LpVariable.dict("f_vec",
            [j for j in fac_i], cat="Binary")

# initialize binary variable for incident-facility mapping
if_mat = pulp.LpVariable.dicts("if_mat",
            [(i,j) for i in inc_i for j in fac_i], cat="Binary")

# objective function - weighted manhattan sum
model += (pulp.lpSum([get_manhattan(incidents.loc[i],
            facilities.loc[j]) * weights[i] * if_mat[(i,j)] 
            for i in inc_i for j in fac_i]))

# every incident must be served
for i in inc_i:
    model += pulp.lpSum(if_mat[(i, j)] for j in fac_i)==1
    
# every incident mapped to facility, given facility must exists
for i in inc_i:
    for j in fac_i:
        model += if_mat[(i, j)] <= f_vec[j]

# we are limited to 5 facilities
model += pulp.lpSum(f_vec[j] for j in fac_i)==5

# facilities can serve incidents within limited capacity
for j in fac_i:
    model += pulp.lpSum(if_mat[(i, j)] * weights[i] for i in inc_i)\
        <=np.round(1.05*tot_inc_served/5,0)

## Solution

In [None]:
# try to solve it
model.solve()

print("Solving the model results in **{}** status and objective function of {}.".\
    format(pulp.LpStatus[model.status].lower(), np.round(pulp.value(model.objective),2)))

The status suggests that the solution is found and all constraints are satisfied.

In [None]:
fac_loc = pd.DataFrame(index=[j for j in fac_i if f_vec[j].varValue==1])
cust_loc = pd.DataFrame([{'fac_id':j } for i in inc_i
                for j in fac_i if if_mat[(i,j)].varValue==1], index=inc_i)

for i in range(fac_loc.shape[0]):
    print("Suitable coordinates for facility {} are lat: {}, lng: {}.".\
        format(i,facilities[fac_loc.index[i]].y, facilities[fac_loc.index[i]].x))

In [None]:
cols = ['#f032e6', '#000075', '#ffe119', '#e6194B', '#3cb44b']
dict_cols = dict(zip(fac_loc.index, cols))
inc_cols = cust_loc.fac_id.map(dict_cols)
fac_cols = fac_loc.index.map(dict_cols)

# plot out incidents
ax = incidents.to_crs(epsg=3857).plot(
    color=inc_cols, markersize=weights/75,
    alpha=.5,figsize=(15, 10))

# plot out facilities
facilities[fac_loc.index].to_crs(epsg=3857).plot(
    color=fac_cols, markersize=200, marker="D", ax=ax)

# add contextily tiles
contextily.add_basemap(ax, zoom=11,
    source=contextily.providers.Stamen.TonerLite)
ax.set_axis_off()

The figure presents the optimization results and serves as a sanity check. We depict the facilities as large diamond markers; serviced areas share the color; the size of the incident/location points depends on the incident frequency. 

We can see that there is a larger, probably rural, region with a lower density of events in the upper-left part of the figure. Other areas appear to be more compact, serving areas with a higher density of the events.

## Further Analysis

In [None]:
# get data about shadow prices and slack
const_df = pd.DataFrame([{'con_name':name,
                          'shadow_price':c.pi,
                          'slack_value':c.slack}
                         for name, c in model.constraints.items()])

const_df.shadow_price.value_counts().sort_index()

There does not appear to be a possibility of improving the objective function.

In [None]:
const_df.slack_value.value_counts().sort_index()

From the printout above, we understand that most of the conditions are bounded (slack is equal to 0); hence change in these conditions leads to a different solution. Let us take a look at one of the conditions with the highest slack.

In [None]:
const_df.sort_values('slack_value', ascending=False).head(3)

In [None]:
str(model.constraints['_C44944'])[-285:]

This is the constraint (4), describing facility location index 401. Slack identified here is not entirely useful, as this particular location is not amongst the selected ones.

# What's Next?

As presented in the previous section, we find a suitable solution for the capacitated facility location problem. However, the solution is somewhat limited. In this section, we describe some of the difficulties & further steps.

**Problem definition** - we might align the solution with the real-world a bit more. Different objective functions can be applied to reflect on that (average time to location/maximum distance traveled/...). Moreover, the feasibility region can be narrowed with further details (time/capacity/location constraints).

**Data** - we build the model on the raster representation (not precise) & use every incident location as a potential facility location (not necessarily good). Both choices might be improved with prior information. Moreover, boundary problems are expected (e.g., there are calls and hospitals adjacent to Montgomery County, which are not acknowledged).

**Sensitivity analysis** - at the end of the CFL, we examine selected components of the model, one at a time; this part of the study can be more thorough. To further challenge our solution, we can do simulations while adding noise to the data/constraints and analyzing the results, leading to better insights.

> Martin Fridrich 08/2019