# SAMUR Emergency Frequencies

This notebook explores how the frequency of different types of emergency changes with time in relation to different periods (hours of the day, days of the week, months of the year...) and locations in Madrid. This will be useful for constructing a realistic emergency generator in the city simulation.

Let's start with some imports and setup, and then read the table.

In [1]:
import pandas as pd
import datetime
import matplotlib.pyplot as plt
import yaml
%matplotlib inline

In [2]:
df = pd.read_csv("../data/emergency_data.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,Año,Mes,Distrito,Hospital,Devuelto,Solicitud,Intervención,Dia de la semana,Tiempo de recorrido,Gravedad,IBC,Coordenadas,Mes_num,Hora_del_dia,Dia_desde_2017,Semana
0,0,2017,ENERO,Centro,Concepción (Fund. J. Díaz),False,2017-01-01 00:23:19,2017-01-01 00:28:59,6,0 days 00:05:40,2,0.0,POINT (-1084.985671688465 315.1856405139253),0,23,0,0
1,1,2017,ENERO,Carabanchel,No derivado,False,2017-01-01 00:27:35,2017-01-01 00:35:44,6,0 days 00:08:09,3,256.0,POINT (-1420.483085081759 -2942.78224949467),0,27,0,0
2,2,2017,ENERO,Salamanca,No derivado,False,2017-01-01 00:47:26,2017-01-01 00:55:49,6,0 days 00:08:23,2,1191.0,POINT (1662.930700416665 1816.00590750245),0,47,0,0
3,3,2017,ENERO,Centro,Doce de Octubre,False,2017-01-01 00:55:13,2017-01-01 01:02:23,6,0 days 00:07:10,3,467.0,POINT (-895.6722010203512 -101.6973139135703),0,55,0,0
4,4,2017,ENERO,Villa de Vallecas,No derivado,False,2017-01-01 01:07:11,2017-01-01 01:19:44,6,0 days 00:12:33,4,718.0,POINT (8340.751952136179 -5598.25045023518),0,67,0,0


The column for the time of the call is a string, so let's change that into a timestamp.

In [3]:
df["time_call"] = pd.to_datetime(df["Solicitud"])

We will also need to assign a numerical code to each district of the city in order to properly vectorize the distribution an make it easier to work along with other parts of the project.

In [4]:
district_codes = {
    'Centro': 1, 
    'Arganzuela': 2, 
    'Retiro': 3, 
    'Salamanca': 4, 
    'Chamartín': 5, 
    'Tetuán': 6, 
    'Chamberí': 7, 
    'Fuencarral - El Pardo': 8, 
    'Moncloa - Aravaca': 9, 
    'Latina': 10, 
    'Carabanchel': 11, 
    'Usera': 12, 
    'Puente de Vallecas': 13, 
    'Moratalaz': 14, 
    'Ciudad Lineal': 15, 
    'Hortaleza': 16, 
    'Villaverde': 17, 
    'Villa de Vallecas': 18, 
    'Vicálvaro': 19, 
    'San Blas - Canillejas': 20, 
    'Barajas': 21,
    }

df["district_code"] = df.Distrito.apply(lambda x: district_codes[x])

Each emergency has already been assigned a severity level, depending on the nature of the reported emergency.

In [5]:
df["severity"] = df["Gravedad"]

We also need the hour, weekday and month of the event in order to assign it in the various distributions.

In [6]:
df["hour"] = df["time_call"].apply(lambda x: x.hour)  # From 0 to 23
df["weekday"] = df["time_call"].apply(lambda x: x.weekday()+1)  # From 1 (Mon) to 7 (Sun)
df["month"] = df["time_call"].apply(lambda x: x.month)

Let's also strip down the dataset to just the columns we need right now.

In [7]:
df = df[["district_code", "severity", "time_call", "hour", "weekday", "month"]]
df.head()

Unnamed: 0,district_code,severity,time_call,hour,weekday,month
0,1,2,2017-01-01 00:23:19,0,7,1
1,11,3,2017-01-01 00:27:35,0,7,1
2,4,2,2017-01-01 00:47:26,0,7,1
3,1,3,2017-01-01 00:55:13,0,7,1
4,18,4,2017-01-01 01:07:11,1,7,1


We are going to group the distributions by severity.

In [8]:
emergencies_per_grav = df.severity.value_counts().sort_index().rename("total_emergencies")
emergencies_per_grav

1    107570
2    107186
3     40759
4    115137
5     42778
Name: total_emergencies, dtype: int64

We will also need the global frequency of the emergencies:

In [9]:
total_seconds = (df.time_call.max()-df.time_call.min()).total_seconds()
frequencies_per_grav = (emergencies_per_grav / total_seconds).rename("emergency_frequencies")
frequencies_per_grav

1    0.001137
2    0.001133
3    0.000431
4    0.001217
5    0.000452
Name: emergency_frequencies, dtype: float64

Each emergency will need to be assigne a district. Assuming independent distribution of emergencies by district and time, each will be assigned to a district according to a global probability based on this dataset, as follows.

In [10]:
prob_per_district = (df.district_code.value_counts().sort_index()/df.district_code.value_counts().sum()).rename("distric_weight")
prob_per_district

1     0.153830
2     0.052923
3     0.035832
4     0.057035
5     0.046039
6     0.055666
7     0.049578
8     0.044593
9     0.059466
10    0.057932
11    0.067799
12    0.037438
13    0.059103
14    0.019031
15    0.047171
16    0.029693
17    0.031178
18    0.025385
19    0.014430
20    0.041925
21    0.013952
Name: distric_weight, dtype: float64

In order to be able to simplify the generation of emergencies, we are going to assume that the distributions of emergencies per hour, per weekday and per month are independent, sharing no correlation. This is obiously not fully true, but it is a good approximation for the chosen time-frames.

In [11]:
hourly_dist = (df.hour.value_counts()/df.hour.value_counts().mean()).sort_index().rename("hourly_distribution")
daily_dist = (df.weekday.value_counts()/df.weekday.value_counts().mean()).sort_index().rename("daily_distribution")
monthly_dist = (df.month.value_counts()/df.month.value_counts().mean()).sort_index().rename("monthly_distribution")

We will actually make one of these per severity level.

This will allow us to modify the base emergency density of a given severity as follows:

In [12]:
def emergency_density(gravity, hour, weekday, month):
    base_density = frequencies_per_grav[gravity]
    density = base_density * hourly_dist[hour] * daily_dist[weekday] * monthly_dist[month]
    return density

In [13]:
emergency_density(3, 12, 4, 5)  # Emergency frequency for severity level 3, at 12 hours of a thursday in May

0.0007160038372819694

In order for the model to read these distributions we will need to store them in a dict-like format, in this case YAML, which is easily readable by human or machine.

In [14]:
dists = {}
for severity in range(1, 6):
    sub_df = df[df["severity"] == severity]
    
    frequency = float(frequencies_per_grav.round(8)[severity])
    
    hourly_dist  = (sub_df.hour.   value_counts()/sub_df.hour.   value_counts().mean()).sort_index().round(5).to_dict()
    daily_dist   = (sub_df.weekday.value_counts()/sub_df.weekday.value_counts().mean()).sort_index().round(5).to_dict()
    monthly_dist = (sub_df.month.  value_counts()/sub_df.month.  value_counts().mean()).sort_index().round(5).to_dict()
    
    district_prob = (sub_df.district_code.value_counts()/sub_df.district_code.value_counts().sum()).sort_index().round(5).to_dict()
    
    dists[severity] = {"frequency": frequency,
                      "hourly_dist": hourly_dist,
                      "daily_dist": daily_dist,
                      "monthly_dist": monthly_dist,
                      "district_prob": district_prob}
    

In [15]:
f = open("../data/distributions.yaml", "w+")
yaml.dump(dists, f, allow_unicode=True)

We can now check that the dictionary stored in the YAML file is the same one we have created.

In [16]:
with open("../data/distributions.yaml") as dist_file:
    yaml_dict = yaml.safe_load(dist_file)

In [17]:
yaml_dict == dists

True