# PREPROCESSING

This notebook includes code to process the original data, extracting only the necessary information for modeling. The data is then segmented by city, as later tasks involve calculating distance matrices and implementing the max coverage algorithm separately.

The dataset containing polygons for 15 Belgian cities is imported, but only 8 of them are selected: Antwerpen, Brugge, Brussels, Charleroi, Gent, Leuven, Liege, and Oostende. These cities provide public street locations suitable for potentially placing AEDs.

## Libraries, Constants, Functions

Importing libraries, constants, and functions from separate .py files, to enhance efficiency and promote code reuse across multiple notebooks.

In [1]:
from libraries import *
from constants import *
from functions import *

## Importing data

Importing the original datasets for AEDs, interventions, and vehicles.

In [2]:
os.chdir(data_path + original_path)

for filename in os.listdir():
    if filename.endswith(".parquet.gzip"):
        df_name = filename.split('.')[0]
        globals()[df_name] = pd.read_parquet(filename)

Importing the dataset with the city polygons

In [3]:
cities = gpd.read_file(data_path + belgium_polygons_path)

objectid_list = [1, 2, 3, 4, 5, 6, 8, 11]  
cities = cities[cities['OBJECTID'].isin(objectid_list)]

city_name_mapping = {
    'Bruxelles / Brussel (greater city)': 'Brussels',
    'Charleroi (greater city)': 'Charleroi',
    'Liège (greater city)': 'Liege'
}

cities['CityName'] = cities['CityName'].replace(city_name_mapping)

# Setting the CRS (Coordinate Reference System) to 4326
cities = cities.to_crs(epsg=4326)

## Preprocessing the original data

### AED locations

Converting the 'public' column to binary values, making it 0 or 1.

In [4]:
aed_locations['public'] = aed_locations['public'].fillna("0")
aed_locations['public'] = aed_locations['public'].apply(lambda x: 
    1 if x.lower().startswith(('o', 'j', 'y')) else 0
)

Geocoding AED addresses to obtain coordinates using the Google Geocode API. It asks the user to confirm running 15,000 API requests to avoid unwanted costs. If confirmed, it retrieves the coordinates; otherwise, it sets all coordinates to 0.

In [5]:
addresses = (
    aed_locations['address'].astype(str) + ", " +
    aed_locations['number'].astype(str) + ", " +
    aed_locations['postal_code'].astype(str) + ", " +
    aed_locations['municipality'].astype(str) + ", " +
    aed_locations['province'].astype(str)
)

confirmation = input(f"This will initialize {len(addresses)} API requests. Are you sure? (yes/no): ")
if confirmation == "yes":
    print("OK. Geocoding...")
    geocoded = addresses.apply(lambda x: gmaps.geocode(x))
    
    latitude = geocoded.apply(lambda x: x[0]['geometry']['location']['lat'] if x else None)
    longitude = geocoded.apply(lambda x: x[0]['geometry']['location']['lng'] if x else None)
    coordinates = pd.DataFrame({'latitude': latitude, 'longitude': longitude})

    aed_locations = pd.concat([aed_locations, coordinates], axis=1)
else:
    print("OK. All coordinates are set to 0.")
    aed_locations['latitude'] = 0
    aed_locations['longitude'] = 0

This will initialize 15227 API requests. Are you sure? (yes/no):  no


OK. All coordinates are set to 0.


Renames 'municipality' column to 'city'

In [6]:
aed_locations.rename(columns={'municipality': 'city'}, inplace=True)

Filtering the dataframe to keep only the processed columns, discarding the others.

In [7]:
aeds = aed_locations[['public', 'latitude', 'longitude', 'city']].copy()

### Interventions (Cards)

In [8]:
interventions = pd.concat([interventions1, interventions2, interventions3], ignore_index=True)
del interventions1, interventions2, interventions3

Filtering only interventions related to cardiac events (P003, P011, P039) from all datasets.

In [9]:
cardiac_codes_string = '|'.join(cardiac_codes)

interventions = interventions[
    interventions['EventType Firstcall'].str.contains(cardiac_codes_string) |
    interventions['EventType Trip'].str.contains(cardiac_codes_string)
]

cad9['EventType Trip'] = cad9['EventType Trip'].fillna("unknown")
cad9 = cad9[cad9['EventType Trip'].str.contains(cardiac_codes_string)]

interventions_bxl = interventions_bxl[
    interventions_bxl['eventtype_firstcall'].str.contains(cardiac_codes_string) |
    interventions_bxl['eventtype_trip'].str.contains(cardiac_codes_string)
]

interventions_bxl2['EventType and EventLevel'] = interventions_bxl2['EventType and EventLevel'].fillna("unknown")
interventions_bxl2 = interventions_bxl2[
    interventions_bxl2['EventType and EventLevel'].str.contains(cardiac_codes_string)
]

Filtering the dataframes to keep only the columns necessary for the model and ensuring they are named consistently across datasets for merging purposes.

In [10]:
selected_columns = ["Latitude intervention", "Longitude intervention", "CityName intervention"]
interventions = interventions[selected_columns]

cad9 = cad9[selected_columns]

selected_columns = ["latitude_intervention", "longitude_intervention", "cityname_intervention"]
interventions_bxl = interventions_bxl[selected_columns]

selected_columns = ["Latitude intervention", "Longitude intervention", "Cityname Intervention"]
interventions_bxl2 = interventions_bxl2[selected_columns]

colnames = ["latitude", "longitude", "city"]

interventions.columns = colnames
cad9.columns = colnames
interventions_bxl.columns = colnames
interventions_bxl2.columns = colnames

cards = pd.concat([interventions, cad9, interventions_bxl, interventions_bxl2], ignore_index=True)

del interventions, cad9, interventions_bxl, interventions_bxl2, colnames, selected_columns

For some observations, the decimal spot on coordinates is incorrect, such as latitude being 5.1789 instead of 51.789. Consequently, the dataset is divided into two groups: one containing accurate coordinates and another containing incorrect ones. A user-defined function is then used to rectify the inaccurate coordinates.

In [11]:
# 2 - Correct format of coordinates
cards2 = cards[
    (cards['latitude'] >= BELGIUM_SOUTH) & (cards['latitude'] <= BELGIUM_NORTH) &
    (cards['longitude'] >= BELGIUM_WEST) & (cards['longitude'] <= BELGIUM_EAST)
]

# 3 - Wrong format of coordinates (but no NAs)
cards3 = cards[
    (cards['latitude'] < BELGIUM_SOUTH) | (cards['latitude'] > BELGIUM_NORTH) |
    (cards['longitude'] < BELGIUM_WEST) | (cards['longitude'] > BELGIUM_EAST)
]
cards3 = cards3[~cards3['latitude'].isna() & ~cards3['longitude'].isna()]

# Fixing cards3
cards3['latitude'] = cards3['latitude'].apply(lambda x: x / 10 if 100 <= x < 1000 else x)
cards3['latitude'] = cards3['latitude'].apply(lambda x: insert_decimal(x, 2) if x >= 1000 else x)

cards3['longitude'] = cards3['longitude'].apply(lambda x: x / 10 if 10 <= x < 100 else (x / 100 if 100 <= x < 1000 else x))
cards3['longitude'] = cards3['longitude'].apply(lambda x: insert_decimal(x, 1) if x >= 1000 else x)

# Concatenate
cards = pd.concat([cards2, cards3])

# Filter outlying values
cards = cards[
    (cards['latitude'] >= BELGIUM_SOUTH) & (cards['latitude'] <= BELGIUM_NORTH) &
    (cards['longitude'] >= BELGIUM_WEST) & (cards['longitude'] <= BELGIUM_EAST)
]

cards['latitude'] = pd.to_numeric(cards['latitude'], errors='coerce')
cards['longitude'] = pd.to_numeric(cards['longitude'], errors='coerce')
cards = cards.drop_duplicates(subset=['latitude', 'longitude'], keep='last')

## Segmenting data by city

Segmenting both the AEDs and cards datasets by city. Furthermore, the cards dataset is randomly split into training and test data in a 75:25 ratio.

In [12]:
os.chdir(data_path + clean_path)

for city_name, city_polygon in cities[['CityName', 'geometry']].values:
    print("Segmenting " + city_name + "...")
    # aeds
    city_aeds = filter_points_within_polygon(aeds, city_polygon)
    city_aeds.to_csv(f'{city_name}_aeds.csv', index=False)
    
    # cards - split into train and test sets
    city_cards = filter_points_within_polygon(cards, city_polygon)
    cards_train, cards_test = train_test_split(city_cards, test_size=TEST_SIZE, random_state=SEED)    
    cards_train.to_csv(f'{city_name}_cards_train.csv', index=False)
    cards_test.to_csv(f'{city_name}_cards_test.csv', index=False)

Segmenting Brussels...
Segmenting Antwerpen...
Segmenting Gent...
Segmenting Charleroi...
Segmenting Liege...
Segmenting Brugge...
Segmenting Leuven...
Segmenting Oostende...


## Calculating possible AED locations

Calculating coordinates for every potential location on public streets where AEDs could be placed. Due to the limited amount of free credits with the Google API, the number of possible locations is sampled. Then, only one location is retained from areas that are too close to each other to ensure better spreading out.

In [14]:
os.chdir(data_path + possible_locations_path)

for city_name, city_polygon in cities[['CityName', 'geometry']].values:
    print("Calculating possible AED locations for " + city_name + "...")
    
    streets = get_streets_within_polygon(city_polygon)
    points = sample_points_on_streets(streets, num_points = 3)
    
    possible_locations = gpd.GeoDataFrame(geometry = points, crs = streets.crs)
    possible_locations = possible_locations.sample(frac = SAMPLE_SIZE, random_state = SEED)
    possible_locations = remove_close_points(possible_locations, min_distance = MIN_DISTANCE)
    
    possible_locations.to_csv(f'{city_name}_possible_locations.csv', index=False)

Calculating possible AED locations for Brussels...


  streets = ox.graph_from_bbox(north=bbox[3], south=bbox[1], east=bbox[2], west=bbox[0], network_type='all')


Calculating possible AED locations for Antwerpen...


  streets = ox.graph_from_bbox(north=bbox[3], south=bbox[1], east=bbox[2], west=bbox[0], network_type='all')


Calculating possible AED locations for Gent...


  streets = ox.graph_from_bbox(north=bbox[3], south=bbox[1], east=bbox[2], west=bbox[0], network_type='all')


Calculating possible AED locations for Charleroi...


  streets = ox.graph_from_bbox(north=bbox[3], south=bbox[1], east=bbox[2], west=bbox[0], network_type='all')


Calculating possible AED locations for Liege...


  streets = ox.graph_from_bbox(north=bbox[3], south=bbox[1], east=bbox[2], west=bbox[0], network_type='all')


Calculating possible AED locations for Brugge...


  streets = ox.graph_from_bbox(north=bbox[3], south=bbox[1], east=bbox[2], west=bbox[0], network_type='all')


Calculating possible AED locations for Leuven...


  streets = ox.graph_from_bbox(north=bbox[3], south=bbox[1], east=bbox[2], west=bbox[0], network_type='all')


Calculating possible AED locations for Oostende...


  streets = ox.graph_from_bbox(north=bbox[3], south=bbox[1], east=bbox[2], west=bbox[0], network_type='all')
