# Location Mapping

The raw data contains over a thousand different locations, a lot of them with very low occurrences. This can considerably reduce the quality of our model. We must filter our locations carefully without reducing the size of our dataset. To do so, we will associate every location to its closest population center. 
\
\
To do so, we must define a list of population centers. Those are regions with large populations therefore increasing the number of possible listings for that specific location. In high density regions such as Montreal, a population center can be as small as borough (e.g. Plateau Mont-Royal ~10 000 listings). In medium density regions, we will typically choose a city (e.g. Granby ~10 000 listings). In low density regions, we might pick a full region! (e.g. Gaspésie ~1000 listings)
\
\
I handmade the final list of population centers by listing all boroughs, cities and regions with a population of more than 5000. I tried multiple lists found on wikipedia and other places, sometimes with my own little tweaks. In the end, I found that my handmade list gave the best results.
\
\
Here's how the list is built:
1. Rows 1 to 20: Montreal Boroughs
2. Rows 21 to 29: Cities on the island of Montreal (e.g. Kirkland, Dorval etc.)
3. Rows 30 to 44: Laval Boroughs 
4. Rows 45 to 82: Major cities in Montérégie (e.g. Brossard, Saint-Jean, Granby etc.)
5. Rows 83 to 92: Major Cities in Laurentides (e.g. Mirabel, Blainville etc.)
6. Rows 93 to 97: Major cities in Lanaudière (e.g. Joliette, Terrebonne etc.)
7. Rows 98 to 101: Major cities in Chaudières Appalaches (e.g. Lévis, Saint-Georges etc.)
8. Rows 102 to 105: Major cities in Estrie (e.g. Sherbrooke, Magog etc.)
9. Rows 106 to 110: Major cities in Saguenay (e.g. Alma, Chicoutimi etc.)
10. Rows 111 to 116: Major cities in Capitale-Nationale (e.g. Quebec City, Portneuf etc.)
11. Rows 117 to 135: Major cities in Outaouais, Mauricie, Centre du Québec, Bas-Saint-Laurent, Abitibi-Temiscamingue, Gaspésie and Côte Nord

### Imports

In [11]:
import numpy as np
import pandas as pd
import pickle

from geopy.distance import geodesic
from geopy.geocoders import Nominatim
from tqdm import tqdm
from os import path

### Methods & Resources

In [12]:
geolocator = Nominatim(user_agent='housing-qc')

In [13]:
# Find geographical coordinates of all locations in given list.
def compute_coordinates(locations: list):
    location_dict = []
    unknown_locations = []

    for location in tqdm(locations, desc='Finding Location Coordinates'):
        # Try to get geocode, None if error occurs.
        try:
            geocode = geolocator.geocode(location + ', QC')
        except:
            geocode = None
        
        if geocode is None:
            unknown_locations.append(location)
            print(location + ' is an unknown location')
        else:
            location_dict.append({'Name': location, 'Latitude': geocode.latitude, 'Longitude': geocode.longitude})

    return pd.DataFrame(location_dict), unknown_locations

In [14]:
def compute_list_coordinates(locations: list, output: str):
    if path.exists(output):
        unknown_locations = []
        coordinates = pd.read_csv(output)
    else:
        coordinates, unknown_locations = compute_coordinates(locations)
        coordinates.to_csv(output, index=False)
    
    return coordinates, unknown_locations

In [15]:
# Given geographical latitude and longitude, find the closest location in the given list.
def find_closest_location(latitude: float, longitude: float, locations: pd.DataFrame):
    distances = []

    for _, location in locations.iterrows():
        distances.append(geodesic((latitude, longitude), (location['Latitude'], location['Longitude'])))

    return locations.iloc[np.argmin(distances)]['Name']

In [16]:
# Map all locations to a reference location based on closest distance.
def build_location_mapper(locations: pd.DataFrame, reference_locations: pd.DataFrame):
    location_mapper = {}
    
    for _, location in tqdm(locations.iterrows(), desc="Building Location Mapper", total=locations.shape[0]):
        closest_location = find_closest_location(location['Latitude'], location['Longitude'], reference_locations)
        location_mapper[location['Name']] = closest_location
        
    return location_mapper

### Read Data

Read unique raw locations

In [17]:
home_df = pd.read_csv('../data/raw/home_listings.csv')
condo_df = pd.read_csv('../data/raw/condo_listings.csv')
raw_df = pd.concat([home_df, condo_df], axis=0, ignore_index=True)

In [18]:
locations = list(raw_df['location'].unique())
print('Raw Locations: ' + str(locations[0]) + ', ' + str(locations[1]) + ', ' + str(locations[2]) + ' etc.')

Raw Locations: Beauport, Deschambault, Mercier etc.


Read list of population centers

In [19]:
population_centers = pd.read_csv('../data/references/handmade/qc-population-centers.csv')['Name'].to_list()
print('Population Centers: ' + str(population_centers[15]) + ', ' + str(population_centers[50]) + ', ' + str(population_centers[80]) + ' etc.')

Population Centers: Lachine, Granby, Saint-Bruno-de-Montarville etc.


### Format Raw Location Strings

Abbreviations

In [20]:
unknown_locations_dict = {
    "L'Ile Des Soeurs": "Ile des soeurs",
    "St-Denis-sur-Mer": "Saint-Denis",
    "St-Simon-De-Rimouski": "Saint-Simon",
    "St-Guillaume-D'Upton": "Saint-Guillaume",
    "St-Joseph-De-Ham-Sud": "Ham-Sud",
    "St-Adelphe-De-Champlain": "Saint-Adelphe",
    "St-Mathieu-De-Laprairie": "Saint-Mathieu",
    "St-Isidore-De-Laprairie": "Saint-Isidore",
    "St-Stanislas-De-Champlain": "Saint-Denis",
    "St-Sebastien-De-Frontenac": "Saint-Sebastien",
    "Ste-Francoise-De-Lotbiniere": "Saint-Francoise",
    "Sheenboro": "Pontiac"
}

locations = [x if x not in unknown_locations_dict else unknown_locations_dict[x] for x in locations]

In [21]:
locations = [str(i) for i in locations]

locations = [location.replace('St-', 'Saint-') for location in locations]
locations = [location.replace('Ste-', 'Sainte-') for location in locations]
locations = [location.replace('Sts-', 'Saints-') for location in locations]
locations = [location.replace('ND-', 'Notre-Dame-') for location in locations]
locations = [location.replace('JC', 'Jacques-Cartier') for location in locations]

Unknown locations by geopy (most likely due to name changes over the years)

### Compute Geographical Coordinates

For Raw Locations

In [22]:
output = '../data/processed/raw_location_coordinates.csv'
raw_location_coordinates, unknown_locations = compute_list_coordinates(locations=locations, output=output)

if len(unknown_locations) == 0:
    print("All coordinates successfully calculated.")

raw_location_coordinates.head()

All coordinates successfully calculated.


Unnamed: 0,Name,Latitude,Longitude
0,Beauport,46.907111,-71.212797
1,Deschambault,46.662647,-71.944288
2,Mercier,45.310444,-73.746051
3,Stoneham,46.999608,-71.369475
4,Trois-Rivières,46.371592,-72.600502


For Population Centers

In [23]:
output = '../data/processed/population_centers_coordinates.csv'
population_centers_coordinates, unknown_locations = compute_list_coordinates(locations=population_centers, output=output)

if len(unknown_locations) == 0:
    print("All coordinates successfully calculated.")

population_centers_coordinates.head()

All coordinates successfully calculated.


Unnamed: 0,Name,Latitude,Longitude
0,Côte-des-Neiges-Notre-Dame-de-Grâce,45.467967,-73.628922
1,Villeray-Saint-Michel-Parc-Extension,45.537006,-73.625796
2,Rosemont-La Petite-Patrie,45.553384,-73.576036
3,Mercier-Hochelaga-Maisonneuve,45.574106,-73.525846
4,Ahuntsic-Cartierville,45.541892,-73.680319


### Map Locations

In [24]:
location_mapper = build_location_mapper(raw_location_coordinates, population_centers_coordinates)

Building Location Mapper: 100%|██████████| 1168/1168 [00:31<00:00, 37.29it/s]


Save to .csv

In [25]:
location_mapper_df = pd.DataFrame(location_mapper.items(), columns=['location', 'mapping'])
location_mapper_df.to_csv('../data/processed/location_mapper.csv', index=False)
display(location_mapper_df.head())

Unnamed: 0,location,mapping
0,Beauport,Quebec City
1,Deschambault,Portneuf
2,Mercier,Mercier
3,Stoneham,Stoneham-et-Tewkesbury
4,Trois-Rivières,Trois-Rivières


Save to .pkl

In [26]:
with open('../data/processed/location_mapper.pkl', 'wb') as f:
    pickle.dump(location_mapper, f)

In [27]:
len(location_mapper_df['mapping'].unique())

129