# Location Mapping Exploration
This notebook is just to explore the best solution for location mapping. Final mapping is done in location_mapping.ipynb
\
\
The raw data contains over a thousand different locations, a lot of them with very low occurrences. This can considerably reduce the quality of our model. We must filter our locations carefully without reducing the size of our dataset. To do so, we will associate every location to its closest population center.
\
\
This notebook goes through all location values and maps them to their closest population center. Population centers are defined as either:
1. Quebec Administrative Regions Main Cities (Estrie: Sherbrooke, Magog, Montérégie: Brossard, Granby etc.)
2. Quebec Administrative Regions (Estrie, Outaouais, Montérégie etc.)
3. Quebec Administrative Regions Main Cities + MTL Boroughs
4. Quebec Administrative Regions + MTL Boroughs
5. Quebec's 112 Biggest Cities + MTL Boroughs

In [41]:
import numpy as np
import pandas as pd
import pickle

from geopy.distance import geodesic
from geopy.location import Location
from geopy.geocoders import Nominatim
from tqdm import tqdm

from os import path
from IPython.display import display

## Useful Methods & Resources

### Geolocation Methods

In [42]:
geolocator = Nominatim(user_agent='housing-qc')

In [43]:
# Find geographical coordinates of all locations in given list
def find_coordinates(locations):
    location_dict = []
    unknown_locations = []

    for location in tqdm(locations, desc='Finding Location Coordinates'):
        try:
            geocode = geolocator.geocode(location + ', QC')
        except:
            geocode = None
        
        if geocode is None:
            unknown_locations.append(location)
            print(location + ' is an unknown location')
        else:
            location_dict.append({'Name': location, 'Latitude': geocode.latitude, 'Longitude': geocode.longitude})

    return pd.DataFrame(location_dict), unknown_locations

In [44]:
# Given geographical latitude and longitude, find the closest location in the given list
def find_closest_location(latitude, longitude, locations: pd.DataFrame):
    distances = []

    for _, location in locations.iterrows():
        distances.append(geodesic((latitude, longitude), (location['Latitude'], location['Longitude'])))

    return locations.iloc[np.argmin(distances)]['Name']

In [45]:
# Map all locations to a reference location based on distance
def build_location_mapper(locations: pd.DataFrame, reference_locations: pd.DataFrame):
    location_mapper = {}
    
    for _, location in tqdm(locations.iterrows(), desc="Building Location Mapper", total=locations.shape[0]):
        closest_location = find_closest_location(location['Latitude'], location['Longitude'], reference_locations)
        location_mapper[location['Name']] = closest_location
        
    return location_mapper

### Formatting Unknown Locations

In [46]:
# Dictionnary for unknown locations
unknown_locations_dict = {}
unknown_locations_dict["L'Ile Des Soeurs"] = "Ile des soeurs"
unknown_locations_dict["St-Joseph-De-Ham-Sud"] = "Ham-Sud"
unknown_locations_dict["St-Mathieu-De-Laprairie"] = "Saint-Mathieu"
unknown_locations_dict["St-Denis-sur-Mer"] = "Saint-Denis"
unknown_locations_dict["St-Isidore-De-Laprairie"] = "Saint-Isidore"
unknown_locations_dict["St-Stanislas-De-Champlain"] = "Saint-Denis"
unknown_locations_dict["St-Sebastien-De-Frontenac"] = "Saint-Sebastien"
unknown_locations_dict["St-Simon-De-Rimouski"] = "Saint-Simon"
unknown_locations_dict["Ste-Francoise-De-Lotbiniere"] = "Saint-Francoise"
unknown_locations_dict["St-Guillaume-D'Upton"] = "Saint-Guillaume"
unknown_locations_dict["St-Adelphe-De-Champlain"] = "Saint-Adelphe"

In [47]:
def fix_unknown_location(unknown_location: str):
    if unknown_location in unknown_locations_dict.keys():
        return unknown_locations_dict[unknown_location]
    elif "ND-" in unknown_location:
        return unknown_location.replace("ND-", "Notre-Dame-")
    elif "JC" in unknown_location:
        return unknown_location.replace("JC", "Jacques-Cartier")
    elif "St-" in unknown_location:
        return unknown_location.replace("St-", "Saint-")
    elif "Ste-" in unknown_location:
        return unknown_location.replace("Ste-", "Sainte-")
    elif "Sts-" in unknown_location:
        return unknown_location.replace("Sts-", "Saints-")
    else:
        return unknown_location

## Raw Locations

### Get Locations from Raw Listings

In [48]:
home_df = pd.read_csv('../data/raw/home_listings.csv')
condo_df = pd.read_csv('../data/raw/condo_listings.csv')
raw_df = pd.concat([home_df, condo_df], axis=0, ignore_index=True)

raw_locations = list(raw_df['location'].value_counts().to_dict().keys())
print('Raw Locations: ' + str(raw_locations[0]) + ', ' + str(raw_locations[1]) + ', ' + str(raw_locations[2]) + ' etc.')

Raw Locations: Gatineau, Trois-Rivières, Beauport etc.


### Get Raw Locations Coordinates

In [49]:
if path.exists('../data/processed/mappers exploration/raw_location_coordinates.csv'):
    raw_location_coordinates = pd.read_csv('../data/processed/mappers exploration/raw_location_coordinates.csv')
else:
    raw_location_coordinates, unknown_locations = find_coordinates(raw_locations)

    fixed_locations = list(map(fix_unknown_location, unknown_locations))
    unknown_locations_mapper = {fixed_locations[i]: unknown_locations[i] for i in range(len(unknown_locations))}

    raw_location_coordinates2, unknown_locations2 = find_coordinates(fixed_locations)
    if len(unknown_locations2) == 0:
        raw_location_coordinates2 = { unknown_locations_mapper.get(k, k): v for k, v in raw_location_coordinates2.items() }
        raw_location_coordinates.update(raw_location_coordinates2)

### Save Final Raw Locations Coordinates

In [50]:
raw_location_coordinates.to_csv('../data/processed/mappers exploration/raw_location_coordinates.csv', index=False)
raw_location_coordinates.head()

Unnamed: 0,Name,Latitude,Longitude
0,Ste-Catherine-de-la-Jacques-Cartier,46.844381,-71.615023
1,Notre-Dame-De-L'Ile-Perrot,45.351663,-73.902969
2,Saint-Mathieu,45.312563,-73.518448
3,Saint-Isidore,46.585058,-71.090469
4,Saint-Magloire-De-Bellechasse,46.592524,-70.440777


## 1. Quebec Administrative Regions Main Cities

### Read Data

In [51]:
regions_df = pd.read_csv('../data/references/wikipedia/qc-administrative-regions.csv')
regions_dict = dict(zip(regions_df['City'], regions_df['Region']))

display(regions_df.head())

Unnamed: 0,City,Region
0,Rimouski,Bas-Saint-Laurent
1,Rivière-du-Loup,Bas-Saint-Laurent
2,Matane,Bas-Saint-Laurent
3,Alma,Saguenay-Lac-Saint-Jean
4,Saguenay,Saguenay-Lac-Saint-Jean


In [52]:
mtl_boroughs = pd.read_csv('../data/references/wikipedia/mtl-boroughs.csv')
for borough in mtl_boroughs['Borough']:
    regions_dict[borough] = borough

### Get Region City Coordinates

In [53]:
if path.exists('../data/processed/mappers exploration/city_region_coordinates.csv'):
    city_region_coordinates = pd.read_csv('../data/processed/mappers exploration/city_region_coordinates.csv')
else:
    city_region_coordinates, unknown_cities = find_coordinates(regions_dict.keys())

    city_region_coordinates.to_csv('../data/processed/mappers exploration/city_region_coordinates.csv', index=False)
    city_region_coordinates.head()

### Map Raw Locations to City Regions

In [54]:
city_region_mapper = build_location_mapper(raw_location_coordinates, city_region_coordinates)

city_region_mapper_df = pd.DataFrame(city_region_mapper.items(), columns=['location', 'mapping'])
city_region_mapper_df.to_csv('../data/processed/mappers exploration/city_region_mapper.csv', index=False)
display(city_region_mapper_df.head())

with open('../data/processed/mappers exploration/city_region_mapper.pkl', 'wb') as f:
    pickle.dump(city_region_mapper, f)

Building Location Mapper: 100%|██████████| 1142/1142 [00:09<00:00, 121.56it/s]


Unnamed: 0,location,mapping
0,Ste-Catherine-de-la-Jacques-Cartier,Quebec City
1,Notre-Dame-De-L'Ile-Perrot,Vaudreuil-Dorion
2,Saint-Mathieu,Candiac
3,Saint-Isidore,Lévis
4,Saint-Magloire-De-Bellechasse,Saint-Georges


## 2. Quebec Administrative Regions

### Map Raw Locations to Regions

In [55]:
region_mapper = {}
for location in city_region_mapper.keys():
    region_city = city_region_mapper[location]
    region_mapper[location] = regions_dict[region_city]

with open('../data/processed/mappers exploration/region_mapper.pkl', 'wb') as f:
    pickle.dump(region_mapper, f)

## 3. Quebec Administrative Regions Main Cities + MTL Boroughs

### Get Coordinates

In [56]:
if path.exists('../data/processed/mappers exploration/city_region_mtl_coordinates.csv'):
    city_region_mtl_coordinates = pd.read_csv('../data/processed/mappers exploration/city_region_mtl_coordinates.csv')
else:
    city_region_mtl_coordinates, unknown_cities = find_coordinates(regions_dict.keys())

    city_region_mtl_coordinates.to_csv('../data/processed/mappers exploration/city_region_mtl_coordinates.csv', index=False)

display(city_region_mtl_coordinates.head())

Unnamed: 0,Name,Latitude,Longitude
0,Rimouski,48.450155,-68.529968
1,Rivière-du-Loup,47.835816,-69.536802
2,Matane,48.846877,-67.52955
3,Alma,48.548887,-71.651459
4,Saguenay,48.405959,-71.069183


### Map Raw Locations to Regions Main Cities + MTL Boroughs

In [57]:
city_region_mtl_mapper = build_location_mapper(raw_location_coordinates, city_region_mtl_coordinates)

city_region_mtl_mapper_df = pd.DataFrame(city_region_mtl_mapper.items(), columns=['location', 'mapping'])
city_region_mtl_mapper_df.to_csv('../data/processed/mappers exploration/city_region_mtl_mapper.csv', index=False)
city_region_mtl_mapper_df.head()

with open('../data/processed/mappers exploration/city_region_mtl_mapper.pkl', 'wb') as f:
    pickle.dump(city_region_mtl_mapper, f)

Building Location Mapper: 100%|██████████| 1142/1142 [00:13<00:00, 83.93it/s]


## 4. Quebec Administrative Regions + MTL Boroughs

### Map Raw Locations to Regions + MTL Boroughs

In [58]:
region_mtl_mapper = {}
for location in city_region_mtl_mapper.keys():
    region_city = city_region_mtl_mapper[location]
    region_mtl_mapper[location] = regions_dict[region_city]

with open('../data/processed/mappers exploration/region_mtl_mapper.pkl', 'wb') as f:
    pickle.dump(region_mtl_mapper, f)

## 5. Quebec's 112 Biggest Cities + MTL Boroughs

### Read Data

In [59]:
big_cities = pd.read_csv('../data/references/wikipedia/qc-cities.csv')['Name'].to_list()
big_cities.remove('Montréal')
big_cities = big_cities + mtl_boroughs['Borough'].to_list()

### Find Coordinates

In [60]:
if path.exists('../data/processed/mappers exploration/big_city_coordinates.csv'):
    big_city_coordinates = pd.read_csv('../data/processed/mappers exploration/big_city_coordinates.csv')
else:
    big_city_coordinates, unknown_locations = find_coordinates(big_cities)

    big_city_coordinates.to_csv('../data/processed/mappers exploration/big_city_coordinates.csv', index=False)

display(big_city_coordinates.head())

Unnamed: 0,Name,Latitude,Longitude
0,Québec,46.813743,-71.208406
1,Laval,45.605589,-73.734417
2,Longueuil,45.533339,-73.420032
3,Gatineau,45.484121,-75.681373
4,Chicoutimi,48.337025,-71.123526


### Map Raw Locations to 112 Biggest Cities + MTL Boroughs

In [61]:
big_city_mapper = build_location_mapper(raw_location_coordinates, big_city_coordinates)

big_city_mapper_df = pd.DataFrame(big_city_mapper.items(), columns=['location', 'mapping'])
big_city_mapper_df.to_csv('../data/processed/mappers exploration/big_city_mapper.csv', index=False)
big_city_mapper_df.head()

with open('../data/processed/mappers exploration/big_city_mapper.pkl', 'wb') as f:
    pickle.dump(big_city_mapper, f)

Building Location Mapper: 100%|██████████| 1142/1142 [00:30<00:00, 37.88it/s]
