<br/>
<a target="_blank" href="https://colab.research.google.com/github/tabt/geo_notebooks/blob/main/geocoders.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a><br/>

## This notebook is an attempt to find good open geocoder for Russian addresses

I've checked three open geocoders (or something like that) and compared their batch geocoding quality. Some geocoders can be deployed in docker, but I've tried sending requests to their public webpages. It's faster and you can decide whether you want to deploy this service or not

The important part of the task is the ability to process many addresses. Nominatim, as a default geocoder, can be good, but not designed for frequent requests

I took a dataset of Moscow addresses with coodinates as a ground truth. After processing sample of 1000 items, geocoded points were compares with their original coordinates

**Caveats**
- this is an experiment, which purpose is to find something good for addresses geocoding in large amount. So it has a lot of inaccuracies
- addresses are not preprocessed
- addresses in Moscow dataset are quite complicated - it can be improved with preprocessing. But still Yandex geocodes them easily
- mingkh contains only multiapartment houses, works badly with private ones


**Results**
- Photon shows the worst results (both in time and quality)
- Pelias rarely outputs nothing, but often returns wrong coordinates
- Mingkh is the best in practice because it is designed for the Russian language, but still it has its caveats and misses private houses

All in all, none of the tested geocoders are good enough. But if I'll find smth better - I'll add it here

---
*I'm also starting to work with NLP and might want to write some smart address preprocessing...*



# imports

In [None]:
!pip -q install mapclassify fuzzywuzzy transliterate

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/45.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.8/45.8 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import pandas as pd
import geopandas as gpd
from shapely import Point
import numpy as np
import json
import requests
from io import StringIO
import glob
from tqdm import tqdm
import urllib.parse
from fuzzywuzzy import process, fuzz
from geopy.geocoders import Nominatim
from bs4 import BeautifulSoup
import time
from random import randint
from transliterate import translit



# utils

Sometimes geocoders finds the right address, but it's not top-1. Function helps finding the closest returned address to query address. In some cases it allows to avoid picking the wrong coordinates

In [None]:
def find_closest_string_fuzzy(test_string, string_list):
    """
    returns: closest (by Levenshtein distance) word index in list
    """
    ratios = [fuzz.ratio(test_string, string) for string in string_list]
    closest_string_index = np.argmax(np.array(ratios))

    return closest_string_index

# photon

Source: https://github.com/komoot/photon

In [None]:
def photon_geocode(query):
    safe_string = urllib.parse.quote_plus(query)
    photon_url = f'https://photon.komoot.io/api/?q={safe_string}&limit=5'

    response = requests.get(photon_url, headers={'User-Agent': 'test'})

    if response.status_code == 200:
        response = json.loads(response.text)
    else:
        print(f'API request error: {response.text}')
        return gpd.GeoDataFrame({"query": [query]})

    if not response["features"]:
        return gpd.GeoDataFrame({"query": [query]})

    geocoded_result = gpd.GeoDataFrame.from_features(response["features"], crs='EPSG:4326')

    string_list = geocoded_result.apply(
        lambda row: ', '.join([str(row.get('postcode', '')), str(row.get('county', '')), str(row.get('city', '')), str(row.get('name', '')), str(row.get('district', '')), str(row.get('street', '')), str(row.get('housenumber'))]).replace('nan', ''),
        axis=1
    ).tolist()

    closest_address_index = find_closest_string_fuzzy(query, string_list)

    if closest_address_index != None:
        geocoded_result['query'] = query
        return geocoded_result.iloc[closest_address_index : closest_address_index + 1]
    else:
        return gpd.GeoDataFrame({"query": [query]})

# dom.mingkh

Source: http://dom.mingkh.ru

In [None]:
def get_house_data_urls(query):
    url = f"https://dom.mingkh.ru/search?address={query}&searchtype=house"

    headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/143.0.0.0 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'ru-RU,ru;q=0.8,en-US;q=0.5,en;q=0.3',
    'Accept-Encoding': 'gzip, deflate, br, zstd',
    'Connection': 'keep-alive',
    'Referer': 'https://dom.mingkh.ru/search?address=1234&searchtype=house',
    'Upgrade-Insecure-Requests': '1',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'same-origin',
    'Sec-Fetch-User': '?1',
    'Priority': 'u=0, i'
    }

    response = requests.request("GET", url, headers=headers)
    addresses_table = response.text

    soup = BeautifulSoup(addresses_table)
    table = pd.read_html(StringIO(str(addresses_table)))[0]

    closest_match_index = find_closest_string_fuzzy(query, (table['Город'] + ' ' + table['Адрес']).tolist())

    house_api_url = f'https://dom.mingkh.ru/api/map/house/{soup.select("tr a")[closest_match_index]["href"].split("/")[-1]}'
    house_page_url = f'https://dom.mingkh.ru{soup.select("tr a")[closest_match_index]["href"]}'

    return house_api_url, house_page_url

def get_house_coords(house_api_url, house_page_url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/143.0.0.0 Safari/537.36',
        'Accept': '*/*',
        'Accept-Language': 'ru-RU,ru;q=0.8,en-US;q=0.5,en;q=0.3',
        'Accept-Encoding': 'gzip, deflate, br, zstd',
        'X-Requested-With': 'XMLHttpRequest',
        'Connection': 'keep-alive',
        'Referer': house_page_url,
        'Sec-Fetch-Dest': 'empty',
        'Sec-Fetch-Mode': 'cors',
        'Sec-Fetch-Site': 'same-origin',
        'TE': 'trailers',
    }

    response = requests.request("GET", house_api_url, headers=headers)
    response = json.loads(response.text)

    if not response['features']:
        return gpd.GeoDataFrame()

    response['features'][0]['geometry']['coordinates'] = response['features'][0]['geometry']['coordinates'][::-1]
    geocoded_result = gpd.GeoDataFrame.from_features(response, crs='EPSG:4326')

    return geocoded_result


def get_house_additional_data(house_page_url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/143.0.0.0 Safari/537.36',
        'Accept': '*/*',
        'Accept-Language': 'ru-RU,ru;q=0.8,en-US;q=0.5,en;q=0.3',
        'Accept-Encoding': 'gzip, deflate, br, zstd',
        'X-Requested-With': 'XMLHttpRequest',
        'Connection': 'keep-alive',
        'Referer': house_page_url,
        'Sec-Fetch-Dest': 'empty',
        'Sec-Fetch-Mode': 'cors',
        'Sec-Fetch-Site': 'same-origin',
        'TE': 'trailers',
    }

    response = requests.request("GET", house_page_url, headers=headers)

    soup = BeautifulSoup(response.text)

    properties = {
        row.select("td")[0].text.replace('\xa0', ''): row.select("td")[-1].text.replace('\xa0', '') for row in soup.select("tr") if len(row.select("td")) > 1 and row.select("td")[0].text[0].isupper()
    }

    return pd.DataFrame([properties])

def mingkh_geocode(query, get_additional_data=True):
    house_api_url, house_page_url = get_house_data_urls(query)

    geocoded_result = get_house_coords(house_api_url, house_page_url)
    if geocoded_result.empty:
        return gpd.GeoDataFrame({"query": [query]})

    if get_additional_data:
        additional_data = get_house_additional_data(house_page_url)
        geocoded_result = pd.concat([geocoded_result, additional_data], axis=1)

    geocoded_result = geocoded_result[[col for col in geocoded_result.columns if 'ремонт' not in col.lower()]]
    geocoded_result['url'] = house_page_url
    geocoded_result['query'] = query

    return geocoded_result

# pelias

Source: https://github.com/pelias/pelias

In [None]:
def pelias_geocode(query):
    api_key = "ge-3bad5ca64898357d"  # technical key from geocode.earth (to update find the following request in F12 -> Network)
    safe_string = urllib.parse.quote_plus(query)
    response = requests.get(f'https://api.geocode.earth/v1/autocomplete?api_key={api_key}&text={safe_string}&size=5')

    if response.status_code == 200:
        response = json.loads(response.text)
    else:
        print(f'API request error: {response.text}')
        return gpd.GeoDataFrame({"query": [query]})

    if response["features"]:
        geocoded_result = gpd.GeoDataFrame.from_features(response["features"], crs='EPSG:4326')
    else:
        return gpd.GeoDataFrame({"query": [query]})

    closest_address_index = find_closest_string_fuzzy(query, [
        str(row.get("country", "")) + " " + translit(str(row.get("region", "")), "ru") + " " + str(row.get("name", "")) for _, row in geocoded_result.iterrows()
    ])

    time.sleep(0.01)  # a short pause to prevent exceeding the API limit

    if closest_address_index != None:
        geocoded_result['query'] = query
        return geocoded_result.iloc[closest_address_index : closest_address_index + 1]
    else:
        return gpd.GeoDataFrame({"query": [query]})

# quality test

Dataset source: https://data.mos.ru/opendata/60562?pageSize=10&pageIndex=0&version=3&release=1581

In [None]:
!wget -O data.zip https://data.mos.ru/odata/export/catalog?idFile=366094
!unzip "data.zip"

Archive:  data.zip
  inflating: data-60562-2026-01-25.csv  


In [None]:
csv_filename = glob.glob("*.csv")[0]
moscow_houses = pd.read_csv(csv_filename, sep=';').drop(index=0, axis=0)
moscow_houses = moscow_houses[moscow_houses['geodata_center'].notna()]

  moscow_houses = pd.read_csv(csv_filename, sep=';').drop(index=0, axis=0)


In [None]:
moscow_houses_sample = moscow_houses.sample(1000)

In [None]:
moscow_houses_sample = moscow_houses_sample[['SIMPLE_ADDRESS', 'geodata_center']]
moscow_houses_sample['coords_str'] = moscow_houses_sample['geodata_center'].str.extract(r'=\[([^\]]+)\]')
moscow_houses_sample['geometry'] = moscow_houses_sample['coords_str'].apply(lambda x: Point(json.loads("[" + x + "]")))
moscow_houses_sample = moscow_houses_sample[['SIMPLE_ADDRESS', 'geometry']]
moscow_houses_sample['SIMPLE_ADDRESS'] = 'Москва, ' + moscow_houses_sample['SIMPLE_ADDRESS']
moscow_houses_sample = gpd.GeoDataFrame(moscow_houses_sample, geometry='geometry', crs='EPSG:4326').reset_index(drop=True)

In [None]:
moscow_houses_sample

Unnamed: 0,SIMPLE_ADDRESS,geometry
0,"Москва, муниципальный округ Внуково, посёлок Д...",POINT (37.33163 55.65943)
1,"Москва, Олонецкий проезд, дом 8, строение 2",POINT (37.66124 55.87387)
2,"Москва, муниципальный округ Краснопахорский, д...",POINT (37.3077 55.42442)
3,"Москва, муниципальный округ Щербинка, квартал ...",POINT (37.51125 55.45246)
4,"Москва, муниципальный округ Краснопахорский, к...",POINT (37.40173 55.34676)
...,...,...
995,"Москва, муниципальный округ Внуково, посёлок С...",POINT (37.11453 55.59823)
996,"Москва, муниципальный округ Коммунарка, деревн...",POINT (37.3966 55.61612)
997,"Москва, улица Академика Королёва, владение 2А",POINT (37.63038 55.82186)
998,"Москва, Поморская улица, земельный участок 5",POINT (37.57718 55.86907)


In [None]:
geocoded_addresses_photon = []
geocoded_addresses_mingkh = []
geocoded_addresses_pelias = []

for _, row in tqdm(moscow_houses_sample.iterrows(), total=moscow_houses_sample.shape[0]):
    geocoded_address_photon = photon_geocode(row["SIMPLE_ADDRESS"])
    geocoded_address_mingkh = mingkh_geocode(row["SIMPLE_ADDRESS"])
    geocoded_address_pelias = pelias_geocode(row["SIMPLE_ADDRESS"])

    geocoded_addresses_photon.append(geocoded_address_photon)
    geocoded_addresses_mingkh.append(geocoded_address_mingkh)
    geocoded_addresses_pelias.append(geocoded_address_pelias)

geocoded_addresses_photon = pd.concat(geocoded_addresses_photon)
geocoded_addresses_mingkh = pd.concat(geocoded_addresses_mingkh)
geocoded_addresses_pelias = pd.concat(geocoded_addresses_pelias)

geocoded_addresses_photon = geocoded_addresses_photon.set_geometry('geometry').reset_index(drop=True)
geocoded_addresses_mingkh = geocoded_addresses_mingkh.set_geometry('geometry').reset_index(drop=True)
geocoded_addresses_pelias = geocoded_addresses_pelias.set_geometry('geometry').reset_index(drop=True)

100%|██████████| 1000/1000 [51:14<00:00,  3.07s/it]


In [None]:
print("Photon:", geocoded_addresses_photon['geometry'].isna().sum() / 10, "% missed")
print("Mingkh:", geocoded_addresses_mingkh['geometry'].isna().sum() / 10, "% missed")
print("Pelias:", geocoded_addresses_pelias['geometry'].isna().sum() / 10, "% missed")

Photon: 58.4 % missed
Mingkh: 3.6 % missed
Pelias: 15.9 % missed


In [None]:
threshold = 80

distances_photon = moscow_houses_sample['geometry'].to_crs('EPSG:6933').distance(geocoded_addresses_photon['geometry'].to_crs('EPSG:6933')).dropna()
distances_mingkh = moscow_houses_sample['geometry'].to_crs('EPSG:6933').distance(geocoded_addresses_mingkh['geometry'].to_crs('EPSG:6933')).dropna()
distances_pelias = moscow_houses_sample['geometry'].to_crs('EPSG:6933').distance(geocoded_addresses_pelias['geometry'].to_crs('EPSG:6933')).dropna()

print("Photon:", round(len(distances_photon[distances_photon > threshold]) / len(distances_photon) * 100, 2), "% of non-empty geometries are geocoded close enough")
print("Mingkh:", round(len(distances_mingkh[distances_mingkh > threshold]) / len(distances_mingkh) * 100, 2), "% of non-empty geometries are geocoded close enough")
print("Pelias:", round(len(distances_pelias[distances_pelias > threshold]) / len(distances_pelias) * 100, 2), "% of non-empty geometries are geocoded close enough")

Photon: 31.97 % of non-empty geometries are geocoded close enough
Mingkh: 85.17 % of non-empty geometries are geocoded close enough
Pelias: 92.15 % of non-empty geometries are geocoded close enough


**Comparison results for a sample of 1000 addresses**

Parameter|Photon|MinGKH|Pelias
---|---|---|---
Time for geocoding|16m 20s|30m 16s|2m 5s
Missed geometries|58.4 %|3.6 %|15.9 %
Points close to ground truth (for non-empty geometries)|31.97 %|85.17 %|92.15 %

In [None]:
# preview points
m = geocoded_addresses_mingkh[['address', 'query', 'geometry']].explore(color='red', legend=True)
moscow_houses_sample.explore(color='blue', m=m, legend=True)
m

# usage example

In [None]:
test_addresses = gpd.GeoDataFrame([
    ["Стрельнинская улица, 12, Санкт-Петербург, 197198", Point(30.298243, 59.959967)],
    ["300045, г. Тула, Новомосковская улица, 1", Point(37.636715, 54.173474)],
    ["Черниговская ул., 12Б, Нижний Новгород", Point(43.973836, 56.321331)],
    ["улица 50 лет ВЛКСМ, 6А, Сургут, Ханты-Мансийский автономный округ — Югра", Point(73.404827, 61.254130)],
    ["улица Тургеневское шоссе, 33/3к22, аул Новая Адыгея, Старобжегокайское сельское поселение, Тахтамукайский район, Республика Адыгея", Point(38.921944, 45.002301)]

], columns=['address', 'geometry'], geometry='geometry', crs='EPSG:4326')

In [None]:
geocoded_addresses_photon = []
geocoded_addresses_mingkh = []
geocoded_addresses_pelias = []

for _, row in tqdm(test_addresses.iterrows(), total=test_addresses.shape[0]):
    geocoded_address_photon = photon_geocode(row["address"])
    geocoded_address_mingkh = mingkh_geocode(row["address"])
    geocoded_address_pelias = pelias_geocode(row["address"])

    geocoded_addresses_photon.append(geocoded_address_photon)
    geocoded_addresses_mingkh.append(geocoded_address_mingkh)
    geocoded_addresses_pelias.append(geocoded_address_pelias)
    time.sleep(0.3)

geocoded_addresses_photon = pd.concat(geocoded_addresses_photon)
geocoded_addresses_mingkh = pd.concat(geocoded_addresses_mingkh)
geocoded_addresses_pelias = pd.concat(geocoded_addresses_pelias)

100%|██████████| 5/5 [00:24<00:00,  4.91s/it]


In [None]:
geocoded_addresses_photon

Unnamed: 0,geometry,osm_type,osm_id,osm_key,osm_value,type,postcode,housenumber,countrycode,country,city,district,street,state,extent,county,name
0,POINT (30.29819 59.95998),W,688869313,building,apartments,house,197198,12,RU,Россия,Санкт-Петербург,Петровский округ,Стрельнинская улица,Санкт-Петербург,"[30.297809, 59.9601394, 30.2985724, 59.9598079]",,
0,POINT (37.63682 54.17345),R,7672107,building,apartments,house,300045,1,RU,Россия,Тула,Красный Перекоп,Новомосковская улица,Тульская область,"[37.6363753, 54.1737111, 37.6372678, 54.1731396]",городской округ Тула,
0,POINT (43.97386 56.32134),W,147551285,historic,building,house,603950,12Б,RU,Россия,Нижний Новгород,,Черниговская улица,Нижегородская область,"[43.9736746, 56.3214582, 43.9740483, 56.3212118]",городской округ Нижний Новгород,
0,POINT (73.40513 61.25418),W,1420030355,building,apartments,house,628403,6А,RU,Россия,Сургут,,улица 50 лет ВЛКСМ,Ханты-Мансийский автономный округ — Югра,"[73.403548, 61.2545539, 73.4064974, 61.2537987]",,
3,POINT (38.93208 45.02888),W,740656258,building,yes,house,385132,33,RU,Россия,Новая Адыгея,,улица Дружбы,Адыгея,"[38.9319876, 45.0289481, 38.9321696, 45.0288195]",Тахтамукайский район,
