# Semantic Map of the Czech Republic: how "close" are our cities semantically?

This project aims to visualize distance between Czech cities not in the geographical, but in semantic sense. Essentially, I scraped Wikipedia info about each Czech cit (there are 610 of them according to Wikipedia), created vector embeddings representing those description and visualized them in 2D space. Such analysis potentially uncovers interesting patterns and similarities between cities that are not even near on the map!

To use this notebook, run each cell step by step.

- created by Yevhenii Karpizenkov


In [1]:
# -- data analysis --
import pandas as pd

# -- wiki api calls -- 
import requests
import wikipediaapi
from tqdm import tqdm
tqdm.pandas()

# -- other helper libs --
import time
import random
import math
import os
import json
import re


# I. Data Extraction

**[IMPORTANT]** **If you have the dataset stored in the files (cities_dataset.csv), you can skip the Data Extraction section and go directly to II. Building the Semantic Map**


In this section we are going to:

1. Scrape names, population and region of all cities in the Czech Republic (610 of them are on Wikipedia)
2. With the wikipedia API we extract description of every city, fetching several sections of the article.*
3. We load the extracted dataset into a .json file (stored in a variable CACHE).

- If you already have got the JSON with scraped descriptions, please have it in the same directory as this notebook. This will let the parsing function see this file and not scrape descriptions all over again. In case you decide to scrape them, be ready for some weiting time (took around 10 min for me)


* some sections (like Citations) are excluded to not pollute the data

## Scraping all cities' names from the Wikipedia list

In [47]:
# this is the URL of the Wikipedia page where we can get all cities' names
url = "https://cs.wikipedia.org/wiki/Seznam_m%C4%9Bst_v_%C4%8Cesku_podle_po%C4%8Dtu_obyvatel"

# defining headers to pass the wikipedia check
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/122.0.0.0 Safari/537.36"
}

# getting the html with an API call
html = requests.get(url, headers=headers).text

# extracting the tables and cities
tables = pd.read_html(html)
df = tables[0]
cities = df['Město'].tolist()

# printing what we got
print(f"Seznam měst:\n{', '.join(cities[:10])}...")

Seznam měst:
Praha, Brno, Ostrava, Plzeň, Liberec, Olomouc, České Budějovice, Hradec Králové, Pardubice, Ústí nad Labem...



Passing literal html to 'read_html' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.



In [65]:
# initialize the dataframe
# initialize the pandas dataframe 
df_city_descriptions = pd.DataFrame()
df_city_descriptions[['name', 'population', 'region']] = df[['Město', 'Počet obyvatel', 'Okres']]
df_city_descriptions.tail()

Unnamed: 0,name,population,region
605,Janov,256,okres Bruntál
606,Boží Dar,255,okres Karlovy Vary
607,Rejštejn,240,okres Klatovy
608,Loučná pod Klínovcem,199,okres Chomutov
609,Přebuz,70,okres Sokolov


## Scraping the descriptions from Wikipedia pages of each city

In [3]:
# initialize wikipedia API

wiki = wikipediaapi.Wikipedia(
    language='cs',
    user_agent="CityMapper/1.0 (contact: yevheniikarpizenkov@gmail.com)"
)


In [4]:
# we will need a function to limit the number of scraped information about each city, which will then be used to represent it in the vector space.
# making the descriptions too short can result in no sufficient information about each city; making them too long can pollute the embeddings.
# current limit is 3.000 characters

def truncate(text, max_chars=3000):
    """
    Cuts the given text, considering the max_chars parameter. 
    Searches for the last sentence ending before the limit in order to not cut words.
    """
    if not isinstance(text, str):
        return ""

    # if text is already short enough → return it
    if len(text) <= max_chars:
        return text.strip()

    # cut roughly at the max_chars boundary
    cut = text[:max_chars]

    # look for the last sentence-ending punctuation (.?!) before the limit
    match = re.search(r'([\.!?])[^\.!?]*$', cut)
    if match:
        end_pos = match.end(1)
        return cut[:end_pos].strip()

    # if no punctuation found, fallback to the last space
    if " " in cut:
        return cut.rsplit(" ", 1)[0].strip()

    # if no space, return the raw cut
    return cut.strip()


In [5]:
# initialize the Wikipedia api
wiki = wikipediaapi.Wikipedia(
    language="cs",
    user_agent="CitySemanticMap/2.0 (https://github.com/yourname)"
)

# define the json file with cached docs to store the results inside it 
CACHE = "city_docs.json"
cache = {}

if os.path.exists(CACHE):
    with open(CACHE, "r", encoding="utf-8") as f:
        cache = json.load(f)

# list unwanted sections of each wikipedia
UNWANTED_SECTIONS = {
    "Reference", "Reference", "Odkazy", "Literatura",
    "Externí odkazy", "Poznámky", "Související články"
}

def clean_text(text: str) -> str:
    """Normalized wikipedia text to avoid things like [1], [2], spaces
    (!) need to add more cleaning for pictures etc."""
    if not isinstance(text, str):
        return ""
    text = re.sub(r"\[\d+\]", "", text) # remove [1], [12], etc.
    text = re.sub(r"\s+", " ", text) # normalize spaces
    return text.strip()


def extract_all_sections(page):
    """
    Extracts all sections of the page filtering out the unwanted sections.
    """
    collected = []
    # include lead summary
    if page.summary:
        collected.append(clean_text(page.summary))

    def recurse(sections):
        for s in sections:
            if s.title.strip() in UNWANTED_SECTIONS:
                continue
            if s.text:
                collected.append(clean_text(s.text))
            recurse(s.sections)

    recurse(page.sections)
    return "\n".join(collected)


def fetch_city_text(city_name, max_retries: int = 5):
    """
    Main function to fetch description for a city.
    Uses several attempts if no answer got from the API.
    """
    # cached?
    if city_name in cache:
        return cache[city_name]

    # try the wiki.page request several times
    for attempt in range(max_retries):
        try:
            # wait in between the attempts
            time.sleep(0.4 + random.random()*0.4)

            page = wiki.page(city_name)
            if page.exists():
                full_text = truncate(extract_all_sections(page))
                if len(full_text) > 80:
                    cache[city_name] = full_text
                    with open(CACHE, "w", encoding="utf-8") as f:
                        json.dump(cache, f, ensure_ascii=False, indent=2)
                    return full_text

            cache[city_name] = None
            with open(CACHE, "w", encoding="utf-8") as f:
                json.dump(cache, f, ensure_ascii=False, indent=2)

        except Exception as e:
            print(f"attempt {attempt+1} failed for {city_name}: {str(e)}")
            time.sleep(0.6 + random.random()*0.5)

    return None


In [66]:
# get description for each city
df_city_descriptions['description'] = df_city_descriptions['name'].progress_apply(fetch_city_text) 
df_city_descriptions.head()

100%|██████████| 610/610 [00:00<00:00, 607870.14it/s]


Unnamed: 0,name,population,region,description
0,Praha,1 397 880,Praha,Praha je hlavní město a současně největší měst...
1,Brno,402 739,okres Brno-město,"Brno (německy Brünn) je statutární město, počt..."
2,Ostrava,283 187,okres Ostrava-město,"Ostrava (polsky Ostrawa, německy Ostrau) je st..."
3,Plzeň,187 928,okres Plzeň-město,Plzeň (v němčině a dalších jazycích Pilsen) je...
4,Liberec,108 090,okres Liberec,Liberec (německy Reichenberg) je statutární mě...


In [10]:
# print an example of a city description
desc = df_city_descriptions.loc[1, 'description']
print(f"Length of an example description: {len(desc)} characters.")
desc

Length of an example description: 2985 characters.


'Brno (německy Brünn) je statutární město, počtem obyvatel i rozlohou druhé největší město v České republice, největší město na Moravě a bývalé hlavní město Moravy. Je sídlem Jihomoravského kraje, v jehož centrální části tvoří samostatný okres Brno-město. Město o rozloze 230,18 km² má přibližně 403 tisíc obyvatel a v jeho metropolitní oblasti žije asi 700 tisíc obyvatel. Brnem protékají řeky Svratka a Svitava, které se v jižní části města slévají. Město se stalo centrem soudní moci České republiky, neboť je sídlem Ústavního soudu, Nejvyššího soudu, Nejvyššího správního soudu i Nejvyššího státního zastupitelství. Také je významným administrativním střediskem, protože zde sídlí státní orgány s celostátní kontrolní působností a další důležité instituce, jako veřejný ochránce práv, Úřad pro ochranu hospodářské soutěže nebo Státní zemědělská a potravinářská inspekce. V Brně je zákonem zřízeno studio České televize a Českého rozhlasu. Od roku 1777 je Brno také sídlem římskokatolické brněnské

In [13]:
# uncovering NaNs
print("NaNs per column: ")
print(df_city_descriptions.isna().sum())

NaNs per column: 
name           0
population     0
region         0
description    0
dtype: int64


## Handling cases where the city name is ambiguous or Wikipedia didn't return any description
After manual check, it was discovered that some city names (e.g. Lom, Solnice) also refer to other things on Wikipedia. 
Some other cities have a name just like othe geographical objects (e.g. there are villages with the same name)
We need to identify such cases and create valid descriptions for them. For this we add the region (okres) to the Wiki query

In [67]:
# defining what text proposes that the result is not valid

# making patterns for text that indicates results for ambiguous titles
DISAMBIG_REGEX = re.compile(
    r"^("
    r"název\s+\S+?\s+(má|nese)\s+více"              
    r"|název\s+\S+?\s+mají?\s+t(yto|ato)\s+\S+"      
    r"|\S+?\s+může\s+(být|označovat|znamenat)"       
    r")",
    re.IGNORECASE
)

AMBIGUOUS_CITIES = ["Solnice", "Lom", "Úterý"]

def is_disambiguation_text(text):
    """
    Detects disambiguation texts (the ones that are not real descriptions)
    """
    if not isinstance(text, str):
        return False

    # check only the beginning of the article
    start = text.strip()[:200]

    return bool(DISAMBIG_REGEX.search(start))


def fetch_city_by_okres(city, okres):
    """
    Fetches summary of a city, mentioning its region (okres) 
    """
    city = city.replace(" ", "_")
    okres = okres.replace(" ", "_")
        
    
    title = f"{city}_({okres})"
    page = wiki.page(title)
    if page.exists():
        print(f"extracted valid page for {title}")
        return page.summary.strip()
    else:
        print(f"page with name {title} doesn't exist")
    return None


def fix_ambiguous_descriptions(df):
    """
    Detects ambiguous cities in the df and makes another API call to fetch it using region
    """
    fixed_count = 0
    
    for idx, row in tqdm(df.iterrows(), total=len(df)):
        desc = row["description"]

        is_name_ambiguous = row["name"] in AMBIGUOUS_CITIES
        
        # skip where description is valid (or missing at all)  
        if not is_disambiguation_text(desc) and not is_name_ambiguous:
            continue
        
        city = row["name"]
        okres = row["region"]
        
        print(f"[Ambiguous] {city} -> retrying with okres = {okres}")
        
        # fetch correct page
        corrected = fetch_city_by_okres(city, okres)
        
        # 3. replace if valid
        if corrected:
            df.at[idx, "description"] = corrected
            fixed_count += 1
        else:
            print(f"[fail] could not resolve {city} with okres {okres}")
    print(f"fixed {fixed_count} city names")
    return df


In [68]:
# fix the city descriptions
df_city_descriptions = fix_ambiguous_descriptions(df_city_descriptions)

  0%|          | 0/610 [00:00<?, ?it/s]

[Ambiguous] Bílina -> retrying with okres = okres Teplice
extracted valid page for Bílina_(okres_Teplice)


 15%|█▍        | 91/610 [00:00<00:02, 208.72it/s]

[Ambiguous] Jesenice -> retrying with okres = okres Praha-západ
extracted valid page for Jesenice_(okres_Praha-západ)


 21%|██        | 127/610 [00:00<00:03, 133.60it/s]

[Ambiguous] Petřvald -> retrying with okres = okres Karviná
extracted valid page for Petřvald_(okres_Karviná)


 30%|██▉       | 181/610 [00:01<00:02, 144.39it/s]

[Ambiguous] Rudná -> retrying with okres = okres Praha-západ
extracted valid page for Rudná_(okres_Praha-západ)


 40%|████      | 244/610 [00:01<00:02, 142.64it/s]

[Ambiguous] Bystřice -> retrying with okres = okres Benešov
extracted valid page for Bystřice_(okres_Benešov)


 48%|████▊     | 291/610 [00:02<00:02, 131.01it/s]

[Ambiguous] Cvikov -> retrying with okres = okres Česká Lípa
extracted valid page for Cvikov_(okres_Česká_Lípa)
[Ambiguous] Osek -> retrying with okres = okres Teplice
extracted valid page for Osek_(okres_Teplice)


 50%|█████     | 305/610 [00:02<00:03, 76.54it/s] 

[Ambiguous] Lom -> retrying with okres = okres Most
extracted valid page for Lom_(okres_Most)


 56%|█████▋    | 344/610 [00:03<00:03, 86.66it/s]

[Ambiguous] Štěpánov -> retrying with okres = okres Olomouc
extracted valid page for Štěpánov_(okres_Olomouc)


 60%|█████▉    | 364/610 [00:03<00:03, 78.34it/s]

[Ambiguous] Březová -> retrying with okres = okres Sokolov
extracted valid page for Březová_(okres_Sokolov)


 72%|███████▏  | 439/610 [00:03<00:01, 115.59it/s]

[Ambiguous] Solnice -> retrying with okres = okres Rychnov nad Kněžnou
extracted valid page for Solnice_(okres_Rychnov_nad_Kněžnou)


 76%|███████▌  | 462/610 [00:04<00:01, 102.16it/s]

[Ambiguous] Ralsko -> retrying with okres = okres Česká Lípa
extracted valid page for Ralsko_(okres_Česká_Lípa)


 78%|███████▊  | 475/610 [00:04<00:01, 83.77it/s] 

[Ambiguous] Hostomice -> retrying with okres = okres Beroun
extracted valid page for Hostomice_(okres_Beroun)


 82%|████████▏ | 501/610 [00:05<00:01, 71.78it/s]

[Ambiguous] Třebenice -> retrying with okres = okres Litoměřice
extracted valid page for Třebenice_(okres_Litoměřice)


 83%|████████▎ | 509/610 [00:05<00:01, 58.27it/s]

[Ambiguous] Všeruby -> retrying with okres = okres Plzeň-sever
extracted valid page for Všeruby_(okres_Plzeň-sever)


 85%|████████▌ | 520/610 [00:05<00:01, 51.10it/s]

[Ambiguous] Kladruby -> retrying with okres = okres Tachov
extracted valid page for Kladruby_(okres_Tachov)


 86%|████████▌ | 526/610 [00:06<00:02, 38.99it/s]

[Ambiguous] Jesenice -> retrying with okres = okres Rakovník
extracted valid page for Jesenice_(okres_Rakovník)


 87%|████████▋ | 531/610 [00:06<00:02, 30.10it/s]

[Ambiguous] Bystré -> retrying with okres = okres Svitavy
extracted valid page for Bystré_(okres_Svitavy)


 89%|████████▊ | 541/610 [00:06<00:02, 29.94it/s]

[Ambiguous] Husinec -> retrying with okres = okres Prachatice
extracted valid page for Husinec_(okres_Prachatice)


 90%|████████▉ | 548/610 [00:07<00:02, 27.65it/s]

[Ambiguous] Sedlice -> retrying with okres = okres Strakonice
extracted valid page for Sedlice_(okres_Strakonice)


 92%|█████████▏| 560/610 [00:07<00:01, 29.30it/s]

[Ambiguous] Deštná -> retrying with okres = okres Jindřichův Hradec
extracted valid page for Deštná_(okres_Jindřichův_Hradec)


 97%|█████████▋| 589/610 [00:08<00:00, 45.04it/s]

[Ambiguous] Úterý -> retrying with okres = okres Plzeň-sever
extracted valid page for Úterý_(okres_Plzeň-sever)


100%|██████████| 610/610 [00:08<00:00, 71.41it/s]

fixed 22 city names





In [160]:
# processing the population column to make the dtype int
df_city_descriptions["population"] = (
    df_city_descriptions["population"]
        .astype(str)
        .replace({"\xa0": ""}, regex=True)   
        .str.replace(r"\D", "", regex=True)  
        .astype(int)
)

print("dtype of values in the population:", df_city_descriptions['population'].describe())
df_city_descriptions['population'].head()

dtype of values in the population: count    6.100000e+02
mean     1.226920e+04
std      6.137550e+04
min      7.000000e+01
25%      2.464750e+03
50%      4.458000e+03
75%      8.619000e+03
max      1.397880e+06
Name: population, dtype: float64


0    1397880
1     402739
2     283187
3     187928
4     108090
Name: population, dtype: int64

In [81]:
df_city_descriptions.head()

Unnamed: 0,name,population,region,description,keywords
0,Praha,1397880,Praha,Praha je hlavní město a současně největší měst...,"[praha, hlavní, evropské, měst, je]"
1,Brno,402739,okres Brno-město,"Brno (německy Brünn) je statutární město, počt...","[brno, nejvyššího, je, mezinárodní, soudu]"
2,Ostrava,283187,okres Ostrava-město,"Ostrava (polsky Ostrawa, německy Ostrau) je st...","[ostrava, ostravy, ostravice, slezské, největš..."
3,Plzeň,187928,okres Plzeň-město,Plzeň (v němčině a dalších jazycích Pilsen) je...,"[plzeň, na, soutoku řek, teorie, bartoloměje]"
4,Liberec,108090,okres Liberec,Liberec (německy Reichenberg) je statutární mě...,"[liberec, nad nisou, nisou, je, stráž]"


In [82]:
# save the dataset into a csv file
# df_city_descriptions = df_city_descriptions.drop(columns=df.filter(regex="^Unnamed.*").columns)
df_city_descriptions.to_csv("cities_dataset.csv", encoding="utf-8", index=False)

# II. Building the Semantic Map of the cities

If you already have the dataset (cities_dataset.csv) load it from the files. If you stored it in the variable, please skip the cell below and proceed to the Keyword extraction section.

In [19]:
# if the cities dataset is not yet stored in a variable, load if from your files
df_city_descriptions = pd.read_csv("cities_dataset.csv", encoding="utf-8")
df_city_descriptions.head()

Unnamed: 0,name,population,region,description
0,Praha,1 397 880,Praha,Praha je hlavní město a současně největší měst...
1,Brno,402 739,okres Brno-město,"Brno (německy Brünn) je statutární město, počt..."
2,Ostrava,283 187,okres Ostrava-město,"Ostrava (polsky Ostrawa, německy Ostrau) je st..."
3,Plzeň,187 928,okres Plzeň-město,Plzeň (v němčině a dalších jazycích Pilsen) je...
4,Liberec,108 090,okres Liberec,Liberec (německy Reichenberg) je statutární mě...


## Keyword extraction from city descriptions

Just for an experiment, let us try to extract key words that better represent cities (are more often mentioned in the given description than in the others). For this we use TF-iDF model.

In [79]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    max_features=5000,
    stop_words=None,
    ngram_range=(1,2),
)

tfidf_matrix = tfidf.fit_transform(df_city_descriptions["description"])
def tfidf_keywords(index, top_k=5):
    row = tfidf_matrix[index].toarray()[0]
    top_idx = row.argsort()[::-1][:top_k]
    return [tfidf.get_feature_names_out()[i] for i in top_idx]

In [80]:
# add the keyword column to the dataframe
df_city_descriptions["keywords"] = df_city_descriptions.index.map(
    lambda idx: tfidf_keywords(idx, top_k=5)
)

In [29]:
# let's see what keywords TF-iDF extracted. 
# interesting examples: 
    # there are often years: 1295 for Pardubice, 1219 for Litomerice etc.
    # "evropské" for Prague
    # "jezero" for Chomutov
df_city_descriptions[['name', 'keywords']].head(30)

Unnamed: 0,name,keywords
0,Praha,"[praha, hlavní, evropské, měst, je]"
1,Brno,"[brno, nejvyššího, je, mezinárodní, soudu]"
2,Ostrava,"[ostrava, ostravy, ostravice, slezské, největš..."
3,Plzeň,"[plzeň, na, soutoku řek, teorie, bartoloměje]"
4,Liberec,"[liberec, nad nisou, nisou, je, stráž]"
5,Olomouc,"[olomouc, festival, olomouci, vojenské, moravě]"
6,České Budějovice,"[budějovice, české budějovice, böhmisch, němči..."
7,Hradec Králové,"[králové, hradec, hradec králové, labe, univer..."
8,Pardubice,"[pardubice, 1295, pardubic, jsou, na]"
9,Ústí nad Labem,"[ústí, ústí nad, nad labem, labem, název]"


## Creating embeddings with the Sentence Transformer

In [84]:
from sentence_transformers import SentenceTransformer
import numpy as np

# initialize the model
model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")

# create a list of texts
texts = df_city_descriptions["description"].tolist()

# generate embeddings
embeddings = model.encode(texts, show_progress_bar=True)


Batches: 100%|██████████| 20/20 [00:26<00:00,  1.31s/it]


In [85]:
# load the embeddings into json
emb_list = embeddings.tolist()  

data = {
    "cities": df_city_descriptions["name"].tolist(),
    "descriptions": df_city_descriptions["description"].tolist(),
    "embeddings": emb_list
}

with open("city_embeddings.json", "w", encoding="utf-8") as f:
    json.dump(data, f, ensure_ascii=False, indent=2)


## Dimensionality reduction with U-MAP

In [86]:
import umap
import textwrap

# U-MAP redices the vector embeddings to 2D format for simple visualization
reducer = umap.UMAP( 
    n_neighbors=15,
    min_dist=0.1,
    metric="cosine",
    random_state=42
)

embeddings_2d = reducer.fit_transform(embeddings)
print("example of embeddings after reduction: ", embeddings_2d[:2])



n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.



example of embeddings after reduction:  [[13.093946  8.338982]
 [11.909134  8.420516]]


## Visualization with Plotly

In [113]:
# make a function that would log-transform the population to express size of map elements for cities depending on the population
def transform_value(x):
    """Log-transforms the population"""
    power = 2
    return (math.log(x)**power)


In [114]:
# log the population to visualize it on the scatter plot
df_city_descriptions['population_transformed'] = df_city_descriptions['population'].apply(lambda x: round(transform_value(x)))
print(df_city_descriptions[['name', 'population', 'population_transformed']].head(5))
print(df_city_descriptions[['name', 'population', 'population_transformed']].tail(5))

      name  population  population_transformed
0    Praha     1397880                     200
1     Brno      402739                     167
2  Ostrava      283187                     158
3    Plzeň      187928                     147
4  Liberec      108090                     134
                     name  population  population_transformed
605                 Janov         256                      31
606              Boží Dar         255                      31
607              Rejštejn         240                      30
608  Loučná pod Klínovcem         199                      28
609                Přebuz          70                      18


In [150]:
# make text wrappings for better visualization in Plotly
def wrap_text(text, width=60, max_chars=400):
    if not isinstance(text, str):
        return ""
    
    # cut the text to a safe length for Plotly hover
    short = text[:max_chars] + "..."

    # wrap into lines of `width` characters
    return "<br>".join(textwrap.wrap(short, width=width))



In [103]:
# run this if the plot doesn't render
import plotly.io as pio
pio.renderers.default = "notebook"
# pio.renderers.default = "notebook_connected" # or this

In [157]:
import plotly.express as px
# wrapping the text and cutting it to make the visualization work
df_city_descriptions["description_wrapped"] = df_city_descriptions["description"].apply(wrap_text).astype(str)

fig = px.scatter(
    df_city_descriptions,
    x="x", 
    y="y",
    hover_name="name",
    hover_data={"description_wrapped": True},
    color="region",
    text="name",
    size="population_transformed"
)

fig.update_traces(
    mode="markers+text",
    textposition="top center",
    textfont=dict(size=10)
)

fig.show()


## Nearest Neighbour Search Tool: Which cities are the semantically "closest" to yours?

In [158]:
from sklearn.metrics.pairwise import cosine_similarity

# find the cosine similarities between the embeddings
similarity_matrix = cosine_similarity(embeddings)

def find_similar_cities(city_name, n=10):
    if city_name not in df_city_descriptions["name"].values:
        raise ValueError(f"City '{city_name}' not found.")
        
    idx = df_city_descriptions.index[df_city_descriptions["name"] == city_name][0]
    sims = similarity_matrix[idx]

    # sorted indices (skip itself)
    top_indices = np.argsort(sims)[::-1][1:n+1]

    result = []
    for rank, i in enumerate(top_indices, start=1):
        result.append({
            "rank": rank,
            "city": df_city_descriptions.loc[i, "name"],
            "similarity": float(sims[i]),
            "description": df_city_descriptions.loc[i, "description"]
        })

    return result


In [159]:
# try it out!
similar_cities = find_similar_cities("Brno", n=8)
for city in similar_cities:
    print(city['rank'],'#', city["city"], round(city["similarity"], 2))

# interesting how in a lot of cases it returns cities that are geographically not far away (which logically follows from the same regions being mentioned in the descriptions)


1 # Kroměříž 0.8
2 # Újezd u Brna 0.75
3 # Moravská Třebová 0.74
4 # Praha 0.73
5 # Moravské Budějovice 0.72
6 # Rudolfov 0.72
7 # Česká Kamenice 0.72
8 # Bruntál 0.72
