# Comparison of random city names with the unified geonames

**Project description:** The career center needs to be able to compare random city names with unified geonames for internal use. The cities in question might be located in Russia, Belarus, Armenia, Kazakhstan, Kyrgyzstan, Georgia or Serbia. The operator should receive a list of recommended names which also contains geonameid, region, country and cosine similarity.

**Task:** To create a solution for selecting the most suitable names from the geonames.

**Data description:** datasets from download.geonames.org which contain geonameid's, region and country info, etc. (admin1CodesASCII, alternateNamesV2, cities15000, countryInfo).

**Work plan:**

1 DATA PREPROCESSING

1.1 Importing necessary libraries

1.2 Loading the datasets into corresponding variables, putting them and info about them on screen

1.3 Creating a working dataset

1.4 Preprocessing the working dataset

1.5 Summary

2 APPLYING THE SENTENCE TRANSFORMER

2.1 Applying the Sentence Transformer

2.2 Summary

3 EXPANDING THE RESULTING DATASET

3.1 Expanding the resulting dataset

3.2 Creating a function for a geoname queries

3.3 Summary

4 TESTING THE SOLUTION

4.1 Testing the solution

4.2 Summary

5 CONCLUSION

## Data preprocessing

### Importing necessary libraries

In [1]:
# !pip install SQLAlchemy
# !pip install --pre SQLAlchemy
# !pip install psycopg2

In [2]:
# !pip install -U sentence-transformers

In [3]:
import pandas as pd
import re

from sentence_transformers import SentenceTransformer, util

from sqlalchemy import create_engine
from sqlalchemy.engine.url import URL

Creating a connection to PostgreSQL:

In [4]:
DATABASE = {
    'drivername': 'postgresql',
    'username': 'postgres',
    'password': 'bathack73',
    'host': 'localhost',
    'port': 5432,
    'database': 'postgres',
    'query': {}
}

In [5]:
engine = create_engine(URL.create(**DATABASE))

### Loading the datasets into corresponding variables, putting them and info about them on screen

In [6]:
admin_divisions = pd.read_csv(
    'C:/Users/ASUS/Downloads/admin1CodesASCII.txt',
    delimiter='\t',
    encoding='utf-16',   # setting the unicode type
    header=None,
    names=[              # naming the columns
        'code',
        'region',
        'region_ascii',
        'geonameid_admin'
    ]
)
admin_divisions.head()

Unnamed: 0,code,region,region_ascii,geonameid_admin
0,AD.06,Sant Julià de Loria,Sant Julia de Loria,3039162
1,AD.05,Ordino,Ordino,3039676
2,AD.04,La Massana,La Massana,3040131
3,AD.03,Encamp,Encamp,3040684
4,AD.02,Canillo,Canillo,3041203


In [7]:
admin_divisions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3880 entries, 0 to 3879
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   code             3880 non-null   object
 1   region           3880 non-null   object
 2   region_ascii     3880 non-null   object
 3   geonameid_admin  3880 non-null   int64 
dtypes: int64(1), object(3)
memory usage: 121.4+ KB


In [8]:
alternate_names = pd.read_csv(
    'C:/Users/ASUS/Downloads/alternateNamesV2.txt',
    delimiter='\t',
    header=None,
    dtype={'from': object, 'to': object},   # setting the columns datatype
    names=[
        'alternatenameid',
        'geonameid',
        'iso_language',
        'alternate_name',
        'is_preferred_name',
        'is_short_name',
        'is_colloquial',
        'is_historic',
        'from',
        'to'
    ]
)
alternate_names.head()

Unnamed: 0,alternatenameid,geonameid,iso_language,alternate_name,is_preferred_name,is_short_name,is_colloquial,is_historic,from,to
0,1284819,2994701,,Roc Mélé,,,,,,
1,1284820,2994701,,Roc Meler,,,,,,
2,4285256,3007683,,Pic des Langounelles,,,,,,
3,1291197,3017832,,Pic de les Abelletes,,,,,,
4,4290387,3017832,,Pic de la Font-Nègre,,,,,,


In [9]:
alternate_names.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16035922 entries, 0 to 16035921
Data columns (total 10 columns):
 #   Column             Dtype  
---  ------             -----  
 0   alternatenameid    int64  
 1   geonameid          int64  
 2   iso_language       object 
 3   alternate_name     object 
 4   is_preferred_name  float64
 5   is_short_name      float64
 6   is_colloquial      float64
 7   is_historic        float64
 8   from               object 
 9   to                 object 
dtypes: float64(4), int64(2), object(4)
memory usage: 1.2+ GB


In [10]:
cities = pd.read_csv(
    'C:/Users/ASUS/Downloads/cities15000.txt',
    delimiter='\t',
    header=None,
    names=[
        'geonameid',
        'name',
        'name_ascii',
        'alternate_names',
        'latitude',
        'longitude',
        'feature_class',
        'feature_code',
        'ISO',
        'cc2',
        'admin1 code',
        'admin2 code',
        'admin3 code',
        'admin4 code',
        'population',
        'elevation',
        'dem',
        'timezone',
        'modification_date'
    ]
)
cities.head()

Unnamed: 0,geonameid,name,name_ascii,alternate_names,latitude,longitude,feature_class,feature_code,ISO,cc2,admin1 code,admin2 code,admin3 code,admin4 code,population,elevation,dem,timezone,modification_date
0,3040051,les Escaldes,les Escaldes,"Ehskal'des-Ehndzhordani,Escaldes,Escaldes-Engo...",42.50729,1.53414,P,PPLA,AD,,8,,,,15853,,1033,Europe/Andorra,2008-10-15
1,3041563,Andorra la Vella,Andorra la Vella,"ALV,Ando-la-Vyey,Andora,Andora la Vela,Andora ...",42.50779,1.52109,P,PPLC,AD,,7,,,,20430,,1037,Europe/Andorra,2020-03-03
2,290594,Umm Al Quwain City,Umm Al Quwain City,"Oumm al Qaiwain,Oumm al Qaïwaïn,Um al Kawain,U...",25.56473,55.55517,P,PPLA,AE,,7,,,,62747,,2,Asia/Dubai,2019-10-24
3,291074,Ras Al Khaimah City,Ras Al Khaimah City,"Julfa,Khaimah,RAK City,RKT,Ra's al Khaymah,Ra'...",25.78953,55.9432,P,PPLA,AE,,5,,,,351943,,2,Asia/Dubai,2019-09-09
4,291580,Zayed City,Zayed City,"Bid' Zayed,Bid’ Zayed,Madinat Za'id,Madinat Za...",23.65416,53.70522,P,PPL,AE,,1,103.0,,,63482,,118,Asia/Dubai,2019-10-24


In [11]:
cities.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27167 entries, 0 to 27166
Data columns (total 19 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   geonameid          27167 non-null  int64  
 1   name               27167 non-null  object 
 2   name_ascii         27167 non-null  object 
 3   alternate_names    24838 non-null  object 
 4   latitude           27167 non-null  float64
 5   longitude          27167 non-null  float64
 6   feature_class      27167 non-null  object 
 7   feature_code       27167 non-null  object 
 8   ISO                27153 non-null  object 
 9   cc2                13 non-null     object 
 10  admin1 code        27159 non-null  object 
 11  admin2 code        22094 non-null  object 
 12  admin3 code        8510 non-null   object 
 13  admin4 code        2628 non-null   object 
 14  population         27167 non-null  int64  
 15  elevation          4256 non-null   float64
 16  dem                271

Creating a "code" column for further merge:

In [12]:
cities['code'] = cities['ISO'] + '.' + cities['admin1 code']
cities.head()

Unnamed: 0,geonameid,name,name_ascii,alternate_names,latitude,longitude,feature_class,feature_code,ISO,cc2,admin1 code,admin2 code,admin3 code,admin4 code,population,elevation,dem,timezone,modification_date,code
0,3040051,les Escaldes,les Escaldes,"Ehskal'des-Ehndzhordani,Escaldes,Escaldes-Engo...",42.50729,1.53414,P,PPLA,AD,,8,,,,15853,,1033,Europe/Andorra,2008-10-15,AD.08
1,3041563,Andorra la Vella,Andorra la Vella,"ALV,Ando-la-Vyey,Andora,Andora la Vela,Andora ...",42.50779,1.52109,P,PPLC,AD,,7,,,,20430,,1037,Europe/Andorra,2020-03-03,AD.07
2,290594,Umm Al Quwain City,Umm Al Quwain City,"Oumm al Qaiwain,Oumm al Qaïwaïn,Um al Kawain,U...",25.56473,55.55517,P,PPLA,AE,,7,,,,62747,,2,Asia/Dubai,2019-10-24,AE.07
3,291074,Ras Al Khaimah City,Ras Al Khaimah City,"Julfa,Khaimah,RAK City,RKT,Ra's al Khaymah,Ra'...",25.78953,55.9432,P,PPLA,AE,,5,,,,351943,,2,Asia/Dubai,2019-09-09,AE.05
4,291580,Zayed City,Zayed City,"Bid' Zayed,Bid’ Zayed,Madinat Za'id,Madinat Za...",23.65416,53.70522,P,PPL,AE,,1,103.0,,,63482,,118,Asia/Dubai,2019-10-24,AE.01


Getting info about the updated dataset:

In [13]:
cities.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27167 entries, 0 to 27166
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   geonameid          27167 non-null  int64  
 1   name               27167 non-null  object 
 2   name_ascii         27167 non-null  object 
 3   alternate_names    24838 non-null  object 
 4   latitude           27167 non-null  float64
 5   longitude          27167 non-null  float64
 6   feature_class      27167 non-null  object 
 7   feature_code       27167 non-null  object 
 8   ISO                27153 non-null  object 
 9   cc2                13 non-null     object 
 10  admin1 code        27159 non-null  object 
 11  admin2 code        22094 non-null  object 
 12  admin3 code        8510 non-null   object 
 13  admin4 code        2628 non-null   object 
 14  population         27167 non-null  int64  
 15  elevation          4256 non-null   float64
 16  dem                271

In [14]:
countries = pd.read_csv(
    'C:/Users/ASUS/Downloads/countryInfo.txt',
    delimiter='\t'
    )
countries.head()

Unnamed: 0,ISO,ISO3,ISO-Numeric,fips,Country,Capital,Area(in sq km),Population,Continent,tld,CurrencyCode,CurrencyName,Phone,Postal Code Format,Postal Code Regex,Languages,geonameid,neighbours,EquivalentFipsCode
0,AD,AND,20,AN,Andorra,Andorra la Vella,468.0,77006,EU,.ad,EUR,Euro,376,AD###,^(?:AD)*(\d{3})$,ca,3041565,"ES,FR",
1,AE,ARE,784,AE,United Arab Emirates,Abu Dhabi,82880.0,9630959,AS,.ae,AED,Dirham,971,,,"ar-AE,fa,en,hi,ur",290557,"SA,OM",
2,AF,AFG,4,AF,Afghanistan,Kabul,647500.0,37172386,AS,.af,AFN,Afghani,93,,,"fa-AF,ps,uz-AF,tk",1149361,"TM,CN,IR,TJ,PK,UZ",
3,AG,ATG,28,AC,Antigua and Barbuda,St. John's,443.0,96286,,.ag,XCD,Dollar,+1-268,,,en-AG,3576396,,
4,AI,AIA,660,AV,Anguilla,The Valley,102.0,13254,,.ai,XCD,Dollar,+1-264,,,en-AI,3573511,,


In [15]:
countries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252 entries, 0 to 251
Data columns (total 19 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ISO                 251 non-null    object 
 1   ISO3                252 non-null    object 
 2   ISO-Numeric         252 non-null    int64  
 3   fips                249 non-null    object 
 4   Country             252 non-null    object 
 5   Capital             246 non-null    object 
 6   Area(in sq km)      252 non-null    float64
 7   Population          252 non-null    int64  
 8   Continent           210 non-null    object 
 9   tld                 251 non-null    object 
 10  CurrencyCode        251 non-null    object 
 11  CurrencyName        251 non-null    object 
 12  Phone               247 non-null    object 
 13  Postal Code Format  162 non-null    object 
 14  Postal Code Regex   162 non-null    object 
 15  Languages           249 non-null    object 
 16  geonamei

Renaming the "geonameid" column, so as not to confuse it with the "cities" column of the same name:

In [16]:
countries.rename(columns = {'geonameid':'geonameid_country'}, inplace = True)

Getting info about the updated dataset:

In [17]:
countries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252 entries, 0 to 251
Data columns (total 19 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ISO                 251 non-null    object 
 1   ISO3                252 non-null    object 
 2   ISO-Numeric         252 non-null    int64  
 3   fips                249 non-null    object 
 4   Country             252 non-null    object 
 5   Capital             246 non-null    object 
 6   Area(in sq km)      252 non-null    float64
 7   Population          252 non-null    int64  
 8   Continent           210 non-null    object 
 9   tld                 251 non-null    object 
 10  CurrencyCode        251 non-null    object 
 11  CurrencyName        251 non-null    object 
 12  Phone               247 non-null    object 
 13  Postal Code Format  162 non-null    object 
 14  Postal Code Regex   162 non-null    object 
 15  Languages           249 non-null    object 
 16  geonamei

### Creating a working dataset

Merging the "alternate_names" and "cities" datasets on the "geonameid" column:

In [18]:
df_1 = pd.merge(alternate_names, cities, on="geonameid")
df_1.head()

Unnamed: 0,alternatenameid,geonameid,iso_language,alternate_name,is_preferred_name,is_short_name,is_colloquial,is_historic,from,to,...,admin1 code,admin2 code,admin3 code,admin4 code,population,elevation,dem,timezone,modification_date,code
0,1297907,3040051,ca,Les Escaldes,,,,,,,...,8,,,,15853,,1033,Europe/Andorra,2008-10-15,AD.08
1,1297908,3040051,ca,Escaldes,,,,,,,...,8,,,,15853,,1033,Europe/Andorra,2008-10-15,AD.08
2,1904145,3040051,fr,Escaldes-Engordany,,,,,,,...,8,,,,15853,,1033,Europe/Andorra,2008-10-15,AD.08
3,1904146,3040051,pl,Escaldes-Engordany,,,,,,,...,8,,,,15853,,1033,Europe/Andorra,2008-10-15,AD.08
4,1904147,3040051,es,Escaldes-Engordany,,,,,,,...,8,,,,15853,,1033,Europe/Andorra,2008-10-15,AD.08


In [19]:
df_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 341198 entries, 0 to 341197
Data columns (total 29 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   alternatenameid    341198 non-null  int64  
 1   geonameid          341198 non-null  int64  
 2   iso_language       289221 non-null  object 
 3   alternate_name     341198 non-null  object 
 4   is_preferred_name  8214 non-null    float64
 5   is_short_name      655 non-null     float64
 6   is_colloquial      302 non-null     float64
 7   is_historic        2096 non-null    float64
 8   from               309 non-null     object 
 9   to                 286 non-null     object 
 10  name               341198 non-null  object 
 11  name_ascii         341198 non-null  object 
 12  alternate_names    338040 non-null  object 
 13  latitude           341198 non-null  float64
 14  longitude          341198 non-null  float64
 15  feature_class      341198 non-null  object 
 16  fe

Merging the received dataset with the "admin_divisions" dataset on the "code" column:

In [20]:
df_2 = pd.merge(df_1, admin_divisions, on="code")
df_2.head()

Unnamed: 0,alternatenameid,geonameid,iso_language,alternate_name,is_preferred_name,is_short_name,is_colloquial,is_historic,from,to,...,admin4 code,population,elevation,dem,timezone,modification_date,code,region,region_ascii,geonameid_admin
0,1297907,3040051,ca,Les Escaldes,,,,,,,...,,15853,,1033,Europe/Andorra,2008-10-15,AD.08,Escaldes-Engordany,Escaldes-Engordany,3338529
1,1297908,3040051,ca,Escaldes,,,,,,,...,,15853,,1033,Europe/Andorra,2008-10-15,AD.08,Escaldes-Engordany,Escaldes-Engordany,3338529
2,1904145,3040051,fr,Escaldes-Engordany,,,,,,,...,,15853,,1033,Europe/Andorra,2008-10-15,AD.08,Escaldes-Engordany,Escaldes-Engordany,3338529
3,1904146,3040051,pl,Escaldes-Engordany,,,,,,,...,,15853,,1033,Europe/Andorra,2008-10-15,AD.08,Escaldes-Engordany,Escaldes-Engordany,3338529
4,1904147,3040051,es,Escaldes-Engordany,,,,,,,...,,15853,,1033,Europe/Andorra,2008-10-15,AD.08,Escaldes-Engordany,Escaldes-Engordany,3338529


In [21]:
df_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 340031 entries, 0 to 340030
Data columns (total 32 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   alternatenameid    340031 non-null  int64  
 1   geonameid          340031 non-null  int64  
 2   iso_language       288127 non-null  object 
 3   alternate_name     340031 non-null  object 
 4   is_preferred_name  8176 non-null    float64
 5   is_short_name      653 non-null     float64
 6   is_colloquial      291 non-null     float64
 7   is_historic        2089 non-null    float64
 8   from               309 non-null     object 
 9   to                 286 non-null     object 
 10  name               340031 non-null  object 
 11  name_ascii         340031 non-null  object 
 12  alternate_names    336874 non-null  object 
 13  latitude           340031 non-null  float64
 14  longitude          340031 non-null  float64
 15  feature_class      340031 non-null  object 
 16  fe

Merging the received dataset with the "countries" dataset on the "ISO" column:

In [22]:
df_full = pd.merge(df_2, countries, on="ISO")
df_full.head()

Unnamed: 0,alternatenameid,geonameid,iso_language,alternate_name,is_preferred_name,is_short_name,is_colloquial,is_historic,from,to,...,tld,CurrencyCode,CurrencyName,Phone,Postal Code Format,Postal Code Regex,Languages,geonameid_country,neighbours,EquivalentFipsCode
0,1297907,3040051,ca,Les Escaldes,,,,,,,...,.ad,EUR,Euro,376,AD###,^(?:AD)*(\d{3})$,ca,3041565,"ES,FR",
1,1297908,3040051,ca,Escaldes,,,,,,,...,.ad,EUR,Euro,376,AD###,^(?:AD)*(\d{3})$,ca,3041565,"ES,FR",
2,1904145,3040051,fr,Escaldes-Engordany,,,,,,,...,.ad,EUR,Euro,376,AD###,^(?:AD)*(\d{3})$,ca,3041565,"ES,FR",
3,1904146,3040051,pl,Escaldes-Engordany,,,,,,,...,.ad,EUR,Euro,376,AD###,^(?:AD)*(\d{3})$,ca,3041565,"ES,FR",
4,1904147,3040051,es,Escaldes-Engordany,,,,,,,...,.ad,EUR,Euro,376,AD###,^(?:AD)*(\d{3})$,ca,3041565,"ES,FR",


In [23]:
df_full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 340031 entries, 0 to 340030
Data columns (total 50 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   alternatenameid     340031 non-null  int64  
 1   geonameid           340031 non-null  int64  
 2   iso_language        288127 non-null  object 
 3   alternate_name      340031 non-null  object 
 4   is_preferred_name   8176 non-null    float64
 5   is_short_name       653 non-null     float64
 6   is_colloquial       291 non-null     float64
 7   is_historic         2089 non-null    float64
 8   from                309 non-null     object 
 9   to                  286 non-null     object 
 10  name                340031 non-null  object 
 11  name_ascii          340031 non-null  object 
 12  alternate_names     336874 non-null  object 
 13  latitude            340031 non-null  float64
 14  longitude           340031 non-null  float64
 15  feature_class       340031 non-nul

Filtering the required countries as a new dataset "df" and putting it on screen:

In [24]:
df = df_full[(df_full['ISO'] == 'RU') | (df_full['ISO'] == 'BY') | \
             (df_full['ISO'] == 'KG') | (df_full['ISO'] == 'KZ') | \
             (df_full['ISO'] == 'AM') | (df_full['ISO'] == 'GE') | \
             (df_full['ISO'] == 'RS')]
df.head()

Unnamed: 0,alternatenameid,geonameid,iso_language,alternate_name,is_preferred_name,is_short_name,is_colloquial,is_historic,from,to,...,tld,CurrencyCode,CurrencyName,Phone,Postal Code Format,Postal Code Regex,Languages,geonameid_country,neighbours,EquivalentFipsCode
2160,135616,174875,,Qafan,,,,,,,...,.am,AMD,Dram,374,######,^(\d{6})$,hy,174982,"GE,IR,AZ,TR",
2161,1925363,174875,es,Kapan,,,,,,,...,.am,AMD,Dram,374,######,^(\d{6})$,hy,174982,"GE,IR,AZ,TR",
2162,1925364,174875,en,Kapan,,,,,,,...,.am,AMD,Dram,374,######,^(\d{6})$,hy,174982,"GE,IR,AZ,TR",
2163,1925365,174875,de,Kapan,,,,,,,...,.am,AMD,Dram,374,######,^(\d{6})$,hy,174982,"GE,IR,AZ,TR",
2164,1925366,174875,fa,کاپان,,,,,,,...,.am,AMD,Dram,374,######,^(\d{6})$,hy,174982,"GE,IR,AZ,TR",


Resetting the index:

In [25]:
df = df.reset_index(drop=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24848 entries, 0 to 24847
Data columns (total 50 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   alternatenameid     24848 non-null  int64  
 1   geonameid           24848 non-null  int64  
 2   iso_language        21118 non-null  object 
 3   alternate_name      24848 non-null  object 
 4   is_preferred_name   1310 non-null   float64
 5   is_short_name       31 non-null     float64
 6   is_colloquial       24 non-null     float64
 7   is_historic         359 non-null    float64
 8   from                215 non-null    object 
 9   to                  144 non-null    object 
 10  name                24848 non-null  object 
 11  name_ascii          24848 non-null  object 
 12  alternate_names     24844 non-null  object 
 13  latitude            24848 non-null  float64
 14  longitude           24848 non-null  float64
 15  feature_class       24848 non-null  object 
 16  feat

Exporting the new dataset to PostgreSQL as "cities_extended":

In [26]:
# df.to_sql('cities_extended', con=engine)

Importing the required columns from PostgreSQL and creating a working dataset as "corpus":

In [27]:
query = '''SELECT geonameid,
                  alternate_name AS name,
                  region,
                  country,
                  population
           FROM cities_extended;
'''

corpus = pd.read_sql_query(query, con=engine)
corpus.head()

Unnamed: 0,geonameid,name,region,country,population
0,174875,Qafan,Syunik,Armenia,33160
1,174875,Kapan,Syunik,Armenia,33160
2,174875,Kapan,Syunik,Armenia,33160
3,174875,Kapan,Syunik,Armenia,33160
4,174875,کاپان,Syunik,Armenia,33160


In [28]:
corpus.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24848 entries, 0 to 24847
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   geonameid   24848 non-null  int64 
 1   name        24848 non-null  object
 2   region      24848 non-null  object
 3   country     24848 non-null  object
 4   population  24848 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 970.8+ KB


### Preprocessing the working dataset

Creating and applying a function for clearing the "name" column from the unnecessary symbols:

In [29]:
def clear_text(text):

    clear_text = re.sub(r'[^a-zA-Z]', ' ', text)
    clear_text = clear_text.split()
    clear_text = " ".join(clear_text)
    return clear_text

corpus['name'] = corpus['name'].apply(clear_text)

corpus.head()

Unnamed: 0,geonameid,name,region,country,population
0,174875,Qafan,Syunik,Armenia,33160
1,174875,Kapan,Syunik,Armenia,33160
2,174875,Kapan,Syunik,Armenia,33160
3,174875,Kapan,Syunik,Armenia,33160
4,174875,,Syunik,Armenia,33160


Clearing the "name" column from the traces of internet links:

In [30]:
corpus['name'] = corpus['name'].str.replace('https en wikipedia org wiki ','')
corpus['name'] = corpus['name'].str.replace('https ru wikipedia org wiki ','')

### **Summary:**

The datasets are loaded with all necessary tweaks and additions. Four datasets are merged into one full dataset, which is then filtered for the required countries and exported into PostgreSQL. From PostgreSQL the working dataset imported as "corpus".

The "name" column of the "corpus" is cleared from the unnecessary symbols and internet links.

## Applying the Sentence Transformer

Dropping duplicates from and applying the "values" method to the "name" column:

In [31]:
names = corpus.name.drop_duplicates().values
names[-10:]

array(['Anadir', 'Anad r', 'An dyr', 'Anadyr town',
       'D D BD D B D B D B D D C D B D BE D D BE D B', 'Anadyris',
       'Anadira', 'DYR', 'Qagiirgiin', 'RUDYR'], dtype=object)

Creating embeddings using LaBSE:

In [32]:
labse = SentenceTransformer('sentence-transformers/LaBSE')
embeddings = labse.encode(names)
embeddings.shape

(9082, 768)

Applying the semantic search to a random city name and saving the results as a new dataset:

In [33]:
result = pd.DataFrame(util.semantic_search(labse.encode('Киров'), embeddings)[0])
result = result.assign(name=names[result.corpus_id])

In [34]:
result.head()

Unnamed: 0,corpus_id,score,name
0,5635,0.904827,Kirov
1,7007,0.85726,Kirovo
2,5637,0.825394,Kirovas
3,4552,0.821535,Kirovsk
4,5623,0.82097,Kirow


**Summary:**

The "names" column is cleared from duplicates and converted into values. From them, embeddings are created using LaBSE. Semantic search for a randomly chosen name 'Киров' demonstrates a match in the first line with the score of 0.9

## Expanding the resulting dataset

Merging the "result" and "corpus" datasets on the "name" to add the required columns and dropping duplicates from the results:

In [35]:
result = pd.merge(result, corpus, on="name")
result = result.drop_duplicates()
result

Unnamed: 0,corpus_id,score,name,geonameid,region,country,population
0,5635,0.904827,Kirov,548408,Kirov Oblast,Russia,507155
13,5635,0.904827,Kirov,548410,Kaluga Oblast,Russia,39319
16,7007,0.85726,Kirovo,548410,Kaluga Oblast,Russia,39319
17,5637,0.825394,Kirovas,548408,Kirov Oblast,Russia,507155
18,4552,0.821535,Kirovsk,548391,Murmansk,Russia,29605
21,4552,0.821535,Kirovsk,548392,Leningradskaya Oblast',Russia,24678
24,5623,0.82097,Kirow,548408,Kirov Oblast,Russia,507155
28,5623,0.82097,Kirow,548410,Kaluga Oblast,Russia,39319
31,7317,0.81597,Kirovgrad,1503335,Sverdlovsk Oblast,Russia,22685
35,5642,0.802906,Kirova,548408,Kirov Oblast,Russia,507155


Sorting the results, so, in case of there being more than one location with the same name, the location with the bigger population would be recommended first:

In [36]:
result = result.sort_values(by = ["score", "population"], ascending = False)
result

Unnamed: 0,corpus_id,score,name,geonameid,region,country,population
0,5635,0.904827,Kirov,548408,Kirov Oblast,Russia,507155
13,5635,0.904827,Kirov,548410,Kaluga Oblast,Russia,39319
16,7007,0.85726,Kirovo,548410,Kaluga Oblast,Russia,39319
17,5637,0.825394,Kirovas,548408,Kirov Oblast,Russia,507155
18,4552,0.821535,Kirovsk,548391,Murmansk,Russia,29605
21,4552,0.821535,Kirovsk,548392,Leningradskaya Oblast',Russia,24678
24,5623,0.82097,Kirow,548408,Kirov Oblast,Russia,507155
28,5623,0.82097,Kirow,548410,Kaluga Oblast,Russia,39319
31,7317,0.81597,Kirovgrad,1503335,Sverdlovsk Oblast,Russia,22685
35,5642,0.802906,Kirova,548408,Kirov Oblast,Russia,507155


Dropping the unnecessary columns from the resulting dataset:

In [37]:
result = result.drop(['corpus_id', 'population'], axis=1)
result

Unnamed: 0,score,name,geonameid,region,country
0,0.904827,Kirov,548408,Kirov Oblast,Russia
13,0.904827,Kirov,548410,Kaluga Oblast,Russia
16,0.85726,Kirovo,548410,Kaluga Oblast,Russia
17,0.825394,Kirovas,548408,Kirov Oblast,Russia
18,0.821535,Kirovsk,548391,Murmansk,Russia
21,0.821535,Kirovsk,548392,Leningradskaya Oblast',Russia
24,0.82097,Kirow,548408,Kirov Oblast,Russia
28,0.82097,Kirow,548410,Kaluga Oblast,Russia
31,0.81597,Kirovgrad,1503335,Sverdlovsk Oblast,Russia
35,0.802906,Kirova,548408,Kirov Oblast,Russia


Converting the resulting dataset to a list of tuples:

In [38]:
result.to_dict(orient='records')

[{'score': 0.9048265218734741,
  'name': 'Kirov',
  'geonameid': 548408,
  'region': 'Kirov Oblast',
  'country': 'Russia'},
 {'score': 0.9048265218734741,
  'name': 'Kirov',
  'geonameid': 548410,
  'region': 'Kaluga Oblast',
  'country': 'Russia'},
 {'score': 0.8572602868080139,
  'name': 'Kirovo',
  'geonameid': 548410,
  'region': 'Kaluga Oblast',
  'country': 'Russia'},
 {'score': 0.8253940343856812,
  'name': 'Kirovas',
  'geonameid': 548408,
  'region': 'Kirov Oblast',
  'country': 'Russia'},
 {'score': 0.8215348720550537,
  'name': 'Kirovsk',
  'geonameid': 548391,
  'region': 'Murmansk',
  'country': 'Russia'},
 {'score': 0.8215348720550537,
  'name': 'Kirovsk',
  'geonameid': 548392,
  'region': "Leningradskaya Oblast'",
  'country': 'Russia'},
 {'score': 0.8209702968597412,
  'name': 'Kirow',
  'geonameid': 548408,
  'region': 'Kirov Oblast',
  'country': 'Russia'},
 {'score': 0.8209702968597412,
  'name': 'Kirow',
  'geonameid': 548410,
  'region': 'Kaluga Oblast',
  'count

Creating a function for a geoname queries:

In [39]:
def geoname(query):
    result = pd.DataFrame(util.semantic_search(labse.encode(query), embeddings, top_k=5)[0])
    result = result.assign(name=names[result.corpus_id])
    result = pd.merge(result, corpus, on="name")
    result = result.drop_duplicates()
    result = result.sort_values(by = ["score", "population"], ascending = False)
    result = result.drop(['corpus_id', 'population'], axis=1)
    result = result.to_dict(orient='records')
    
    return result

Testing the function:

In [40]:
geoname('Киров')

[{'score': 0.9048265218734741,
  'name': 'Kirov',
  'geonameid': 548408,
  'region': 'Kirov Oblast',
  'country': 'Russia'},
 {'score': 0.9048265218734741,
  'name': 'Kirov',
  'geonameid': 548410,
  'region': 'Kaluga Oblast',
  'country': 'Russia'},
 {'score': 0.8572602868080139,
  'name': 'Kirovo',
  'geonameid': 548410,
  'region': 'Kaluga Oblast',
  'country': 'Russia'},
 {'score': 0.8253940343856812,
  'name': 'Kirovas',
  'geonameid': 548408,
  'region': 'Kirov Oblast',
  'country': 'Russia'},
 {'score': 0.8215348720550537,
  'name': 'Kirovsk',
  'geonameid': 548391,
  'region': 'Murmansk',
  'country': 'Russia'},
 {'score': 0.8215348720550537,
  'name': 'Kirovsk',
  'geonameid': 548392,
  'region': "Leningradskaya Oblast'",
  'country': 'Russia'},
 {'score': 0.8209702968597412,
  'name': 'Kirow',
  'geonameid': 548408,
  'region': 'Kirov Oblast',
  'country': 'Russia'},
 {'score': 0.8209702968597412,
  'name': 'Kirow',
  'geonameid': 548410,
  'region': 'Kaluga Oblast',
  'count

**Summary:**

To receive all the required columns, the resulting dataset is merged with the "corpus" on "name" and cleared from duplicates. Then, the dataset is sorted, so, in case of there being more than one location with the same name, the location with the bigger population would be recommended first. No longer necessary columns are dropped and the "result" dataset is converted into the list of tuples as per requirement.

The "geoname" function is created and tested for optimizing the further use of the solution.

## Testing the solution

Loading the test dataset:

In [41]:
geo_test = pd.read_csv(
    'C:/Users/ASUS/Downloads/geo_test.csv',
    sep='\;',
    engine='python'
    )
geo_test.head()

Unnamed: 0,query,name,region,country
0,Смоленск,Smolensk,Smolensk Oblast,Russia
1,Кемерово,Kemerovo,Kuzbass,Russia
2,Бишкек,Bishkek,Bishkek,Kyrgyzstan
3,Москва,Moscow,Moscow,Russia
4,Алматы,Almaty,Almaty,Kazakhstan


Getting five random queries for further testing:

In [44]:
geo_test.sample(n=5)

Unnamed: 0,query,name,region,country
160,Каспийск,Kaspiysk,Dagestan,Russia
238,Ахалцихе,Akhaltsikhe,Samtskhe-Javakheti,Georgia
70,Павловск,Pavlovsk,St.-Petersburg,Russia
130,Уссурийск,Ussuriysk,Primorye,Russia
220,Новоалтайск,Novoaltaysk,Altai Krai,Russia


Testing the queries:

In [45]:
geoname('Каспийск')

[{'score': 0.918549120426178,
  'name': 'Kaspijsk',
  'geonameid': 551847,
  'region': 'Dagestan',
  'country': 'Russia'},
 {'score': 0.7944949269294739,
  'name': 'Kaspiysk',
  'geonameid': 551847,
  'region': 'Dagestan',
  'country': 'Russia'},
 {'score': 0.7252591848373413,
  'name': 'Kopejsk',
  'geonameid': 1502603,
  'region': 'Chelyabinsk',
  'country': 'Russia'},
 {'score': 0.6966314315795898,
  'name': 'Kabakovsk',
  'geonameid': 1492663,
  'region': 'Sverdlovsk Oblast',
  'country': 'Russia'},
 {'score': 0.6928472518920898,
  'name': 'Karpinsk',
  'geonameid': 1504343,
  'region': 'Sverdlovsk Oblast',
  'country': 'Russia'}]

In [46]:
geoname('Ахалцихе')

[{'score': 0.8862635493278503,
  'name': 'Achalciche',
  'geonameid': 615860,
  'region': 'Samtskhe-Javakheti',
  'country': 'Georgia'},
 {'score': 0.8687085509300232,
  'name': 'Achalcich',
  'geonameid': 615860,
  'region': 'Samtskhe-Javakheti',
  'country': 'Georgia'},
 {'score': 0.8383264541625977,
  'name': 'Achaltsiche',
  'geonameid': 615860,
  'region': 'Samtskhe-Javakheti',
  'country': 'Georgia'},
 {'score': 0.8310538530349731,
  'name': 'Achalziche',
  'geonameid': 615860,
  'region': 'Samtskhe-Javakheti',
  'country': 'Georgia'},
 {'score': 0.8116089105606079,
  'name': 'Achalzych',
  'geonameid': 615860,
  'region': 'Samtskhe-Javakheti',
  'country': 'Georgia'}]

In [47]:
geoname('Павловск')

[{'score': 0.9115622043609619,
  'name': 'Pavlovszk',
  'geonameid': 512052,
  'region': 'St.-Petersburg',
  'country': 'Russia'},
 {'score': 0.9028421640396118,
  'name': 'Pavlovsk',
  'geonameid': 512053,
  'region': 'Voronezh Oblast',
  'country': 'Russia'},
 {'score': 0.9028421640396118,
  'name': 'Pavlovsk',
  'geonameid': 512052,
  'region': 'St.-Petersburg',
  'country': 'Russia'},
 {'score': 0.8789234757423401,
  'name': 'Pawlowsk',
  'geonameid': 512052,
  'region': 'St.-Petersburg',
  'country': 'Russia'},
 {'score': 0.8559516668319702,
  'name': 'Pavlovska',
  'geonameid': 512052,
  'region': 'St.-Petersburg',
  'country': 'Russia'},
 {'score': 0.8551025390625,
  'name': 'Pavlovskaya',
  'geonameid': 512051,
  'region': 'Krasnodar Krai',
  'country': 'Russia'}]

In [48]:
geoname('Уссурийск')

[{'score': 0.9266659021377563,
  'name': 'Ussuryjsk',
  'geonameid': 2014006,
  'region': 'Primorye',
  'country': 'Russia'},
 {'score': 0.9204007983207703,
  'name': 'Ussurijsk',
  'geonameid': 2014006,
  'region': 'Primorye',
  'country': 'Russia'},
 {'score': 0.8803470134735107,
  'name': 'Ussuriisk',
  'geonameid': 2014006,
  'region': 'Primorye',
  'country': 'Russia'},
 {'score': 0.8572003841400146,
  'name': 'Usszurijszk',
  'geonameid': 2014006,
  'region': 'Primorye',
  'country': 'Russia'},
 {'score': 0.837356448173523,
  'name': 'Ussuriysk',
  'geonameid': 2014006,
  'region': 'Primorye',
  'country': 'Russia'}]

In [49]:
geoname('Новоалтайск')

[{'score': 0.9462506771087646,
  'name': 'Novoaltaysk',
  'geonameid': 1497173,
  'region': 'Altai Krai',
  'country': 'Russia'},
 {'score': 0.9332069158554077,
  'name': 'Novoaltajsk',
  'geonameid': 1497173,
  'region': 'Altai Krai',
  'country': 'Russia'},
 {'score': 0.8715115785598755,
  'name': 'Nowoaltaisk',
  'geonameid': 1497173,
  'region': 'Altai Krai',
  'country': 'Russia'},
 {'score': 0.8325079679489136,
  'name': 'Novonikolayevsk',
  'geonameid': 1496747,
  'region': 'Novosibirsk Oblast',
  'country': 'Russia'},
 {'score': 0.8051725625991821,
  'name': 'Novoulyanovsk',
  'geonameid': 517766,
  'region': 'Ulyanovsk',
  'country': 'Russia'}]

**Summary:**

The test dataset is loaded. Five random queries are chosen from it and tested with the "geonames" function. For the Russian city names it gives a stable match inside the top 5 variants with the score ranging from 0.79 to 0.95. For the Georgian name 'Ахалцихе' the function gets a correct geonameid in the first line, but the spelling is a bit off, although still recognizable.

## Conclusion

During the preprocessing the required datasets (admin1CodesASCII, alternateNamesV2, cities15000, countryInfo) are loaded with all necessary tweaks and additions. Four datasets are merged into one full dataset, which is then filtered for the required countries and exported into PostgreSQL. From PostgreSQL the working dataset imported as "corpus". The "name" column of the "corpus" is cleared from the unnecessary symbols and internet links.

Then the "names" column is cleared from duplicates and converted into values. A decision is made to use the Sentence Transformer for the task. From values, embeddings are created using LaBSE. Semantic search for a randomly chosen name 'Киров' demonstrates a match in the first line with the score of 0.9

To receive all the required columns, the resulting dataset is merged with the "corpus" on "name", cleared from duplicates and sorted by the population. The resulting dataset is converted into the list of tuples as per requirement. The "geoname" function is created for optimizing the further use of the solution.

Five random queries are chosen from the "geo_test" dataset and tested with the "geonames" function. For the Russian city names it gives a stable match inside the top 5 variants with the score ranging from 0.79 to 0.95. For the Georgian name 'Ахалцихе' the function gets a correct geonameid in the first line, but the spelling is a bit off, although still recognizable.

All in all, considering that the solution will be used as a recommendation system for a human operator, the Sentence Transformer works quite well for this task and will be suitable for the customer's needs.