# Indexing

## Direct Indexing

## Example Data from geonames

From http://download.geonames.org/export/dump/

```
The main 'geoname' table has the following fields :
---------------------------------------------------
geonameid         : integer id of record in geonames database
name              : name of geographical point (utf8) varchar(200)
asciiname         : name of geographical point in plain ascii characters, varchar(200)
alternatenames    : alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000)
latitude          : latitude in decimal degrees (wgs84)
longitude         : longitude in decimal degrees (wgs84)
feature class     : see http://www.geonames.org/export/codes.html, char(1)
feature code      : see http://www.geonames.org/export/codes.html, varchar(10)
country code      : ISO-3166 2-letter country code, 2 characters
cc2               : alternate country codes, comma separated, ISO-3166 2-letter country code, 200 characters
admin1 code       : fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code; varchar(20)
admin2 code       : code for the second administrative division, a county in the US, see file admin2Codes.txt; varchar(80) 
admin3 code       : code for third level administrative division, varchar(20)
admin4 code       : code for fourth level administrative division, varchar(20)
population        : bigint (8 byte int) 
elevation         : in meters, integer
dem               : digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' (ca 90mx90m) or 30''x30'' (ca 900mx900m) area in meters, integer. srtm processed by cgiar/ciat.
timezone          : the iana timezone id (see file timeZone.txt) varchar(40)
modification date : date of last modification in yyyy-MM-dd format
```



In [108]:
# Download
!curl http://download.geonames.org/export/dump/allCountries.zip --output data/allCountries.zip


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  361M  100  361M    0     0  11.0M      0  0:00:32  0:00:32 --:--:-- 11.0M0.3M      0  0:00:35  0:00:02  0:00:33 10.3M00:29  0:00:03 11.0M


In [72]:

def read_geonames(filename):
    from io import TextIOWrapper
    import zipfile
    import csv
    result = {}
    with zipfile.ZipFile(f'data/{filename}.zip') as myzip:
        with myzip.open(f'{filename}.txt', 'r') as csv_file:
            reader = csv.DictReader(TextIOWrapper(csv_file, 'utf-8'), delimiter='\t', fieldnames=['geonameid', 'name', 'asciiname', 'alternatenames', 'latitude', 'longitude', 'feature class', 'feature code', 'country code', 'cc2', 'admin1 code', 'admin2 code', 'admin3 code', 'admin4 code', 'population', 'elevation', 'dem', 'timezone', 'modification date'])
            for data in reader:
                result[int(data['geonameid'])] = data

    return result

cities = read_geonames('cities500')
places = read_geonames('allCountries')

## Visualize

Print the size and look at the first entry. Refrain from using Pandas as we want to code retrieval ourselves.

In [64]:
print(len(cities))
next(iter(cities.values()))



199727


{'geonameid': '3038999',
 'name': 'Soldeu',
 'asciiname': 'Soldeu',
 'alternatenames': '',
 'latitude': '42.57688',
 'longitude': '1.66769',
 'feature class': 'P',
 'feature code': 'PPL',
 'country code': 'AD',
 'cc2': '',
 'admin1 code': '02',
 'admin2 code': '',
 'admin3 code': '',
 'admin4 code': '',
 'population': '602',
 'elevation': '',
 'dem': '1832',
 'timezone': 'Europe/Andorra',
 'modification date': '2017-11-06'}

## Lookup
Let's find London in the data - oh, we have more than one?

In [76]:
def lin_search(db, attribute, query):
    for town in db.values():
        if town[attribute] == query:
            yield town

start = time.time()
for result in lin_search(cities, "name", "London"): print(result)
elapsed = time.time() - start
print(f"took {elapsed:.2g}s")


{'geonameid': '6058560', 'name': 'London', 'asciiname': 'London', 'alternatenames': 'Landona,London,Londonas,Londono,YXU,leondeon,lndn,lndn  antaryw,londoni,lun dui,lun dun,lwndwn,rondon,Лондон,לונדון,لندن,لندن، انتاریو,لندن، اونٹاریو,ლონდონი,ロンドン,伦敦,런던', 'latitude': '42.98339', 'longitude': '-81.23304', 'feature class': 'P', 'feature code': 'PPL', 'country code': 'CA', 'cc2': '', 'admin1 code': '08', 'admin2 code': '', 'admin3 code': '', 'admin4 code': '', 'population': '346765', 'elevation': '', 'dem': '252', 'timezone': 'America/Toronto', 'modification date': '2012-08-19'}
{'geonameid': '2643743', 'name': 'London', 'asciiname': 'London', 'alternatenames': 'ILondon,LON,Lakana,Landan,Landen,Ljondan,Llundain,Lodoni,Londain,Londan,Londar,Londe,Londen,Londin,Londinium,Londino,Londn,London,London osh,Londona,Londonas,Londoni,Londono,Londons,Londonu,Londra,Londres,Londrez,Londri,Londro,Londye,Londyn,Londýn,Lonn,Lontoo,Loundres,Luan GJon,Lun-tun,Lunden,Lundra,Lundun,Lundunir,Lundúnir,Lung-d

### Direct Lookup

When reading the data, we conveniently recorded a dictionary with the `geonameid` as key. If we know it, lookup is quick, roughly 100x faster than looking through the entire dictionary:

In [82]:
import time
start = time.time()
cities[2643743]
elapsed = time.time() - start
print(f"took {elapsed:.2g}s")

took 3.1e-05s


### Secondary Index

We want fast lookups for town name, so we create a second index: a dictionary that maps from town name to the primary, unique key (geonameid). Index values are sets of ids. We also convert geonameids into numbers to save space.

Execution is roughly 100x faster than with linear search.

In [86]:
name_idx = {}
for town in cities.values():
    id = int(town['geonameid'])
    # setdefault retrieves the value for key or adds
    # the given default value if the key does not exist.
    name_idx.setdefault(town['name'], set()).add(id)

def index_search(db, idx, query):
    for id in idx[query]:
        yield db[id]

start = time.time()
for result in index_search(cities, name_idx, "London"): print(result)
elapsed = time.time() - start
print(f"took {elapsed:.2g}s")


{'geonameid': '6058560', 'name': 'London', 'asciiname': 'London', 'alternatenames': 'Landona,London,Londonas,Londono,YXU,leondeon,lndn,lndn  antaryw,londoni,lun dui,lun dun,lwndwn,rondon,Лондон,לונדון,لندن,لندن، انتاریو,لندن، اونٹاریو,ლონდონი,ロンドン,伦敦,런던', 'latitude': '42.98339', 'longitude': '-81.23304', 'feature class': 'P', 'feature code': 'PPL', 'country code': 'CA', 'cc2': '', 'admin1 code': '08', 'admin2 code': '', 'admin3 code': '', 'admin4 code': '', 'population': '346765', 'elevation': '', 'dem': '252', 'timezone': 'America/Toronto', 'modification date': '2012-08-19'}
{'geonameid': '4119617', 'name': 'London', 'asciiname': 'London', 'alternatenames': 'Haddoxburg,London', 'latitude': '35.32897', 'longitude': '-93.25296', 'feature class': 'P', 'feature code': 'PPL', 'country code': 'US', 'cc2': '', 'admin1 code': 'AR', 'admin2 code': '115', 'admin3 code': '90813', 'admin4 code': '', 'population': '1046', 'elevation': '116', 'dem': '121', 'timezone': 'America/Chicago', 'modificati

## Range Queries

We want to find all towns with a population between 10M and 15M, but avoid a costly linear pass.

Solution: use a lists of (population, geonameid) tuples, sorted by population. Using binary search, we can find the boundaries for the requested range.

In [107]:
# Create the index.
population_idx = []
for town in cities.values():
    population_idx.append((int(town['population']), int(town['geonameid'])))
population_idx.sort()  # Sort first by population, then id

# Instead of implementing binary search ourselves...
import bisect
# Use operator.itemgetter(0) to extract the population from the (pop, id) tuple.
import operator
import time
start = time.time()

lower = bisect.bisect_left(population_idx, 10e6, key=operator.itemgetter(0))
upper = bisect.bisect_right(population_idx, 15e6, key=operator.itemgetter(0))
print(f"Found {upper - lower} cities.")
for idx in range(lower, upper):
    megacity = cities[population_idx[idx][1]]
    print(megacity['name'], megacity['population'])

elapsed = time.time() - start
print(f"took {elapsed:.2g}s")

Found 12 cities.
Seoul 10349312
Dhaka 10356500
Moscow 10381222
Wuhan 10392693
Delhi 10927986
Tianjin 11090314
Karachi 11624219
Mexico City 12294193
São Paulo 12400232
Mumbai 12691836
Chengdu 13568357
Istanbul 14804116
Lagos 15388000
took 0.00023s
