# Information Retrieval & Indexing

The goal of _information retrieval_ is to select a small number of results from a much larger _dataset_, according to a _query_. In the case of an internet search engine, the dataset is the collection of all internet pages, and the query is the search terms used to query the engine.

The data structure used to quickly find the matching search results is called an **index**.

Besides finding all matching elements from the dataset, we typically expect the results to be ordered by some measure, for example by relevance. We will only touch the subject of _ranking_ here, however.

## Datasets
For this tutorial we will use one of three example datasets for testing our code and measuring performance. All cover geographic entities that come with textual information such as their name, but also numeric properties such as their population and coordinates.

### Toy Dataset
For developing and testing our code, using a small toy dataset can be useful:

In [1]:
# Toy dataset
thurgau = {
    0: {'name': 'Romanshorn', 'population': 11556, 'latitude': 47.56586, 'longitude': 9.37869},
    1: {'name': 'Amriswil', 'population': 14313, 'latitude': 47.54814, 'longitude': 9.30327},
    2: {'name': 'Arbon', 'population': 15459, 'latitude': 47.51360, 'longitude': 9.42999},
    3: {'name': 'Weinfelden', 'population': 11893, 'latitude': 47.56638, 'longitude': 9.10588},
    4: {'name': 'Frauenfeld', 'population': 26093, 'latitude': 47.55856, 'longitude': 8.89685},
    5: {'name': 'Kreuzlingen', 'population': 22788, 'latitude': 47.645837,'longitude': 9.178608},
    6: {'name': 'Egnach', 'population': 4897, 'latitude': 47.54565, 'longitude': 9.37864},
}

### Data from geonames.org

For more serious performance testing, we import two datasets from geonames.org:
  * cities500: About 200k localities (towns etc.) from around the world.
  * allCountries: About 12M entities (towns, lakes, establishments...)

They follow the same layout:

From http://download.geonames.org/export/dump/

```
The main 'geoname' table has the following fields :
---------------------------------------------------
geonameid         : integer id of record in geonames database
name              : name of geographical point (utf8) varchar(200)
asciiname         : name of geographical point in plain ascii characters, varchar(200)
alternatenames    : alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000)
latitude          : latitude in decimal degrees (wgs84)
longitude         : longitude in decimal degrees (wgs84)
feature class     : see http://www.geonames.org/export/codes.html, char(1)
feature code      : see http://www.geonames.org/export/codes.html, varchar(10)
country code      : ISO-3166 2-letter country code, 2 characters
cc2               : alternate country codes, comma separated, ISO-3166 2-letter country code, 200 characters
admin1 code       : fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code; varchar(20)
admin2 code       : code for the second administrative division, a county in the US, see file admin2Codes.txt; varchar(80) 
admin3 code       : code for third level administrative division, varchar(20)
admin4 code       : code for fourth level administrative division, varchar(20)
population        : bigint (8 byte int) 
elevation         : in meters, integer
dem               : digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' (ca 90mx90m) or 30''x30'' (ca 900mx900m) area in meters, integer. srtm processed by cgiar/ciat.
timezone          : the iana timezone id (see file timeZone.txt) varchar(40)
modification date : date of last modification in yyyy-MM-dd format
```

We need to download each of the datasets once:

In [2]:
# Download
#!curl http://download.geonames.org/export/dump/cities500.zip --output data/cities500.zip
#!curl http://download.geonames.org/export/dump/allCountries.zip --output data/allCountries.zip

We read the downloaded zip files in CSV format and produce outputs similar to the toy dataset above.

In [None]:
%pip install tqdm
%pip install ipywidgets

In [4]:
from tqdm.auto import tqdm

def read_geonames(filename):
    """Read the given filename and produce a dictionary from geonameid to a dict per entity."""
    from io import TextIOWrapper
    import zipfile
    import csv
    result = {}
    with zipfile.ZipFile(f'data/{filename}.zip') as myzip:
        with myzip.open(f'{filename}.txt', 'r') as csv_file:
            reader = csv.DictReader(TextIOWrapper(csv_file, 'utf-8'), delimiter='\t', fieldnames=['geonameid', 'name', 'asciiname', 'alternatenames', 'latitude', 'longitude', 'feature class', 'feature code', 'country code', 'cc2', 'admin1 code', 'admin2 code', 'admin3 code', 'admin4 code', 'population', 'elevation', 'dem', 'timezone', 'modification date'])
            for data in tqdm(reader, desc=filename):
                result[int(data['geonameid'])] = data

    return result

cities = read_geonames('cities500')  # 200k
all_entities = read_geonames('allCountries')  # 12M
places = all_entities

cities500: 0it [00:00, ?it/s]

allCountries: 0it [00:00, ?it/s]

### Visualize

Print the size and look at the first entry. Refrain from using Pandas as we want to code retrieval ourselves.

In [5]:
print(len(places))
next(iter(places.values()))

12410889


{'geonameid': '2986043',
 'name': 'Pic de Font Blanca',
 'asciiname': 'Pic de Font Blanca',
 'alternatenames': 'Pic de Font Blanca,Pic du Port',
 'latitude': '42.64991',
 'longitude': '1.53335',
 'feature class': 'T',
 'feature code': 'PK',
 'country code': 'AD',
 'cc2': '',
 'admin1 code': '00',
 'admin2 code': '',
 'admin3 code': '',
 'admin4 code': '',
 'population': '0',
 'elevation': '',
 'dem': '2860',
 'timezone': 'Europe/Andorra',
 'modification date': '2014-11-05'}

## Linear Search
Let's find Romanshorn in the data - how long does that take? Oh, there is more than one match in the `places` database!

Notes:
  * we can use the `tqdm` library to provide nice progress bars for long-running code.
  * alternatively, use the `%time` jupyter magic to measure the time it takes (only within jupyter).

In [14]:
from tqdm.auto import tqdm

def lin_search(db, attribute, query):
    """Linear search through db, looking for entities with an attribute equal to query."""
    for town in tqdm(db.values()):
        if town[attribute] == query:
            yield town

%time for result in lin_search(cities, "name", "Romanshorn"): print(result)

  0%|          | 0/199727 [00:00<?, ?it/s]

{'geonameid': '2658985', 'name': 'Romanshorn', 'asciiname': 'Romanshorn', 'alternatenames': 'Romanshorn,Romanskhorn,luo man si huo en,Романсхорн,羅曼斯霍恩', 'latitude': '47.56586', 'longitude': '9.37869', 'feature class': 'P', 'feature code': 'PPLA3', 'country code': 'CH', 'cc2': '', 'admin1 code': 'TG', 'admin2 code': '2011', 'admin3 code': '4436', 'admin4 code': '', 'population': '8956', 'elevation': '', 'dem': '401', 'timezone': 'Europe/Zurich', 'modification date': '2013-04-02'}
CPU times: user 45.4 ms, sys: 134 ms, total: 179 ms
Wall time: 590 ms


## Direct Lookup

When reading the data, we conveniently recorded a dictionary with the `geonameid` as key. If we know it, lookup is quick.

Note the use of the `%%timeit` magic which runs the line of code many times to accurately measure how long it takes (only within jupyter).

In [7]:
%timeit result = places[2658985]
print(result)

21.6 ns ± 0.265 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
{'geonameid': '11963382', 'name': 'Romanshorn', 'asciiname': 'Romanshorn', 'alternatenames': '8506121,RH,Romanshorn', 'latitude': '47.56553', 'longitude': '9.37936', 'feature class': 'S', 'feature code': 'RSTN', 'country code': 'CH', 'cc2': '', 'admin1 code': 'TG', 'admin2 code': '2011', 'admin3 code': '4436', 'admin4 code': '', 'population': '0', 'elevation': '', 'dem': '398', 'timezone': 'Europe/Zurich', 'modification date': '2018-11-05'}


## Secondary Index

We want fast lookups for place name, we can create a _secondary index_: a dictionary that maps from place name to the primary, unique key (geonameid). The dictionary values are _sets_ of ids.

In [19]:
from tqdm.auto import tqdm

def build_attribute_index(dataset, attribute):
    index = {}
    for id, entity in tqdm(dataset.items()):
        # matching_places = None
        # if town['name'] in name_idx:
        #     matching_places = name_idx[town['name']]
        # else:
        #     matching_places = set()
        #     name_idx[town['name']] = matching_places
        # matching_places.add(id)

        # setdefault retrieves the value for key or adds
        # the given default value if the key does not exist.
        index.setdefault(entity[attribute], set()).add(id)
    return index

def index_search(dataset, idx, query):
    """Search using the secondary index 'idx', then look up the entity in the dataset."""
    for id in idx[query]:
        yield dataset[id]

%time name_idx = build_attribute_index(places, 'name')


  0%|          | 0/12410889 [00:00<?, ?it/s]

CPU times: user 20 s, sys: 1min 1s, total: 1min 21s
Wall time: 2min 23s


In [18]:
thurgau

{0: {'name': 'Romanshorn',
  'population': 11556,
  'latitude': 47.56586,
  'longitude': 9.37869},
 1: {'name': 'Amriswil',
  'population': 14313,
  'latitude': 47.54814,
  'longitude': 9.30327},
 2: {'name': 'Arbon',
  'population': 15459,
  'latitude': 47.5136,
  'longitude': 9.42999},
 3: {'name': 'Weinfelden',
  'population': 11893,
  'latitude': 47.56638,
  'longitude': 9.10588},
 4: {'name': 'Frauenfeld',
  'population': 26093,
  'latitude': 47.55856,
  'longitude': 8.89685},
 5: {'name': 'Frauenfeld',
  'population': 22788,
  'latitude': 47.645837,
  'longitude': 9.178608},
 6: {'name': 'Egnach',
  'population': 4897,
  'latitude': 47.54565,
  'longitude': 9.37864}}

In [17]:
name_idx

{'Romanshorn': {0},
 'Amriswil': {1},
 'Arbon': {2},
 'Weinfelden': {3},
 'Frauenfeld': {4, 5},
 'Egnach': {6}}

Index creation takes a long time, but lookup is instant compared to the long time with linear search.

In [20]:
%time results = list(index_search(places, name_idx, "Romanshorn"))
for place in results: print(place)

CPU times: user 12 µs, sys: 160 µs, total: 172 µs
Wall time: 460 µs
{'geonameid': '2658985', 'name': 'Romanshorn', 'asciiname': 'Romanshorn', 'alternatenames': 'Romanshorn,Romanskhorn,luo man si huo en,Романсхорн,羅曼斯霍恩', 'latitude': '47.56586', 'longitude': '9.37869', 'feature class': 'P', 'feature code': 'PPLA3', 'country code': 'CH', 'cc2': '', 'admin1 code': 'TG', 'admin2 code': '2011', 'admin3 code': '4436', 'admin4 code': '', 'population': '8956', 'elevation': '', 'dem': '401', 'timezone': 'Europe/Zurich', 'modification date': '2013-04-02'}
{'geonameid': '7286940', 'name': 'Romanshorn', 'asciiname': 'Romanshorn', 'alternatenames': 'CH4436', 'latitude': '47.56354', 'longitude': '9.35639', 'feature class': 'A', 'feature code': 'ADM3', 'country code': 'CH', 'cc2': '', 'admin1 code': 'TG', 'admin2 code': '2011', 'admin3 code': '4436', 'admin4 code': '', 'population': '11269', 'elevation': '', 'dem': '428', 'timezone': 'Europe/Zurich', 'modification date': '2021-12-03'}
{'geonameid': '

## Range Queries

We want to find all towns with a population between 10M and 15M, but avoid a costly linear pass.

Solution: use a lists of (population, geonameid) tuples, sorted by population. Using binary search, we can find the boundaries for the requested range.

In [13]:
from tqdm.auto import tqdm

# Create the index.
def build_integer_range_index(dataset, attribute):
    index = []
    for id, entity in tqdm(dataset.items()):
        # Only include entities that specify a population (no mountains and lakes, please)
        try:
            value = int(entity[attribute])
            if value > 0:
                index.append((value, id))
        except ValueError:
            pass
    index.sort()  # Sort first by population, then id
    return index


def query_numeric_index(dataset, idx, lower, upper):
    # Use bisect instead of implementing binary search ourselves:
    import bisect
    # Use operator.itemgetter(0) to extract the population from the (pop, id) tuple.
    import operator

    # The index of the smallest entity in idx greater or equal than lower.
    lower_index = bisect.bisect_left(idx, lower, key=operator.itemgetter(0))
    # The index of the smallest entity in population_idx greater than upper.
    upper_index = bisect.bisect_right(idx, upper, key=operator.itemgetter(0))
    for index_key in range(lower_index, upper_index):
        yield dataset[idx[index_key][1]]

%time population_idx = build_integer_range_index(places, 'population')

  0%|          | 0/12410889 [00:00<?, ?it/s]

CPU times: user 3.14 s, sys: 4.19 s, total: 7.33 s
Wall time: 14.2 s


Again, building the index takes a long time, but querying is instant.

In [25]:
%time results = list(query_numeric_index(places, population_idx, 10e6, 15e6))
for place in results:
    # Only report places, not countries or administrative areas.
    if place['feature class'] == 'P':
        print(place['name'], place['population'])

CPU times: user 150 µs, sys: 28 µs, total: 178 µs
Wall time: 216 µs
Seoul 10349312
Dhaka 10356500
Moscow 10381222
Wuhan 10392693
Delhi 10927986
Tianjin 11090314
Karachi 11624219
Mexico City 12294193
São Paulo 12400232
Mumbai 12691836
Chengdu 13568357
Istanbul 14804116
