# Text Indexing

*Information Retrieval* is about *selecting* the documents from a - possibly large - collection that match the *query*, and *ranking* the result set according to relevance. One obvious use-case are internet search engines, but efficient and precise search is needed in many other applications.

## Selection

The first part of retrieval is finding the documents that are match the query, ignoring their relative relevance for now.

One obvious way to do this what we have already done for the secondary indexing case with the global city database. We created an *inverted index* or *inverted file*: a dictionary that records the set of documents (cities) that match a search term.

This is the similar to the index at the back of a book, which maps index terms to pages:

<img src="figures/book-index.png" alt="Index" width="400"/>

In [5]:
documents = {
    1: "a donut on a glass plate",
    2: "only the donut",
    3: "listen to the drum machine",
}

def extract_terms(document):
    yield from document.split(' ')

def build_full_text_index(corpus):
    index = dict()
    for id, document in corpus.items():
        for term in extract_terms(document):
            index.setdefault(term, set()).add(id)
    return index

index = build_full_text_index(documents)
index

{'a': {1},
 'donut': {1, 2},
 'on': {1},
 'glass': {1},
 'plate': {1},
 'only': {2},
 'the': {2, 3},
 'listen': {3},
 'to': {3},
 'drum': {3},
 'machine': {3}}

### Querying

We can query the index using a collection of search terms. Conveniently, we use the same term extractor on the query as we used for indexing.

In [13]:
def query_index(index, query):
    result_set = None
    for term in extract_terms(query):
        results = index.get(term, set())
        result_set = results if result_set is None else result_set.intersection(results)
    return result_set

query_index(index, "the")

{2, 3}

### Term Extraction: Filtering & Expansion

We are probably not very interested in documents containing words that appear everywhere, such as `a` and `the`. Also, we'd like our index to ignore upper- and lowercase variants.

A more sophisticated term extractor would likely perform additional filtering and expansions:
  * transform words to their stems ("donuts" --> "donut")
  * add synonyms ("donut" --> "pastry")
  * add acronmys ("HTML" --> "hypertext markup language")

#### Excercise A

Add stop-wording and case filtering to the term extraction function!

## Index Wikipedia

Can we build an index of English Wikipedia in a reasonable amount of time? Lets download the abstracts of all articles from https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-abstract1.xml.gz.

In [14]:
!curl https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-abstract1.xml.gz -o data/wiki-abstracts.xml.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  118M  100  118M    0     0  4010k      0  0:00:30  0:00:30 --:--:-- 4020k0  3392k      0  0:00:35  0:00:05  0:00:30 3777k  0     0  3794k      0  0:00:32  0:00:09  0:00:23 4581k


Let's look at a few lines of the downloaded archive. On unix-like systems, this is straight-forward: `gzcat` streams a gzipped file to standard output, `head` limits the output to the first few lines.

In [25]:
!gzcat data/wiki-abstracts.xml.gz | head

<feed>
<doc>
<title>Wikipedia: Anarchism</title>
<url>https://en.wikipedia.org/wiki/Anarchism</url>
<abstract>Anarchism is a political philosophy and movement that is skeptical of all justifications for authority and seeks to abolish the institutions it claims maintain unnecessary coercion and hierarchy, typically including nation-states, and capitalism. Anarchism advocates for the replacement of the state with stateless societies and voluntary free associations.</abstract>
<links>
<sublink linktype="nav"><anchor>Etymology, terminology, and definition</anchor><link>https://en.wikipedia.org/wiki/Anarchism#Etymology,_terminology,_and_definition</link></sublink>
<sublink linktype="nav"><anchor>History</anchor><link>https://en.wikipedia.org/wiki/Anarchism#History</link></sublink>
<sublink linktype="nav"><anchor>Pre-modern era</anchor><link>https://en.wikipedia.org/wiki/Anarchism#Pre-modern_era</link></sublink>
<sublink linktype="nav"><anchor>Modern era</anchor><link>https://en.wikipedia.or

Aha, we are interested in the <abstract></abstract> element contents. Let's read those. Also, the article URL could serve as a useful identifier for each document. Below, we use a bit of Python XML magic to generate all (url, abstract) tuples in the downloaded archive. You don't need to understand the details (but you may well).

In [34]:
def read_wikipedia_abstracts(filename):
    """Read the given filename and yield each articles fulltext."""
    from io import TextIOWrapper
    import gzip
    from xml.dom import pulldom

    with gzip.open(f'data/{filename}.gz', mode='rt') as xml:
        doc = pulldom.parse(xml)
        url = None
        abstract = None
        for event, node in doc:
            try:
                if event == pulldom.START_ELEMENT and node.tagName == 'url':
                    doc.expandNode(node)
                    url = node.firstChild.data
                elif event == pulldom.START_ELEMENT and node.tagName == 'abstract':
                    doc.expandNode(node)
                    abstract = node.firstChild.data
                    yield url, abstract
            except Exception as e:
                print(f'Error around {url}: {e}')

next(read_wikipedia_abstracts("wiki-abstracts.xml"))


('https://en.wikipedia.org/wiki/Anarchism',
 'Anarchism is a political philosophy and movement that is skeptical of all justifications for authority and seeks to abolish the institutions it claims maintain unnecessary coercion and hierarchy, typically including nation-states, and capitalism. Anarchism advocates for the replacement of the state with stateless societies and voluntary free associations.')

### Excercise 2

Create an index of all wikipedia abstracts!