# Week 6: Information Retrieval

For this week's notebook, we'll implement a (very) simple search engine with ranked retrieval. Our scoring model will be based on TF-IDF:

$$ \text{TF-IDF}_{t,d} = \left(1 + \log_{10}(\text{TF}_{t,d})\right) \cdot \log_{10}\left( \frac{N}{\text{DF}_{t}}\right) $$

Where for each document, we'll compute the summed TF-IDF scores for each term in the query:

$$ \text{score}(q,d) = \sum_{t \in q} \text{TF-IDF}_{t,d} $$

For a small corpus, it's relatively trivial to implement this in-memory. To better reflect the real-world use, where many machines are needed to construct and serve the index, we'll build our model using a MapReduce paradigm, in three stages:

1. **Map stage**: extract `(word, doc_id)` pairs
2. **Reduce stage 1**: sum identical tuples to get `(word, doc_id, `$\text{TF}_{t,d}$`)`
3. **Reduce stage 2**: build word index of sparse TF-IDF vectors

We'll just use Python functions for the basic `map`, `groupby`, and `reduce` primitives, but the same control flow can be adapted to run as a large-scale distributed system.

In [1]:
import sys, os, time, re
import collections, itertools

import numpy as np
import nltk

import utils; reload(utils)

from IPython.display import display, HTML

For our corpus, we'll use the [Reuters corpus](http://www.nltk.org/book/ch02.html#reuters-corpus) included with NLTK. This corpus consists of about 12,000 short news articles from Reuters in 1987.

The `paras()` function will return individual articles which we can index.

In [2]:
corpus = nltk.corpus.reuters
documents = corpus.paras()  # lazy iterator! streams from disk.

## Map Stage

In this stage, we'll make a single pass through the documents. We'll build a document index `doc_id -> text`, and emit tuples of `(word, doc_id)` that we'll use to build the term index.

In [3]:
doc_index = {}
map_output = []

word_count = 0
doc_count = 0
t0 = time.time()
for doc_id, document in enumerate(documents):
  for word in utils.flatten(document):
    map_output.append((word.lower(), doc_id))
    word_count += 1
  # Store document text
  doc_index[doc_id] = utils.flatten(document)
  doc_count += 1

print "Map stage completed in %s" % utils.pretty_timedelta(since=t0)
print "Emitted %d words from %d documents" % (word_count, doc_count)

Map stage completed in 0:00:06
Emitted 1720917 words from 11887 documents


In [4]:
map_output[:10]

[(u'asian', 0),
 (u'exporters', 0),
 (u'fear', 0),
 (u'damage', 0),
 (u'from', 0),
 (u'u', 0),
 (u'.', 0),
 (u's', 0),
 (u'.-', 0),
 (u'japan', 0)]

## Reduce Stage 1

In the first reduce stage, we'll simply de-dupe to convert our emitted tuples into term counts. So:

`("foo", 42)`  
`("foo", 42)`  
`("bar", 7)`

will reduce to:

`("foo", 42, 2)`  
`("bar", 7, 1)`

In [5]:
reduce_output = []

t0 = time.time()
output_count = 0

keyfunc = lambda (w,d): (w,d)
sorted_map_output = sorted(map_output, key=keyfunc)
for (w,d), group in itertools.groupby(sorted_map_output, key=keyfunc):
  reduce_output.append((w,d,len(list(group))))
  output_count += 1
  
print "Reduce stage 1 completed in %s" % utils.pretty_timedelta(since=t0)
print "Emitted %d tuples" % (output_count,)

Reduce stage 1 completed in 0:00:04
Emitted 904955 tuples


In [6]:
reduce_output[:10]

[(u'!', 5327, 1),
 (u'!', 5354, 1),
 (u'!', 11541, 1),
 (u'"', 0, 7),
 (u'"', 6, 2),
 (u'"', 9, 9),
 (u'"', 10, 2),
 (u'"', 12, 1),
 (u'"', 13, 2),
 (u'"', 14, 3)]

## Reduce Stage 2

In the second reduce stage, we'll build our actual word index. We'll store TF-IDF values, which we can use to compute relevance to a particular query.

In [7]:
doc_frequencies = {}
tfidf_vectors = {}

def tfidf(tf, df, N=doc_count):
  """Compute log-scaled TF-IDF."""
  return (1 + np.log10(tf)) * np.log10(float(N)/df)

t0 = time.time()
output_count = 0

keyfunc = lambda (word, doc_id, tf): word
sorted_reduce_output = sorted(reduce_output, key=keyfunc)
for word, group in itertools.groupby(sorted_reduce_output, key=keyfunc):
  posting_list = sorted([(d,tf) for (w,d,tf) in group])
  df = len(posting_list)
  
  # Convert to TF-IDF score
  tfidf_vec = [(d, tfidf(tf, df)) for (d,tf) in posting_list]
  tfidf_vectors[word] = tfidf_vec
  doc_frequencies[word] = df
  
  output_count += 1
  
print "Reduce stage 2 completed in %s" % utils.pretty_timedelta(since=t0)
print "Index size: %d words" % (output_count,)

Reduce stage 2 completed in 0:00:03
Index size: 31077 words


In [8]:
term = "barley"
print "%d documents for term %s" % (len(tfidf_vectors[term]), term)
print "Sample entries (doc_id, tfidf_score):"
print "\n".join(map(str, tfidf_vectors[term][:10]))

74 documents for term barley
Sample entries (doc_id, tfidf_score):
(460, 3.2582939505509438)
(482, 2.2058405429751424)
(483, 2.2058405429751424)
(497, 2.8698647120623835)
(529, 2.2058405429751424)
(612, 2.2058405429751424)
(615, 4.0699920624274197)
(675, 2.8698647120623835)
(1486, 2.2058405429751424)
(1487, 2.2058405429751424)


## Query Index

Now that we've got our TF-IDF vectors, we still need an efficient way to query the index. We'll exploit the fact that our vectors are still quite sparse, and only consider documents that have a match for at least one term in the query.

In [9]:
def get_candidates(query_words):
  candidate_docs = collections.defaultdict(lambda: 0.0)
  print "Searching!"
  for word in query_words:
    matches = tfidf_vectors[word]
    idf = np.log10(float(doc_count)/doc_frequencies[word])
    print "- term \"%s\": %d documents, idf = %.03f" % (word, len(matches), idf)
    # Increment score for each matching doc
    for (doc_id, score) in matches:
      candidate_docs[doc_id] += score
  
  # Sort by most relevant
  keyfunc = lambda (doc_id, score): score
  return sorted(candidate_docs.iteritems(), key=keyfunc, reverse=True)

## Ten Blue Links

Finally, we'll just add a bit of formatting code to make the output a bit more... [familiar](https://www.google.com/).

In [10]:
results_dir = "results"
if not os.path.isdir(results_dir):
  os.mkdir(results_dir)

def ten_blue_links(query, k=10):
  query_words = query.split()
  candidates = get_candidates(query_words)
  for i, (doc_id, score) in enumerate(candidates[:k]):
    document = doc_index[doc_id]
    # Write temp result file
    fname = "%s/result_%d.html" % (results_dir, i)
    with open(fname, 'w') as fd:
      print >> fd, " ".join(document)
    # Display nice link format
    link_text = " ".join(document[:20]) + " ..."
    utils.show_search_result(i, doc_id, score, fname, link_text)
    
ten_blue_links("barley cyprus")

Searching!
- term "barley": 74 documents, idf = 2.206
- term "cyprus": 11 documents, idf = 3.034
