# Information Retreival

This notebook demonstrates how to put together a simple retreival engine in python. Here we'll focus on boolean retreival, which works over bags of words. We won't bother about any optimisations, and just use python dicts, sets etc.

Our dataset is from NLTK, using the Reuters document collection.

In [1]:
import nltk
corpus = nltk.corpus.reuters

Take a look at the data, which are text documents like the one below.

In [2]:
print corpus.raw(corpus.fileids()[0])[:1000]

ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RIFT
  Mounting trade friction between the
  U.S. And Japan has raised fears among many of Asia's exporting
  nations that the row could inflict far-reaching economic
  damage, businessmen and officials said.
      They told Reuter correspondents in Asian capitals a U.S.
  Move against Japan might boost protectionist sentiment in the
  U.S. And lead to curbs on American imports of their products.
      But some exporters said that while the conflict would hurt
  them in the long-run, in the short-term Tokyo's loss might be
  their gain.
      The U.S. Has said it will impose 300 mln dlrs of tariffs on
  imports of Japanese electronics goods on April 17, in
  retaliation for Japan's alleged failure to stick to a pact not
  to sell semiconductors on world markets at below cost.
      Unofficial Japanese estimates put the impact of the tariffs
  at 10 billion dlrs and spokesmen for major electronics firms
  said they would virtually halt exports

We'll need to tokenise the documents, remove stop-words and stem the words to form our bag-of-words representation. Here we'll use the PorterStemmer (but be aware there are others in NLTK). 

In [3]:
stopwords = set(nltk.corpus.stopwords.words('english')) # wrap in a set() (see below)
stemmer = nltk.stem.PorterStemmer() 

def extract_terms(doc):
    terms = set()
    for token in nltk.word_tokenize(doc):
        if token not in stopwords: # 'in' and 'not in' operations are much faster over sets that lists
            terms.add(stemmer.stem(token.lower()))
    return terms

Let's test the method using the first document.

In [4]:
doc = corpus.raw(corpus.fileids()[0])
print list(sorted(extract_terms(doc)))

[u'&', u"''", u"'s", u'(', u')', u',', u'.', u'10', u'15.6', u'17', u'1985', u'30', u'300', u'4.9', u'53', u'7.1', u'95', u';', u'>', u'``', u'a', u'abl', u'account', u'action', u'advantag', u'alleg', u'allow', u'also', u'american', u'among', u'analyst', u'and', u'april', u'asia', u'asian', u'ask', u'associ', u'australia', u'australian', u'avow', u'await', u'awar', u'barrier', u'beef', u'below-cost', u'beyond', u'biggest', u'billion', u'block', u'boost', u'broker', u'budget', u'busi', u'businessmen', u'but', u'button', u'call', u'canberra', u'capel', u'capit', u'centr', u'chairman', u'chief', u'co', u'coal', u'commerci', u'complet', u'concern', u'conflict', u'continu', u'correspond', u'cost', u'could', u'countri', u'curb', u'cut', u'damag', u'day', u'defus', u'democrat', u'deputi', u'despit', u'deterior', u'diplomat', u'director-gener', u'disadvantag', u'disput', u'dlr', u'domest', u'due', u'econom', u'economi', u'effort', u'electr', u'electron', u'emerg', u'end', u'eros', u'estim', u'

We probably want to remove numbers and punctuation, which aren't being caught by the stop list. We may want to be a bit more agressive with tokenising hyphenated words (although take care, as some might be important.) Have a go yourself and see if you can improve the preprocessing to correct for these issues.

Now we can apply the term extraction method to all documents in our corpus. (This may take a minute or two.) 

In [5]:
docs = {}
for docid in corpus.fileids():
    terms = extract_terms(corpus.raw(docid))
    docs[docid] = terms

And build an inverted index, which transposes the above data structure such that terms are the key and a set of document identifiers are the values.

In [6]:
from collections import defaultdict

inverted_index = defaultdict(list)
for docid, terms in docs.items():
    for term in terms:
        inverted_index[term].append(docid)
        
# need to keep doc lists in sorted order
for term, docids in inverted_index.items():
    docids.sort()

Let's try a query term, say *'Taiwanese'*. We can retreive all documents containing this term (being careful to process the query in the same way as the document).

In [7]:
inverted_index[stemmer.stem('Taiwanese'.lower())]

[u'test/14826',
 u'test/16214',
 u'test/19040',
 u'training/10299',
 u'training/11813',
 u'training/354',
 u'training/6976',
 u'training/7135',
 u'training/7531',
 u'training/8063',
 u'training/9007',
 u'training/9184']

How about a multiple term query? Consider *Taiwanese beef* as our query.

In [8]:
postings1 = inverted_index[stemmer.stem('Taiwanese'.lower())]
postings2 = inverted_index[stemmer.stem('beef'.lower())]
print len(postings1), len(postings2)

12 67


We now have to intersect the posting lists (sets here) to implement the conjuctive query. As the postings are in sorted order, we can do this efficiently.

In [10]:
def intersect_postings(postings1, postings2):
    i = j = 0
    results = []
    while i < len(postings1) and j < len(postings2):
        if postings1[i] < postings2[j]:
            i += 1
        elif postings1[i] > postings2[j]:
            j += 1
        else:
            results.append(postings1[i])
            i += 1
            j += 1
    return results

Can you see why the postings lists need to be sorted? Now we can test it on our query.

In [11]:
intersect_postings(postings1, postings2)

[u'test/14826']

You might want to think about how to process queries that include more terms, and disjunctions (OR) or negations.