<a href="https://colab.research.google.com/github/rahiakela/nlp-research-and-practice/blob/main/getting-started-with-nlp/03-information-retrieval/information_retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Information Retrieval

You might have come across the term information retrieval in the context of
search engines; for example, Google famously started its business by providing a
powerful search algorithm that kept improving over time. The search for information,
however, is a basic need that you may face beyond searching online.

For
instance, every time you search for the files on your computer, you are performing
a sort of information retrieval.

Information search is based on
the idea that the content of a document or set of documents is relevant given the content
of a particular query, so a documents data structure should keep the contents of
all available documents for the algorithm to select from.



##Setup

In [1]:
import math
import string
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk import word_tokenize, WordNetLemmatizer, pos_tag
from nltk.corpus import stopwords
from nltk.stem.lancaster import LancasterStemmer
from operator import itemgetter

In [None]:
%%shell

wget https://github.com/ekochmar/Getting-Started-with-NLP/raw/master/cisi.zip

unzip cisi.zip
rm -rf cisi.zip

##Dataset

There are three components to this data:

* documents with their ids and content-there are 1460 of those to be precise
* questions/queries with their ids and content- there are 112 of those;
* mapping between the queries and relevant documents

First, let's read in documents from the `CISI.ALL` file and store the result in `documents` data structure- set of tuples of document ids matched with contents:

In [5]:
# populate the documents dictionary
def read_documents():
  f = open("cisi/CISI.ALL")
  merged = ""

  for line in f.readlines():
    if line.startswith("."):
      merged += "\n" + line.strip()
    else:
      merged += " " + line.strip()

  documents = {}
  content = ""
  doc_id = ""
  for line in merged.split("\n"):
    if line.startswith(".I"):
      doc_id = line.split(" ")[1].strip()
    elif line.startswith(".X"):
        documents[doc_id] = content
        content = ""
        doc_id = ""
    else:
      content += line.strip()[3:] + " "
  
  documents[doc_id] = content
  f.close()
  return documents

In [6]:
# let's do a sanity check
documents = read_documents()
print(len(documents))
print(documents.get("1"))

1461
 18 Editions of the Dewey Decimal Classifications Comaromi, J.P. The present study is a history of the DEWEY Decimal Classification.  The first edition of the DDC was published in 1876, the eighteenth edition in 1971, and future editions will continue to appear as needed.  In spite of the DDC's long and healthy life, however, its full story has never been told.  There have been biographies of Dewey that briefly describe his system, but this is the first attempt to provide a detailed history of the work that more than any other has spurred the growth of librarianship in this country and abroad. 


Second, let's read in queries from the `CISI.QRY` file and store the result in `queries` data structure – set of tuples of query ids matched with contents:

In [7]:
# populate the queries dictionary
def read_queries():
  f = open("cisi/CISI.QRY")
  merged = ""

  for line in f.readlines():
    if line.startswith("."):
      merged += "\n" + line.strip()
    else:
      merged += " " + line.strip()

  queries = {}
  content = ""
  query_id = ""
  for line in merged.split("\n"):
    if line.startswith(".I"):
      if not content=="":
        queries[query_id] = content
        content = ""
        query_id = ""
      query_id = line.split(" ")[1].strip()
    elif line.startswith(".W") or line.startswith(".T"):
      content += line.strip()[3:] + " "
  
  queries[query_id] = content
  f.close()
  return queries

In [8]:
# let's do a sanity check
queries = read_queries()
print(len(queries))
print(queries.get("1"))

112
What problems and concerns are there in making up descriptive titles? What difficulties are involved in automatically retrieving articles from approximate titles? What is the usual relevance of the content of articles to their titles? 


Finally, let's read in the mapping between the queries and the documents – we'll keep these in the `mappings` data structure – with tuples where each query index (key) corresponds to the list of one or more document indices (value):

In [9]:
# populate the mappings dictionary
def read_mappings():
  f = open("cisi/CISI.REL")

  mappings = {}
  for line in f.readlines():
    voc = line.strip().split()
    key = voc[0].strip()
    current_value = voc[1].strip()
    value = []
    if key in mappings.keys():
      value = mappings.get(key)
    value.append(current_value)
    mappings[key] = value
  
  f.close()
  return mappings

In [10]:
# let's do a sanity check
mappings = read_mappings()
print(len(mappings))
print(mappings.keys())
print(mappings.get("1"))

76
dict_keys(['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '37', '39', '41', '42', '43', '44', '45', '46', '49', '50', '52', '54', '55', '56', '57', '58', '61', '62', '65', '66', '67', '69', '71', '76', '79', '81', '82', '84', '90', '92', '95', '96', '97', '98', '99', '100', '101', '102', '104', '109', '111'])
['28', '35', '38', '42', '43', '52', '65', '76', '86', '150', '189', '192', '193', '195', '215', '269', '291', '320', '429', '465', '466', '482', '483', '510', '524', '541', '576', '582', '589', '603', '650', '680', '711', '722', '726', '783', '813', '820', '868', '869', '894', '1162', '1164', '1195', '1196', '1281']


That’s it! You have successfully initialized one dictionary for `documents` with the
IDs linked to the articles content, another dictionary for `queries` linking query IDs to
their correspondent texts, and the `mappings` dictionary, which matches the query IDs
to the lists of relevant document IDs.

##Boolean search algorithm