# Information Retrieval

Many tasks in information retrieval (IR) and natural language processing (NLP) involve performing document similarity comparisons. These tasks include document clustering, retrieving the most relevant documents for a given query, finding document translation pairs in a large multilingual collection, etc.  

Most practical applications of document similarity represent documents in a common vector space that captures document features. Representing documents in a feature space abstracts away from the specific sequence of words used in each document and, with appropriate representations, can also facilitate the analysis of relationships between documents written using different vocabularies.

Here, we will cover one of the fundamental retrieval models using the tf-idf representation of documents. Queries and documents are represented in a space whose dimensions are the vocabulary terms (words, n-grams, stems, phrases, etc.). More specifically, for a document $d$ and a vocabulary $V$ of index terms this representation is a vector whose dimensions are the tf-idf weights for each of the index terms.

The tf-idf representation of documents is typically sparse, because documents only contain a subset of the words in the vocabulary. This allows for the whole document collection to be represeted as a sparse matrix, where the rows are the documents and the columns correspond to the indexed terms.

Here we will use the following form of the tf-idf representation:
$$
\textrm{weight}_{wd} = \frac{f_{wd}}{\sum_{w^\prime} f_{w^\prime d}} \times \log\left(\frac{\textrm{number of documents}}{df_w}\right).
$$

Let's first import the packages that we will use in this session. In particular, we will use `TfidfVectorizer` from scikit-learn, which allows us to obtain the tf-idf representation.

In [None]:
import os
import string
import sklearn.metrics.pairwise
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

## Data

Let's now read the data. We will use a small subset of books extracted from the [Project Gutenberg website](http://www.gutenberg.org/ebooks/).

By default, the code below reads the books under folder `./gutenberg_subset`. Don't forget to replace the path with the appropriate path in your computer!

In [None]:
datapath = './gutenberg_subset'  # <-- Replace with the appropriate path
books = {}
book_names = []
exclude = set(string.punctuation)
for filename in sorted(os.listdir(datapath)):
    if filename.endswith(".txt"):
        try:
            # Open the text and convert to lower case
            book = open(os.path.join(datapath, filename)).read().lower()
            # Remove punctuation
            book = ''.join(ch for ch in book if ch not in exclude)
            # Keep the book in a dictionary indexed by its filename
            books[filename] = book
            book_names.append(filename)
        except:
            # If error, do nothing
            pass

**[Task]** Find out the number of books in the collection.

In [None]:
print('There are {} books'.format(len(books)))

**[Task]** Print the first 2000 characters of the book whose filename is `12381.txt`.

In [None]:
print(books['12381.txt'][0:2000])

**Compute tf-idf representation.** For that, we use scikit-learn's `TfidfVectorizer`. This allows to automatically remove English stopwords.

Note that the method `fit_transform` gives the tf-idf representation of a list of books.

In [None]:
# Convert the collection of books into tf-idf vectors:
tfidf_converter = TfidfVectorizer(stop_words='english')
books_tfidf = tfidf_converter.fit_transform([books[name] for name in sorted(books.keys())])

**[Task]** Use the cell below to print the vocabulary size.

In [None]:
print("Number of books: {}".format(books_tfidf.shape[0]))
print("Vocabulary size: {}".format(books_tfidf.shape[1]))

**[Task]** Use the cell below to print a block of the tf-idf matrix (e.g., rows from `0:4` and columns `0:1000`). What do all these number indicate?

In [None]:
print(books_tfidf[0:4,0:1000])

## Queries

Now let's prepare a query and convert it to tf-idf.

In [None]:
# A query (feel free to type your own text!)
query=['''how to teach education
          teacher educational'''.lower().translate(string.punctuation)]

We now convert the query into its tf-idf representation using `transform` (important: we now have to use `transform` instead of `fit_transform`. The latter defines the vocabulary and the transformation while the former simply applies an exising transformation).

**[Task]** Create the variable `query_tfidf` with the tf-idf representation of the query.

In [None]:
# Transform the query to tf-idf representation
query_tfidf = tfidf_converter.transform(query)

Now we compute the cosine distance between the query and the vector representation of all books. Then we sort the results by distance and print the top 10 retrieved books.

In [None]:
# Compute cosine distance between the query and the books
cosine_distances = sklearn.metrics.pairwise.cosine_distances(books_tfidf, query_tfidf)
# Flatten the distances into an array
cosine_distances = cosine_distances.ravel()

**[Task]** Create the variable `cosine_distances_sorted_idx` with the indices needed to sort the cosine distances.

In [None]:
#Sort cosine distances:
cosine_distances_sorted_idx = np.argsort(cosine_distances)

Now we print the first 10 retrieved documents:

In [None]:
# Print the first 10 retrieved books
for i in range(10):
    print('{:02d}.\t File "{:s}" \t(Distance = {:.2f})'.format(i+1, book_names[cosine_distances_sorted_idx[i]],
                                                               cosine_distances[cosine_distances_sorted_idx[i]]))

**[Task]** Open the two first retrieved books using a text editor. Check the title of the books.

**[Task]** Use the code above to retrieve the Declaration of Independence of the United States.

**[Solution]** You may use the following query:

```python
query=['''declaration independence united states
          america justice welfare liberty'''.lower().translate(string.punctuation)]
```