# Latent Semantic Analysis

We will use a subset of articles from the New York Times dataset (downloaded from the [UCI repository](https://archive.ics.uci.edu/ml/datasets/Bag+of+Words)).

We start with a set of imports of the packages that we will need.

In [None]:
import gensim
import os
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook

## Load the Data

The data file contains one line per article (document). Each line contains a list of the words that are contained in that document, sorted in alphabetical order. The data has already been preprocessed by removing *stopwords* and *punctuation marks*, and it has been converted to *lower case*.

We load the data into the variable `nytimes` using the following piece of code:

In [None]:
datapath = './dat'   # replace with your own data path
nytimes = []
with open(os.path.join(datapath, 'nytimes_30000docs.txt')) as inputfile:
    for line in inputfile:
        nytimes.append(line.lower().split())

**[Task]** How many documents are there in the collection?

In [None]:
print("There are {} documents".format(len(nytimes)))

## Preprocess the Data

As we mentioned above, the data has already been preprocessed by removing stopwords, punctuation marks, and transforming upper case characters to lower case.

So there isn't a lot of pre-processing that remains to be done.

**Create the dictionary**

We use the `Dictionary` method from `gensim.corpora` to create the dictionary. The `gensim` dictionary encapsulates the mapping between normalized words and their integer ids. The dictionary can be created from a corpus and can later be pruned according to document frequency (removing (un)common words via the `filter_extremes()` and `filter_n_most_frequent()` methods), save/loaded from disk (via `save()` and `load()` methods), merged with other dictionary (`merge_with()`), etc. Another important function is `doc2bow()`, which converts a collection of words to its bag-of-words representation. 

In [None]:
# Create the dictionary
dictionary = gensim.corpora.Dictionary(nytimes)

**[Task]** How many unique tokens are there?

In [None]:
print('The vocabulary size is {}'.format(len(dictionary)))

**Remove high and low-frequency words**

As mentioned above, we can filter out the (un)common words in the collection. As an example, we will remove words that appear in less than 4 documents and words that appear in more than $80\%$ of the documents.

In [None]:
dictionary.filter_extremes(no_below=4, no_above=0.8)

**[Task]** How many unique tokens are there after the preprocessing step?

In [None]:
print('The vocabulary size is now {}'.format(len(dictionary)))

In `gensim.corpora.Dictionary`, the variable `token2id` is a dictionary that allows us to recover the token id assigned to each vocabulary word.

**[Task]** Find out the id assigned to word "chromosomal".

In [None]:
print('The token id of word "chromosomal" is: {}'.format(dictionary.token2id['chromosomal']))

Similarly we can print the word corresponding to a particular token id. We can simply use `dictionary[token_id]` for that.

**[Task]** Use the cell below to print the word corresponding to id 6178.

In [None]:
print('The word corresponding to id 6178 is: {}'.format(dictionary[6178]))

Note that we could also use the `gensim.corpora.Dictionary` object to perform other tasks, such as removing the most frequent words. For example, we could use `dictionary.filter_n_most_frequent(25)` to remove the 25 most frequent words.

**Create bag-of-words representation**

To create the bag-of-words (BOW) representation, we use the method `doc2bow()` of `gensim.corpora.Dictionary`.

In [None]:
corpus = [dictionary.doc2bow(doc) for doc in nytimes]

Note that `corpus` is now a list of (word token, word count) pairs. For instance, this is the BOW representation of the first document in the corpus:

In [None]:
print(corpus[0])

**Obtain tf-idf representation**

The package `gensim` also has a method to compute the tf-idf representation as follows.

In [None]:
tfidf_converter = gensim.models.TfidfModel(dictionary=dictionary)
corpus_tfidf = tfidf_converter[corpus]

**[Task]** Print below the tf-idf representation of the first document in the corpus, and compare it to the BOW representation.

In [None]:
print(corpus_tfidf[0])

## Apply LSA

The package `gensim` has the method `gensim.models.LsiModel`, which internally performs the SVD computations. We wish to set the number of topics to $100$.

We now fit the LSI model to the data.

In [None]:
lsi_model = gensim.models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=100)

**[Warning]** Fitting LSI to this dataset on a laptop may take around one minute at most. However, if it takes longer than 5 minutes in your computer, you may use the dataset with 5000 documents instead of the dataset with 30000 documents. Just replace the corresponding line on the initial cell and re-run the notebook.

**[Note]** We can alternatively factorize the term-document matrix of raw counts, rather than the tf-idf representation. Both approaches are valid.

**[Task]** Plot the $5$ dominant topics using the function `print_topics(num_topics=num_topics, num_words=num_words)` of the object `model`. You may display $10$ words per topic. Read the words in the displayed topics. Does the result make sense? Can you think of a "title" that summarizes each topic?

In [None]:
lsi_model.print_topics(num_topics=5, num_words=10)

**[Note]** We will see in the next lab session that the topics from LDA are more interpretable.

## Document Retrieval with LSI

Now we explore how we can compute similarities to a given query in the LSI space. To prepare for (cosine) similarity queries, we need to enter all documents which we want to compare against subsequent queries. We can do that using `gensim.similarities.MatrixSimilarity` as follows:

In [None]:
corpus_lsi = lsi_model[corpus_tfidf]
doc_similarities = gensim.similarities.MatrixSimilarity(corpus_lsi)

We now write a query and then obtain its LSI representation.

In [None]:
# Feel free to replace the text with you own query
query = '''infection doctor medicine antibiotic'''.lower()

# Find the LSI representation of this article (doc2bow --> tf-idf --> LSI)
query_tfidf = tfidf_converter[dictionary.doc2bow(query.split())]
query_lsi = lsi_model[query_tfidf]

We can plot the representation of the query in the LSI space.

In [None]:
# Plot the LSI representation of the query
aux = np.array([val for (_, val) in query_lsi])
plt.bar(np.arange(100), aux)

Now we can compute the similarity to all documents in the collection as

In [None]:
sims = doc_similarities[query_lsi]

Note that `sims` is now an array containing the similarity with respect to all documents.

**[Task]** Print the indices of the 10 documents with the highest similarity to the query, as well as their similarity to the query. *Hint:* You may use `np.argsort` to obtain the indices that would sort an array. However, keep in mind that `argsort` sorts in ascending order!

In [None]:
doc_sorted = np.argsort(-sims)  # Minus sign because argsort sorts in ascending order
for d in range(10):
    print("{:02d}.\t Document {:d}\t (Similarity {:.2f})".format(d+1,
                                                                 doc_sorted[d],
                                                                 sims[doc_sorted[d]]))

**[Task]** Print the first retrieved document and check the words that it contains so that you can get a sense of what the document is about.

In [None]:
print(nytimes[doc_sorted[0]])