# Gensim
Gensim is a useful python library for topic modelling and processling large corpora (collection of documents).
To install Gensim first run the folloing command in terminal
```
pip install -U gensim
```
Alternatively see the gensim github page for instructions for building/installing from source [here](https://github.com/RaRe-Technologies/gensim).

To get things started we must first import gensim and any other libraries we wish to use later on.

Note that if you use pip to install gensim, you may see the following error if you're using Anaconda:

```
Intel MKL FATAL ERROR: Cannot load libmkl_avx2.so or libmkl_def.so.
```
And the python kernal may not start.

This is caused by numpy being incorectly configured with MKL (Intel's Math Kernel Library) on some installations.
To remedy  this run:
```
conda install mkl
conda install -f numpy
```

In [1]:
import gensim as gs
from gensim import models
from collections import defaultdict as dd

Now we're going to define a corpus to use as an example. I use an excerpt from East of Eden. 

In [2]:
our_corpus = ["When a child first catches adults out when it first walks into his grave little head that adults do not always have divine intelligence, that their judgments are not always wise, their thinking true, their sentences just his world falls into panic desolation.",
              "The gods are fallen and all safety gone.",
              "And there is one sure thing about the fall of gods:",
              "they do not fall a little;",
              "they crash and shatter or sink deeply into green muck.",
              "It is a tedious job to build them up again;",
              "they never quite shine.",
              "And the child world is never quite whole again.",
              "It is an aching kind of growing."]

And then we can do some easy and simple text processing as we have done similarly using nltk.

In [3]:
# Our self-defined stoplist, small for the purpose of example.
stoplist = ["the", "a", "it", "is", "an", "are", "and", "or", "to", "of", "that"]
# Then we make each word lowercase and "tokenize" words using whitespace
tokens_list = [[word for word in document.lower().split() if word not in stoplist] for document in our_corpus]


Normally we would want to clean up our corpus so we only have words that occur more than once, and so we can
define a function to do so, but we wont use it for our example since our corpus is so small.

In [4]:
# A naive appraoch to cleaning corpus, including naive_tokenization.
# In pracice we would need to process more of the tokens.
def clean_corpus(corpus, stoplist=[]):
    naive_tokens =  [[word for word in document.lower().split() if word not in stoplist] for document in corpus]
    # Count, alternatively we can use Collections.counter
    freq = dd(int)
    for tokens in naive_tokens:
        for token in tokens:
            freq[token] += 1
    return [[token for token in tokens if freq[token] > 1] for tokens in naive_tokens]

# We wont really use cleaned_corpus later
cleaned_corpus =  clean_corpus(our_corpus, stoplist=stoplist)

Using gensim, we cas use the corpora class to determine how many unique words are in our corpus.

Note that this uses gensim's dictionary to process words.

In [5]:
our_dict = gs.corpora.Dictionary(tokens_list)
print our_dict

# Compared to the amount of tokens in tokens (counting repeats and similar words)
print "Total tokens: {}".format(sum(sum(1 for token in tokens) for tokens in tokens_list))

Dictionary(63 unique tokens: [u'all', u'adults', u'judgments', u'wise,', u'into']...)
Total tokens: 82


Before we continue, here are a couple of notes on the `gensim.corpora.dictionary.Dictionary` class.

The `Dictionary` class is essentially a wrapper around the bag of words model, and thus keeping this in mind the following methods make sense. Consider that we have a dictionary that we initialized using 
```
c_dict = gensim.corpora.Dictionary(documents, prune_at=20000)
```
* the prune_at arguement dictates the maximum number of unique words in the instance of the dictionary. Once this limit has been reached, it begins to expel the least frequent words
* `c_dict.add_documents(documents, prune_at = 20000)` allows us to add more documents to our dictionary and change the prune_at parameter.
* `c_dict.doc2bow(document, allow_update=False)` Creates a bag of word dictionary from a list of words (tokenized and normalized) and updates the given dictionary if allow_update is set to true. The bag of word dictionay is a list of 2-tuples which are (token_id, token_count). It should be noted that in gensim, this is thought of as a "vector" and these vectors are in a sparse format.
* The Dictionary class implements the methods get(), items(), iteritems(), iterkeys(), and itervalues().
* For a complete list of functions see [here](https://radimrehurek.com/gensim/corpora/dictionary.html)

As an example, if we want to create a bag of words using a different example string we would do the following.

In [6]:
test_doc = "The quick brown fox jumped over the lazy dog. Television, Radio and the Internet are all modern things. Radio is oldest."
test_doc_tokens = [word for word in test_doc.lower().split() if word not in stoplist]
c_dict = gs.corpora.Dictionary()

# bow1 will be empty since we're not updating the dictionary (False is default)
bow1 = c_dict.doc2bow(test_doc_tokens, allow_update=False)
print "Empty list since no update: {}".format(bow1)

# bow2 will not be empty since we are updating the dictionary
bow2 = c_dict.doc2bow(test_doc_tokens, allow_update=True)
print "\n(token_id, token_count) list: \n{}".format(bow2)

# If we want to replace the token_id with the token itself we can do the following for visual confirmation
# we can do the following
visual_bow2 = [(c_dict.get(k), v) for (k,v) in bow2]
print "\nA visual bag of words (word, count): \n{}".format(visual_bow2)

Empty list since no update: []

(token_id, token_count) list: 
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 2), (10, 1), (11, 1), (12, 1), (13, 1)]

A visual bag of words (word, count): 
[(u'dog.', 1), (u'brown', 1), (u'lazy', 1), (u'things.', 1), (u'jumped', 1), (u'over', 1), (u'modern', 1), (u'fox', 1), (u'all', 1), (u'radio', 2), (u'internet', 1), (u'quick', 1), (u'oldest.', 1), (u'television,', 1)]


Indeed, the Dictionary is more useful than creating a frequency tracker as we did in clean_corpus above.
Therefore a better implementation (still lacking) would replace the frequency dictionary with a `corpora.dictionary`

In [7]:
# As an aside we can get the dictionary for token2id or id2token  by doing the respective commands
print c_dict.token2id
print "\n"
print c_dict.id2token

{u'dog.': 0, u'brown': 1, u'lazy': 2, u'things.': 3, u'jumped': 4, u'over': 5, u'modern': 6, u'fox': 7, u'all': 8, u'radio': 9, u'internet': 10, u'quick': 11, u'oldest.': 12, u'television,': 13}


{0: u'dog.', 1: u'brown', 2: u'lazy', 3: u'things.', 4: u'jumped', 5: u'over', 6: u'modern', 7: u'fox', 8: u'all', 9: u'radio', 10: u'internet', 11: u'quick', 12: u'oldest.', 13: u'television,'}


# Streaming
One of the reasons people would use gensim is because it supports streaming documents in.
For example, If we had a corpus with many documents (on the order of millions), it would be way too intensive, and maybe impossible, to load all the documents into memory to process our corpus of vectors. Instead it is much more feasable to stream documents in and process them in a queue-like fashion. This is a possible because Gensim uses a sparce matrix implementations whenever it can.

To make this simple to see, let's define an iterable that will take in a text file that will have documents seperated by newline characters. (We will be using Alice in Wonderland sourced from [here](https://www.gutenberg.org/ebooks/11))

We will be using the concepts of a dictionary as discussed above in our class. Additionally we will be using the handy `gensim.parsing.preprocessing` library to process our text a bit.

In [8]:
import gensim.parsing.preprocessing as preproc

class StreamCorpus:
    def __init__(self, filename):
        """
           Takes in a path to a file, and a gensim.corpora.Dictionary
        """
        self.filename = filename
        self.dictionary = self.populate_dictionary()
        
    def __iter__(self):
        with open(self.filename) as fd:
            paragraphs = self.iter_paragraphs(fd)
            for paragraph in paragraphs:
                tokens = self.process_text(paragraph).lower().split(' ')
                yield self.dictionary.doc2bow(tokens)
                
    # Create entries in our dictionary so we can generate a bag-of-words vector later on
    def populate_dictionary(self):
        class_dict = gs.corpora.Dictionary()
        with open(self.filename) as fd:
            paragraphs = self.iter_paragraphs(fd)
            for paragraph in paragraphs:
                tokens = self.process_text(paragraph).lower().split(' ')
                class_dict.add_documents([tokens])
        return class_dict
   
    # Function to give paragraph
    def iter_paragraphs(self, fileobj):
        lines = []
        for line in fileobj:
            if (line == "\n") and lines:
                yield ' '.join(lines)
                lines = []
            else:
                lines.append(line);
        yield ' '.join(lines)

    # Feel free to change how we process text for different results
    def process_text(self, text):
        # Remove stopwords using gensim lib
        text = preproc.remove_stopwords(text)
        # Strip punctuation
        text = preproc.strip_punctuation(text)
        # Get rid of multiple whitespaces
        text = preproc.strip_multiple_whitespaces(text)
        return text
        

In [9]:
# We can now create our corpus with vector representation with a memory-friendly approach!
streamed = StreamCorpus('alice_in_wonderland.txt')
alice_corpus = [vector for vector in streamed]

# Models
You may be wondering at this point, why use gensim?
Well instead of exiting using two libraries to create a model, we can stay within the gensim environment!

One simple model we can train is the text frequency-inverse document frequency model, tfidf for short.

We will want to take our corpus and make a vector for each document. In the end we will have a list of vectors which we will pass onto the tfidf model

In [10]:
# Remember we have already defined our_dict with the corpus, our_corpus
# We just want to create the bag of words lists (list of vectors), and not update the dictionary.
bow_corpus = [our_dict.doc2bow(text) for text in tokens_list]

# Then we can pass in this bag-of-words corpus into out tfidf model.
tfidf = models.TfidfModel(bow_corpus)

As an example, we can transform the phrase "tedious aching" by tokenizing it, creating the bag of words vector for "tedious aching", then passing it to our model.

In [11]:
tfidf_pred = tfidf[our_dict.doc2bow("tedious aching".lower().split())]
print "Output is a list of (token_id, tfidf weight): \n{}".format(tfidf_pred)

Output is a list of (token_id, tfidf weight): 
[(49, 0.7071067811865475), (62, 0.7071067811865475)]


In [12]:
# We can also use our example from before for Alice in Wonderland
alice_tfidf = models.TfidfModel(alice_corpus)
print alice_tfidf

TfidfModel(num_docs=831, num_nnz=14395)


Gensim also supports a wide variety of other models to use, many of which are more complex than TF-IDF.


You can check them out [here](https://radimrehurek.com/gensim/apiref.html)

# Similarity
Gensim can also help us determine similarities between texts which is often a useful tool.
Gensim uses Cosine Similarity.

Suppose we wanted to compare the sentence:

"Tiny alice is a mouse in wonderland, running from the red queen. She has a friend called the Mad Hatter and likes cards"


to all of the documents we defined in Alice in Wonderland. We will be using the Latent Semantic Indexing model as an example of Gensims support for other models while showing how this Similarity is "queried"

In [13]:
# Our Streamed corpus of Alice in Wonderland already has a dictionary, so we will just use that.
work_dict = streamed.dictionary
query_doc = "Tiny alice is a mouse in wonderland, running from the red queen. She has a friend called the Mad Hatter and likes cards"
# Create our LSI model
alice_lsi = models.LsiModel(alice_corpus, id2word = work_dict, num_topics = 250)
# Process our text using the code we wrote before
tokens = streamed.process_text(query_doc).lower().split()
# Turn our query into a vector using our dictionary (b.o.w style)
query_vector_bow = work_dict.doc2bow(tokens)
# Now use our LSI model to change vector types (almost there!)
query_vector_lsi = alice_lsi[query_vector_bow]


Gensim uses a class called `gensim.similarities.docsim.MatrixSimilarity` to work out similarity between a collection of documents and a query document in memory:

In [14]:
import gensim.similarities as sim

In [15]:
# set up the Simularity query-handler
sim_calc = sim.MatrixSimilarity(alice_lsi[alice_corpus])
# perform a similarity query against the entirer alice_corpus
similarities = sim_calc[query_vector_lsi]
similarities = sorted(enumerate(similarities), key = lambda item: -item[1])
print similarities[:5]

[(535, 0.39417034), (55, 0.37248236), (708, 0.35933661), (447, 0.35756776), (438, 0.32280397)]


The above output is a list of the highest ranked (by similarity ranking) document numbers. It is a list of (Document number, cosine_similarity) tuples. Recall that Cosine_Similarity is between -1 and 1.

If you want to see what documents these are you would have to pull up the corresponding document for our corpus (the non-vector one) since our vector corpus is just a bag of words vector. We have included a (wasteful) function to find the document below for our document-per-paragraph model:

In [18]:
class DocFinder:
    def __init__(self, filename):
        self.filename = filename
    # Function to give paragraph
    def iter_paragraphs(self, fileobj):
        lines = []
        for line in fileobj:
            if (line == "\n") and lines:
                yield ' '.join(lines)
                lines = []
            else:
                lines.append(line);
        yield ' '.join(lines)
    def get_doc(self, num):
        with open(self.filename) as fd:
            paragraphs = self.iter_paragraphs(fd)
            i = 0
            for paragraph in paragraphs:
                if i == num:
                    return paragraph
                i += 1

In [17]:
finder = DocFinder('alice_in_wonderland.txt')
print finder.get_doc(535)

‘Let’s go on with the game,’ the Queen said to Alice; and Alice was
 too much frightened to say a word, but slowly followed her back to the
 croquet-ground.



A use for this type of similiarity ranking is to see how similar a users query is to a collection of documents. We could then suggest certain documents according to the users query. This isn't as good as Google's PageRank for web searching, but may be passable when searching through collections of documents.