We pick the number of topics ahead of time even if we dont know the topics. 
The idea is to identify what is being discussed in a document using Latent Dirichlet Allocation (LDA).

Use LDA to convert a set of research papers into a set of topics. 

In [1]:
!python -m spacy download en_core_web_sm
import spacy

Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
[K    100% |████████████████████████████████| 37.4MB 8.8MB/s ta 0:00:012    43% |█████████████▉                  | 16.1MB 40.6MB/s eta 0:00:01    69% |██████████████████████▎         | 26.0MB 5.4MB/s eta 0:00:03

[93m    Linking successful[0m
    /Users/shivangisareen/anaconda3/lib/python3.6/site-packages/en_core_web_sm
    -->
    /Users/shivangisareen/anaconda3/lib/python3.6/site-packages/spacy/data/en_core_web_sm

    You can now load the model via spacy.load('en_core_web_sm')



In [2]:
nlp=spacy.load('en_core_web_sm')   #load english tokenizer, tagger, parser

In [3]:
!python -m spacy download en

Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
[K    100% |████████████████████████████████| 37.4MB 3.5MB/s ta 0:00:011   19% |██████▎                         | 7.3MB 10.8MB/s eta 0:00:03    48% |███████████████▋                | 18.3MB 10.6MB/s eta 0:00:02

[93m    Linking successful[0m
    /Users/shivangisareen/anaconda3/lib/python3.6/site-packages/en_core_web_sm
    -->
    /Users/shivangisareen/anaconda3/lib/python3.6/site-packages/spacy/data/en

    You can now load the model via spacy.load('en')



In [4]:
#cleaning text 

spacy.load('en')
from spacy.lang.en import English
parser=English()

def tokenize(text):
    lda_tokens=[]
    tokens=parser(text)
    for token in tokens:    #spaCy keeps space tokens so need to filter them out
        if token.orth_.isspace():
            continue
        elif token.like_url:   #does the token resemeble a URL?
            lda_tokens.append('URL')
        elif token.orth_.startswith('@'):
            lda_tokens.append('SCREEN_NAME')
        else:
            lda_tokens.append(token.lower_)  #the underscore after the attriute represents unicode type
    return lda_tokens

We will now use NLTK's wordnet to find the meaning of words, synonyms, antonyms etc. 
Also, use wordNetLemmatizer to find out the root word.

In [5]:
import nltk
nltk.download('wordnet')


[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/shivangisareen/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [6]:
from nltk.corpus import wordnet as wm


#only base form words (roots) are stored in WordNet. So, Morphy is applied to the search string to generate a from present in WordNet
def get_lemma(word):
    lemma=wm.morphy(word)
    if lemma is None:
        return word
    else:
        return lemma
    
from nltk.stem.wordnet import WordNetLemmatizer
def get_lemma2(word):
    return WordNetLemmatizer().lemmatize(word)
    

In [7]:
#filtering out stopwords

nltk.download('stopwords')
en_stop=set(nltk.corpus.stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/shivangisareen/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
#define a function to prepare the text for topic modelling

def prepare_text_for_lda(text):
    tokens=tokenize(text)
    tokens=[token for token in tokens if len(token)>4]
    tokens=[token for token in tokens if token not in en_stop]
    tokens=[get_lemma(token) for token in tokens]
    return tokens

In [9]:
#Open up the data, read line by line and for each line prepare text for LDA and add to a list

import random
text_data=[]
with open('starwars.txt') as f:
    for line in f:
        tokens=prepare_text_for_lda(line)
        if random.random() >.99:  #random.random() returns the next random floating point number in [0.0, 1.0)
            #use this to restirct the number of tokens generated 
            print(tokens)
            text_data.append(tokens)
            

LDA with Gensim:

Now, we will create a dictionary from the data, then convert to bag-of-words(bow) corpus and save the dictionary and corpus. 

In [10]:
from gensim import corpora
dictionary=corpora.Dictionary(text_data)
corpus=[dictionary.doc2bow(text) for text in text_data] #converting bag-of-words to a vector(corpus)

import pickle   #the pickle module implements an alogrithm for serializing and deserializing a Python object structure.
#pickling is the process whereby a Python object is hierarchy is converted into a byte stream and unpickling is the vice vera. 

pickle.dump(corpus, open('corpus.pkl', 'wb')) 
#write a pickled representation of corpus.pkl to the open file object (wb).
dictionary.save('dictionary.gensim')

We wish to find, suppose, 5 topics in the data:

In [12]:
import gensim
NUM_TOPICS=5
ldamodel=gensim.models.ldamodel.LdaModel(corpus,num_topics=NUM_TOPICS, id2word=dictionary, passes=15)
ldamodel.save('model5.gensim')
topics=ldamodel.print_topics(num_words=5)
for topic in topics:
    print(topic)

ValueError: cannot compute LDA over an empty collection (no terms)

pyLDAvis is designed to help users interpret the topics in a topic model. The package extracts information from a fitted LDA topic model to present an interactive web-based visualization.

In [13]:
dictionary=gensim.corpora.Dictionary.load('dictionary.gensim')
corpus=pickle.load(open('corpus.pkl', 'rb'))
lda=gensim.models.ldamodel.LdaModel.load('model5.gensim')

import pyLDAvis.gensim
lda_display=pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

ValidationError: 
 * Not all rows (distributions) in topic_term_dists sum to 1.

Saliency- a measure of how much the term really tells you about the topic 

Relevance- a weighted average of the probability of the word given the topic and the word given the topic normalised by the probability of the topic. 