## Extracting Topics from Text

In this recipe, we are going to discuss how to identify topics from the document. Say, for example there is an online library with multiple department based on the kind of the book. As the new book comes in, you want to look at the unique keywords/topics and decide on which department this book might belong to and place it accordingly. In this kind of situation topic modelling would be handy

In [2]:
doc_complete=['I am learning NLP, it is very interesting and exiting it include machine learning and Deep learning',
             'My father is data scientist and he is NLP expert',
             'My sister has good exposure into android development']
doc_complete

['I am learning NLP, it is very interesting and exiting it include machine learning and Deep learning',
 'My father is data scientist and he is NLP expert',
 'My sister has good exposure into android development']

In [3]:
!pip install gensim

Collecting gensim
[?25l  Downloading https://files.pythonhosted.org/packages/c2/db/677f0c8a1c49b44e7a999c2fdbcba576017c10d3d77d11c29ee3fa1b291e/gensim-3.8.1-cp35-cp35m-manylinux1_x86_64.whl (24.2MB)
[K     |████████████████████████████████| 24.2MB 15kB/s  eta 0:00:01     |█████████████████▊              | 13.4MB 1.8MB/s eta 0:00:06     |████████████████████████▋       | 18.6MB 1.8MB/s eta 0:00:04     |████████████████████████████    | 21.1MB 2.3MB/s eta 0:00:02
Collecting smart-open>=1.8.1
[?25l  Downloading https://files.pythonhosted.org/packages/0c/09/735f2786dfac9bbf39d244ce75c0313d27d4962e71e0774750dc809f2395/smart_open-1.9.0.tar.gz (70kB)
[K     |████████████████████████████████| 71kB 658kB/s  eta 0:00:01
Building wheels for collected packages: smart-open
  Building wheel for smart-open (setup.py) ... [?25ldone
[?25h  Created wheel for smart-open: filename=smart_open-1.9.0-cp35-none-any.whl size=79335 sha256=2f18107b52fd17dda1203cb227cba7d5f54d164482f883879bc847de1deb2891
  

In [6]:
import nltk
nltk.download('stopwords')
!python -m textblob.download_corpora

[nltk_data] Downloading package stopwords to /home/nbuser/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package brown to /home/nbuser/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /home/nbuser/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/nbuser/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/nbuser/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to /home/nbuser/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /home/nbuser/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
Finished.


In [9]:
# preprocessing
from nltk.corpus import stopwords
from textblob import Word
import string

stop = set(stopwords.words('english'))
exclude = set(string.punctuation)

def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    #print(stop_free)
    punc_free = "".join([j for j in stop_free if j not in exclude])
    #print(punc_free)
    normalized = " ".join([Word(k).lemmatize() for k in punc_free.split()])
    #print(normalized)
    return normalized
    
doc_clean = [clean(doc).split() for doc in doc_complete]
doc_clean


[['learning',
  'nlp',
  'interesting',
  'exiting',
  'include',
  'machine',
  'learning',
  'deep',
  'learning'],
 ['father', 'data', 'scientist', 'nlp', 'expert'],
 ['sister', 'good', 'exposure', 'android', 'development']]

In [43]:
# preparing document term matrix
import gensim 
from gensim import corpora

dictionary = corpora.Dictionary(doc_clean)  # creating the term dictionary of corpus where every unique term is assingned an index
k = [k for k in dictionary.iteritems()]
print(k)
# creating a list of documents(corpus) into Document-Term Matrix using dictionary prepared above
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]
doc_term_matrix

[(11, 'android'), (13, 'exposure'), (4, 'learning'), (2, 'include'), (8, 'expert'), (6, 'nlp'), (10, 'scientist'), (5, 'machine'), (3, 'interesting'), (14, 'good'), (1, 'exiting'), (15, 'sister'), (9, 'father'), (0, 'deep'), (7, 'data'), (12, 'development')]


[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 3), (5, 1), (6, 1)],
 [(6, 1), (7, 1), (8, 1), (9, 1), (10, 1)],
 [(11, 1), (12, 1), (13, 1), (14, 1), (15, 1)]]

In [25]:
help(dictionary)

Help on Dictionary in module gensim.corpora.dictionary object:

class Dictionary(gensim.utils.SaveLoad, collections.abc.Mapping)
 |  Dictionary encapsulates the mapping between normalized words and their integer ids.
 |  
 |  Notable instance attributes:
 |  
 |  Attributes
 |  ----------
 |  token2id : dict of (str, int)
 |      token -> tokenId.
 |  id2token : dict of (int, str)
 |      Reverse mapping for token2id, initialized in a lazy manner to save memory (not created until needed).
 |  cfs : dict of (int, int)
 |      Collection frequencies: token_id -> how many instances of this token are contained in the documents.
 |  dfs : dict of (int, int)
 |      Document frequencies: token_id -> how many documents contain this token.
 |  num_docs : int
 |      Number of documents processed.
 |  num_pos : int
 |      Total number of corpus positions (number of processed words).
 |  num_nnz : int
 |      Total number of non-zeroes in the BOW matrix (sum of the number of unique
 |      word

In [52]:
Lda = gensim.models.ldamodel.LdaModel # creating the object for LDA model using gensim library

ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50) # Running and Training LDA model on the document term matrix for 3 topics.

print(ldamodel.print_topics())

[(0, '0.129*"sister" + 0.129*"good" + 0.129*"exposure" + 0.129*"development" + 0.129*"android" + 0.032*"nlp" + 0.032*"father" + 0.032*"scientist" + 0.032*"data" + 0.032*"expert"'), (1, '0.233*"learning" + 0.093*"deep" + 0.093*"include" + 0.093*"interesting" + 0.093*"machine" + 0.093*"exiting" + 0.093*"nlp" + 0.023*"scientist" + 0.023*"data" + 0.023*"father"'), (2, '0.129*"nlp" + 0.129*"father" + 0.129*"data" + 0.129*"scientist" + 0.129*"expert" + 0.032*"exposure" + 0.032*"good" + 0.032*"development" + 0.032*"android" + 0.032*"sister"')]


In [46]:
help(ldamodel)

Help on LdaModel in module gensim.models.ldamodel object:

class LdaModel(gensim.interfaces.TransformationABC, gensim.models.basemodel.BaseTopicModel)
 |  Train and use Online Latent Dirichlet Allocation (OLDA) models as presented in
 |  `Hoffman et al. :"Online Learning for Latent Dirichlet Allocation" <https://www.di.ens.fr/~fbach/mdhnips2010.pdf>`_.
 |  
 |  Examples
 |  -------
 |  Initialize a model using a Gensim corpus
 |  
 |  .. sourcecode:: pycon
 |  
 |      >>> from gensim.test.utils import common_corpus
 |      >>>
 |      >>> lda = LdaModel(common_corpus, num_topics=10)
 |  
 |  You can then infer topic distributions on new, unseen documents.
 |  
 |  .. sourcecode:: pycon
 |  
 |      >>> doc_bow = [(1, 0.3), (2, 0.1), (0, 0.09)]
 |      >>> doc_lda = lda[doc_bow]
 |  
 |  The model can be updated (trained) with new documents.
 |  
 |  .. sourcecode:: pycon
 |  
 |      >>> # In practice (corpus =/= initial training corpus), but we use the same here for simplicity.
 |   