In [7]:
#Why do we need topic modeling?**

In [8]:
Okay, so now the question arises why do we need topic modeling? If we look around, we can see a huge amount of textual
data lying around us in an unstructured format in the form of news articles, research papers, social media posts etc.
and we need a way to understand, organize and label this data to make informed decisions. Topic modeling is 
used in various applications like finding questions on stack overflow that are similar to each other, 
news flow aggregation and analysis, recommender systems etc. All of these focus on finding the hidden 
thematic structure in the text, as it is believed that every text that we write be it a tweet, post or a 
research paper is composed of themes like sports, 
physics, aerospace etc.

SyntaxError: invalid syntax (<ipython-input-8-5ad259fbbfe8>, line 1)

# How to do topic modeling? Latent Dirichlet Allocation

In [None]:
Currently, there are many ways to do topic modeling, but we will be discussing a probabilistic modeling
approach called Latent Dirichlet Allocation (LDA) developed by Prof. David M. Blei in 2003. 
This is an extension of Probabilistic Latent Semantic Analysis (PLSA) developed in 1999 by Thomas Hoffman
with a very minute difference in terms of how they treat per-document distribution.
So let’s jump straight into how LDA works.

In [None]:
Latent: This refers to everything that we don’t know a priori and are hidden in the data. Here, 
the themes or topics that document consists of are unknown, but they are believed to be present as 
the text is generated based on those topics.

In [None]:
Dirichlet: It is a ‘distribution of distributions’. Yes, you read it right. But what does this mean? Let’s think about
this with the help of an example. Let’s suppose there is a machine that produces dice and we can control whether 
the machine will always produce a dice with equal weight to all sides, or will there be any bias for some sides. So, 
the machine producing dice is a distribution as it is producing dice of different types. Also, we know that the dice 
itself is a distribution as we get multiple values when we roll a dice. This is what it means to be a distribution of 
distributions and this is what Dirichlet is. Here, in the context of topic modeling, the Dirichlet is the distribution 
of topics in documents and distribution of words in the topic. It might not be very clear at this point of time, 
but it’s fine as we will look at it in more detail in a while.

In [None]:
Allocation: This means that once we have Dirichlet, we will allocate topics to the documents and words of the document 
to topics.

In [None]:
What LDA essentially says is that each word in each document comes from a topic and the topic is 
selected from a per-document distribution over topics. 

In [None]:
One could apply LDA to DNA and nucleotides, pizzas and toppings, molecules and atoms, employees and skills, 
or keyboards and crumbs.

The probabilistic topic model estimated by LDA consists of two tables (matrices). The first table describes 
the probability or chance of selecting a particular part when sampling a particular topic (category).

The second table describes the chance of selecting a particular topic when sampling a particular 
document or composite.


Lets take an example : *I suddenly have a taste for bacon avocado toast.*

In [None]:

no of words associate with topics , no of topics associate with document.
with respect to words frequency we can say this topic is belongs to this domain similarly
with respect to topic  frequency we can say this document is belongs to this domain


# if we multiply word/document * document/topic = word/topic

In [9]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\Windows
[nltk_data]     10\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [10]:
! pip install pyLDAvis



In [11]:
! pip install spacy



In [12]:
! pip install biopython



In [13]:
import re
import numpy as np
import pandas as pd
from pprint import pprint

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# spacy for lemmatization
import spacy

# Plotting tools
import pyLDAvis
import pyLDAvis.gensim 
import matplotlib.pyplot as plt
%matplotlib inline

# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

from Bio import Medline

In [None]:
A topic is nothing but a collection of dominant keywords that are typical representatives. Just by looking at the 
keywords, you can identify what the topic is all about.

The following are key factors to obtaining good segregation topics:

The quality of text processing.
The variety of topics the text talks about.
The choice of topic modeling algorithm.
The number of topics fed to the algorithm.
The algorithms tuning parameters.

In [14]:
# NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

In [15]:
# Function that uses the Medline module from
# the Biopython library to parse and read MEDLINE
# formatted files. Results are stored in a Pandas 
# DataFrame
def read_medline_data(filename):
    recs = Medline.parse(open(filename, 'r'))
    text = pd.DataFrame(columns = ["title", "authors", "abstract"])
    count = 0
    for rec in recs:
        try:
            abstr = rec["AB"]
            title = rec["TI"]
            auths = rec["AU"]
            text = text.append(pd.DataFrame([[title, auths, abstr]],
                                     columns=['title', 'authors', 'abstract']),
                              ignore_index=True)            
        except:
            pass
    return text

In [16]:
# Read in MEDLINE formatted text
papers = read_medline_data("1.txt")

In [17]:
papers

Unnamed: 0,title,authors,abstract
0,Depression and Mania in Bipolar Disorder.,"[Tondo L, VÃ¡zquez GH, Baldessarini RJ]","BACKGROUND: Episode duration, recurrence rates..."
1,Cognitive Impairment in Bipolar Disorder: Trea...,"[SolÃ© B, JimÃ©nez E, Torrent C, Reinares M, B...","Over the last decade, there has been a growing..."
2,Bipolar disorder: clinical overview.,"[MÃ¼ller JK, Leweke FM]",Bipolar disorder is a severe psychiatric disor...
3,Diagnosis and treatment of patients with bipol...,"[McCormick U, Murray B, McNew B]",PURPOSE: This review article provides an overv...
4,Bipolar Disorder: Its Etiology and How to Mode...,"[Freund N, Juckel G]",Characterized by the switch of manic and depre...
5,Borderline personality disorder and bipolar di...,"[Paris J, Black DW]",Borderline personality disorder (BPD) and bipo...
6,Bipolar disorder.,"[Smith DJ, Whitham EA, Ghaemi SN]",Bipolar disorder is a serious disorder of mood...
7,Older Age Bipolar Disorder.,"[Dols A, Beekman A]",Further understanding of older age bipolar dis...
8,The relationship between borderline personalit...,"[Zimmerman M, Morgan TA]",It is clinically important to recognize both b...
9,"Update on the Epidemiology, Diagnosis, and Tre...","[Chen P, Dols A, Rej S, Sajatovic M]",PURPOSE OF REVIEW: The population over age 60 ...


In [18]:
# Convert to list
data = papers.title.values.tolist()

# Remove Emails
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]

# Remove new line characters
data = [re.sub('\s+', ' ', sent) for sent in data]

# Remove distracting single quotes
data = [re.sub("\'", "", sent) for sent in data]

In [19]:
print(data[:1])

['Depression and Mania in Bipolar Disorder.']


In [20]:
print(data[10:11])

['Bipolar disorder.']


In [None]:
After removing the emails and extra spaces, the text still looks messy. It is not ready for the LDA to consume. You need to
break down each sentence into a list of words through tokenization, while clearing up all the messy text in the process.

In [21]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(data))

print(data_words[:5])

[['depression', 'and', 'mania', 'in', 'bipolar', 'disorder'], ['cognitive', 'impairment', 'in', 'bipolar', 'disorder', 'treatment', 'and', 'prevention', 'strategies'], ['bipolar', 'disorder', 'clinical', 'overview'], ['diagnosis', 'and', 'treatment', 'of', 'patients', 'with', 'bipolar', 'disorder', 'review', 'for', 'advanced', 'practice', 'nurses'], ['bipolar', 'disorder', 'its', 'etiology', 'and', 'how', 'to', 'model', 'in', 'rodents']]


In [None]:
Gensim’s Phrases model can build and implement the bigrams, trigrams, quadgrams and more. The two important arguments to 
Phrases are min_count and threshold. The higher the values of these param, the harder it is for words to be combined to 
bigrams.

In [22]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# See trigram example
print(trigram_mod[bigram_mod[data_words[0]]])

['depression', 'and', 'mania', 'in', 'bipolar', 'disorder']


In [23]:
# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [25]:
# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)

nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized[:1])

[['depression', 'bipolar', 'disorder']]


# The two main inputs to the LDA topic model are the dictionary(id2word) and the corpus.

In [26]:
# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:11])

[[(0, 1), (1, 1), (2, 1)], [(0, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)], [(0, 1), (2, 1), (8, 1), (9, 1)], [(0, 1), (2, 1), (7, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1)], [(0, 1), (2, 1), (16, 1), (17, 1), (18, 1)], [(0, 1), (2, 2), (19, 1), (20, 1), (21, 1)], [(0, 1), (2, 1)], [(0, 1), (2, 1), (22, 1), (23, 1)], [(0, 1), (2, 2), (24, 1), (25, 1)], [(0, 1), (2, 1), (7, 1), (11, 1), (22, 1), (23, 1), (26, 1)], [(0, 1), (2, 1)]]


In [27]:
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]

[[('bipolar', 1), ('depression', 1), ('disorder', 1)]]

In [28]:
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)


In [None]:
The above LDA model is built with 20 different topics where each topic is a combination of keywords and each keyword 
contributes a certain weightage to the topic.

You can see the keywords for each topic and the weightage(importance) of each keyword using lda_model.print_topics()

In [29]:
# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.013*"meta" + 0.013*"feature" + 0.013*"antecedent" + 0.013*"stage" + '
  '0.013*"geriatric" + 0.013*"neuroimage" + 0.013*"status" + 0.013*"biomarker" '
  '+ 0.013*"ahead" + 0.013*"early"'),
 (1,
  '0.066*"prospective" + 0.066*"narrative" + 0.066*"study" + '
  '0.066*"antecedent" + 0.066*"course" + 0.066*"early" + 0.066*"longitudinal" '
  '+ 0.066*"manif" + 0.066*"change" + 0.066*"review"'),
 (2,
  '0.227*"disorder" + 0.227*"bipolar" + 0.066*"patient" + 0.034*"treatment" + '
  '0.034*"diagnosis" + 0.034*"review" + 0.034*"nurse" + 0.034*"advanced" + '
  '0.034*"practice" + 0.034*"age"'),
 (3,
  '0.117*"differentiation" + 0.117*"clinician" + 0.117*"era" + '
  '0.117*"challenge" + 0.117*"disorder" + 0.006*"neuroimage" + 0.006*"status" '
  '+ 0.006*"ahead" + 0.006*"stage" + 0.006*"geriatric"'),
 (4,
  '0.129*"bipolar" + 0.129*"disorder" + 0.066*"impact" + 0.066*"study" + '
  '0.066*"machine" + 0.066*"technique" + 0.066*"learning" + 0.066*"review" + '
  '0.066*"systematic" + 0.066*"

In [None]:
How to interpret this?

Topic 0 is a represented as _0.016“car” + 0.014“power” + 0.010“light” + 0.009“drive” + 0.007“mount” + 0.007“controller” + 0.007“cool” + 0.007“engine” + 0.007“back” + ‘0.006“turn”.

It means the top 10 keywords that contribute to this topic are: ‘car’, ‘power’, ‘light’.. and so on and the weight of ‘car’ on topic 0 is 0.016.

The weights reflect how important a keyword is to that topic.

Looking at these keywords, can you guess what this topic could be? You may summarise it either are ‘cars’ or ‘automobiles’.

In [32]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis

TypeError: Object of type complex is not JSON serializable

PreparedData(topic_coordinates=                        x                   y  topics  cluster       Freq
topic                                                                    
2     -0.222833+0.000000j -0.017957+0.000000j       1        1  16.410395
8     -0.076619+0.000000j  0.032467+0.000000j       2        1  14.109157
17    -0.150372+0.000000j  0.053891+0.000000j       3        1  13.152353
12    -0.022269+0.000000j -0.052152+0.000000j       4        1   9.088848
1      0.041718+0.000000j -0.165875+0.000000j       5        1   8.103252
4     -0.065239+0.000000j -0.127950+0.000000j       6        1   7.814704
9     -0.084513+0.000000j -0.006990+0.000000j       7        1   7.120166
15    -0.035493+0.000000j  0.112688+0.000000j       8        1   6.098638
6     -0.111993+0.000000j -0.009868+0.000000j       9        1   5.778699
16    -0.048924+0.000000j  0.062548+0.000000j      10        1   4.766218
3      0.083931+0.000000j  0.027516+0.000000j      11        1   3.440933
5      

In [33]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\Windows
[nltk_data]     10\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [34]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Windows 10\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

In [35]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

test_string = 'European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices'

def preprocess(sent):
    sent = nltk.word_tokenize(sent)
    sent = nltk.pos_tag(sent)
    return sent

sent = preprocess(test_string)
sent

[('European', 'JJ'),
 ('authorities', 'NNS'),
 ('fined', 'VBD'),
 ('Google', 'NNP'),
 ('a', 'DT'),
 ('record', 'NN'),
 ('$', '$'),
 ('5.1', 'CD'),
 ('billion', 'CD'),
 ('on', 'IN'),
 ('Wednesday', 'NNP'),
 ('for', 'IN'),
 ('abusing', 'VBG'),
 ('its', 'PRP$'),
 ('power', 'NN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('mobile', 'JJ'),
 ('phone', 'NN'),
 ('market', 'NN'),
 ('and', 'CC'),
 ('ordered', 'VBD'),
 ('the', 'DT'),
 ('company', 'NN'),
 ('to', 'TO'),
 ('alter', 'VB'),
 ('its', 'PRP$'),
 ('practices', 'NNS')]

# SpaCy’s named entity recognition has been trained on the OntoNotes 5 corpus

In [36]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

In [37]:
doc = nlp('European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices')
print([(X.text, X.label_) for X in doc.ents])

[('European', 'NORP'), ('Google', 'ORG'), ('$5.1 billion', 'MONEY'), ('Wednesday', 'DATE')]


In [38]:
from bs4 import BeautifulSoup
import requests
import re
def url_to_string(url):
    res = requests.get(url)
    html = res.text
    soup = BeautifulSoup(html, 'html5lib')
    for script in soup(["script", "style", 'aside']):
        script.extract()
    return " ".join(re.split(r'[\n\t]+', soup.get_text()))
ny_bb = url_to_string('https://www.nytimes.com/2018/08/13/us/politics/peter-strzok-fired-fbi.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=first-column-region&region=top-news&WT.nav=top-news')
article = nlp(ny_bb)
len(article.ents)

158

In [39]:
labels = [x.label_ for x in article.ents]
Counter(labels)

Counter({'PERSON': 75,
         'ORG': 39,
         'PRODUCT': 1,
         'CARDINAL': 5,
         'GPE': 11,
         'DATE': 23,
         'NORP': 3,
         'ORDINAL': 1})

# The following are three most frequent tokens.

In [40]:
items = [x.text for x in article.ents]
Counter(items).most_common(3)

[('Strzok', 28), ('F.B.I.', 18), ('Trump', 11)]

In [41]:
sentences = [x for x in article.sents]
print(sentences[9])

Trump and his allies seized on the texts — exchanged during the 2016 campaign with a former F.B.I. lawyer, Lisa Page — in assailing the Russia investigation as an illegitimate “witch hunt.”


In [42]:
displacy.render(nlp(str(sentences[9])), jupyter=True, style='ent')

In [43]:
displacy.render(nlp(str(sentences[9])), style='dep', jupyter = True, options = {'distance': 120})

In [44]:
[(x.orth_,x.pos_, x.lemma_) for x in [y 
                                      for y
                                      in nlp(str(sentences[10])) 
                                      if not y.is_stop and y.pos_ != 'PUNCT']]

[('Mr.', 'PROPN', 'Mr.'),
 ('Strzok', 'PROPN', 'Strzok'),
 ('rose', 'VERB', 'rise'),
 ('20', 'NUM', '20'),
 ('years', 'NOUN', 'year'),
 ('F.B.I.', 'PROPN', 'F.B.I.'),
 ('experienced', 'ADJ', 'experienced'),
 ('counterintelligence', 'NOUN', 'counterintelligence'),
 ('agents', 'NOUN', 'agent'),
 ('key', 'ADJ', 'key'),
 ('figure', 'NOUN', 'figure'),
 ('early', 'ADJ', 'early'),
 ('months', 'NOUN', 'month'),
 ('inquiry', 'NOUN', 'inquiry')]

In [45]:
dict([(str(x), x.label_) for x in nlp(str(sentences[10])).ents])
print([(x, x.ent_iob_, x.ent_type_) for x in sentences[10]])

[(Mr., 'O', ''), (Strzok, 'B', 'PERSON'), (,, 'O', ''), (who, 'O', ''), (rose, 'O', ''), (over, 'O', ''), (20, 'B', 'DATE'), (years, 'I', 'DATE'), (at, 'O', ''), (the, 'O', ''), (F.B.I., 'B', 'ORG'), (to, 'O', ''), (become, 'O', ''), (one, 'O', ''), (of, 'O', ''), (its, 'O', ''), (most, 'O', ''), (experienced, 'O', ''), (counterintelligence, 'O', ''), (agents, 'O', ''), (,, 'O', ''), (was, 'O', ''), (a, 'O', ''), (key, 'O', ''), (figure, 'O', ''), (in, 'O', ''), (the, 'B', 'DATE'), (early, 'I', 'DATE'), (months, 'I', 'DATE'), (of, 'O', ''), (the, 'O', ''), (inquiry, 'O', ''), (., 'O', '')]


In [46]:
displacy.render(nlp(str(sentences)), jupyter=True, style='ent')