# Lesson 1 : Latent Dirichlet Allocation

Latent Dirichlet Allocation : LDA for Topic Modeling
ASSUMPTIONS
1. Documents with similiar topics use similiar groups of words
2. Latent Topics can then be found by searching for groups of words that
frequently occur together in documents across the corpus

- Documents have probability distribution of topics
- Topics have probability distribution of words

In [1]:
import pandas as pd
import sklearn
import numpy as np

  LARGE_SPARSE_SUPPORTED = LooseVersion(scipy_version) >= '0.14.0'


In [3]:
npr=pd.read_csv('UPDATED_NLP_COURSE/05-Topic-Modeling/npr.csv')

In [4]:
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [6]:
# Unsupervised learning
# No of articles
len(npr)

11992

In [5]:
npr["Article"][0]

'In the Washington of 2016, even when the policy can be bipartisan, the politics cannot. And in that sense, this year shows little sign of ending on Dec. 31. When President Obama moved to sanction Russia over its alleged interference in the U. S. election just concluded, some Republicans who had long called for similar or more severe measures could scarcely bring themselves to approve. House Speaker Paul Ryan called the Obama measures ”appropriate” but also ”overdue” and ”a prime example of this administration’s ineffective foreign policy that has left America weaker in the eyes of the world.” Other GOP leaders sounded much the same theme. ”[We have] been urging President Obama for years to take strong action to deter Russia’s worldwide aggression, including its   operations,” wrote Rep. Devin Nunes,  . chairman of the House Intelligence Committee. ”Now with just a few weeks left in office, the president has suddenly decided that some stronger measures are indeed warranted.” Appearing 

In [9]:
#Words in first article
len(npr["Article"][0])

7646

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

In [9]:
cv=CountVectorizer(max_df=0.9,min_df=2,stop_words='english')

In [10]:
dtm=cv.fit_transform(npr['Article'])

In [14]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

11992 Documents with 54777 terms 

In [15]:
from sklearn.decomposition import LatentDirichletAllocation

In [16]:
LDA=LatentDirichletAllocation(n_components=7,random_state=42)

In [17]:
LDA.fit(dtm)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='batch', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_components=7, n_jobs=None, n_topics=None, perp_tol=0.1,
             random_state=42, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

In [25]:
#Grad the vocabolary of words
print(len(cv.get_feature_names()))
type(cv.get_feature_names())

54777


list

In [22]:
cv.get_feature_names()[50000]

'transcribe'

In [22]:
cv.get_feature_names()[41000]

'reproductive'

In [31]:
import random 

random_word_id=random.randint(0,54777)
cv.get_feature_names()[random_word_id]

'edgewood'

In [32]:
# Grab the Topics
len(LDA.components_)

7

In [34]:
type(LDA.components_)

numpy.ndarray

In [35]:
LDA.components_.shape

(7, 54777)

In [31]:
LDA.components_

array([[8.64332806e+00, 2.38014333e+03, 1.42900522e-01, ...,
        1.43006821e-01, 1.42902042e-01, 1.42861626e-01],
       [2.76191749e+01, 5.36394437e+02, 1.42857148e-01, ...,
        1.42861973e-01, 1.42857147e-01, 1.42906875e-01],
       [7.22783888e+00, 8.24033986e+02, 1.42857148e-01, ...,
        6.14236247e+00, 2.14061364e+00, 1.42923753e-01],
       ...,
       [3.11488651e+00, 3.50409655e+02, 1.42857147e-01, ...,
        1.42859912e-01, 1.42857146e-01, 1.42866614e-01],
       [4.61486388e+01, 5.14408600e+01, 3.14281373e+00, ...,
        1.43107628e-01, 1.43902481e-01, 2.14271779e+00],
       [4.93991422e-01, 4.18841042e+02, 1.42857151e-01, ...,
        1.42857146e-01, 1.43760101e-01, 1.42866201e-01]])

In [38]:
LDA.components_[0]

array([8.64332806e+00, 2.38014333e+03, 1.42900522e-01, ...,
       1.43006821e-01, 1.42902042e-01, 1.42861626e-01])

In [40]:
single_topic=LDA.components_[0]

In [45]:
# To get the index position of topic with highest probability
# Least to Greatest
single_topic.argsort()

array([ 2475, 18302, 35285, ..., 22673, 42561, 42993], dtype=int64)

In [46]:
top_ten_words=single_topic.argsort()[-10:]
# top_twenty_words=single_topic.argsort()[-20:]

In [47]:
for index in top_ten_words:
    print(cv.get_feature_names()[index])

new
percent
government
company
million
care
people
health
said
says


In [48]:
for i,topic in enumerate(LDA.components_):
    print(f'The TOP 15 Words for this #{i}')
    print([cv.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')
    print('\n')

The TOP 15 Words for this #0
['companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']




The TOP 15 Words for this #1
['military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']




The TOP 15 Words for this #2
['way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']




The TOP 15 Words for this #3
['time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']




The TOP 15 Words for this #4
['voters', 'vote', 'election', 'party', 'new', 'obama', 'court', 'republican', 'campaign', 'people', 'state', 'president', 'clinton', 'said', 'trump']




The TOP 15 Words for this #5
['years', 'going', 've', 'life', 'don', 'new', 'way', 'music', 'really', 'time', 'know', 'thi

In [None]:
# Grade the highest probability words per topics

In [51]:
for index,topic in enumerate(LDA.components_):
    print(f"The TOP 15 words for Topic{index}")
    print([cv.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

The TOP 15 words for Topic0
['companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']


The TOP 15 words for Topic1
['military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']


The TOP 15 words for Topic2
['way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']


The TOP 15 words for Topic3
['time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']


The TOP 15 words for Topic4
['voters', 'vote', 'election', 'party', 'new', 'obama', 'court', 'republican', 'campaign', 'people', 'state', 'president', 'clinton', 'said', 'trump']


The TOP 15 words for Topic5
['years', 'going', 've', 'life', 'don', 'new', 'way', 'music', 'really', 'time', 'know', 'think', 'people', '

In [52]:
npr

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."
5,I did not want to join yoga class. I hated tho...
6,With a who has publicly supported the debunk...
7,"I was standing by the airport exit, debating w..."
8,"If movies were trying to be more realistic, pe..."
9,"Eighteen years ago, on New Year’s Eve, David F..."


In [53]:
topic_results=LDA.transform(dtm)

In [55]:
topic_results.shape

(11992, 7)

In [57]:
topic_results[0].argmax()

1

In [69]:
topic_pred=[i.argmax() for i in topic_results]

In [66]:
npr['topic']=topic_results.argmax(axis=1)
npr

Unnamed: 0,Article,topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1
4,"From photography, illustration and video, to d...",2
5,I did not want to join yoga class. I hated tho...,3
6,With a who has publicly supported the debunk...,3
7,"I was standing by the airport exit, debating w...",2
8,"If movies were trying to be more realistic, pe...",3
9,"Eighteen years ago, on New Year’s Eve, David F...",2


# Lesson 2 : Non-Negative Matrix Factorization

In [71]:
import numpy as np
import pandas as pd
import sklearn

In [73]:
npr=pd.read_csv('UPDATED_NLP_COURSE/05-Topic-Modeling/npr.csv')

In [74]:
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [75]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [76]:
tfidf=TfidfVectorizer(max_df=0.95,min_df=2,stop_words='english')

In [77]:
dtm=tfidf.fit_transform(npr['Article'])

In [78]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.float64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [79]:
from sklearn.decomposition import NMF

In [80]:
nmf_model=NMF(n_components=7,random_state=42)

In [81]:
nmf_model.fit(dtm)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
  n_components=7, random_state=42, shuffle=False, solver='cd', tol=0.0001,
  verbose=0)

In [82]:
tfidf.get_feature_names()[2300]

'albala'

In [84]:
for index,topic in enumerate(nmf_model.components_):
    print(f'The top 15 words for topic :{index}')
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

The top 15 words for topic :0
['new', 'research', 'like', 'patients', 'health', 'disease', 'percent', 'women', 'virus', 'study', 'water', 'food', 'people', 'zika', 'says']


The top 15 words for topic :1
['gop', 'pence', 'presidential', 'russia', 'administration', 'election', 'republican', 'obama', 'white', 'house', 'donald', 'campaign', 'said', 'president', 'trump']


The top 15 words for topic :2
['senate', 'house', 'people', 'act', 'law', 'tax', 'plan', 'republicans', 'affordable', 'obamacare', 'coverage', 'medicaid', 'insurance', 'care', 'health']


The top 15 words for topic :3
['officers', 'syria', 'security', 'department', 'law', 'isis', 'russia', 'government', 'state', 'attack', 'president', 'reports', 'court', 'said', 'police']


The top 15 words for topic :4
['primary', 'cruz', 'election', 'democrats', 'percent', 'party', 'delegates', 'vote', 'state', 'democratic', 'hillary', 'campaign', 'voters', 'sanders', 'clinton']


The top 15 words for topic :5
['love', 've', 'don', 'al

In [85]:
topic_results=nmf_model.transform(dtm)

In [90]:
npr['topic']=topic_results.argmax(axis=1)

In [96]:
mytopicdict={0:'health',1:'election',2:'legis',3:'poli',4:'election',5:'music',6:'edu'}
npr['Topic Label']=npr['topic'].map(mytopicdict)

In [97]:
npr

Unnamed: 0,Article,topic,Topic Label
0,"In the Washington of 2016, even when the polic...",1,election
1,Donald Trump has used Twitter — his prefe...,1,election
2,Donald Trump is unabashedly praising Russian...,1,election
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3,poli
4,"From photography, illustration and video, to d...",6,edu
5,I did not want to join yoga class. I hated tho...,5,music
6,With a who has publicly supported the debunk...,0,health
7,"I was standing by the airport exit, debating w...",0,health
8,"If movies were trying to be more realistic, pe...",0,health
9,"Eighteen years ago, on New Year’s Eve, David F...",5,music
