## Topic-Modeling

**Topic Modeling allows for us to efficiently analyze large volumes of text by clustering documents into topic**

A large volume of text data is unlabeled meaning we won't be able to apply supervised learning approachhes to to create machine learning models for the data!

- if we have **unlabeled** data, then we can attempt to "discover" labels. In the case of text data, this means attempting to discover clusters of documents, grouped together by topic.

- A very important idea to keep in mind her is that we don't know the "correct" topic or "right answer" . All we know is that the documents clustered together share similar topic ideas. It is up to the user to identify what these topics represent.

## <font color = green > Latent dirichlet Allocation (LDA) </font>
Johann Peter Gustav Lejeune Dirichlet was a German mathmatician in the 1800s who contributed widely to the field of modern mathematics. There is aprobability distribution named after him **"Dirichlet Distribution"**. LDA is based off this probability distribution. In 2003, LDA was first published as a graphical model for topic disocvery in ***Journel of Machine Learning*** by David Blei, Andrew Ng and Michael l. Jordan.

- **Assumptions of LDA for Topic Modeling**
    1. Docuemnts with similar topics use similar groups of words.
    2. Latent topics can then be found by searching for groups of words that frequently occur together in documents across the corpus
    3. Documents are probability distributions over latent topics.
    4. Topics themselves are probability distributions over words.
    
    
 **We can imagine that any particular document will have probability distribution of words in given amount of topics**
 
 *we're not saying that doc1 belongs to topic2 but it has probaility distribution over many topics*
    
 <img src='5.PNG'>
 
 **Topic themselves are probability distributions over words.**
    
 <img src='6.PNG'>
    
Now, we can assume that Topic1 deals with pets. It is upto us to intrpret the topic.

So, LDA represents documents as mixtures of topics that spit out words with certain probabilities.

In [2]:
import pandas as pd

In [3]:
npr = pd.read_csv('npr.csv')

In [4]:
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [5]:
npr['Article'][0]

'In the Washington of 2016, even when the policy can be bipartisan, the politics cannot. And in that sense, this year shows little sign of ending on Dec. 31. When President Obama moved to sanction Russia over its alleged interference in the U. S. election just concluded, some Republicans who had long called for similar or more severe measures could scarcely bring themselves to approve. House Speaker Paul Ryan called the Obama measures ”appropriate” but also ”overdue” and ”a prime example of this administration’s ineffective foreign policy that has left America weaker in the eyes of the world.” Other GOP leaders sounded much the same theme. ”[We have] been urging President Obama for years to take strong action to deter Russia’s worldwide aggression, including its   operations,” wrote Rep. Devin Nunes,  . chairman of the House Intelligence Committee. ”Now with just a few weeks left in office, the president has suddenly decided that some stronger measures are indeed warranted.” Appearing 

In [6]:
len(npr)

11992

In [7]:
npr['Article'][789]

'Just how bad is the state of the nation’s highway infrastructure? So bad, tires on FedEx trucks last only half as long as they did 20 years ago, as they deteriorate rapidly from crumbling pavement and get more flats from gaping potholes. ”We’re using almost 100 percent more tires to produce the same mileage of transportation,” FedEx Chairman and CEO Fred Smith told the U. S. House Transportation and Infrastructure Committee Wednesday. ”Why is that? Because the road infrastructure has so many potholes in it, it’s tearing up tires faster than before.” In addition, ”the cost of congestion is getting worse,” Smith continued. ”It’s preventing   deliveries” for his and other businesses whose growth is dependent upon   orders and rapid,    delivery. The House committee is beginning to lay the groundwork for what is expected to be a massive infrastructure proposal from the Trump administration in coming months. The president has said he will spend up to $1 trillion rebuilding the nation’s roa

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

In [10]:
cv = CountVectorizer(max_df=0.9, min_df=2, stop_words='english')

max_df takes value between (0 to 1) , 0.9 means words which are very common and are present in 90% of the documents. min_df can also take value from 0 to 1 but it also takes any integer, so 2 means any words should be two times in documents. This will remove very common words(max_df) and all misspellings (min_df) and stop_words.

In [11]:
dtm = cv.fit_transform(npr['Article'])

In [12]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [13]:
from sklearn.decomposition import LatentDirichletAllocation

In [14]:
LDA = LatentDirichletAllocation(n_components=7,random_state=42)

In [15]:
LDA.fit(dtm)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=7, n_jobs=None,
                          perp_tol=0.1, random_state=42, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)

In [17]:
#grab the vocabulary of words
len(cv.get_feature_names())

54777

In [18]:
type(cv.get_feature_names())

list

cv.get_feature_names() is just all the unique words in the document. It is a list of all those words.

In [22]:
cv.get_feature_names()[3628]

'arrange'

In [25]:
import random
random_word_id = random.randint(0,54777)
cv.get_feature_names()[random_word_id]

'discovers'

In [26]:
#grab the topics
len(LDA.components_)

7

In [27]:
type(LDA.components_)

numpy.ndarray

In [28]:
LDA.components_.shape

(7, 54777)

In [29]:
LDA.components_

array([[8.64332806e+00, 2.38014333e+03, 1.42900522e-01, ...,
        1.43006821e-01, 1.42902042e-01, 1.42861626e-01],
       [2.76191749e+01, 5.36394437e+02, 1.42857148e-01, ...,
        1.42861973e-01, 1.42857147e-01, 1.42906875e-01],
       [7.22783888e+00, 8.24033986e+02, 1.42857148e-01, ...,
        6.14236247e+00, 2.14061364e+00, 1.42923753e-01],
       ...,
       [3.11488651e+00, 3.50409655e+02, 1.42857147e-01, ...,
        1.42859912e-01, 1.42857146e-01, 1.42866614e-01],
       [4.61486388e+01, 5.14408600e+01, 3.14281373e+00, ...,
        1.43107628e-01, 1.43902481e-01, 2.14271779e+00],
       [4.93991422e-01, 4.18841042e+02, 1.42857151e-01, ...,
        1.42857146e-01, 1.43760101e-01, 1.42866201e-01]])

In [30]:
single_topic = LDA.components_[0]

In [31]:
type(single_topic)

numpy.ndarray

In [32]:
single_topic.argsort()

array([ 2475, 18302, 35285, ..., 22673, 42561, 42993], dtype=int64)

In [34]:
import numpy as np

In [35]:
arr = np.array([10,200,1])

In [36]:
arr

array([ 10, 200,   1])

In [37]:
arr.argsort()

array([2, 0, 1], dtype=int64)

**what argsort() is doing is showing index positions of high probability words.**

In [38]:
#ARGSORT ---> INDEX POSITIONS SORTED FROM LEAST TO GREATEST
#TOP 10 VALUES (10 GREATEST VALUES)
#LAST 10 VALUES OF ARGSORT()
single_topic.argsort()[-10:] #grab the last 10 values of .argsort()

array([33390, 36310, 21228, 10425, 31464,  8149, 36283, 22673, 42561,
       42993], dtype=int64)

In [42]:
top_twenty_words_index = single_topic.argsort()[-20:]

In [43]:
for index in top_twenty_words_index:
    print(cv.get_feature_names()[index])

president
state
tax
insurance
trump
companies
money
year
federal
000
new
percent
government
company
million
care
people
health
said
says


In [44]:
#setting up a loop to give 15 top words for each topic

for i, topic in enumerate(LDA.components_):
    print(f"The Top 15 Words for Topic #{i}")
    print([cv.get_feature_names()[index] for index in topic.argsort()[-15:]])
    print('\n')
    print('\n')

The Top 15 Words for Topic #0
['companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']




The Top 15 Words for Topic #1
['military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']




The Top 15 Words for Topic #2
['way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']




The Top 15 Words for Topic #3
['time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']




The Top 15 Words for Topic #4
['voters', 'vote', 'election', 'party', 'new', 'obama', 'court', 'republican', 'campaign', 'people', 'state', 'president', 'clinton', 'said', 'trump']




The Top 15 Words for Topic #5
['years', 'going', 've', 'life', 'don', 'new', 'way', 'music', 'really', 'time', 'know'

In [45]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [46]:
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [47]:
topic_results = LDA.transform(dtm)

In [49]:
topic_results.shape

(11992, 7)

This is our original 11992 articles with their probabilitie of 7 topics. Let's see that one in action.

In [50]:
topic_results[1]

array([3.63424997e-02, 8.86130697e-01, 4.40751747e-04, 4.40668674e-04,
       7.57636804e-02, 4.40866779e-04, 4.40835574e-04])

So, These are seven probabilities of seven topics for that particular article

In [51]:
topic_results[1].round(2)

array([0.04, 0.89, 0.  , 0.  , 0.08, 0.  , 0.  ])

So, this says that article number 2 (index 1) has highest probaility 0f 0.89 to belong in topic1

In [52]:
npr['Article'][1]

'  Donald Trump has used Twitter  —   his preferred means of communication  —   to weigh in on a swath of foreign policy issues over the past few weeks. His comments give a glimpse into how his incoming administration will deal with pressing foreign matters  —   but also highlight how reactionary comments on social media can immediately spur international concern and attention. And his staff has indicated that taking to Twitter to air his concerns or, often, grievances, won’t end once he enters the Oval Office. On Wednesday, Trump blasted the U. S.’s abstention from the U. N. Security Council vote on Israeli settlements earlier this month. The tweets came just hours before Secretary of State John Kerry gave a speech defending the decision and calling the continued building of settlements on Palestinian territory in the West Bank a threat to the   solution in the region. Trump’s support for Israel and Prime Minister Benjamin Netanyahu  —   who has had a fraught relationship with Preside

In [54]:
topic_results[1].argmax() #returns index position of highest probaility

1

In [55]:
npr['Topic'] = topic_results.argmax(axis=1)

In [56]:
npr

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1
4,"From photography, illustration and video, to d...",2
...,...,...
11987,The number of law enforcement officers shot an...,1
11988,"Trump is busy these days with victory tours,...",4
11989,It’s always interesting for the Goats and Soda...,3
11990,The election of Donald Trump was a surprise to...,4
