# Non-Negative Matrix Factorization

## Data

We will be using articles scraped from NPR (National Public Radio), obtained from their website [www.npr.org](http://www.npr.org)

In [1]:
import pandas as pd

npr = pd.read_csv("D:\\ML-Datasets\\NLP-Datasets\\npr.csv")

# Check the data load
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [2]:
# Lets explore the second article
npr['Article'][1]

'  Donald Trump has used Twitter  —   his preferred means of communication  —   to weigh in on a swath of foreign policy issues over the past few weeks. His comments give a glimpse into how his incoming administration will deal with pressing foreign matters  —   but also highlight how reactionary comments on social media can immediately spur international concern and attention. And his staff has indicated that taking to Twitter to air his concerns or, often, grievances, won’t end once he enters the Oval Office. On Wednesday, Trump blasted the U. S.’s abstention from the U. N. Security Council vote on Israeli settlements earlier this month. The tweets came just hours before Secretary of State John Kerry gave a speech defending the decision and calling the continued building of settlements on Palestinian territory in the West Bank a threat to the   solution in the region. Trump’s support for Israel and Prime Minister Benjamin Netanyahu  —   who has had a fraught relationship with Preside

In [3]:
# Check on number of articles
len(npr)

11992

## Preprocessing

**`max_df`**` : float in range [0.0, 1.0] or int, default=1.0`<br>
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

**`min_df`**` : float in range [0.0, 1.0] or int, default=1`<br>
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [5]:
# Apply this tfidf to create the document term matrix

dtm = tfidf.fit_transform(npr['Article'])

# Check the matrix
dtm

<11992x54777 sparse matrix of type '<class 'numpy.float64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

## NMF

In [6]:
from sklearn.decomposition import NMF

nmf_model = NMF(n_components=7, random_state=42)

# Apply this model to the document term matrix
nmf_model.fit(dtm)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
    n_components=7, random_state=42, shuffle=False, solver='cd', tol=0.0001,
    verbose=0)

## Interpret the NMF results

In [7]:
# Get the vocabulary of words

len(tfidf.get_feature_names()) # This also matches with the number of columns in our sparse matrix

54777

In [8]:
# get a random feature name
tfidf.get_feature_names()[2300]

'albala'

In [9]:
# I can print out a bunch of random words from my list of words

import random

for index in range(20):
    random_word_id = random.randint(0, len(tfidf.get_feature_names()) - 1)
    print(tfidf.get_feature_names()[random_word_id])

generally
premises
inner
threads
buyer
irresponsibly
pierre
sensor
quail
combs
nyc
strives
folklife
punish
trawl
timing
wends
banjul
taxing
replied


In [11]:
# Grab the topics

len(nmf_model.components_) # This should match with the number of topics we specified during training

7

In [12]:
# The type is a numpy array, having probabilities of each word in all the documents belonging to this topic
type(nmf_model.components_)

numpy.ndarray

In [13]:
# Check the shape of the numpy array
nmf_model.components_.shape

(7, 54777)

In [14]:
# Now if we want to grab the highest probability word in each topic (Lets do it for the first topic)

# Get the probability of all words belonging to this topic
single_topic = nmf_model.components_[0]

# Now we will get the index positions to sort this, in order to get the highest/lowest probability words
# default sort is by ascending order, so in order to get the top 10 highest probability words

# This will return as the index positions of the top 10 words with highest probabilities for this topic
top_ten_words = single_topic.argsort()[-10:]

In [16]:
for index in top_ten_words:
    print(tfidf.get_feature_names()[index])

disease
percent
women
virus
study
water
food
people
zika
says


#### Now we will setup a loop which will print out the top 15 words with highest probabilities for all the seven topics

In [17]:
for index, topic in enumerate(nmf_model.components_):
    print(f"THE TOP 15 WORDS FOR TOPIC #{index}")
    print([tfidf.get_feature_names()[index] for index in topic.argsort()[-15:]])
    print("\n")

THE TOP 15 WORDS FOR TOPIC #0
['new', 'research', 'like', 'patients', 'health', 'disease', 'percent', 'women', 'virus', 'study', 'water', 'food', 'people', 'zika', 'says']


THE TOP 15 WORDS FOR TOPIC #1
['gop', 'pence', 'presidential', 'russia', 'administration', 'election', 'republican', 'obama', 'white', 'house', 'donald', 'campaign', 'said', 'president', 'trump']


THE TOP 15 WORDS FOR TOPIC #2
['senate', 'house', 'people', 'act', 'law', 'tax', 'plan', 'republicans', 'affordable', 'obamacare', 'coverage', 'medicaid', 'insurance', 'care', 'health']


THE TOP 15 WORDS FOR TOPIC #3
['officers', 'syria', 'security', 'department', 'law', 'isis', 'russia', 'government', 'state', 'attack', 'president', 'reports', 'court', 'said', 'police']


THE TOP 15 WORDS FOR TOPIC #4
['primary', 'cruz', 'election', 'democrats', 'percent', 'party', 'delegates', 'vote', 'state', 'democratic', 'hillary', 'campaign', 'voters', 'sanders', 'clinton']


THE TOP 15 WORDS FOR TOPIC #5
['love', 've', 'don', 'al

In [18]:
# Now we will attach the topic numbers to the original articles

# In order to do this, we would need to apply the transform method on our LDA model and DTM to associate the topic to each
# article
topic_results = nmf_model.transform(dtm)

# Now we will check the shape
topic_results.shape

(11992, 7)

In [19]:
# If we check the first element, we will get the probabilities of each of the 11992 articles belonging to one of the 7 topics
topic_results[0]

array([0.        , 0.12075603, 0.00140297, 0.05919954, 0.01518909,
       0.        , 0.        ])

In [20]:
# to see this better
topic_results[0].round(2)

array([0.  , 0.12, 0.  , 0.06, 0.02, 0.  , 0.  ])

In [21]:
# We will grab the index position of the highest probability
topic_results[0].argmax()

1

In [22]:
# Associating topic to each of the documents

npr['Topic'] = topic_results.argmax(axis=1)

npr.head(10)

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3
4,"From photography, illustration and video, to d...",6
5,I did not want to join yoga class. I hated tho...,5
6,With a who has publicly supported the debunk...,0
7,"I was standing by the airport exit, debating w...",0
8,"If movies were trying to be more realistic, pe...",0
9,"Eighteen years ago, on New Year’s Eve, David F...",5
