<a href="https://colab.research.google.com/github/sujayrittikar/NLP/blob/main/Non_negative_Matrix_Factorization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

- Non-negative Matrix Factorization is an unsupervised algorithm that simultaneously performs Dimensionality Reduction and Clustering.
- We use it in conjunction with TF-IDF to model topics across documents.

# Process
- Given a non-negative matrix A, find k-dimension approximation in terms of non-negative factors W and H.

  - A = W . H
  - A -> n x m Data Matrix
  - W -> n x k Basis Vectors
  - H -> k x m Coefficiant Matrix
- Approximate each object (i.e., the column of A) by a linear combination of k reduced dimensions or, "basis vectors" in W.
- Each basis vector can be interpreted as a cluster. The memberships of objects in these clusters encoded by H.

- Input: 
  - Non-negative Data Matrix (A) - TF-IDF Matrix, 
  - number of basis vector (k) - Number of Topics, 
  - initial values for factors W and H

- Objective Function: Some measure of reconstruction error between A and the approximation WH.

- Expectation-maximaization optimization to refine W and H in order to minimize the objective function. Update H and W till the function converges.

# Steps
1. Construct vector space model for documents (after stopword filtering), resulting in a term-document matrix A.
2. Apply TF-IDF term weight normalization to A
3. Normalize TF-IDF vectors to unit length.
4. Initialize factors using NNDSVD on A.
5. Apply Projected Gradient NMF to A.

In [1]:
import pandas as pd

In [3]:
npr = pd.read_csv('npr.csv')

In [4]:
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [6]:
tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [7]:
dtm = tfidf.fit_transform(npr['Article'])

In [8]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.float64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [9]:
from sklearn.decomposition import NMF

In [10]:
nmf_model = NMF(n_components=7, random_state=1)

In [11]:
nmf_model.fit(dtm)



NMF(n_components=7, random_state=1)

In [12]:
for index, topic in enumerate(nmf_model.components_):
  print(f"The TOP 15 words for Topic # {index}")
  print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])

The TOP 15 words for Topic # 0




['new', 'research', 'like', 'patients', 'health', 'disease', 'percent', 'women', 'virus', 'study', 'water', 'food', 'people', 'zika', 'says']
The TOP 15 words for Topic # 1
['gop', 'pence', 'presidential', 'russia', 'administration', 'election', 'republican', 'obama', 'white', 'house', 'donald', 'campaign', 'said', 'president', 'trump']
The TOP 15 words for Topic # 2
['senate', 'house', 'people', 'act', 'law', 'tax', 'plan', 'republicans', 'affordable', 'obamacare', 'coverage', 'medicaid', 'insurance', 'care', 'health']
The TOP 15 words for Topic # 3
['officers', 'syria', 'security', 'department', 'law', 'isis', 'russia', 'government', 'state', 'attack', 'president', 'reports', 'court', 'said', 'police']
The TOP 15 words for Topic # 4
['primary', 'cruz', 'election', 'democrats', 'percent', 'party', 'delegates', 'vote', 'state', 'democratic', 'hillary', 'campaign', 'voters', 'sanders', 'clinton']
The TOP 15 words for Topic # 5
['love', 've', 'don', 'album', 'way', 'time', 'song', 'life'

In [13]:
topic_results = nmf_model.transform(dtm)

In [14]:
topic_results[0].argmax()

1

In [15]:
npr['Topic'] = topic_results.argmax(axis=1)

In [16]:
mytopic_dict = {0: 'health', 1: 'elections', 2: 'legislation', 3: 'politics', 4: 'elections', 5: 'music', 6: 'education'}

In [17]:
npr['Topic_Label'] = npr['Topic'].map(mytopic_dict)

In [18]:
npr.head()

Unnamed: 0,Article,Topic,Topic_Label
0,"In the Washington of 2016, even when the polic...",1,elections
1,Donald Trump has used Twitter — his prefe...,1,elections
2,Donald Trump is unabashedly praising Russian...,1,elections
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3,politics
4,"From photography, illustration and video, to d...",6,education
