<a href="https://colab.research.google.com/github/souparnabose99/topic-modelling-lda-nnmf/blob/main/Non_Negative_Matrix_Factorization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Non Negative Matrix Factorization is an usupervised algorithm that simultaneously performs dimensionality reduction and clustering. It can be used in conjunction with TF-IDF for topic modelling across documents.

### Load Libraries & Dataset:

In [1]:
import numpy as np
import pandas as pd

!wget https://media.githubusercontent.com/media/souparnabose99/topic-modelling-lda-nnmf/main/npr.csv

--2021-06-27 10:14:20--  https://media.githubusercontent.com/media/souparnabose99/topic-modelling-lda-nnmf/main/npr.csv
Resolving media.githubusercontent.com (media.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to media.githubusercontent.com (media.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 55511119 (53M) [text/plain]
Saving to: ‘npr.csv’


2021-06-27 10:14:21 (178 MB/s) - ‘npr.csv’ saved [55511119/55511119]



In [2]:
df = pd.read_csv('npr.csv')
df.head(10)

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."
5,I did not want to join yoga class. I hated tho...
6,With a who has publicly supported the debunk...
7,"I was standing by the airport exit, debating w..."
8,"If movies were trying to be more realistic, pe..."
9,"Eighteen years ago, on New Year’s Eve, David F..."


### Perform TF-IDF vectorization:

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [4]:
doc_term_mat = tfidf.fit_transform(df['Article'])
doc_term_mat

<11992x54777 sparse matrix of type '<class 'numpy.float64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

### Perform NMF:

In [5]:
from sklearn.decomposition import NMF

In [6]:
nmf = NMF(n_components=8, random_state=100)
nmf.fit(doc_term_mat)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
    n_components=8, random_state=100, shuffle=False, solver='cd', tol=0.0001,
    verbose=0)

In [10]:
tfidf.get_feature_names()[1800]

'adolf'

In [11]:
for index, topic in enumerate(nmf.components_):
  print(f"The top 15 words for the Topic : {index+1}")
  print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])
  print('\n')

The top 15 words for the Topic : 1
['year', 'workers', '000', 'china', 'company', 'study', 'just', 'years', 'new', 'percent', 'like', 'water', 'food', 'people', 'says']


The top 15 words for the Topic : 2
['gop', 'pence', 'russia', 'presidential', 'administration', 'election', 'republican', 'obama', 'white', 'house', 'donald', 'campaign', 'said', 'president', 'trump']


The top 15 words for the Topic : 3
['senate', 'house', 'act', 'people', 'tax', 'law', 'plan', 'republicans', 'affordable', 'obamacare', 'coverage', 'medicaid', 'insurance', 'care', 'health']


The top 15 words for the Topic : 4
['officers', 'security', 'syria', 'government', 'department', 'state', 'law', 'isis', 'russia', 'president', 'attack', 'reports', 'court', 'said', 'police']


The top 15 words for the Topic : 5
['cruz', 'primary', 'election', 'democrats', 'percent', 'party', 'delegates', 'vote', 'state', 'democratic', 'hillary', 'campaign', 'voters', 'sanders', 'clinton']


The top 15 words for the Topic : 6
['b

### Add Topic Labels:

In [15]:
topic_results = nmf.transform(doc_term_mat)
topic_results.argmax(axis=1)

array([1, 1, 1, ..., 7, 4, 0])

In [16]:
df['Topic'] = topic_results.argmax(axis=1)
df.head(10)

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3
4,"From photography, illustration and video, to d...",6
5,I did not want to join yoga class. I hated tho...,5
6,With a who has publicly supported the debunk...,0
7,"I was standing by the airport exit, debating w...",0
8,"If movies were trying to be more realistic, pe...",0
9,"Eighteen years ago, on New Year’s Eve, David F...",0
