## Objective

For DTM Matrix $A$ find matrices $W,H$ such that of rank $k$ (selected ahead of time)

$$ A \sim W \cdot H$$

where $W$ is basis vectors, and $H$ is coefficients of memeberships for documents

## Process

1. find TF-IDF matrix $A$ (DTM) via construction of vector space model for documents (after removing stopwords)
2. apply TF-IDF term weight normalization to matrix $A$
3. Normalize TF-IDF vectors
4. initialize factors using NNDSVD (Non-Negative Double Singular Value Decomposition) on $A$ 
5. apply projected gradient NMF to $A$

## Result

* Basis Vectors: topics (clusters) in the data
* Coefficiant Matrix: membership weights for documents relative to each topic (cluster)




# NMF in Python

In [1]:
import pandas as pd

In [2]:
npr = pd.read_csv('npr.csv')

In [3]:
npr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11992 entries, 0 to 11991
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Article  11992 non-null  object
dtypes: object(1)
memory usage: 93.8+ KB


In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [5]:
tfidf = TfidfVectorizer(max_df=0.95, min_df=2,stop_words='english')

In [7]:
# not actually a dtm matrix, just writing to line up with previous code
dtm = tfidf.fit_transform(npr['Article'])

In [8]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.float64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [2]:
from sklearn.decomposition import NMF

In [3]:
help(NMF)

Help on class NMF in module sklearn.decomposition._nmf:

class NMF(sklearn.base.TransformerMixin, sklearn.base.BaseEstimator)
 |  NMF(n_components=None, *, init='warn', solver='cd', beta_loss='frobenius', tol=0.0001, max_iter=200, random_state=None, alpha='deprecated', alpha_W=0.0, alpha_H='same', l1_ratio=0.0, verbose=0, shuffle=False, regularization='deprecated')
 |  
 |  Non-Negative Matrix Factorization (NMF).
 |  
 |  Find two non-negative matrices (W, H) whose product approximates the non-
 |  negative matrix X. This factorization can be used for example for
 |  dimensionality reduction, source separation or topic extraction.
 |  
 |  The objective function is:
 |  
 |      .. math::
 |  
 |          0.5 * ||X - WH||_{loss}^2
 |  
 |          + alpha\_W * l1_{ratio} * n\_features * ||vec(W)||_1
 |  
 |          + alpha\_H * l1_{ratio} * n\_samples * ||vec(H)||_1
 |  
 |          + 0.5 * alpha\_W * (1 - l1_{ratio}) * n\_features * ||W||_{Fro}^2
 |  
 |          + 0.5 * alpha\_H * 

In [10]:
nmf_model = NMF(n_components=7,random_state = 42)

In [11]:
nmf_model.fit(dtm)

NMF(n_components=7, random_state=42)

In [12]:
tfidf.get_feature_names()[2300]

'albala'

In [14]:
m = 15
print(f'TOP {m} words for each topic\n\n')
for idx, topic in enumerate(nmf_model.components_):
    print(f'TOPIC # {idx}:')
    print([tfidf.get_feature_names()[index] for index in topic.argsort()[-m:]])
    print('\n')

TOP 15 words for each topic


TOPIC # 0:
['new', 'research', 'like', 'patients', 'health', 'disease', 'percent', 'women', 'virus', 'study', 'water', 'food', 'people', 'zika', 'says']


TOPIC # 1:
['gop', 'pence', 'presidential', 'russia', 'administration', 'election', 'republican', 'obama', 'white', 'house', 'donald', 'campaign', 'said', 'president', 'trump']


TOPIC # 2:
['senate', 'house', 'people', 'act', 'law', 'tax', 'plan', 'republicans', 'affordable', 'obamacare', 'coverage', 'medicaid', 'insurance', 'care', 'health']


TOPIC # 3:
['officers', 'syria', 'security', 'department', 'law', 'isis', 'russia', 'government', 'state', 'attack', 'president', 'reports', 'court', 'said', 'police']


TOPIC # 4:
['primary', 'cruz', 'election', 'democrats', 'percent', 'party', 'delegates', 'vote', 'state', 'democratic', 'hillary', 'campaign', 'voters', 'sanders', 'clinton']


TOPIC # 5:
['love', 've', 'don', 'album', 'way', 'time', 'song', 'life', 'really', 'know', 'people', 'think', 'just', 'm

In [15]:
topic_results = nmf_model.transform(dtm)

In [17]:
topic_results[0].round(2), topic_results[0].argmax()

(array([0.  , 0.12, 0.  , 0.06, 0.02, 0.  , 0.  ]), 1)

In [18]:
npr['topic'] = topic_results.argmax(axis = 1)

In [19]:
npr.head(20)

Unnamed: 0,Article,topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3
4,"From photography, illustration and video, to d...",6
5,I did not want to join yoga class. I hated tho...,5
6,With a who has publicly supported the debunk...,0
7,"I was standing by the airport exit, debating w...",0
8,"If movies were trying to be more realistic, pe...",0
9,"Eighteen years ago, on New Year’s Eve, David F...",5


In [20]:
my_topic_dictionary = {
    6:'education',
    5:'lifestyle',
    4:'elections',
    3:'geopolitics',
    2:'legislation',
    1:'election',
    0:'public health'
}


npr['topic_label'] = npr['topic'].map(my_topic_dictionary)

In [21]:
npr.head(20)

Unnamed: 0,Article,topic,topic_label
0,"In the Washington of 2016, even when the polic...",1,election
1,Donald Trump has used Twitter — his prefe...,1,election
2,Donald Trump is unabashedly praising Russian...,1,election
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3,geopolitics
4,"From photography, illustration and video, to d...",6,education
5,I did not want to join yoga class. I hated tho...,5,lifestyle
6,With a who has publicly supported the debunk...,0,public health
7,"I was standing by the airport exit, debating w...",0,public health
8,"If movies were trying to be more realistic, pe...",0,public health
9,"Eighteen years ago, on New Year’s Eve, David F...",5,lifestyle
