Using scikit-learn dataset: Newsgroups

In [3]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn import decomposition
from scipy import linalg
import matplotlib.pyplot as plt

In [4]:
%matplotlib inline
np.set_printoptions(suppress=True)

Split the dataset

In [6]:
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
remove = ('headers', 'footers', 'quotes')
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, remove=remove)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, remove=remove)

In [13]:
num_topics, num_top_words = 6, 8
print("\n".join(newsgroups_test.data[:5]))

TRry the SKywatch project in  Arizona.
The Vatican library recently made a tour of the US.
 Can anyone help me in finding a FTP site where this collection is 
 available.
Hi there,

I am here looking for some help.

My friend is a interior decor designer. He is from Thailand. He is
trying to find some graphics software on PC. Any suggestion on which
software to buy,where to buy and how much it costs ? He likes the most
sophisticated 
software(the more features it has,the better)
RFD
                          Request For Discussion
                                for the
                          OPEN  TELEMATIC GROUP

                                  OTG

I have proposed the forming of a consortium/task force for the
promotion of NAPLPS/JPEG, FIF to openly discuss ways, method,
procedures,algorythms, applications, implementation, extensions of
NAPLPS/JPEG standards.  These standards should facilitate the creation
of REAL_TIME Online applications that make use of Voice, Video,
Telecomm

Preprocess the data

In [14]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import nltk

In [19]:
vectorizer = CountVectorizer(stop_words='english')
vectors = vectorizer.fit_transform(newsgroups_train.data).todense()
vocab = np.array(vectorizer.get_feature_names())
vocab[10000:10020]

array(['factors', 'factory', 'facts', 'factsnet', 'factual', 'factually',
       'faculty', 'fade', 'fades', 'fading', 'fag', 'faget', 'faggots',
       'fahrenheit', 'fai', 'fail', 'failed', 'failing', 'fails',
       'failsafe'], dtype='<U80')

Using SVD

Singular value decomposition is a method of decomposing a matrix into three other matrices:
    
    A = USV(T)
    
where:
    
    A is an m × n matrix
    U is an m × n orthogonal matrix
    S is an n × n diagonal matrix
    V is an n × n orthogonal matrix
    
    
https://towardsdatascience.com/understanding-singular-value-decomposition-and-its-application-in-data-science-388a54be95d

Create topic by assigning it to the most common words

In [25]:
%time U, s, Vh = linalg.svd(vectors, full_matrices=False)

num_top_words=5

def show_topics(a):
    top_words = lambda t: [vocab[i] for i in np.argsort(t)[:-num_top_words-1:-1]]
    topic_words = ([top_words(t) for t in a])
    return [' '.join(t) for t in topic_words]

Wall time: 15.5 s


In [26]:
show_topics(Vh[:10])

['ditto critus propagandist surname galacticentric',
 'jpeg gif file color quality',
 'graphics edu pub mail 128',
 'jesus god matthew people atheists',
 'image data processing analysis software',
 'god atheists atheism religious believe',
 'space nasa lunar mars probe',
 'image probe surface lunar mars',
 'argument fallacy conclusion example true',
 'space larson image theory universe']

Using NMF

NMF (Nonnegative Matrix Factorization)  is a matrix factorization method where we constrain the matrices to be nonnegative. In order to understand NMF, we should clarify the underlying intuition between matrix factorization


https://blog.acolyer.org/2019/02/18/the-why-and-how-of-nonnegative-matrix-factorization/

https://mlexplained.com/2017/12/28/a-practical-introduction-to-nmf-nonnegative-matrix-factorization/

The paper:

https://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf




In [29]:
m,n=vectors.shape
d=6  # Number of topics

In [30]:
clf = decomposition.NMF(n_components=d, random_state=1)

W1 = clf.fit_transform(vectors)
H1 = clf.components_

In [31]:
show_topics(H1)

['jpeg image gif file color',
 'edu graphics pub mail 128',
 'space launch satellite nasa commercial',
 'jesus matthew prophecy people said',
 'image data available software processing',
 'god atheists atheism religious believe']

Create topics with data preprocessed using TFIDF

In [32]:
vectorizer_tfidf = TfidfVectorizer(stop_words='english')
vectors_tfidf = vectorizer_tfidf.fit_transform(newsgroups_train.data) 

In [33]:
W1 = clf.fit_transform(vectors_tfidf)
H1 = clf.components_

In [34]:
show_topics(H1)

['don people just think like',
 'thanks graphics files image file',
 'space nasa launch shuttle orbit',
 'ico bobbe tek beauchaine bronx',
 'god jesus bible believe atheism',
 'objective morality values moral subjective']