### NMF example

Rather than constraining our factors to be orthogonal, another idea would to constrain them to be non-negative. NMF is a factorization of a non-negative data set  𝑉 :
𝑉=𝑊𝐻
 
into non-negative matrices  𝑊,𝐻 . Often positive factors will be more easily interpretable (and this is the reason behind NMF's popularity).

A couple of good videos to watch on this topic -

1. https://www.youtube.com/watch?v=o4pPTwsd-5M


In [41]:
import pandas as pd
import numpy as np
import json
import zipfile
import os
import time

from sklearn import preprocessing
import matplotlib.pyplot as plt

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn.datasets import fetch_20newsgroups
from sklearn import decomposition


from sklearn.datasets import load_iris
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from scipy import linalg

### Download the data to run NMF and TF_IDF

In [3]:
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
remove = ('headers', 'footers', 'quotes')
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, remove=remove)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, remove=remove)

In [4]:
newsgroups_train.filenames.shape, newsgroups_train.target.shape

((2034,), (2034,))

In [5]:
print(newsgroups_train.target_names)
print(np.unique(newsgroups_train.target))

['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']
[0 1 2 3]


In [6]:
print("\n".join(newsgroups_train.data[:3]))

Hi,

I've noticed that if you only save a model (with all your mapping planes
positioned carefully) to a .3DS file that when you reload it after restarting
3DS, they are given a default position and orientation.  But if you save
to a .PRJ file their positions/orientation are preserved.  Does anyone
know why this information is not stored in the .3DS file?  Nothing is
explicitly said in the manual about saving texture rules in the .PRJ file. 
I'd like to be able to read the texture rule information, does anyone have 
the format for the .PRJ file?

Is the .CEL file format available from somewhere?

Rych


Seems to be, barring evidence to the contrary, that Koresh was simply
another deranged fanatic who thought it neccessary to take a whole bunch of
folks with him, children and all, to satisfy his delusional mania. Jim
Jones, circa 1993.


Nope - fruitcakes like Koresh have been demonstrating such evil corruption
for centuries.

 >In article <1993Apr19.020359.26996@sq.sq.com>, msb@sq.sq.c

In [7]:
np.array(newsgroups_train.target_names)[newsgroups_train.target[:3]]

array(['comp.graphics', 'talk.religion.misc', 'sci.space'], dtype='<U18')

In [8]:
vectorizer = CountVectorizer(stop_words='english')
x_input = vectorizer.fit_transform(newsgroups_train.data).todense()
x_input.shape

(2034, 26576)

In [9]:
vocab = np.array(vectorizer.get_feature_names())
vocab.shape

(26576,)

In [10]:
vocab[7000:7020]

array(['cosmonauts', 'cosmos', 'cosponsored', 'cost', 'costa', 'costar',
       'costing', 'costly', 'costruction', 'costs', 'cosy', 'cote',
       'couched', 'couldn', 'council', 'councils', 'counsel',
       'counselees', 'counselor', 'count'], dtype='<U80')

In [36]:
np.unique(newsgroups_train.target, return_counts=True)

(array([0, 1, 2, 3]), array([480, 584, 593, 377]))

In [37]:
newsgroups_train.target_names

['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']

_All classes are uniformly distributed._

#### Calculate NMF

In [12]:
m,n = x_input.shape
d=10  # num topics

In [16]:
clf = decomposition.NMF(n_components=d, random_state=1)

W1 = clf.fit_transform(x_input)
H1 = clf.components_

In [24]:
W1.shape

(2034, 10)

In [18]:
H1.shape

(10, 26576)

### Figure out top words in the top 10 components

In [20]:
num_top_words=8

def show_topics(a):
    top_words = lambda t: [vocab[i] for i in np.argsort(t)[-num_top_words-1:]]
    topic_words = ([top_words(t) for t in a])
    return [' '.join(t) for t in topic_words]

In [21]:
### Using abs as we need to pick up words that have either +ve or -ve heavy weights
show_topics(np.abs(H1[0:10]))

['version quality format images color file gif image jpeg',
 'ftp 3d send 128 ray mail pub graphics edu',
 'services data year satellites market commercial satellite space launch',
 'psalm isaiah david messiah said people prophecy matthew jesus',
 'tool tools edu images analysis software processing data image',
 'just does people religion believe religious atheism atheists god',
 'sci list propulsion available information center shuttle nasa space',
 'space earth orbit probes surface moon mars lunar probe',
 'false premises argumentum ad true example conclusion fallacy argument',
 'contact sgi pub graphics image edu ftp available data']

#### Find some documents that has these words

In [25]:
### First convert the document to lowercase

newsgroups_train["lower_data"] = []
for item in newsgroups_train.data:
    newsgroups_train["lower_data"].append(item.lower())
len(newsgroups_train["lower_data"])

2034

In [39]:
### search for the document that has the text
a = np.char.find(newsgroups_train.lower_data,'gif')
idx_document = np.where(a != -1)
idx_document

(array([  53,   90,  119,  154,  184,  205,  239,  312,  351,  358,  459,
         481,  499,  553,  566,  570,  581,  697,  698,  715,  722,  839,
         873,  874,  927, 1018, 1028, 1042, 1139, 1141, 1154, 1217, 1238,
        1258, 1278, 1290, 1312, 1322, 1332, 1406, 1425, 1426, 1454, 1459,
        1518, 1527, 1578, 1590, 1600, 1628, 1634, 1661, 1679, 1682, 1691,
        1718, 1723, 1733, 1762, 1775, 1786, 1799, 1803, 1912, 1936, 2008,
        2011]),)

In [33]:
for label_id in newsgroups_train.target[idx_document]:
    print(newsgroups_train.target_names[label_id])

sci.space
comp.graphics
alt.atheism
talk.religion.misc
comp.graphics
comp.graphics
comp.graphics
comp.graphics
comp.graphics
comp.graphics
comp.graphics
sci.space
comp.graphics
sci.space
talk.religion.misc
comp.graphics
comp.graphics
comp.graphics
sci.space
comp.graphics
comp.graphics
sci.space
sci.space
comp.graphics
comp.graphics
comp.graphics
comp.graphics
alt.atheism
comp.graphics
sci.space
talk.religion.misc
comp.graphics
comp.graphics
comp.graphics
comp.graphics
comp.graphics
comp.graphics
comp.graphics
comp.graphics
comp.graphics
alt.atheism
comp.graphics
comp.graphics
comp.graphics
talk.religion.misc
comp.graphics
comp.graphics
sci.space
comp.graphics
comp.graphics
comp.graphics
comp.graphics
sci.space
sci.space
sci.space
comp.graphics
comp.graphics
comp.graphics
comp.graphics
comp.graphics
comp.graphics
comp.graphics
comp.graphics
comp.graphics
comp.graphics
comp.graphics
talk.religion.misc


_It appears that these component is more inclined towards comp.graphics._

In [36]:
np.unique(newsgroups_train.target, return_counts=True)

(array([0, 1, 2, 3]), array([480, 584, 593, 377]))

In [37]:
newsgroups_train.target_names

['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']

### Calculate tf_idf for the documents

In [47]:
vectorizer_tfidf = TfidfVectorizer(stop_words='english')
#vectorizer_tfidf = TfidfVectorizer()
vectors_tfidf = vectorizer_tfidf.fit_transform(newsgroups_train.data) # (documents, vocab)

In [48]:
vectors_tfidf.shape

(2034, 26576)

In [49]:
print(vectors_tfidf)

  (0, 21025)	0.1458931413020125
  (0, 3998)	0.07036762966055367
  (0, 5546)	0.1458931413020125
  (0, 10605)	0.16717988448915072
  (0, 20973)	0.09485258934884024
  (0, 19841)	0.06346990252225734
  (0, 2408)	0.0740617740444856
  (0, 14706)	0.046643803845554874
  (0, 20977)	0.09029017643192268
  (0, 23828)	0.2258173261964949
  (0, 21208)	0.10761272733713331
  (0, 15301)	0.1064966894139308
  (0, 21084)	0.06206087717654851
  (0, 9848)	0.1064966894139308
  (0, 22878)	0.11143515786946494
  (0, 13023)	0.13924444936699948
  (0, 14154)	0.04763446792799959
  (0, 8554)	0.10162931867025875
  (0, 18949)	0.13313300331371947
  (0, 18704)	0.10260670288726481
  (0, 19066)	0.4152868172245007
  (0, 17464)	0.22287031573892988
  (0, 18699)	0.0823823626717928
  (0, 7698)	0.11451045447542962
  (0, 11203)	0.07006583621645067
  :	:
  (2032, 5893)	0.0865825827883191
  (2032, 11856)	0.058644602949810304
  (2032, 3463)	0.07548272575460073
  (2032, 3834)	0.06627380676073108
  (2032, 11291)	0.04827424514898305
  (20

#### Analyze _idf values

In [51]:
vectorizer_tfidf.idf_.shape

(26576,)

#### Analyze the top 5 and bottom 5 idf values

In [58]:
### words with highest idf_ values (i.e. less frequent words)
np.argsort(vectorizer_tfidf.idf_)[-5:]

array([ 6692, 16226, 16227, 16213, 13287])

In [66]:
word_id_to_review = np.argsort(vectorizer_tfidf.idf_)[-5:][0]
word_id_to_review

6692

In [67]:
### Let's look at 6692
cx = vectors_tfidf.tocoo()    
for i,j,v in zip(cx.row, cx.col, cx.data):
    if j == word_id_to_review:
        print (i,j,v)

746 6692 0.04517145895198725


In [69]:
reverse_key = {}
for key, value in vectorizer_tfidf.vocabulary_.items():
    reverse_key[value] = key
reverse_key[6692]

'consign'

In [70]:
newsgroups_train.data[746]

'In <lsjc8cINNmc1@saltillo.cs.utexas.edu> turpin@cs.utexas.edu (Russell Turpin)\n\n\nI regard love as no more or less "benign" than any other Christian does.\nYou are merely expressing "approval" of the consequences I find therein.\nWhich says more about our politics and cultural trappings than about my\n(or any) religion.  "Love" is a highly ambiguous word, of which Christians\ncan write both the "gentle" words Paul uses of it in 1 Corinthians -- in\na passage that even the "conservatives" will quote at you :-) -- and the\nwords of T. S. Eliot in his Pentacost Hymn, "Love is the unfamiliar Name\nthat wove the intolerable shirt of flame ..."\n\nThis is in any case rather to the side of what I was attempting to raise\nin my note, as will become more evident below.\n\n\nblechhh.  I think you are misreading me, rather seriously.  Though,\ngiven my principle that one CANNOT force one\'s own notion of "sin" on\nanother, and my unshakeable "disestablishmentarianism", Russel Turpin\nand other

In [53]:
np.argsort(vectorizer_tfidf.idf_)[:5]

array([ 8591, 14706, 13905, 14154, 23927])

In [77]:
reverse_key = {}
for key, value in vectorizer_tfidf.vocabulary_.items():
    reverse_key[value] = key
for word_id in np.argsort(vectorizer_tfidf.idf_)[:5]:
    print (reverse_key[word_id])

don
like
just
know
think


In [74]:
word_id_to_review = np.argsort(vectorizer_tfidf.idf_)[:5][1]
word_id_to_review

14706

In [75]:
### Let's look at 6692
cx = vectors_tfidf.tocoo()    
for i,j,v in zip(cx.row, cx.col, cx.data):
    if j == word_id_to_review:
        print (i,j,v)

0 14706 0.046643803845554874
1 14706 0.07764856119560923
3 14706 0.06738610255088694
5 14706 0.056766835811944166
6 14706 0.0395390203672734
8 14706 0.07328892238622065
9 14706 0.07648041797367917
11 14706 0.05402000537456859
13 14706 0.14912631949286906
14 14706 0.05266926742022202
16 14706 0.01636521415981106
17 14706 0.12388912053325497
18 14706 0.10922351848885939
23 14706 0.014409194085636037
33 14706 0.030327474421354653
39 14706 0.044282748562448106
41 14706 0.03554206913208134
42 14706 0.10638233860222919
52 14706 0.009110339761786249
55 14706 0.04156449811094116
72 14706 0.10025728632064736
78 14706 0.08575693037677251
82 14706 0.017853524086736027
88 14706 0.06424280124365672
90 14706 0.038790884078079506
95 14706 0.03670631062739414
100 14706 0.0696718031585248
106 14706 0.03476101514567444
107 14706 0.06076603354193711
119 14706 0.019194402452927112
123 14706 0.04856595001077073
129 14706 0.06491117814778462
131 14706 0.004797379656969984
148 14706 0.0777282856437142
151 14

1944 14706 0.12700575215431004
1950 14706 0.05782506498576146
1964 14706 0.05276497970482318
1972 14706 0.029392537708145875
1975 14706 0.03660314529378269
1987 14706 0.060783303409795326
1988 14706 0.012198315222441086
1992 14706 0.10234501882881485
1993 14706 0.022960721261637018
1994 14706 0.04689352912446415
2002 14706 0.023280277981438984
2003 14706 0.06888432776588563
2008 14706 0.06923702463534868
2010 14706 0.01737109245222705
2011 14706 0.009638833561243914
2018 14706 0.10539791856194394
2020 14706 0.0476326026347598
2021 14706 0.0696109992662231
2023 14706 0.15537138822166874
2028 14706 0.041605357493379076
2032 14706 0.03711866463525761


_As can be seen, the words with low _idf values are common words. Those words are found in a lot of documents._