ZHU Fangda & ZHANG Bolong

In [1]:
from time import time
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups
import numpy as np
import scipy.sparse as sp

# TOPIC EXTRACTION FROM DOCUMENTS

The goal is to study the use of nonnegative matrix factorisation (NMF) for topic extraction from a dataset of text documents. The rationale is to interpret each extracted NMF component as being associated with a specific topic.

Study and test the following script (introduced on http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html) :

In [2]:
n_samples = 2000
n_features = 1000
n_components = 10
n_top_words = 20

In [3]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

In [4]:
def preprocess(vectorizer='tf_idf', verbose=False):
    print("Loading dataset...")
    t0 = time()
    dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                                 remove=('headers', 'footers', 'quotes'))
    data_samples = dataset.data[:n_samples]
    if(verbose):
        print("Loading dataset done in %0.3fs." % (time() - t0))

    if vectorizer == 'tf_idf':
        # Use tf-idf features for NMF.
        _vectorizer = TfidfVectorizer(input = "content", max_df=0.95,                                            max_features=n_features,
                                           stop_words='english')
    elif vectorizer == 'tf':
        # Use tf features for NMF.
        _vectorizer = CountVectorizer(input = "content", max_df=0.95,
                                max_features=n_features,
                                stop_words='english')
    else:
        raise ValueError("Excepted value of vectorizer is tf_idf or tf.")
        
    t0 = time()
    features = _vectorizer.fit_transform(data_samples)
    feature_names = _vectorizer.get_feature_names()
    if(verbose):
        print(" for LDA...")
        print("Extracting" + vectorizer +  "features done in %0.3fs." % (time() - t0))
    return features, feature_names

In [5]:
def NMF_SK(features, _vectorizerName=None, W=None, H=None, K = None
             ,random_state=None
             ,solver= 'cd', beta_loss = 'frobenius', init='random',verbose = False ):

    t0 = time()
    nmf = NMF(n_components, init, solver, beta_loss, 
                  random_state=random_state,
                  alpha=.1, l1_ratio=.5, verbose = verbose).fit(features)
    if init =='random':
        nmf = nmf.fit(features)
    else:
        nmf = nmf.fit_transform(features, W=_W, H=_H)

    if(verbose):
         print("NMF done in %0.3fs." % (time() - t0))
    return nmf, n_top_words

## Test and comment on the effect of varying the initialisation, especially using random
nonnegative values as initial guesses (for W and H coefficients, using the notations introduced during the lecture)

In [6]:
features, feature_names = preprocess()
nmf, n_top_words = NMF_SK(features)
print_top_words(nmf, feature_names, n_top_words)

Loading dataset...
Topic #0: just people don think like know time good make way really say ve right want did use ll new years
Topic #1: god jesus bible faith christian christ christians does heaven sin believe lord life church mary atheism love belief human religion
Topic #2: drive drives hard disk software floppy card mac 00 power computer scsi controller apple mb pc sale rom monitor memory
Topic #3: car cars tires miles new engine insurance 00 price condition oil speed power 000 good brake models bought year used
Topic #4: game team games year win play season players nhl runs goal toronto hockey division flyers player defense leafs bad teams
Topic #5: edu soon send com university internet mit ftp mail cc article pub information hope email contact home blood mac program
Topic #6: thanks know does mail advance hi info interested email anybody looking card help like appreciated information list send video need
Topic #7: windows file dos files program use using window problem help os run

In [7]:
features, feature_names = preprocess()
nmf, n_top_words = NMF_SK(features, random_state = 26)
print_top_words(nmf, feature_names, n_top_words)

Loading dataset...
Topic #0: just people don think like know time good way make really say ve right did ll new want going years
Topic #1: god jesus bible faith christian christ christians does heaven sin believe lord life church mary atheism love belief human religion
Topic #2: car cars tires miles new engine insurance price 00 oil condition power speed good 000 brake year models bought used
Topic #3: windows file dos files program using problem window os help running drivers ftp ms version pc application screen work available
Topic #4: key chip clipper keys encryption government public enforcement secure phone law nsa communications security privacy clinton used message user going
Topic #5: thanks know does mail advance hi info interested email anybody looking card help like appreciated information list send video need
Topic #6: drive drives disk hard software card floppy 00 mac computer power scsi controller apple pc mb sale monitor rom memory
Topic #7: game team games year win play 

According to the result, we can find the initial value of W and H have a influence to the final results. So we can find the alogo is not stable. The result depends on the initialisation, we may say that the results are similar, the order of topic is different with different initial value.

## Compare and comment on the difference between the results obtained with $l_2$ cost compared to the generalised Kullback-Liebler cost

In [8]:
features, feature_names = preprocess()
nmf, n_top_words = NMF_SK(features,solver = 'mu', beta_loss='kullback-leibler')
print_top_words(nmf, feature_names, n_top_words)

Loading dataset...




Topic #0: thanks using windows need use know help hi does file software problem work advance version info pc mail drive video
Topic #1: work people heard state small different write going able news tell unless gets idea order common law given right look
Topic #2: want time make sure things let got good hard stuff real like way look need nice long just new pretty
Topic #3: used use guess public general wouldn years key using light government course rest currently second control times national doing nasa
Topic #4: wrong support way believe usually people says did matter reason set word far com time instead fact said god called
Topic #5: year post won mail send working thanks said posting check number don reply runs lot mentioned case net bad edu
Topic #6: years team new 20 ago states play women 11 possible 40 13 second started jewish 1993 total 10 white day
Topic #7: looking interested new world price sale university good sell couple buy offer cost weeks edu source email 10 bike phone
To

The topics found seem similar, but not exactly, for example, there is no topic about religion for Kullback-Liebler cost which may not be very precise. Also, l2 cost may be more efficient, it extracts more information regarding the topics. And l2 cost converge more fast, the results of kullback-leibler is very larger. So with kullback-leibler, we can get WH which are more close to V for the same number of steps. 

## Test and comment on the results obtained using a simpler term-frequency representation as input (as opposed to the TF-IDF representation considered in the code above) when considering the Kullback-Liebler cost.

In [9]:
features, feature_names = preprocess('tf')
nmf, n_top_words = NMF_SK(features,solver = 'mu', beta_loss='kullback-leibler')
print_top_words(nmf, feature_names, n_top_words)

Loading dataset...
Topic #0: don just like think people know good make way ve want really say going sure ll doesn right need things
Topic #1: didn car people said just know went like did time don came going home old got come right started bike
Topic #2: edu com mail graphics send pub file ftp server files code faq list message image cs format mit available xfree86
Topic #3: government key use law state public israel encryption clipper chip keys section gun used security weapons person military insurance enforcement
Topic #4: 10 drive 55 disk 16 11 hard drives 25 15 17 controller 18 12 rom 21 card 20 23 13
Topic #5: space year game team play years earth points moon surface probe season games flyers lunar players new mission orbit 10
Topic #6: god does people jesus bible law believe true church point fact life christian time did jews say world book religion
Topic #7: people 000 new hiv health children research president 1993 said aids april national states turkish program information car

Neither the simple Term Frequency representation and the simple Count of tokens has a better result. For example, topic 4, there is no effective or much useful information/word. With tf_idf, it is more easy to distinguish the similar topic.

# Custom NMF Implementation

In [10]:
def _special_sparse_dot(W, H, X):
    """Computes np.dot(W, H), only where X is non zero."""
    if sp.issparse(X):
        ii, jj = X.nonzero()
        dot_vals = np.multiply(W[ii, :], H.T[jj, :]).sum(axis=1)
        WH = sp.coo_matrix((dot_vals, (ii, jj)), shape=X.shape)
        return WH.tocsr()
    else:
        return np.dot(W, H)

In [11]:
def _beta_divergence(X, Y, beta):
    if beta == 0:
        return np.sum(X/Y - np.ma.log(X/Y) - 1)
    elif beta == 1:
        item = np.ma.log( np.ma.divide(X,Y))
        item = item.filled(0)
        return np.sum(np.multiply(X,item) - X + Y)
    else:
        term1 = X**beta
        term2 = (beta - 1) * Y**beta
        term3 = beta * np.multiply(X,Y**(beta-1))
        term = (term1 + term2 - term3) / (beta*(beta-1))
        return np.sum(term)
    
def custom_NMF(V, K, W=None, H=None, beta = 1, steps=50, show_loss=False):
    if (V.ndim != 2):
        raise ValueError('The dim of V should be 2 but found ' + str(V.ndim))
    if (K < 2):
        raise ValueError('The K should a integer bigger then 2 but found ' +
                         str(K))
    
    F, N = V.shape
    if (W == None):
        W = np.random.rand(F, K)
    if (H == None):
        H = np.random.rand(K, N)
     
    pre_error = 0
    error = 0
    for step in range(steps):
        WH = W.dot(H)
        H_num = W.T.dot(np.multiply( np.power(WH,beta-2),V))
        H_den =  W.T.dot(np.power(W.dot(H), beta-1))
        H = np.multiply(H, np.ma.divide(H_num, H_den))
        WH = W.dot(H)
        W_num = np.multiply(WH**(beta-2),V).dot(H.T) 
        W_den =  np.dot(WH**(beta-1), H.T)
        W = np.multiply(W, np.ma.divide(W_num, W_den))
        
        H = np.clip(H, 10**-150, None)
        W = np.clip(W, 10**-150, None)
        
        if(show_loss and (step+1) %25 == 0):
            pre_error = error
            WH = _special_sparse_dot(W, H, V)
            error = _beta_divergence(V, W.dot(H), beta)
            
            print("Iteration %d Error: %.3f" % (step + 1,error) )
            print("Iteration %d Relative Error: %.3f" % (step,pre_error - error) )
     
    return np.asarray(W), np.asarray(H)

In [12]:
features, feature_names = preprocess()

Loading dataset...


In [13]:
W, H = custom_NMF(features.toarray(), 10, beta = 10, show_loss=True, steps = 100)

Iteration 25 Error: 0.430
Iteration 24 Relative Error: -0.430
Iteration 50 Error: 0.403
Iteration 49 Relative Error: 0.027
Iteration 75 Error: 0.402
Iteration 74 Relative Error: 0.001
Iteration 100 Error: 0.402
Iteration 99 Relative Error: 0.000


In [14]:
W, H = custom_NMF(features.toarray(), 10, beta = 2, show_loss=True, steps = 1000)

Iteration 25 Error: 887.589
Iteration 24 Relative Error: -887.589
Iteration 50 Error: 885.227
Iteration 49 Relative Error: 2.362
Iteration 75 Error: 884.746
Iteration 74 Relative Error: 0.481
Iteration 100 Error: 884.470
Iteration 99 Relative Error: 0.276
Iteration 125 Error: 884.352
Iteration 124 Relative Error: 0.117
Iteration 150 Error: 884.294
Iteration 149 Relative Error: 0.058
Iteration 175 Error: 884.276
Iteration 174 Relative Error: 0.018
Iteration 200 Error: 884.257
Iteration 199 Relative Error: 0.019
Iteration 225 Error: 884.245
Iteration 224 Relative Error: 0.012
Iteration 250 Error: 884.239
Iteration 249 Relative Error: 0.006
Iteration 275 Error: 884.236
Iteration 274 Relative Error: 0.003
Iteration 300 Error: 884.232
Iteration 299 Relative Error: 0.004
Iteration 325 Error: 884.230
Iteration 324 Relative Error: 0.002
Iteration 350 Error: 884.227
Iteration 349 Relative Error: 0.002
Iteration 375 Error: 884.225
Iteration 374 Relative Error: 0.002
Iteration 400 Error: 884.221


In [15]:
print("Custome MNF:")
for topic_idx in range(n_components):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in H[topic_idx,:].argsort()[:-n_top_words - 1:-1]])
        print(message)
print()

Custome MNF:
Topic #0: windows file dos using program use window files problem help os application running drivers version screen ms ftp available work
Topic #1: god jesus bible faith does christian christians christ believe heaven life sin lord church mary religion love true atheism human
Topic #2: people just don know like say time right did make ve really said law government things israel way let want
Topic #3: think don just use good like need pretty extra make yes sure bible early try reading going wasn course wrong
Topic #4: key chip clipper keys encryption government use public phone secure enforcement data law nsa security used communications going standard privacy
Topic #5: drive drives hard disk card software floppy pc mac apple power scsi computer controller memory problem board monitor mb video
Topic #6: car new 00 10 bike price good year sale cars space power engine years cost miles condition like used 12
Topic #7: thanks know does mail advance hi info interested anybody l

In [16]:
print("sklearn MNF:")
features, feature_names = preprocess()

sklearn MNF:
Loading dataset...


In [17]:
nmf, n_top_words = NMF_SK(features, solver='mu',  beta_loss='kullback-leibler', verbose = True)
print_top_words(nmf, feature_names, n_top_words)

Epoch 10 reached after 0.421 seconds, error: 218.052710
Epoch 20 reached after 0.846 seconds, error: 214.712050
Epoch 30 reached after 1.272 seconds, error: 213.776898
Epoch 40 reached after 1.701 seconds, error: 213.329420
Epoch 50 reached after 2.145 seconds, error: 213.059091
Epoch 60 reached after 2.588 seconds, error: 212.872114
Epoch 70 reached after 3.033 seconds, error: 212.729480
Epoch 80 reached after 3.469 seconds, error: 212.608326
Epoch 90 reached after 3.903 seconds, error: 212.508895
Epoch 100 reached after 4.334 seconds, error: 212.448747
Epoch 110 reached after 4.778 seconds, error: 212.385838
Epoch 120 reached after 5.227 seconds, error: 212.313436
Epoch 130 reached after 5.642 seconds, error: 212.263410
Epoch 140 reached after 6.078 seconds, error: 212.226382
Epoch 150 reached after 6.508 seconds, error: 212.190192
Epoch 160 reached after 6.937 seconds, error: 212.148670
Epoch 170 reached after 7.384 seconds, error: 212.115660
Epoch 180 reached after 7.824 seconds, e



Epoch 10 reached after 0.455 seconds, error: 218.389820
Epoch 20 reached after 0.890 seconds, error: 214.904369
Epoch 30 reached after 1.309 seconds, error: 213.869140
Epoch 40 reached after 1.740 seconds, error: 213.353410
Epoch 50 reached after 2.184 seconds, error: 213.074355
Epoch 60 reached after 2.613 seconds, error: 212.886821
Epoch 70 reached after 3.048 seconds, error: 212.752822
Epoch 80 reached after 3.486 seconds, error: 212.647430
Epoch 90 reached after 3.929 seconds, error: 212.544264
Epoch 100 reached after 4.355 seconds, error: 212.481188
Epoch 110 reached after 4.793 seconds, error: 212.412379
Epoch 120 reached after 5.209 seconds, error: 212.360674
Epoch 130 reached after 5.649 seconds, error: 212.296555
Epoch 140 reached after 6.068 seconds, error: 212.246282
Epoch 150 reached after 6.501 seconds, error: 212.206117
Epoch 160 reached after 6.931 seconds, error: 212.165737
Epoch 170 reached after 7.369 seconds, error: 212.129729
Epoch 180 reached after 7.815 seconds, e

The result of our custome MNF is pretty good. Comparing the implementation with the one offered by scikit-learn, the sklearn MNF seems better and fastern the beta_loss is smaller and converge fast. It extracts more effective topics and information.