<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="cognitiveclass.ai logo">
</center>


# Machine Learning Foundation

## Course 4, Part e: Non-Negative Matrix Factorization DEMO


This exercise illustrates usage of Non-negative Matrix factorization and covers techniques related to sparse matrices and some basic work with Natural Langauge Processing.  We will use NMF to look at the top words for given topics.


## Data


We'll be using the BBC dataset. These are articles collected from 5 different topics, with the data pre-processed. 

These data are available in the data folder (or online [here](http://mlg.ucd.ie/files/datasets/bbc.zip?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork821-2023-01-01)). The data consists of a few files. The steps we'll be following are:

* *bbc.terms* is just a list of words 
* *bbc.docs* is a list of artcles listed by topic.

At a high level, we're going to 

1. Turn the `bbc.mtx` file into a sparse matrix (a [sparse matrix](https://docs.scipy.org/doc/scipy/reference/sparse.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork821-2023-01-01) format can be useful for matrices with many values that are 0, and save space by storing the position and values of non-zero elements).
1. Decompose that sparse matrix using NMF.
1. Use the resulting components of NMF to analyze the topics that result.


## Data Setup


Note: This lab has been updated to work in skillsnetwork for your convenience.


In [27]:
import urllib

In [28]:
with urllib.request.urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML0187EN-SkillsNetwork/labs/module%203/data/bbc.mtx') as r:
    content = r.readlines()[2:]

## Part 1

Here, we will turn this into a list of tuples representing a [sparse matrix](https://docs.scipy.org/doc/scipy/reference/sparse.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork821-2023-01-01). Remember the description of the file from above:

* *bbc.mtx* is a list: first column is **wordID**, second is **articleID** and the third is the number of times that word appeared in that article.

So, if word 1 appears in article 3, 2 times, one element of our list will be:

`(1, 3, 2)`


In [29]:
sparsemat = [tuple(map(int,map(float,c.split()))) for c in content]
# Let's examine the first few elements
sparsemat[:8]

[(1, 1, 1),
 (1, 7, 2),
 (1, 11, 1),
 (1, 14, 1),
 (1, 15, 2),
 (1, 19, 2),
 (1, 21, 1),
 (1, 29, 1)]

In [43]:
len(sparsemat)

286774

## Part 2: Preparing Sparse Matrix data for NMF 


We will use the [coo matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork821-2023-01-01) function to turn the sparse matrix into an array. 


In [62]:
import numpy as np
from scipy.sparse import coo_matrix
rows = [x[0]-1 for x in sparsemat]
cols = [x[1]-1 for x in sparsemat]
values = [x[2] for x in sparsemat]
coo = coo_matrix((values, (cols, rows)))

In [63]:
coo

<2225x9635 sparse matrix of type '<class 'numpy.int64'>'
	with 286774 stored elements in COOrdinate format>

In [46]:
len(set(rows)), len(set(cols))

(9635, 2225)

In [53]:
set(cols)

{1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99,
 100,
 101,
 102,
 103,
 104,
 105,
 106,
 107,
 108,
 109,
 110,
 111,
 112,
 113,
 114,
 115,
 116,
 117,
 118,
 119,
 120,
 121,
 122,
 123,
 124,
 125,
 126,
 127,
 128,
 129,
 130,
 131,
 132,
 133,
 134,
 135,
 136,
 137,
 138,
 139,
 140,
 141,
 142,
 143,
 144,
 145,
 146,
 147,
 148,
 149,
 150,
 151,
 152,
 153,
 154,
 155,
 156,
 157,
 158,
 159,
 160,
 161,
 162,
 163,
 164,
 165,
 166,
 167,
 168,
 169,
 170,
 171,
 172,
 173,
 174,
 175,
 176,
 177,
 178,
 179,
 180,
 181,
 182,
 183,
 184,
 185

## NMF


NMF is a way of decomposing a matrix of documents and words so that one of the matrices can be interpreted as the "loadings" or "weights" of each word on a topic. 


Check out [the NMF documentation](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork821-2023-01-01) and the [examples of topic extraction using NMF and LDA](http://scikit-learn.org/0.18/auto_examples/applications/topics_extraction_with_nmf_lda.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork821-2023-01-01).


**Note:** Just like we read in the data above, we'll have to read in the words from the `bbc.terms` file.


In [32]:
with urllib.request.urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML0187EN-SkillsNetwork/labs/module%203/data/bbc.terms') as r:
    content = r.readlines()
words = [c.split()[0] for c in content]

In [33]:
with urllib.request.urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML0187EN-SkillsNetwork/labs/module%203/data/bbc.docs') as r:
    doc_content = r.readlines()

doc_content[:8]


[b'business.001\n',
 b'business.002\n',
 b'business.003\n',
 b'business.004\n',
 b'business.005\n',
 b'business.006\n',
 b'business.007\n',
 b'business.008\n']

In [40]:

docs = [c.split()[0] for c in doc_content]
docs

[b'business.001',
 b'business.002',
 b'business.003',
 b'business.004',
 b'business.005',
 b'business.006',
 b'business.007',
 b'business.008',
 b'business.009',
 b'business.010',
 b'business.011',
 b'business.012',
 b'business.013',
 b'business.014',
 b'business.015',
 b'business.016',
 b'business.017',
 b'business.018',
 b'business.019',
 b'business.020',
 b'business.021',
 b'business.022',
 b'business.023',
 b'business.024',
 b'business.025',
 b'business.026',
 b'business.027',
 b'business.028',
 b'business.029',
 b'business.030',
 b'business.031',
 b'business.032',
 b'business.033',
 b'business.034',
 b'business.035',
 b'business.036',
 b'business.037',
 b'business.038',
 b'business.039',
 b'business.040',
 b'business.041',
 b'business.042',
 b'business.043',
 b'business.044',
 b'business.045',
 b'business.046',
 b'business.047',
 b'business.048',
 b'business.049',
 b'business.050',
 b'business.051',
 b'business.052',
 b'business.053',
 b'business.054',
 b'business.055',
 b'busines

## Part 3

Here, we will import `NMF`, define a model object with 5 components, and `fit_transform` the data created above.


In [34]:
# Surpress warnings from using older version of sklearn:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

from sklearn.decomposition import NMF
model = NMF(n_components=5, init='random', random_state=818)
doc_topic = model.fit_transform(coo)

doc_topic.shape
# we should have 9636 observations (articles) and five latent features

(9636, 5)

In [35]:
# find feature with highest value per doc
np.argmax(doc_topic, axis=1)

array([0, 0, 2, ..., 4, 4, 4])

## Part 4: 

Check out the `components` of this model:


In [36]:
model.components_.shape

(5, 2226)

This is five rows, each of which is a "topic" containing the weights of each word on that topic. The exercise is to _get a list of the top 10 words for each topic_. We can just store this in a list of lists.


In [37]:
topic_words = []
for r in model.components_:
    a = sorted([(v,i) for i,v in enumerate(r)],reverse=True)[0:12]
    topic_words.append([words[e[1]] for e in a])

In [38]:
# Here, each set of words relates to the corresponding topic (ie the first set of words relates to topic 'Business', etc.)
topic_words[:5]

[[b'bondi',
  b'stanlei',
  b'continent',
  b'mortgag',
  b'bare',
  b'least',
  b'extent',
  b'200',
  b'leav',
  b'frustrat',
  b'yuan',
  b'industri'],
 [b'manipul',
  b'teenag',
  b'drawn',
  b'go',
  b'prosecutor',
  b'herbert',
  b'host',
  b'protest',
  b'hike',
  b'nation',
  b'calcul',
  b'power'],
 [b'dimens',
  b'hous',
  b'march',
  b'wider',
  b'owner',
  b'intend',
  b'declin',
  b'forc',
  b'posit',
  b'founder',
  b'york',
  b'unavail'],
 [b'rome',
  b'ft',
  b'regain',
  b'lawmak',
  b'outright',
  b'resum',
  b'childhood',
  b'greatest',
  b'citi',
  b'stagnat',
  b'crown',
  b'bodi'],
 [b'build',
  b'empir',
  b'isol',
  b'\xc2\xa312',
  b'restructur',
  b'closer',
  b'plung',
  b'depreci',
  b'durham',
  b'race',
  b'juli',
  b'segreg']]

The original data had 5 topics, as listed in `bbc.docs` (which these topic words relate to). 

```
Business
Entertainment
Politics
Sport
Tech
```

In "real life", we would have found a way to use these to inform the model. But for this little demo, we can just compare the recovered topics to the original ones. And they seem to match reasonably well. The order is different, which is to be expected in this kind of model.


In [45]:
len(words), len(docs), coo.shape, len(words) * len(docs), coo.shape[0]*coo.shape[1]

(9635, 2225, (9636, 2226), 21437875, 21449736)

In [49]:
arr = coo.toarray()
len(arr), len(arr[0])

(9636, 2226)

In [64]:
import pandas as pd

df = pd.DataFrame(coo.toarray(), columns=words, index=docs)

df

Unnamed: 0,b'ad',b'sale',b'boost',b'time',b'warner',b'profit',b'quarterli',b'media',b'giant',b'jump',...,b'\xc2\xa3339',b'denialofservic',b'ddo',b'seagrav',b'bot',b'wirelessli',b'streamcast',b'peripher',b'headphon',b'flavour'
b'business.001',1,5,2,3,4,10,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
b'business.002',0,0,1,2,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
b'business.003',0,4,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
b'business.004',0,1,0,0,0,4,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
b'business.005',0,0,0,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
b'tech.397',0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
b'tech.398',0,0,0,1,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
b'tech.399',0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
b'tech.400',0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [90]:
components_arr = model.components_.round(3)
newlist = []
for l in components_arr:
    newlist.append(np.delete(l, [0]))

arr = np.array(newlist)
arr.shape

(5, 2225)

In [94]:
topic_doc = pd.DataFrame(arr.T, columns=['topic_1','topic_2','topic_3','topic_4','topic_5'], index=docs)
topic_doc

Unnamed: 0,topic_1,topic_2,topic_3,topic_4,topic_5
b'business.001',0.000,0.000,0.242,0.009,0.117
b'business.002',0.014,0.000,0.271,0.004,0.000
b'business.003',0.017,0.000,0.157,0.000,0.000
b'business.004',0.000,0.007,0.351,0.000,0.000
b'business.005',0.000,0.005,0.127,0.011,0.051
...,...,...,...,...,...
b'tech.397',0.000,0.000,0.090,0.000,0.523
b'tech.398',0.010,0.000,0.052,0.000,0.410
b'tech.399',0.116,0.000,0.195,0.048,0.375
b'tech.400',0.027,0.003,0.086,0.006,0.216


In [96]:
topic_word = pd.DataFrame(topic_doc.round(5), index=['topic_1','topic_2','topic_3','topic_4','topic_5'], columns=words)
topic_word

Unnamed: 0,b'ad',b'sale',b'boost',b'time',b'warner',b'profit',b'quarterli',b'media',b'giant',b'jump',...,b'\xc2\xa3339',b'denialofservic',b'ddo',b'seagrav',b'bot',b'wirelessli',b'streamcast',b'peripher',b'headphon',b'flavour'
topic_1,,,,,,,,,,,...,,,,,,,,,,
topic_2,,,,,,,,,,,...,,,,,,,,,,
topic_3,,,,,,,,,,,...,,,,,,,,,,
topic_4,,,,,,,,,,,...,,,,,,,,,,
topic_5,,,,,,,,,,,...,,,,,,,,,,


---
### Machine Learning Foundation (C) 2020 IBM Corporation
