## 实验演示：对文本数据集作非负矩阵分解 (NMF) 提取话题(topic) 信息

### 话题(topic)是一种粗略地描述语义(semantic)的方式

![alt text](figure/NMF.png)

**一个2个话题，4个单词，6篇文本的数据集的非负矩阵分集示意图**

* **V**: term-document matrix, `V[i][j]` 表示文本j里面词语i出现的频率
* **W**: term-topic matrix， `W[i][j]` 表示单词i和话题j的密切程度
* **H**: topic-document matrix， `H[i][j]`表示文本j里面话题i的权重

In [1]:
from time import time

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.datasets import fetch_20newsgroups

import numpy as np

In [2]:
def print_top_words(model, feature_names, n_top_words):
    """打印每个话题里面"n_top_words"个最常见的词语
    """
    
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(", ".join([feature_names[i]
                         for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()


In [3]:
# 获取 20newsgroups 数据集
print("读取数据 ...")
t0 = time()
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'))
print("完成，共耗时 %0.3f 秒." % (time() - t0))

n_samples = 2000
print("共有%d篇文本， 我们使用其中%d做实验演示" % (len(dataset['data']), n_samples))
data_samples = dataset.data[:n_samples]

读取数据 ...
完成，共耗时 1.261 秒.
共有11314篇文本， 我们使用其中2000做实验演示


In [4]:
# 打印几篇样本
for i in range(3):
    print("%s\n" % dataset['data'][i].replace('\n', ''))

Well i'm not sure about the story nad it did seem biased. WhatI disagree with is your statement that the U.S. Media is out toruin Israels reputation. That is rediculous. The U.S. media isthe most pro-israeli media in the world. Having lived in EuropeI realize that incidences such as the one described in theletter have occured. The U.S. media as a whole seem to try toignore them. The U.S. is subsidizing Israels existance and theEuropeans are not (at least not to the same degree). So I thinkthat might be a reason they report more clearly on theatrocities.	What is a shame is that in Austria, daily reports ofthe inhuman acts commited by Israeli soldiers and the blessingreceived from the Government makes some of the Holocaust guiltgo away. After all, look how the Jews are treating other raceswhen they got power. It is unfortunate.

Yeah, do you expect people to read the FAQ, etc. and actually accept hardatheism?  No, you need a little leap of faith, Jimmy.  Your logic runs outof steam!Jim,S

![alt text](figure/nmf_bow.png)

In [6]:
n_features = 1000

print("正在提取文本的 tf-idf 特征 ...")

# 计算文本的tf-idf作为输入NMF模型的特征（feature）.
# 1max_df=0.95, min_df=2： 使用一些启发式(heuristic)规则预处理去掉一些词语,
#   删除只在一个文本中出现或者在95%以上的文本中出现的词语
# max_features=n_features: 预处理后，在剩余的词语里面保留数据集中最常见的n_feature个词语
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=n_features,
                                   stop_words='english')
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(data_samples)
print("完成，共耗时 %0.3f 秒." % (time() - t0))

正在提取文本的 tf-idf 特征 ...
完成，共耗时 0.296 秒.


In [7]:
print("2000篇文本数据集的 bag-of-word 表示:")
tfidf.shape

2000篇文本数据集的 bag-of-word 表示:


(2000, 1000)

In [8]:
# Fit the NMF model
n_topics = 10
print("学习 NMF 分解来拟合 tfidf 特征矩阵, NMF使用%d个话题（topics）..."
      % (n_topics))
t0 = time()
nmf = NMF(n_components=n_topics, random_state=1,
          alpha=.1, l1_ratio=.5).fit(tfidf)
print("完成，共耗时 %0.3f 秒." % (time() - t0))

学习 NMF 分解来拟合 tfidf 特征矩阵, NMF使用10个话题（topics）...
完成，共耗时 0.934 秒.


## 机器学习模型两个基本部分:结构和参数
### 1. 人为定义的结构（例如tensorflow实现时候的computational graph)
### 2. 模型自己学习到的知识，保存在参数中


In [9]:
# nmf模型对文本的理解保存在nmf.components_参数矩阵中

nmf.components_.shape

(10, 1000)

## 来定性理解下NMF参数中的语义信息

### 1. 首先看看每个话题下面的重要词语

In [10]:
print("\n每个话题的代表词语有:")
n_top_words = 10

tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)


每个话题的代表词语有:
Topic #0:
just, people, don, think, like, know, time, good, make, way
Topic #1:
windows, use, dos, using, window, program, os, drivers, application, help
Topic #2:
god, jesus, bible, faith, christian, christ, christians, does, heaven, sin
Topic #3:
thanks, know, does, mail, advance, hi, info, interested, email, anybody
Topic #4:
car, cars, tires, miles, 00, new, engine, insurance, price, condition
Topic #5:
edu, soon, com, send, university, internet, mit, ftp, mail, cc
Topic #6:
file, problem, files, format, win, sound, ftp, pub, read, save
Topic #7:
game, team, games, year, win, play, season, players, nhl, runs
Topic #8:
drive, drives, hard, disk, floppy, software, card, mac, computer, power
Topic #9:
key, chip, clipper, keys, encryption, government, public, use, secure, enforcement



### 2. 然后看看一些单词的话题归属

In [11]:
id = [None]*4
id[0] = tfidf_feature_names.index('software')
id[1] = tfidf_feature_names.index('computer')
id[2] = tfidf_feature_names.index('faith')
id[3] = tfidf_feature_names.index('bible')

xs = [None]*4
for i in range(4):
    xs[i] = nmf.components_[:,id[i]]


In [12]:
print("四个单词在话题空间的坐标/表示(representation)/特征(feature)\n")

print("每个单词是一个%d维的词向量:" % (n_topics))

np.array(xs).T


四个单词在话题空间的坐标/表示(representation)/特征(feature)

每个单词是一个10维的词向量:


array([[ 0.03246496,  0.09201278,  0.02325284,  0.05181221],
       [ 0.26150334,  0.05643936,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.30131641,  0.42290532],
       [ 0.01995936,  0.01684022,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.01892334,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.31628097,  0.20540135,  0.        ,  0.        ],
       [ 0.        ,  0.00533712,  0.        ,  0.        ]])