#  1.TF-IDF原理

<font face="微软雅黑" size=4>
    TF-IDF 是 Term Frequency - Inverse Document Frequency 的缩写，词频-逆文本。TF反应词在文本中出现的频率，IDF反应一个词在所有文本中出现的频率。
$$IDF(x)=log\frac{N}{N(x)}$$
    其中，$N$代表语料库中文本的总数，$N(x)$代表语料库中包含词$x$的文本总数，对公式进行平滑：
$$IDF(x)=log\frac{N+1}{N(x)+1}+1$$
    tf-idf计算公式为：
    $$TF-IDF=TF(x)\times IDF(x)$$
 </font>

In [111]:
from sklearn.feature_extraction.text import TfidfTransformer  
from sklearn.feature_extraction.text import CountVectorizer  

corpus=["I come to China to travel with my gril friend but my gril freind is a boy", 
    "This is a car polupar in China which named hongqi",          
    "I love tea and Apple ",   
    "The work is to write some papers in science or acta"] 

vectorizer=CountVectorizer()
transformer = TfidfTransformer()
v = vectorizer.fit_transform(corpus)#每个词在每个文本里对应的频数
print("每个词在每个文本里对应的频数:")
print(v)
print("=================")
print("显示每个词的tf-idf值: ")
tfidf = transformer.fit_transform(v)#.toarray()  #显示每个词的tf-idf值 
print (tfidf)

每个词在每个文本里对应的频数:
  (0, 3)	1
  (0, 13)	1
  (0, 8)	1
  (0, 4)	1
  (0, 9)	1
  (0, 10)	2
  (0, 15)	2
  (0, 28)	1
  (0, 26)	1
  (0, 6)	1
  (0, 25)	2
  (0, 7)	1
  (1, 11)	1
  (1, 16)	1
  (1, 27)	1
  (1, 12)	1
  (1, 19)	1
  (1, 5)	1
  (1, 24)	1
  (1, 13)	1
  (1, 6)	1
  (2, 2)	1
  (2, 1)	1
  (2, 22)	1
  (2, 14)	1
  (3, 0)	1
  (3, 17)	1
  (3, 20)	1
  (3, 18)	1
  (3, 21)	1
  (3, 30)	1
  (3, 29)	1
  (3, 23)	1
  (3, 12)	1
  (3, 13)	1
  (3, 25)	1
显示每个词的tf-idf值: 
  (0, 7)	0.2323987345417178
  (0, 25)	0.3664516633475799
  (0, 6)	0.18322583167378995
  (0, 26)	0.2323987345417178
  (0, 28)	0.2323987345417178
  (0, 15)	0.4647974690834356
  (0, 10)	0.4647974690834356
  (0, 9)	0.2323987345417178
  (0, 4)	0.2323987345417178
  (0, 8)	0.2323987345417178
  (0, 13)	0.1483371018604668
  (0, 3)	0.2323987345417178
  (1, 6)	0.28503967675464414
  (1, 13)	0.23076418416976147
  (1, 24)	0.361536687086221
  (1, 5)	0.361536687086221
  (1, 19)	0.361536687086221
  (1, 12)	0.28503967675464414
  (1, 27)	0.361536687086221
  (1

# 2.文本矩阵化

$j$个文本里的$i$个去重单词，构成矩阵，列数为单词个数，行数为文本个数，每个值$x_{ij}$对应每个单词的tf-idf值。

In [112]:
arraytfidf = tfidf.toarray()
print(arraytfidf)
print("===============")
print("文本数：",len(arraytfidf),"单词数：",len(arraytfidf[0]))

[[0.         0.         0.         0.23239873 0.23239873 0.
  0.18322583 0.23239873 0.23239873 0.23239873 0.46479747 0.
  0.         0.1483371  0.         0.46479747 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.36645166 0.23239873 0.         0.23239873 0.
  0.        ]
 [0.         0.         0.         0.         0.         0.36153669
  0.28503968 0.         0.         0.         0.         0.36153669
  0.28503968 0.23076418 0.         0.         0.36153669 0.
  0.         0.36153669 0.         0.         0.         0.
  0.36153669 0.         0.         0.36153669 0.         0.
  0.        ]
 [0.         0.5        0.5        0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.5        0.         0.         0.
  0.         0.         0.         0.         0.5        0.
  0.         0.         0.         0.         0.         0.
  0.        ]
 [0.32190145 0.         0.         0.     

# 3.互信息的原理

互信息是信息论里的一种信息度量，它可以看成是一个随机变量中包含的关于另一个随机变量的信息量，或者说是一个随机变量由于已知另一个随机变量二减少的不确定性。
设两个随机变量$(X,Y)$的联合概率分布为$p(x,y)$,边缘概率分布为$p(x),p(y)$,互信息$I(X;Y)$是联合分布$p(x,y)$于乘积分布$p(x),p(y)$的相对熵
$$I(X;Y)=\sum_{x\subset X,y\subset Y}p(x,y)log\frac{p(x,y)}{p(x)p(y)}$$

# 4.特征筛选

In [113]:
from sklearn.feature_selection import mutual_info_classif
from sklearn.datasets import load_iris
from sklearn import metrics as mr
import numpy as np
from sklearn import datasets

In [126]:
eachwords = vectorizer.get_feature_names()
eachwords = np.asarray(eachwords)
eachwords = eachwords.T
print(type(eachwords))
print("每个单词作为特征：")
print(eachwords)

<class 'numpy.ndarray'>
每个单词作为特征：
['acta' 'and' 'apple' 'boy' 'but' 'car' 'china' 'come' 'freind' 'friend'
 'gril' 'hongqi' 'in' 'is' 'love' 'my' 'named' 'or' 'papers' 'polupar'
 'science' 'some' 'tea' 'the' 'this' 'to' 'travel' 'which' 'with' 'work'
 'write']


In [129]:
#arraytfidf = arraytfidf.T
#mutual_info = mutual_info_classif(arraytfidf, eachwords, discrete_features= False)
#print(np.mat(arraytfidf))
print("第一个单词和后三个单词之间的互信息：")
print(mr.mutual_info_score(arraytfidf[0], arraytfidf[1]))
print(mr.mutual_info_score(arraytfidf[0], arraytfidf[2]))
print(mr.mutual_info_score(arraytfidf[0], arraytfidf[3]))
#mr.mutual_info_score(arraytfidf,eachwords)

第一个单词和后三个单词之间的互信息：
0.3528681999138815
0.0691103344933365
0.3946507687941262
