## 词频-逆文档频率 TF-IDF## 

其实 IDF 的概念就是[一个特定条件下、关键词的概率分布的交叉熵（Kullback-Leibler Divergence）](https://www.cnblogs.com/ZisZ/p/9087921.html)

[TF-IDF实现](https://www.geeksforgeeks.org/tf-idf-model-for-page-ranking/)

[TF-IDF与余弦相似性的应用（一）](http://www.ruanyifeng.com/blog/2013/03/tf-idf.html)

### 词频TF ###

$$tf = \frac{某个词在文档中出现的次数}{文档的总词数}$$

### 逆文档频率IDF ###

$$ idf = \log\frac{文档总数}{包含该词的文档数}$$
通常还需要平滑处理，为什么是log?

### TF-IDF ###

 $$tf-idf = tf \times idf$$

In [14]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer


In [15]:
corpus = [ "I come to China to travel", 
    "This is a car polupar in China",          
    "I love tea and Apple ",   
    "The work is to write some papers in science"]

vectorizer = CountVectorizer()
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))

In [16]:
print(tfidf)

  (0, 16)	0.4424621378947393
  (0, 15)	0.697684463383976
  (0, 4)	0.4424621378947393
  (0, 3)	0.348842231691988
  (1, 14)	0.45338639737285463
  (1, 9)	0.45338639737285463
  (1, 6)	0.3574550433419527
  (1, 5)	0.3574550433419527
  (1, 3)	0.3574550433419527
  (1, 2)	0.45338639737285463
  (2, 12)	0.5
  (2, 7)	0.5
  (2, 1)	0.5
  (2, 0)	0.5
  (3, 18)	0.3565798233381452
  (3, 17)	0.3565798233381452
  (3, 15)	0.2811316284405006
  (3, 13)	0.3565798233381452
  (3, 11)	0.3565798233381452
  (3, 10)	0.3565798233381452
  (3, 8)	0.3565798233381452
  (3, 6)	0.2811316284405006
  (3, 5)	0.2811316284405006


In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf2 = TfidfVectorizer()
res = tfidf2.fit_transform(corpus)
print(res)

  (0, 4)	0.4424621378947393
  (0, 15)	0.697684463383976
  (0, 3)	0.348842231691988
  (0, 16)	0.4424621378947393
  (1, 3)	0.3574550433419527
  (1, 14)	0.45338639737285463
  (1, 6)	0.3574550433419527
  (1, 2)	0.45338639737285463
  (1, 9)	0.45338639737285463
  (1, 5)	0.3574550433419527
  (2, 7)	0.5
  (2, 12)	0.5
  (2, 0)	0.5
  (2, 1)	0.5
  (3, 15)	0.2811316284405006
  (3, 6)	0.2811316284405006
  (3, 5)	0.2811316284405006
  (3, 13)	0.3565798233381452
  (3, 17)	0.3565798233381452
  (3, 18)	0.3565798233381452
  (3, 11)	0.3565798233381452
  (3, 8)	0.3565798233381452
  (3, 10)	0.3565798233381452


## 互信息-MI ##

考虑两组变量$x$,$y$的联合概率分布$p(x,y)$, 如果两组变量是独⽴的，那么他
们的联合分布可以分解为边缘分布的乘积$p(x, y) = p(x)p(y)$,如果变量不是独⽴的，那么我们可以通过考察联合概率分布与边缘概率分布乘积之间的Kullback-Leibler散度来判断它们是否“接近”于相互独⽴。这被称为变量x和变量y之间的互信息（mutual information）。根据Kullback-Leibler散度的性质，我们看到$I[x, y] \ge 0$，当且仅当$x和$y相互独⽴时等号成⽴。

$$ I[x, y] = KL(p(x,y)||p(x)p(y))=-\int\int p(x,y)\ln(\frac{p(x)p(y)}{p(x,y)}dxdy)$$

使⽤概率的加和规则和乘积规则，我们看到互信息和条件熵之间的关系为:

$$I[x, y] = H[x] − H[x | y] = H[y] − H[y | x]$$

因此我们可以把互信息看成由于知道y值⽽造成的x的不确定性的减⼩（反之亦然）。从贝叶斯
的观点来看，我们可以把p(x)看成$x$的先验概率分布，把$p(x | y)$看成我们观察到新数据$y$之后
的后验概率分布。因此互信息表⽰⼀个新的观测$y$造成的$x$的不确定性的减⼩。

![Mutual_Information](Mutual_Information.png)

The Mutual Information is a measure of the similarity between two labels of the same data. 

$$ MI(U,V)=\sum_{i=1}^{|U|} \sum_{j=1}^{|V|} \frac{|U_i\cap V_j|}{N}
        \log\frac{N|U_i \cap V_j|}{|U_i||V_j|} $$

In [6]:
from sklearn import metrics 

In [9]:
metrics.mutual_info_score([2, 4], [7,9])

0.6931471805599453