# TF-IDF
TF_IDF(Term Frequency-Inverse Document Frequency, 词频-逆文件频率)。  
> 是一种用于信息检索和文本挖掘的常用加权技术。TF-IDF是一种统计方法，用以评估一个字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加，但同时会随着它在语料库中出现的频率成反比下降。  

上述引用总结就是，一个词语在一篇文章中出现的次数越多，同时在所有文档中出现的次数越少，越能够代表该文章。  

** 应用** ： 用来做关键词的抽取，词的TF_IDF值越大，则为关键词。

### 1. 基本原理
** 词频(term frequency, TF)** 指的是某一个给定的词语在该文件中出现的次数。这个数字通常会被归一化(一般是词频除以文章总词数)，以防止它偏向长的文件。(同一个词语在长文件里可能会比短文件有更高的词频，而不管该词语重要与否)    
- 表示关键词w在文档$D_i$中出现的频率：  
$$TF_{w,D_i}=\dfrac{\mathrm{count}(w)}{|D_i|}$$  
其中，$\mathrm{count}(w)$为关键词w的出现次数，$|D_i|$为$|D_i|$中所有词的数量。  


** 逆向文件频率(inverse document frequency, IDF)** 主要思想是：如果包含词条t的文档越少，IDF越大，则说明词条具有很好的类别区分能力。某一特定词语的IDF，可以由总文件数据除以包含该词语的文件的数据，再将得到的商取对数得到。  
- 反映关键词的普遍程度-当一个词越普遍(即有大量文档包含这个词)时，其IDF值越低；反之，IDF值越高。定义如下：  
$$IDF_w=\log\dfrac{N}{\sum_{i=1}^N I(w,D_i)}$$  
其中，N为所有的文档总数，$I(w,D_i)$表示文档$D_i$是否包含关键词w，若包含则为1，否则为0.若词w在所有文档中均未出现，则IDF公式中的分母为0；因此需要对IDF进行平滑：  
$$IDF_w=\log\dfrac{N}{1+\sum_{i=1}^N I(w,D_i)}$$  


关键词w在文档$D_i$的**TF-IDF**值为：   

$${TF-IDF}_{w,D_i}=TF_{w,D_i}*IDF_w$$  
### 小结：  
1. 当一个词在文档频率越高并且新鲜度高(即普遍度低)，其TF-IDF值越高；  
2. TF-IDF兼顾词频与新鲜度，过滤一些常见词，保留能提供更多信息的重要词。

### 2. Python应用实例
### Basic term frequencies

In [30]:
# examples taken from here: http://stackoverflow.com/a/1750187
# 计算词频
mydoclist = ['Julie loves me more than Linda loves me',
           'Jane likes me more than Julie loves me',
           'He likes basketball more than baseball']

from collections import Counter

for doc in mydoclist:
    tf = Counter()
    for word in doc.split():
        tf[word] += 1
    print(tf.items())

dict_items([('loves', 2), ('more', 1), ('Linda', 1), ('than', 1), ('me', 2), ('Julie', 1)])
dict_items([('loves', 1), ('more', 1), ('Jane', 1), ('than', 1), ('likes', 1), ('me', 2), ('Julie', 1)])
dict_items([('baseball', 1), ('more', 1), ('than', 1), ('basketball', 1), ('He', 1), ('likes', 1)])


In [32]:
import string

# 词的集合
def build_lexicon(corpus):
    lexicon=set()
    for doc in corpus:
        for word in doc.split():
            lexicon.update(word)
    return lexicon


# 词在一个文档中的频率
def tf(term,document):
    return freq(term,document)

def freq(term,document):
    return document.split().count(term)

# 词典 key-vocabulary 
vocabulary = build_lexicon(mydoclist)
print('Our vocabulary vector is ['+','.join(list(vocabulary))+']')

for doc in mydoclist:
    print('The doc is "' +doc+ '"')
    tf_vector =[tf(word,doc) for word in vocabulary]
    tf_vector_string = ','.join(format(freq,'d') for freq in tf_vector)
    print ('the tf vector for Document %d is [%s]'%((mydoclist.index(doc)+1),tf_vector_string))
    doc_term_matrix.append(tf_vector)

print('All combined, here is our master document term matrix:')
print(doc_term_matrix)

Our vocabulary vector is [baseball,more,Jane,He,Linda,loves,than,basketball,likes,me,Julie]
The doc is "Julie loves me more than Linda loves me"
the tf vector for Document 1 is [0,1,0,0,1,2,1,0,0,2,1]
The doc is "Jane likes me more than Julie loves me"
the tf vector for Document 2 is [0,1,1,0,0,1,1,0,1,2,1]
The doc is "He likes basketball more than baseball"
the tf vector for Document 3 is [1,1,0,1,0,0,1,1,1,0,0]
All combined, here is our master document term matrix:
[[0, 1, 0, 0, 1, 2, 1, 0, 0, 2, 1], [0, 1, 1, 0, 0, 1, 1, 0, 1, 2, 1], [1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0]]


### Normalizing vectors to L2 Norm

In [20]:
import math
import numpy as np

def l2_normalizer(vec):
    denom= np.sum([el**2 for el in vec])
    return [(el/math.sqrt(denom)) for el in vec]

doc_term_matrix_l2 = []
for vec in doc_term_matrix:
    doc_term_matrix_l2.append(l2_normalizer(vec))
    
print('A regular old document term matrix:')
print(np.matrix(doc_term_matrix))

print('\nA document term matrix with row-wide L2 norms of 1')
print(np.matrix(doc_term_matrix_l2))

A regular old document term matrix:
[[0 1 0 0 1 2 1 0 0 2 1]
 [0 1 1 0 0 1 1 0 1 2 1]
 [1 1 0 1 0 0 1 1 1 0 0]]

A document term matrix with row-wide L2 norms of 1
[[ 0.          0.28867513  0.          0.          0.28867513  0.57735027
   0.28867513  0.          0.          0.57735027  0.28867513]
 [ 0.          0.31622777  0.31622777  0.          0.          0.31622777
   0.31622777  0.          0.31622777  0.63245553  0.31622777]
 [ 0.40824829  0.40824829  0.          0.40824829  0.          0.
   0.40824829  0.40824829  0.40824829  0.          0.        ]]


### IDF frequency weighting

In [64]:
def numDocsContaining(word, doclist):
    doccount=0
    for doc in doclist:
        if freq(word,doc)>0:
            doccount += 1
    return doccount

def idf(word, doclist):
    n_samples=len(doclist)
    df = numDocsContaining(word,doclist)
    return np.log(n_samples/(1+df))
my_idf_vector = [idf(word,mydoclist) for word in vocabulary]

print('Our vocabulary vector is ['+','.join(list(vocabulary))+']')
print('The inverse document frequency vector is ['+','.join(format(freq,'f') for freq in
                                                           my_idf_vector)+']')

Our vocabulary vector is [baseball,more,Jane,He,Linda,loves,than,basketball,likes,me,Julie]
The inverse document frequency vector is [0.405465,-0.287682,0.405465,0.405465,0.405465,0.000000,-0.287682,0.405465,0.000000,0.000000,0.000000]


In [25]:
# 将idf_vector 变成 idf_matrix，对角线元素为向量元素
def build_idf_matrix(idf_vector):
    idf_mat=np.zeros((len(idf_vector),len(idf_vector)))
    np.fill_diagonal(idf_mat,idf_vector)
    return idf_mat
my_idf_matrix=build_idf_matrix(my_idf_vector)
#print(my_idf_matrix)

### TF-IDF calculation

In [81]:
doc_term_matrix_tfidf = np.dot(doc_term_matrix_l2,my_idf_matrix)
print('TF_IDF matrix is:',doc_term_matrix_tfidf)

TF_IDF matrix is: [[ 0.         -0.08304666  0.          0.          0.11704769  0.
  -0.08304666  0.          0.          0.          0.        ]
 [ 0.         -0.09097306  0.12821933  0.          0.          0.
  -0.09097306  0.          0.          0.          0.        ]
 [ 0.16553044 -0.11744571  0.          0.16553044  0.          0.
  -0.11744571  0.16553044  0.          0.          0.        ]]


### 3. 使用sklearn实现TF-IDF  
**注意**：   
与上面过程不同之处在于IDF的计算，sklearn中以公式：  

$$IDF_w=\log\dfrac{N}{\sum_{i=1}^N I(w,D_i)}+1$$  
进行计算。

In [49]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer(min_df = 1)
term_freq_matrix = count_vectorizer.fit_transform(mydoclist)
print('Vocabulary:', count_vectorizer.vocabulary_)  

from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(norm = 'l2')
tfidf.fit(term_freq_matrix)

tf_idf_matrix = tfidf.transform(term_freq_matrix)
print(tf_idf_matrix.toarray())

Vocabulary: {'baseball': 0, 'more': 9, 'julie': 4, 'jane': 3, 'linda': 6, 'loves': 7, 'than': 10, 'basketball': 1, 'he': 2, 'likes': 5, 'me': 8}
[[ 0.          0.          0.          0.          0.28945906  0.
   0.38060387  0.57891811  0.57891811  0.22479078  0.22479078]
 [ 0.          0.          0.          0.41715759  0.3172591   0.3172591
   0.          0.3172591   0.6345182   0.24637999  0.24637999]
 [ 0.48359121  0.48359121  0.48359121  0.          0.          0.36778358
   0.          0.          0.          0.28561676  0.28561676]]


In [56]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(min_df = 1)
tfidf_matrix = tfidf_vectorizer.fit_transform(mydoclist)

print(tfidf_matrix.toarray())

[[ 0.          0.          0.          0.          0.28945906  0.
   0.38060387  0.57891811  0.57891811  0.22479078  0.22479078]
 [ 0.          0.          0.          0.41715759  0.3172591   0.3172591
   0.          0.3172591   0.6345182   0.24637999  0.24637999]
 [ 0.48359121  0.48359121  0.48359121  0.          0.          0.36778358
   0.          0.          0.          0.28561676  0.28561676]]
