# 

# Term Weighting, Vector Space Model

* Document Frequency(df) :  the document frequency of t: the number of documents that contain t.
* Term Frequnecy(tf) : The term frequency $tf_{td}$ of term $t$ in document $d$ is defined as the number of times that $t$ occurs in $d$.
* Inverse Document Frequency (idf)
	* define the idf (inverse document frequency) weight of t by $idf = \log_{10}(N/df)$
	* use $log(N/df)$ instead of $N(df)$ to 'dampen' the effect of idf.
* tf-idf : combine the definitions of term frequency and inverse document frequency,to produce a composite weight for each term in each document.

## Vector Space Model
* cosine similarity 
  * $sim(d_1, d2) = (\pmb V(d_1), \pmb V(d_2)) = \frac {\pmb V(d_1) \cdot \pmb V(d_1)}{||\pmb V(d_1)|| \cdot ||\pmb V(d_2)||}$

  * to compare the query and the document and how to rank them.
  * compute cosine of angle between ~q and ~d.


# exercise

Suppose N = 806,791 and compute the tf-idf weights for the terms, for each document using the idf values.  


| term      | Doc1 | Doc2 | Doc3 | DF     |
|-----------|------|------|------|--------|
| car       | 27   | 4    | 24   | 18,165 |
| auto      | 3    | 33   | 0    | 6,723  |
| insurance | 0    | 33   | 29   | 19,241 |
| best      | 14   | 0    | 17   | 25,235 |
	

In [9]:
import numpy as np
import pandas as pd

def idf(n, DF):
    return np.array([np.log10(n/df) for df in DF])
     
def tfidf(TF, IDF):
    return np.array([IDF*tf for tf in TF])

def show(data):
    colums=["car", "auto", "insurance", 'best']
    rows=["car, auto, insurance, best"]
    return pd.DataFrame(data, columns=colums)

N = 806791
DF = [18165, 6723, 19241, 25235]
TF = [[27, 3, 0, 14], [4, 33, 33, 0], [24, 0, 29, 17]]

IDF = idf(N, DF)
TFIDF = tfidf(TF, IDF)
show(TFIDF)

Unnamed: 0,car,auto,insurance,best
0,44.483192,6.237594,0.0,21.066608
1,6.590103,68.613532,53.543602,0.0
2,39.540615,0.0,47.053469,25.580882


Now compute the euclidean normalized document vectors

In [13]:
def docVector(tfidf):
    return np.sqrt(sum([e**2 for e in tfidf]))

def euclidean_norm(tfidf):
    result = np.empty_like(tfidf)
    for index, e in enumerate(tfidf):
        vector = docVector(e)
        result[index] = [abs(comp)/vector for comp in e]
        
    return result

EUCLIDEAN_NORM = euclidean_norm(TFIDF)
show(EUCLIDEAN_NORM)

Unnamed: 0,car,auto,insurance,best
0,0.896601,0.125725,0.0,0.424617
1,0.075503,0.786112,0.613455,0.0
2,0.59395,0.0,0.706803,0.384257
