# 

# Term Weighting, Vector Space Model

## Weighting
* Document Frequency($df_t$) :  the document frequency of $t$. the number of documents that contain $t$.
* Term Frequnecy(tf) : $tf_{td}$ of term $t$ in document $d$ is defined as the number of times that $t$ occurs in $d$.
* Inverse Document Frequency (idf)
	* define the idf (inverse document frequency) weight of t by: $$idf = \log_{10}(\frac {N}{df_t})$$
	* use $log(\frac {N}{df_t})$ instead of $N(\frac {N}{df_t})$ to 'dampen' the effect of idf.
* **tf-idf** : combine the definitions of term frequency and inverse document frequency,to produce a composite weight for each term in each document.


## Vector Space Model
### Consider documents as vectors
  * try to represent this matrix as a set set of vectors for each document.
  * Terms are axes of the space. V terms means V dimensions.
  * Documents are points or vectors in this space.
  * Cosine similarity $$sim(d_1, d2) = (\pmb V(d_1), \pmb V(d_2)) = \frac {\pmb V(d_1) \cdot \pmb V(d_1)}{||\pmb V(d_1)|| \cdot ||\pmb V(d_2)||} =\frac {\pmb V(d_1)}{\pmb ||\pmb V(d_1)||}\cdot \frac {\pmb V(d_2)}{\pmb ||\pmb V(d_2)||} $$

### Consider queries as vector
  * Do the same for queries: represent them as vectors in the space
  * Rank documents according to their proximity to the query in this space

### Compare query and documents - rank documents
  * Compute the cosine similarity between the query and documents : cosine(query,document)
    * Cosine similarity $$sim(q, d) = (\pmb V(q), \pmb V(d)) = \frac {\pmb V(q) \cdot \pmb V(d)}{||\pmb V(q)|| \cdot ||\pmb V(d)||}$$
  * With unit vectors, the cosine of the angle
between the two vectors is simply the scalar product between the two vectors.

### Practical Calculation
  * For length-normalized vectors, cosine similarity is simply the dot product (or scalar product) $$cos(\overrightarrow{q}, \overrightarrow{d}) = \overrightarrow{q} \cdot \overrightarrow{d} \ = \textstyle\sum_{i=1}^v q_i d_i $$




## Exercise 1

Suppose N = 806,791 and compute the tf-idf weights for the terms, for each document using the idf values.  


| term      | Doc1 | Doc2 | Doc3 | DF     |
|-----------|------|------|------|--------|
| car       | 27   | 4    | 24   | 18,165 |
| auto      | 3    | 33   | 0    | 6,723  |
| insurance | 0    | 33   | 29   | 19,241 |
| best      | 14   | 0    | 17   | 25,235 |
	

In [30]:
import numpy as np
import pandas as pd

def idf(n: int, df: np.ndarray) -> np.ndarray:
    return np.array([np.log10(n / df_val) for df_val in df])

def tfidf(tf: np.ndarray, idf: np.ndarray) -> np.ndarray:
    return np.array([idf * tf_val for tf_val in tf])

def show(data: np.ndarray):
    colums = ["car", "auto", "insurance", 'best']
    return pd.DataFrame(data, columns=colums)

N = 806791
DF = [18165, 6723, 19241, 25235]
TF = [[27, 3, 0, 14], [4, 33, 33, 0], [24, 0, 29, 17]]

IDF = idf(N, DF)
TFIDF = tfidf(TF, IDF)
show(TFIDF)


Unnamed: 0,car,auto,insurance,best
0,44.483192,6.237594,0.0,21.066608
1,6.590103,68.613532,53.543602,0.0
2,39.540615,0.0,47.053469,25.580882


Now compute the euclidean normalized document vectors for each documents, where each vector has four components, one for each of the four terms.

In [31]:
def doc_vector(tfidf: np.array) -> int:
    return np.sqrt(sum([e**2 for e in tfidf]))

def euclidean_norm(tfidf: np.array) -> np.array:
    result = np.empty_like(tfidf)
    for index, e in enumerate(tfidf):
        vector = doc_vector(e)
        result[index] = [abs(comp)/vector for comp in e]
    return result

EUCLIDEAN_NORM = euclidean_norm(TFIDF)
show(EUCLIDEAN_NORM)

Unnamed: 0,car,auto,insurance,best
0,0.896601,0.125725,0.0,0.424617
1,0.075503,0.786112,0.613455,0.0
2,0.59395,0.0,0.706803,0.384257
