### Introduction
TF-IDF or Term Frequency - Inverse Document Frequency is a measure of importance of a word to a document. The TF-IDF is used in the field of information retrieval and machine learning. 

### Definition
The TF-IDF is the product (element wise) of two statistics, the term frequency and the inverse document frequency.

#### Term Frequency

The $tf(t, d)$ is the term frequency of term `t` and the document `d`: $tf(d) = f_{t, d} / |d|$

__Where:__
- $f_{t, d}$ is the frequency of term `t`
- $|d|$ is the size of document `d`

For example:
- Given document `d` = the cats are in the house
- $f_{\text{the}, d} = 2 / 6$
- $f_{\text{cats}, d} = 1 / 6$
- $f_{\text{are}, d} = 1 / 6$
- $f_{\text{in}, d} = 1 / 6$
- $f_{\text{house}, d} = 1 / 6$


#### Inverse Document Frequency

The $idf(t, D)$ is the inverse document frequency of term `t` and the collection `D`: $idf(d, D) = |D| / |{d \in D : t \in d}|$

__Where:__
- $|D|$ is the size of collection
- $|{d \in D : t \in d}|$ is the number of documents that term `t` appears given the collection `D`


#### TF-IDF

To compute the TF-IDF of a given document `d` on collection `D`:
1. Compute $tf(d) = [tf(t_1, d), tf(t_2, d), ..., tf(t_N, d)]$
2. Compute $idf(D) = [idf(t_1, D), idf(t_2, D), ..., idf(t_N, D)$
3. $tf-idf = tf * idf$

Reference: [Wikipedia](https://en.wikipedia.org/wiki/Tf–idf)

---

### Compute TF-IDF

In [2]:
import numpy as np

In [10]:
from typing import List
from typing import Dict

def build_vocab(D: List[str]):
    word2idx = {}
    idx2word = []
    for d in D:
        for w in d.split(" "):
            if w not in word2idx:
                idx = len(idx2word)
                word2idx[w] = idx
                idx2word.append(w)
    return word2idx, idx2word

D = [
    "the cats are in the house",
    "the dogs are in the house and outside",
    "the cats and the dogs are friends",
]

word2idx, idx2word = build_vocab(D)

In [20]:
def compute_freq(doc: str, word2idx: Dict[str, int]) -> np.array:
    # https://numpy.org/doc/stable/reference/generated/numpy.zeros.html
    res = np.zeros(len(word2idx), dtype=np.int32)
    for w in doc.split(" "):
        if w not in word2idx: continue

        idx = word2idx[w]
        res[idx] += 1

    return res

freq = compute_freq(D[0], word2idx)
print(f"Document: {D[0]}")
print(f"Encoded document: {freq}")

Document: the cats are in the house
Encoded document: [2 1 1 1 1 0 0 0 0]


In [22]:
for i, v in enumerate(freq):
    print(i, v/np.sum(freq))

0 0.3333333333333333
1 0.16666666666666666
2 0.16666666666666666
3 0.16666666666666666
4 0.16666666666666666
5 0.0
6 0.0
7 0.0
8 0.0


In [25]:
def compute_tf(d: str, word2idx: Dict[str, int]) -> np.array:
    freq = compute_freq(d, word2idx)
    tf = np.zeros_like(freq, dtype=np.float32)
    size = np.sum(freq)
    for i, v in enumerate(freq):
        tf[i] = v / size
    return tf

tf = compute_tf(D[0], word2idx)
print(f"Document: {D[0]}")
print(f"TF: {tf}")

tf = compute_tf(D[1], word2idx)
print(f"Document: {D[1]}")
print(f"TF: {tf}")

Document: the cats are in the house
TF: [0.33333334 0.16666667 0.16666667 0.16666667 0.16666667 0.
 0.         0.         0.        ]
Document: the dogs are in the house and outside
TF: [0.25  0.    0.125 0.125 0.125 0.125 0.125 0.125 0.   ]


In [41]:
def compute_idf(D: List[str], word2idx: Dict[str, int]) -> np.array:
    N = len(D)
    idf = np.zeros(len(word2idx))
    for d in D:
        f = compute_freq(d, word2idx)
        idf = idf + (f > 0)
    return np.log(N / idf)

In [42]:
idf = compute_idf(D, word2idx)
tf = compute_tf(D[0], word2idx)

print(f"Document: {D[0]}")
print(f"TF-IDF: {tf * idf}")

Document: the cats are in the house
TF-IDF: [0.         0.06757752 0.         0.06757752 0.06757752 0.
 0.         0.         0.        ]


### Compare two TF-IDF vectors

In [51]:
idf = compute_idf(D, word2idx)
v0 = compute_tf(D[0], word2idx) * idf
v1 = compute_tf(D[1], word2idx) * idf
v2 = compute_tf(D[2], word2idx) * idf

#### Scalar Product

In [55]:
print(f"D0: {D[0]}")
print(f"D1: {D[1]}")
print(f"Scalar Product: {v0.dot(v1.T)}")
print()
print(f"D0: {D[0]}")
print(f"D1: {D[2]}")
print(f"Scalar Product: {v0.dot(v2.T)}")
print()
print(f"D0: {D[1]}")
print(f"D1: {D[2]}")
print(f"Scalar Product: {v1.dot(v2.T)}")

D0: the cats are in the house
D1: the dogs are in the house and outside
Scalar Product: 0.006850081616363562

D0: the cats are in the house
D1: the cats and the dogs are friends
Scalar Product: 0.003914332527192041

D0: the dogs are in the house and outside
D1: the cats and the dogs are friends
Scalar Product: 0.0058714986158037675


#### Cosine Similarity

In [53]:
v0_norm = v0 / np.linalg.norm(v0)
v1_norm = v1 / np.linalg.norm(v1)
v2_norm = v2 / np.linalg.norm(v2)


print(f"D0: {D[0]}")
print(f"D1: {D[1]}")
print(f"Cosine Similarity: {v0_norm.dot(v1_norm.T)}")
print()
print(f"D0: {D[0]}")
print(f"D1: {D[2]}")
print(f"Cosine Similarity: {v0_norm.dot(v2_norm.T)}")
print()
print(f"D0: {D[1]}")
print(f"D1: {D[2]}")
print(f"Cosine Similarity: {v1_norm.dot(v2_norm.T)}")

D0: the cats are in the house
D1: the dogs are in the house and outside
Cosine Similarity: 0.34287439123039537

D0: the cats are in the house
D1: the cats and the dogs are friends
Cosine Similarity: 0.17953479253880886

D0: the dogs are in the house and outside
D1: the cats and the dogs are friends
Cosine Similarity: 0.1846736480892583
