<a href="https://colab.research.google.com/github/rahiakela/natural-language-processing-research-and-practice/blob/main/nlp-fundamental-works/tf_idf_fundamentals.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##TF-IDF Fundamentals

`tf-idf` score is statistic that quantifies this intuition is the term
frequency-inverse document frequency or `tf-idf` score.

* One of the most popular schemes used today
* Let $t$ be a term (n-gram), $d$ be a document, and $D$ be a
corpus (collection of documents) under consideration
* The `tf-idf` score of term $t$ in document $d$ with respect to
corpus $D$ is

   $$tfidf(t, d, D) = tf(t, d) * idf(t, D) $$

* Many different methods for quantifying `tf` and `idf`


<img src='https://github.com/rahiakela/natural-language-processing-research-and-practice/blob/main/nlp-fundamental-works/images/tf-idf-corpus.png?raw=1' width='400'/>

* Term frequency $tf(t, d)$: Typically the fraction of terms in document $d$
which are term $t$

   * Letting $f_{t,d}$ be the number of occurrences of $t$ in $d$,

    $$tf(t, d) = \frac{f_{t,d}}{\sum_{\hat t} f_{\hat t, d}}$$

* Inverse document frequency $idf(t, D)$: A measure of how
rare term $t$ is across the corpus $D$ (i.e., how much information
it provides about a document it appears in)

    * Letting `N=|D|` be the number of documents in the corpus and $n_t$
 be the number of documents where $t$ occurs, it is typically quantified as

  $$idf(t, D) = log_{10} \begin{pmatrix} \frac{n_t}{N}  \end{pmatrix}^{-1} = log_{10} \frac{N}{n_t} $$




##Example

Dataset: Take the following four strings to be (very small) documents
comprising a (very small) corpus:

```txt
1. “The sky is blue.”
2. “The sun is bright today.”
3. “The sun in the sky is bright.”
4. “We can see the shining sun, the bright sun.”
```

Task: Filter out obvious stopwords, and determine the tf-idf scores of each
term in each document.

After stopword filtering:

```txt
(1) "sky blue", 
(2) "sun bright today", 
(3) "sun sky bright", 
(4) "can see shining sun bright sun"
```

Let's define the documet-to-word matrix.

<img src='https://github.com/rahiakela/natural-language-processing-research-and-practice/blob/main/nlp-fundamental-works/images/tf.png?raw=1' width='800'/>

In [2]:
import numpy as np

tf_mat = np.array([
  [1, 0, 0, 0, 0, 1, 0, 0],
  [0, 1, 0, 0, 0, 0, 1, 1],
  [0, 1, 0, 0, 0, 1, 1, 0],
  [0, 1, 1, 1, 1, 0, 2, 0],      
])

## Term-frequency calculation

Let's find doc-word matrix, then normalize rows to sum to 1.

$$tf(t, d) = \frac{f_{t, d}}{\sum_{\hat t} f_{\hat t, d}} $$

In [5]:
# get row sum
row_sum = tf_mat.sum(axis=1)
row_sum

array([2, 3, 3, 6])

Let's define tf function.

In [15]:
def tf(mat, r_sum):
  tmp_mat = mat.T * (1 / r_sum)
  return tmp_mat.T

In [17]:
tf_mat_tmp = tf(tf_mat, row_sum)
print(tf_mat_tmp)

[[0.5        0.         0.         0.         0.         0.5
  0.         0.        ]
 [0.         0.33333333 0.         0.         0.         0.
  0.33333333 0.33333333]
 [0.         0.33333333 0.         0.         0.         0.33333333
  0.33333333 0.        ]
 [0.         0.16666667 0.16666667 0.16666667 0.16666667 0.
  0.33333333 0.        ]]


## Inverse-document-frequency calculation