# TF(Term Frequency)

- It is defined for a word w<sub>i</sub> in a **document d<sub>i</sub>**.
- It is defined as:

> $$\text{TF}(w_i, d_i) = \frac{\text{Number of occurrences of the word } w_i}{\text{Total number of words in the document } d_i}$$

- Term frequency of a word w<sub>i</sub> in a document d<sub>i</sub> lies between 0 and 1, i.e 
> 0 <= TF(w<sub>i</sub>, d<sub>i</sub>) <= 1

- Since TF(w<sub>i</sub>, d<sub>i</sub>) lies between 0 and 1, it can be thought of as the probability of the word w<sub>i</sub> in document d<sub>i</sub>.

# IDF(Inverse Document Frequency)

- It is defined for a word w<sub>i</sub> in the **document corpus D<sub>c</sub>**.
- It is defined as:

> $$\text{IDF}(w_i, D_c) = \log{\frac{N}{n_i}}$$
<center>where N = Total number of documents</center>

<center>$n_{i}$ = Number of documents in which the word w<sub>i</sub> occurs</center>

- Since $n_{i}$ <= N (always), this implies $\frac{N}{n_{i}}$ >= 1 (always). Hence, 
> $$\text{IDF}(w_i, D_c) = \log{\frac{N}{n_i} >= 0 \text{ (Always)}}$$

- From the above relations we can see that **if $n_{i}$ increases, IDF decreases and vice-versa**.
- That means if the word $w_{i}$ is more frequent, IDF will be small and if the word $w_{i}$ is rare then IDF will be large.

# TF-IDF

- In this scheme, the value of any dimension of the vector $v_{i}$, corresponding to a document $d_{i}$, is calculated as:

> $$\text{TF}(w_i, d_i)*\text{IDF}(w_i, D_c)$$

- Usage example: Let there be 6 dimensions in a vector $v_{i}$ and every dimension represents a unique word as depicted below:

| w<sub>1</sub> |  w<sub>2</sub>  |  w<sub>3</sub>  | w<sub>4</sub>  | w<sub>5</sub>  |w<sub>6</sub>  |
| --- |--- | --- | --- |--- |--- |
|  |  |  |  |  |  |  |

- Then the value of the dimension corresponding to, say w<sub>4</sub>, is calculated as:

$$\text{TF}(w_4, d_i)*\text{IDF}(w_4, D_c)$$

- TF-IDF gives more **importance to rarer words** in the **document corpus** because of the presence of **IDF** in the formula.
- Also, tf-idf gives more **importance to frequent words** in a **document** because of the presence of **TF** in the formula.
- This scheme doesn't consider the semantic meaning of words. For example the words 'tasty' and 'delicious' will have different dimensions, though they are semantically same.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(corpus)

vectorizer.get_feature_names_out()

array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], dtype=object)

In [None]:
print(X.toarray())

[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]


In [None]:
df =pd.DataFrame(X.toarray(), columns = vectorizer.get_feature_names())
df



Unnamed: 0,and,document,first,is,one,second,the,third,this
0,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085
1,0.0,0.687624,0.0,0.281089,0.0,0.538648,0.281089,0.0,0.281089
2,0.511849,0.0,0.0,0.267104,0.511849,0.0,0.267104,0.511849,0.267104
3,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085
