## Term Frequency - Inverse Document Frequency (IF-IDF)

TF-IDF mining technique is used to convert text into a numeric table representation.
Its output is a table where rows represent documents and columns represent words.
Each cell provides a count/value that indicates the strength of the word with respect to the document

##### _**Creating the count table:**_

| Document | sample | good | word | again | same | real | hurt |
| -------- | ------ | ---- | ---- | ----- | ---- | ---- | ---- |
| Doc1     | 1      | 1    | 1    |       |      |      |      |
| Doc2     |        |      | 2    | 2     | 1    |      |      |
| Doc3     |        |      | 1    |       |      | 1    | 1    |

##### _**Finding term frequency:**_

| Document | sample | good | word | again | same | real | hurt |
| -------- | ------ | ---- | ---- | ----- | ---- | ---- | ---- |
| Doc1     | 0.33   | 0.33 | 0.33 |       |      |      |      |
| Doc2     |        |      | 0.4  | 0.4   | 0.2  |      |      |
| Doc3     |        |      | 0.33 |       |      | 0.33 | 0.33 |

##### _**Finding inverse document frequency:**_

$$log_e(\frac{\text{total docs}}{\text{docs with the word}})$$

| Document | sample | good  | word | again | same  | real  | hurt  |
| -------- | ------ | ----- | ---- | ----- | ----- | ----- | ----- |
| IDF      | 1.098  | 1.098 | 0    | 1.098 | 1.098 | 1.098 | 1.098 |

The purpose of IDF is to find words that are unique and prevalent in a few documents only.  
The fewer documents having the word, the higher is IDF.

##### _**Finding TF-IDF:**_

$$TF*IDF$$

| Document | sample | good | word | again | same | real | hurt |
| -------- | ------ | ---- | ---- | ----- | ---- | ---- | ---- |
| Doc1     | 0.36   | 0.36 | 0    |       |      |      |      |
| Doc2     |        |      | 0    | 0.44  | 0.22 |      |      |
| Doc3     |        |      | 0    |       |      | 0.36 | 0.36 |


## Build TF-IDF matrix

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# use a small corpus for each visualization
vector_corpus = [
    'NBA is a Basketball league',
    'Basketball is popular in America',
    'TV in America telecast BasketBall'
]

# create a vectorizer for English language
vectorizer = TfidfVectorizer(stop_words='english')

# create the vector
tfidf = vectorizer.fit_transform(vector_corpus)

print("Tokens used as features:")
print(vectorizer.get_feature_names_out())
print(f"\nSize of array: {tfidf.shape}. Each row represents a document. Each column represents a feature/token")
print("\nActual TF-IDF array:")
print(tfidf.toarray())

Tokens used as features:
['america' 'basketball' 'league' 'nba' 'popular' 'telecast' 'tv']

Size of array: (3, 7). Each row represents a document. Each column represents a feature/token

Actual TF-IDF array:
[[0.         0.38537163 0.65249088 0.65249088 0.         0.
  0.        ]
 [0.54783215 0.42544054 0.         0.         0.72033345 0.
  0.        ]
 [0.44451431 0.34520502 0.         0.         0.         0.5844829
  0.5844829 ]]
