#### TF-IDF (Term Frequency-Inverse Document Frequency)
TF-IDF is a statistical method used in **natural language processing (NLP)** and to evaluate how important a word is to a document in relation to a larger collection of documents
1. Term Frequency (TF): measures how often a word appears in a document. A higher frequency suggests greater importance
   $$ IDF(t, d) = log\frac{\text{Number of times term t appears in document d}}{\text{Total number of terms in document d}} $$
2. Inverse Document Frequency (IDF): Reduces the weight of common words across multiple documents while increasing the weight of rare words. If a term appears in fewer documents, it more likely to be meaningful and specific
    $$ IDF(t, D) = log\frac{\text{Total number of documents in corpus D}}{\text{Number of documents containing term t}} $$

   corous (a collection of documents)

This balance allows TF-IDF to highlight terms that are both frequent within a specific document and distinctive across the text document.

Taks: search ranking, text classification, keyword extraction

##### Converting Text into Vectors

**Implementing TF-IDF in Python**

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
d0 = 'Geeks for geeks'
d1 = 'Geeks'
d2 = 'r2j'
string = [d0, d1, d2]
tfidf = TfidfVectorizer()
result = tfidf.fit_transform(string)
print('\nidf values:')
for ele1, ele2 in zip(tfidf.get_feature_names_out(), tfidf.idf_):
    print(ele1, ':', ele2)
print('\nWord indexes:')
print(tfidf.vocabulary_)
print('\ntf-idf value:')
print(result)
print('\ntf-idf values in matrix form:')
print(result.toarray())


idf values:
for : 1.6931471805599454
geeks : 1.2876820724517808
r2j : 1.6931471805599454

Word indexes:
{'geeks': 1, 'for': 0, 'r2j': 2}

tf-idf value:
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 4 stored elements and shape (3, 3)>
  Coords	Values
  (0, 1)	0.8355915419449176
  (0, 0)	0.5493512310263033
  (1, 1)	1.0
  (2, 2)	1.0

tf-idf values in matrix form:
[[0.54935123 0.83559154 0.        ]
 [0.         1.         0.        ]
 [0.         0.         1.        ]]
