**-----------------------------------------------------------------------------------------------------------------**

*In this lecture we are going to explore:*

1. Why Use TF-IDF (Term Frequency Inverse Document Frequency) in NLP?
2. How TF-IDF Works?
3. Python implementation of TF-IDF in NLP.

**-----------------------------------------------------------------------------------------------------------------**

# 5.3 TF-IDF

* TF-IDF is a numerical statistic that reflects the importance of a word in a document.
* It is commonly used in NLP to represent the relevance of a term to a document or a corpus of documents.
* The TF-IDF algorithm takes into account two main factors: the frequency of a word in a document (TF) and the frequency of the word across all documents in the corpus (IDF).

![5.3_1_Tf IDF.png](attachment:f28963db-ddc5-41be-bd7d-e4bc5e412e94.png)

Below are the terms you’ll need to understand to create a TF-IDF model.

t — term (word).

d — document (set of words).

N — count of corpus.

corpus — the total document set.

* Term Frequency (TF): Measures how often a word appears in a document. A higher frequency suggests greater importance. If a term appears frequently in a document, it is likely relevant to the document’s content.

* Limitations of TF Alone: TF does not account for the global importance of a term across the entire corpus.

* Common words like “the” or “and” may have high TF scores but are not meaningful in distinguishing documents.
* Inverse Document Frequency (IDF): Reduces the weight of common words across multiple documents while increasing the weight of rare words. 

![5.3_2_Tf IDF.png](attachment:97643481-2c9c-4c5f-98c3-bef02b9e1ace.png)

![5.3_3_Tf IDF.png](attachment:7b96548b-d25a-41cc-9dfb-94c3e83f5b9f.png)

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
data = [' Most shark attacks occur about 10 feet from the beach since that is where the people are',
        'the efficiency with which he paired the socks in the drawer was quite admirable',
        'carol drank the blood as if she were a vampire',
        'giving directions that the mountains are to the west only works when you can see them',
        'the sign said there was road work ahead so he decided to speed up',
        'the gruff old man sat in the back of the bait shop grumbling to himself as he scooped out a handful of worms']

In [3]:
tfidfvec = TfidfVectorizer()

In [4]:
tfidfvec_fit = tfidfvec.fit_transform(data)

In [5]:
tfidf_bag = pd.DataFrame(tfidfvec_fit.toarray(), columns = tfidfvec.get_feature_names_out())

In [6]:
print(tfidf_bag)

         10     about  admirable     ahead       are        as   attacks  \
0  0.257061  0.257061   0.000000  0.000000  0.210794  0.000000  0.257061   
1  0.000000  0.000000   0.293641  0.000000  0.000000  0.000000  0.000000   
2  0.000000  0.000000   0.000000  0.000000  0.000000  0.292313  0.000000   
3  0.000000  0.000000   0.000000  0.000000  0.222257  0.000000  0.000000   
4  0.000000  0.000000   0.000000  0.290766  0.000000  0.000000  0.000000   
5  0.000000  0.000000   0.000000  0.000000  0.000000  0.178615  0.000000   

      back     bait     beach  ...      were     west     when     where  \
0  0.00000  0.00000  0.257061  ...  0.000000  0.00000  0.00000  0.257061   
1  0.00000  0.00000  0.000000  ...  0.000000  0.00000  0.00000  0.000000   
2  0.00000  0.00000  0.000000  ...  0.356474  0.00000  0.00000  0.000000   
3  0.00000  0.00000  0.000000  ...  0.000000  0.27104  0.27104  0.000000   
4  0.00000  0.00000  0.000000  ...  0.000000  0.00000  0.00000  0.000000   
5  0.21782 