## Understanding TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used in natural language processing and information retrieval to evaluate the importance of a word in a document relative to a collection of documents (corpus).

Unlike simple word frequency, TF-IDF balances common and rare words to highlight the most meaningful terms.



### How TF-IDF Works?

TF-IDF combines two components: **Term Frequency (TF)** and **Inverse Document Frequency (IDF)**.

#### Term Frequency (TF):
Measures how often a word appears in a document. A higher frequency suggests greater importance. If a term appears frequently in a document, it is likely relevant to the document’s content.

**Formula:**

TF(term, document) = (Number of times term appears in document) / (Total number of terms in the document)

#### Limitations of TF Alone:
- TF does not account for the global importance of a term across the entire corpus.
- Common words like “the” or “and” may have high TF scores but are not meaningful in distinguishing documents.



#### Inverse Document Frequency (IDF):
Reduces the weight of common words across multiple documents while increasing the weight of rare words. If a term appears in fewer documents, it is more likely to be meaningful and specific.

**Formula:**

IDF(term, corpus) = log(Total number of documents / Number of documents containing the term)

The logarithm is used to dampen the effect of very large or very small values, ensuring the IDF score scales appropriately.

#### Limitations of IDF Alone:
- IDF does not consider how often a term appears within a specific document.
- A term might be rare across the corpus (high IDF) but irrelevant in a specific document (low TF).



### Converting Text into Vectors with TF-IDF: Example

Imagine we have a corpus with three documents:

- **Document 1**: "The cat sat on the mat."
- **Document 2**: "The dog played in the park."
- **Document 3**: "Cats and dogs are great pets."

Our goal is to calculate the TF-IDF score for the word "cat" in these documents.

#### Step 1: Calculate Term Frequency (TF)
For Document 1:
- The word “cat” appears 1 time.
- Total number of terms in Document 1 is 6 (“the”, “cat”, “sat”, “on”, “the”, “mat”).
- TF(cat, Document 1) = 1/6

For Document 2:
- The word “cat” does not appear.
- TF(cat, Document 2) = 0

For Document 3:
- The word “cat” appears 1 time (as “cats”).
- Total number of terms in Document 3 is 6 (“cats”, “and”, “dogs”, “are”, “great”, “pets”).
- TF(cat, Document 3) = 1/6

#### Step 2: Calculate Inverse Document Frequency (IDF)
Total number of documents in the corpus (D): 3
Number of documents containing the term “cat”: 2 (Document 1 and Document 3).

IDF(cat, D) = log(3 / 2) ≈ 0.176

#### Step 3: Calculate TF-IDF
The TF-IDF score for "cat" in Document 1 and Document 3 is 0.029, and 0 in Document 2.


### Why is TF-IDF Useful in This Example?

1. **Identifying Important Terms**: TF-IDF helps us understand that “cat” is important in Document 1 and Document 3 but irrelevant in Document 2. This helps in ranking documents for search engines.

2. **Filtering Common Words**: Words like “the” or “and” would have high TF scores but very low IDF scores because they appear in almost all documents. Their TF-IDF scores would be close to 0, indicating they are not meaningful.

3. **Highlighting Unique Terms**: If a term like “mat” appeared only in Document 1, it would have a higher IDF score, making its TF-IDF score more significant in that document.



In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
import re
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

class TextPreprocessor:
    def __init__(self):
        # Initialize the stemmer and stopwords
        self.stemmer = PorterStemmer()
        self.stop_words = set(stopwords.words('english'))

    def lowercase(self, text):
        """Convert text to lowercase"""
        return text.lower()

    def remove_punctuation(self, text):
        """Remove punctuation from the text"""
        return text.translate(str.maketrans('', '', string.punctuation))

    def remove_stopwords(self, text):
        """Remove common stop words from the text"""
        words = word_tokenize(text)
        return ' '.join([word for word in words if word not in self.stop_words])

    def tokenize(self, text):
        """Tokenize text into words"""
        return word_tokenize(text)

    def stemming(self, text):
        """Apply stemming to the text"""
        words = word_tokenize(text)
        return ' '.join([self.stemmer.stem(word) for word in words])

    def remove_numbers(self, text):
        """Remove numbers from the text"""
        return re.sub(r'\d+', '', text)

    def pre_process(self, text):
        """Pre-process the text by applying all steps"""
        text = self.lowercase(text)
        text = self.remove_punctuation(text)
        text = self.remove_stopwords(text)
        text = self.remove_numbers(text)
        text = self.stemming(text)
        return text


d0 = 'The cat sat on the mat.'
d1 = 'The dog played in the park'
d2 = 'Cats and dogs are great pets'

# Merge documents into a single corpus
corpus = [d0, d1, d2]

# Initialize the TextPreprocessor
preprocessor = TextPreprocessor()

transformed_corpus = list(map(preprocessor.pre_process, corpus))

In [None]:
corpus

['The cat sat on the mat.',
 'The dog played in the park',
 'Cats and dogs are great pets']

In [None]:
transformed_corpus

['cat sat mat', 'dog play park', 'cat dog great pet']

In [None]:
#Implementing TF-IDF in Sklearn with Python

# Let's implement TF-IDF using Python's sklearn library.

# Import the required module
from sklearn.feature_extraction.text import TfidfVectorizer


# Create a TfidfVectorizer object
tfidf = TfidfVectorizer()

# Get tf-idf values
result = tfidf.fit_transform(transformed_corpus)

# Display IDF values
print("\nIDF values:")
for word, idf in zip(tfidf.get_feature_names_out(), tfidf.idf_):
    print(word, ":", idf)



IDF values:
cat : 1.2876820724517808
dog : 1.2876820724517808
great : 1.6931471805599454
mat : 1.6931471805599454
park : 1.6931471805599454
pet : 1.6931471805599454
play : 1.6931471805599454
sat : 1.6931471805599454


In [None]:
# get indexing
print('\nWord indexes:')
print(tfidf.vocabulary_)

# display tf-idf values
print('\ntf-idf value:')
print(result)

# in matrix form
print('\ntf-idf values in matrix form:')
print(result.toarray())


Word indexes:
{'cat': 0, 'sat': 7, 'mat': 3, 'dog': 1, 'play': 6, 'park': 4, 'great': 2, 'pet': 5}

tf-idf value:
  (0, 0)	0.4736296010332684
  (0, 7)	0.6227660078332259
  (0, 3)	0.6227660078332259
  (1, 1)	0.4736296010332684
  (1, 6)	0.6227660078332259
  (1, 4)	0.6227660078332259
  (2, 0)	0.4280460350631185
  (2, 1)	0.4280460350631185
  (2, 2)	0.5628290964997665
  (2, 5)	0.5628290964997665

tf-idf values in matrix form:
[[0.4736296  0.         0.         0.62276601 0.         0.
  0.         0.62276601]
 [0.         0.4736296  0.         0.         0.62276601 0.
  0.62276601 0.        ]
 [0.42804604 0.42804604 0.5628291  0.         0.         0.5628291
  0.         0.        ]]


In [None]:
result.toarray().shape

(3, 8)

In [None]:
len(tfidf.vocabulary_)

8

In [None]:
d0[]