In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
import pandas as pd

In [2]:
# Sample text data (small corpus)
corpus = [
    "I love machine learning and deep learning",
    "Machine learning is fascinating and powerful",
    "Deep learning is a subset of machine learning",
]
texts = [' '.join([word for word in text.split() if word.lower() not in ENGLISH_STOP_WORDS]) for text in corpus]

print("Clean Corpus:")
texts

Clean Corpus:


['love machine learning deep learning',
 'Machine learning fascinating powerful',
 'Deep learning subset machine learning']

In [3]:
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the documents into a TF-IDF matrix
tfidf_matrix = tfidf_vectorizer.fit_transform(texts)

# Get the feature names (terms) from the vectorizer
feature_names = tfidf_vectorizer.get_feature_names_out()

# Convert the TF-IDF matrix to a dense array
tfidf_dense = tfidf_matrix.todense()

In [4]:
# 2. Displaying the results

# Vocabulary - this shows the index of each word in the TF-IDF matrix
print("\nVocabulary:")
for index, word in sorted(tfidf_vectorizer.vocabulary_.items(), key=lambda item: item[1]):
    print(f"{word}: {index}")


Vocabulary:
0: deep
1: fascinating
2: learning
3: love
4: machine
5: powerful
6: subset


In [5]:
# Feature names - the actual words corresponding to the columns in the matrix
print("\nFeature Names:")
print(feature_names)


Feature Names:
['deep' 'fascinating' 'learning' 'love' 'machine' 'powerful' 'subset']


In [6]:
texts

['love machine learning deep learning',
 'Machine learning fascinating powerful',
 'Deep learning subset machine learning']

In [7]:
# TF-IDF Matrix - dense representation of the TF-IDF weights for each word in the documents
pd.DataFrame(tfidf_dense, columns=feature_names)

Unnamed: 0,deep,fascinating,learning,love,machine,powerful,subset
0,0.417233,0.0,0.648038,0.548612,0.324019,0.0,0.0
1,0.0,0.608845,0.359594,0.0,0.359594,0.608845,0.0
2,0.417233,0.0,0.648038,0.0,0.324019,0.0,0.548612


To explain how scikit-learn calculates the TF-IDF (Term Frequency-Inverse Document Frequency) values step by step, let's break down the process using the provided documents, vocabulary, and the resulting TF-IDF matrix.

### **Vocabulary and Documents**

**Vocabulary Indexing:**

| Index | Term         |
|-------|--------------|
| 0     | deep         |
| 1     | fascinating  |
| 2     | learning     |
| 3     | love         |
| 4     | machine      |
| 5     | powerful     |
| 6     | subset       |

**Documents:**

1. Document 1: "love machine learning deep learning"
2. Document 2: "Machine learning fascinating powerful"
3. Document 3: "Deep learning subset machine learning"

### **Step 1: Tokenization and Term Frequency (TF)**

**Tokenize** each document and count the frequency of each term in the vocabulary.

#### **Document 1:**

- Tokens: love, machine, learning, deep, learning
- Term Counts:

  | Term        | Count |
  |-------------|-------|
  | deep        | 1     |
  | fascinating | 0     |
  | learning    | 2     |
  | love        | 1     |
  | machine     | 1     |
  | powerful    | 0     |
  | subset      | 0     |

#### **Document 2:**

- Tokens: machine, learning, fascinating, powerful
- Term Counts:

  | Term        | Count |
  |-------------|-------|
  | deep        | 0     |
  | fascinating | 1     |
  | learning    | 1     |
  | love        | 0     |
  | machine     | 1     |
  | powerful    | 1     |
  | subset      | 0     |

#### **Document 3:**

- Tokens: deep, learning, subset, machine, learning
- Term Counts:

  | Term        | Count |
  |-------------|-------|
  | deep        | 1     |
  | fascinating | 0     |
  | learning    | 2     |
  | love        | 0     |
  | machine     | 1     |
  | powerful    | 0     |
  | subset      | 1     |

### **Step 2: Calculate Document Frequency (DF) and Inverse Document Frequency (IDF)**

**Document Frequency (DF):** Number of documents containing the term.

| Term        | DF (Number of Documents Containing the Term) |
|-------------|----------------------------------------------|
| deep        | 2                                            |
| fascinating | 1                                            |
| learning    | 3                                            |
| love        | 1                                            |
| machine     | 3                                            |
| powerful    | 1                                            |
| subset      | 1                                            |

**Total Number of Documents (N):** 3

**Inverse Document Frequency (IDF):**

Using scikit-learn's formula:

$$
\text{idf}(t) = \ln\left(\frac{1 + N}{1 + \text{DF}(t)}\right) + 1
$$

Compute IDF for each term:

1. **deep:**

   $$
   \text{idf}(\text{deep}) = \ln\left(\frac{1 + 3}{1 + 2}\right) + 1 = \ln\left(\frac{4}{3}\right) + 1 \approx 1.28768
   $$


2. **fascinating, love, powerful, subset:**

   $$
   \text{idf} = \ln\left(\frac{1 + 3}{1 + 1}\right) + 1 = \ln(2) + 1 \approx 1.69314
   $$

3. **learning, machine:**

   $$
   \text{idf} = \ln\left(\frac{1 + 3}{1 + 3}\right) + 1 = \ln(1) + 1 = 1
   $$

### **Step 3: Calculate TF-IDF Scores**

For each term in each document, calculate the TF-IDF score:

$$
\text{tf-idf}(t, d) = \text{tf}(t, d) \times \text{idf}(t)
$$

#### **Document 1 TF-IDF Scores:**

| Term        | TF (Count) | IDF       | TF-IDF           |
|-------------|------------|-----------|------------------|
| deep        | 1          | 1.28768   | $$1 \times 1.28768 = 1.28768$$ |
| fascinating | 0          | 1.69314    | $$0 \times 1.69314 = 0$$        |
| learning    | 2          | 1         | $$2 \times 1 = 2$$             |
| love        | 1          | 1.69314    | $$1 \times 1.69314 = 1.69314$$   |
| machine     | 1          | 1         | $$1 \times 1 = 1$$             |
| powerful    | 0          | 1.69314    | $$0 \times 1.69314 = 0$$        |
| subset      | 0          | 1.69314    | $$0 \times 1.69314 = 0$$        |

#### **Document 2 TF-IDF Scores:**

| Term        | TF (Count) | IDF       | TF-IDF           |
|-------------|------------|-----------|------------------|
| deep        | 0          | 1.28768   | $$0 \times 1.28768 = 0$$       |
| fascinating | 1          | 1.69314    | $$1 \times 1.69314 = 1.69314$$   |
| learning    | 1          | 1         | $$1 \times 1 = 1$$             |
| love        | 0          | 1.69314    | $$0 \times 1.69314 = 0$$        |
| machine     | 1          | 1         | $$1 \times 1 = 1$$             |
| powerful    | 1          | 1.69314    | $$1 \times 1.69314 = 1.69314$$   |
| subset      | 0          | 1.69314    | $$0 \times 1.69314 = 0$$        |

#### **Document 3 TF-IDF Scores:**

| Term        | TF (Count) | IDF       | TF-IDF           |
|-------------|------------|-----------|------------------|
| deep        | 1          | 1.28768   | $$1 \times 1.28768 = 1.28768$$ |
| fascinating | 0          | 1.69314    | $$0 \times 1.69314 = 0$$        |
| learning    | 2          | 1         | $$2 \times 1 = 2$$             |
| love        | 0          | 1.69314    | $$0 \times 1.69314 = 0$$        |
| machine     | 1          | 1         | $$1 \times 1 = 1$$             |
| powerful    | 0          | 1.69314    | $$0 \times 1.69314 = 0$$        |
| subset      | 1          | 1.69314    | $$1 \times 1.69314 = 1.69314$$   |

### **Step 4: Normalize TF-IDF Vectors**

scikit-learn normalizes the TF-IDF vectors using **L2 normalization** (Euclidean norm).

**Formula for L2 Normalization:**

$$
\text{Normalized TF-IDF}(t, d) = \frac{\text{TF-IDF}(t, d)}{\sqrt{\sum_{t} \text{TF-IDF}(t, d)^2}}
$$

#### **Document 1 Normalization:**

- **TF-IDF Vector Before Normalization:**
  $$
  [1.28768, 0, 2, 1.69314, 1, 0, 0]
  $$

- **Calculate L2 Norm:**
  $$
  \sqrt{(1.28768)^2 + 0^2 + 2^2 + (1.69314)^2 + 1^2 + 0^2 + 0^2} \approx 3.08623
  $$

- **Normalized TF-IDF Vector:**

  | Term        | Normalized TF-IDF                 |
  |-------------|-----------------------------------|
  | deep        | $$\frac{1.28768}{3.08623} \approx 0.417233$$ |
  | fascinating | $$0$$                             |
  | learning    | $$\frac{2}{3.08623} \approx 0.648038$$      |
  | love        | $$\frac{1.69314}{3.08623} \approx 0.548612$$ |
  | machine     | $$\frac{1}{3.08623} \approx 0.324019$$      |
  | powerful    | $$0$$                             |
  | subset      | $$0$$                             |

#### **Document 2 Normalization:**

- **TF-IDF Vector Before Normalization:**
  $$
  [0, 1.69314, 1, 0, 1, 1.69314, 0]
  $$

- **Calculate L2 Norm:**
  $$
  \sqrt{0^2 + (1.69314)^2 + 1^2 + 0^2 + 1^2 + (1.69314)^2 + 0^2} \approx 2.78090
  $$

- **Normalized TF-IDF Vector:**

  | Term        | Normalized TF-IDF                 |
  |-------------|-----------------------------------|
  | deep        | $$0$$                             |
  | fascinating | $$\frac{1.69314}{2.78090} \approx 0.608845$$ |
  | learning    | $$\frac{1}{2.78090} \approx 0.359594$$      |
  | love        | $$0$$                             |
  | machine     | $$\frac{1}{2.78090} \approx 0.359594$$      |
  | powerful    | $$\frac{1.69314}{2.78090} \approx 0.608845$$ |
  | subset      | $$0$$                             |

#### **Document 3 Normalization:**

- **TF-IDF Vector Before Normalization:**
  $$
  [1.28768, 0, 2, 0, 1, 0, 1.69314]
  $$

- **Calculate L2 Norm:**
  $$
  \sqrt{(1.28768)^2 + 0^2 + 2^2 + 0^2 + 1^2 + 0^2 + (1.69314)^2} \approx 3.08623
  $$

- **Normalized TF-IDF Vector:**

  | Term        | Normalized TF-IDF                 |
  |-------------|-----------------------------------|
  | deep        | $$\frac{1.28768}{3.08623} \approx 0.417233$$ |
  | fascinating | $$0$$                             |
  | learning    | $$\frac{2}{3.08623} \approx 0.648038$$      |
  | love        | $$0$$                             |
  | machine     | $$\frac{1}{3.08623} \approx 0.324019$$      |
  | powerful    | $$0$$                             |
  | subset      | $$\frac{1.69314}{3.08623} \approx 0.548612$$ |

### **Final TF-IDF Matrix:**

|                | deep     | fascinating | learning | love     | machine  | powerful | subset   |
|----------------|----------|-------------|----------|----------|----------|----------|----------|
| **Document 1** | 0.417233 | 0.000000    | 0.648038 | 0.548612 | 0.324019 | 0.000000 | 0.000000 |
| **Document 2** | 0.000000 | 0.608845    | 0.359594 | 0.000000 | 0.359594 | 0.608845 | 0.000000 |
| **Document 3** | 0.417233 | 0.000000    | 0.648038 | 0.000000 | 0.324019 | 0.000000 | 0.548612 |

### **Summary of Formulas Used:**

1. **Inverse Document Frequency (IDF):**

   $$
   \text{idf}(t) = \ln\left(\frac{1 + N}{1 + \text{DF}(t)}\right) + 1
   $$

2. **Term Frequency-Inverse Document Frequency (TF-IDF):**

   $$
   \text{tf-idf}(t, d) = \text{tf}(t, d) \times \text{idf}(t)
   $$

3. **L2 Normalization:**

   $$
   \text{Normalized TF-IDF}(t, d) = \frac{\text{TF-IDF}(t, d)}{\sqrt{\sum_{t} \text{TF-IDF}(t, d)^2}}
   $$

By following these steps, scikit-learn calculates the TF-IDF values for each term in each document, resulting in a normalized TF-IDF matrix that represents the importance of terms relative to each document and across the corpus.
