# Task 2: TF-IDF Representation

## Overview
In this task, you will enhance the **Bag of Words (BoW) representation** by applying **Term Frequency-Inverse Document Frequency (TF-IDF) weighting.** This transformation helps improve text classification by reducing the influence of common words while highlighting more important terms.

## Why TF-IDF?

The **TF-IDF model** is an improvement over BoW. Instead of just counting word occurrences, it:

1. Assigns **higher weights** to words that appear frequently in a document but **rarely in other documents** (important keywords).
2. Assigns **lower weights** to words that appear in **many documents** (e.g., "the", "is", “and”), as they contribute less to distinguishing meaning.

## How TF-IDF Works

### **1. Term Frequency (TF)**

Term Frequency measures how often a term (word) appears in a document. Words that appear more frequently in a document are considered more important.

The formula for calculating TF is:

$$
\text{TF}(t) = \frac{\text{Number of times term } t \text{ appears in a document}}{\text{Total number of terms in the document}}
$$

### **2. Inverse Document Frequency (IDF)**

Inverse Document Frequency measures how important a term is across an entire corpus (collection of documents). Terms that appear in many documents have lower IDF scores because they are less specific to any single document.

The formula for IDF is:

$$
\text{IDF}(t) = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing term } t}\right)
$$

To avoid division by zero when a term appears in all documents, some implementations use a modified formula:

$$
\text{IDF}(t) = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing term } t} + 1\right)
$$

### **3. TF-IDF Score**

The TF-IDF score combines Term Frequency and Inverse Document Frequency to assign importance to terms. It gives higher importance to terms that are frequent in a specific document but rare across the corpus.

The formula for calculating TF-IDF is:

$$
\text{TF-IDF}(t) = \text{TF}(t) \times \text{IDF}(t)
$$

### **Why Use TF-IDF?**

TF-IDF is widely used to represent textual data as numerical vectors where each dimension corresponds to a word in the corpus. It helps identify significant words within a document while reducing the influence of commonly occurring words across all documents.

### **Applications of TF-IDF**

- Text classification
- Clustering
- Search engine indexing
- Information retrieval systems
- Document similarity analysis


In [1]:
import numpy as np
from sklearn.feature_extraction.text import TfidfTransformer

In [2]:
tfidf_transformer = TfidfTransformer()

# tfidf for train dataset
X_train = np.load('data/bow_train.npy')
x_train_tfidf = tfidf_transformer.fit_transform(X_train).toarray()
np.save('data/tfidf_train.npy', x_train_tfidf)

In [3]:
# tfidf for test dataset 
X_test = np.load('data/bow_test.npy')
x_test_tfidf = tfidf_transformer.transform(X_test).toarray()
np.save('data/tfidf_test.npy', x_test_tfidf)

In [4]:
# display dimension for each of them
print("TF-IDF train shape: " + str(x_train_tfidf.shape[0]) + " x " + str(x_train_tfidf.shape[1]))
print("TF-IDF test shape: " + str(x_test_tfidf.shape[0]) + " x " + str(x_test_tfidf.shape[1]))

TF-IDF train shape: 11314 x 1000
TF-IDF test shape: 7532 x 1000
