# Relevance (Manual) Calculation using TF-IDF

We have three documents and a query. We’ll use TF-IDF to measure how relevant each document is to the query by calculating the term frequency-inverse document frequency for each term.

## Documents and Query

**Documents:**
1. "Machine learning is a subset of artificial intelligence."
2. "Deep learning is a type of machine learning."
3. "Natural language processing is used in AI applications."

**Query:**  
"Tell me about machine learning."

## Step 1: Preprocess Text

To simplify, we’ll remove common stopwords (such as "is," "a," "of," etc.) and focus on meaningful terms.

After preprocessing, we have:

- **Document 1**: `["machine", "learning", "subset", "artificial", "intelligence"]`
- **Document 2**: `["deep", "learning", "type", "machine", "learning"]`
- **Document 3**: `["natural", "language", "processing", "used", "AI", "applications"]`
- **Query**: `["tell", "machine", "learning"]`

## Step 2: Calculate Term Frequencies (TF)

The **TF** of a term in a document is the number of times the term appears in the document divided by the total number of terms in that document.

| Term          | Doc 1 TF                  | Doc 2 TF                  | Doc 3 TF                  | Query TF                |
|---------------|---------------------------|---------------------------|---------------------------|--------------------------|
| machine       | 1/5 = 0.20                | 1/5 = 0.20                | 0                         | 1/3 = 0.33               |
| learning      | 1/5 = 0.20                | 2/5 = 0.40                | 0                         | 1/3 = 0.33               |
| subset        | 1/5 = 0.20                | 0                         | 0                         | 0                        |
| artificial    | 1/5 = 0.20                | 0                         | 0                         | 0                        |
| intelligence  | 1/5 = 0.20                | 0                         | 0                         | 0                        |
| deep          | 0                         | 1/5 = 0.20                | 0                         | 0                        |
| type          | 0                         | 1/5 = 0.20                | 0                         | 0                        |
| natural       | 0                         | 0                         | 1/6 ≈ 0.17               | 0                        |
| language      | 0                         | 0                         | 1/6 ≈ 0.17               | 0                        |
| processing    | 0                         | 0                         | 1/6 ≈ 0.17               | 0                        |
| used          | 0                         | 0                         | 1/6 ≈ 0.17               | 0                        |
| AI            | 0                         | 0                         | 1/6 ≈ 0.17               | 0                        |
| applications  | 0                         | 0                         | 1/6 ≈ 0.17               | 0                        |
| tell          | 0                         | 0                         | 0                         | 1/3 = 0.33               |
| me            | 0                         | 0                         | 0                         | 1/3 = 0.33               |
| about         | 0                         | 0                         | 0                         | 1/3 = 0.33               |

## Step 3: Calculate Inverse Document Frequency (IDF)

The **IDF** of each term is calculated as:
$$\text{IDF} = \log \left( \frac{N + 1}{\text{DF} + 1} \right) + 1$$
Where:
N is the total number of documents.
DF is the number of documents containing the term.



Using this formula, we get the following IDF values:

| Term          | DF (Documents with Term) | IDF                      |
|---------------|--------------------------|---------------------------|
| machine       | 2                        | log((3 + 1) / (2 + 1)) + 1 ≈ 1.13 |
| learning      | 2                        | log((3 + 1) / (2 + 1)) + 1 ≈ 1.13 |
| subset        | 1                        | log((3 + 1) / (1 + 1)) + 1 ≈ 1.69 |
| artificial    | 1                        | log((3 + 1) / (1 + 1)) + 1 ≈ 1.69 |
| intelligence  | 1                        | log((3 + 1) / (1 + 1)) + 1 ≈ 1.69 |
| deep          | 1                        | log((3 + 1) / (1 + 1)) + 1 ≈ 1.69 |
| type          | 1                        | log((3 + 1) / (1 + 1)) + 1 ≈ 1.69 |
| natural       | 1                        | log((3 + 1) / (1 + 1)) + 1 ≈ 1.69 |
| language      | 1                        | log((3 + 1) / (1 + 1)) + 1 ≈ 1.69 |
| processing    | 1                        | log((3 + 1) / (1 + 1)) + 1 ≈ 1.69 |
| used          | 1                        | log((3 + 1) / (1 + 1)) + 1 ≈ 1.69 |
| AI            | 1                        | log((3 + 1) / (1 + 1)) + 1 ≈ 1.69 |
| applications  | 1                        | log((3 + 1) / (1 + 1)) + 1 ≈ 1.69 |
| tell          | 0                        | log((3 + 1) / (0 + 1)) + 1 ≈ 2.39 |
| me            | 0                        | log((3 + 1) / (0 + 1)) + 1 ≈ 2.39 |
| about         | 0                        | log((3 + 1) / (0 + 1)) + 1 ≈ 2.39 |

## Step 4: Calculate TF-IDF for Each Term

For each term in each document, multiply the TF values by the corresponding IDF values to get the **TF-IDF** scores.

## Step 5: Measure Similarity with Cosine Similarity

Once we have the TF-IDF vectors for each document and the query, we can calculate the cosine similarity between the query vector and each document vector. The document with the highest similarity score will be the most relevant to the query.


In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [2]:

documents = [
    "Machine learning is a subset of artificial intelligence.",
    "Deep learning is a type of machine learning.",
    "Natural language processing is used in AI applications.",
]


In [3]:
# User query
query = "Tell me about machine learning."


In [4]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform([query] + documents)

# Calculate cosine similarity between the query and documents
cosine_similarities = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:]).flatten()

# Sort documents by similarity score
most_similar_document = documents[cosine_similarities.argmax()]


In [5]:
# Print the most relevant document
print("Most Relevant Document:", most_similar_document)

Most Relevant Document: Deep learning is a type of machine learning.
