# Bag-of-Words Model (BoW) - Manual Calculation

We manually calculate the similarity between the query and each document using the Bag-of-Words (BoW) model and cosine similarity.

## Step 1: Define Documents and Query

**Documents:**
1. "Machine learning is a subset of artificial intelligence."
2. "Deep learning is a type of machine learning."
3. "Natural language processing is used in AI applications."

**Query:**  
"Tell me about machine learning."

## Step 2: Create Vocabulary

We create a vocabulary of unique words from all documents and the query:
['machine', 'learning', 'subset', 'artificial', 'intelligence', 'deep', 'type', 'natural', 'language', 'processing', 'used', 'AI', 'applications', 'tell', 'me', 'about']



## Step 3: Create Bag-of-Words Vectors

Count the occurrence of each term from the vocabulary in each document and the query:

| Term          | Doc 1 | Doc 2 | Doc 3 | Query |
|---------------|-------|-------|-------|-------|
| machine       | 1     | 1     | 0     | 1     |
| learning      | 1     | 2     | 0     | 1     |
| subset        | 1     | 0     | 0     | 0     |
| artificial    | 1     | 0     | 0     | 0     |
| intelligence  | 1     | 0     | 0     | 0     |
| deep          | 0     | 1     | 0     | 0     |
| type          | 0     | 1     | 0     | 0     |
| natural       | 0     | 0     | 1     | 0     |
| language      | 0     | 0     | 1     | 0     |
| processing    | 0     | 0     | 1     | 0     |
| used          | 0     | 0     | 1     | 0     |
| AI            | 0     | 0     | 1     | 0     |
| applications  | 0     | 0     | 1     | 0     |
| tell          | 0     | 0     | 0     | 1     |
| me            | 0     | 0     | 0     | 1     |
| about         | 0     | 0     | 0     | 1     |

## Step 4: Calculate Cosine Similarity

To calculate cosine similarity between the query and each document:

$$
\text{Cosine Similarity} = \frac{A \cdot B}{\|A\| \|B\|}
$$

### Cosine Similarity Calculations

#### Document 1
- **Dot Product:** \( 2 \)
- **Magnitudes:** \( \|A\| = 2 \), \( \|B\| \approx 2.24 \)
- **Similarity:** 
$$
\frac{2}{2 \times 2.24} \approx 0.447
$$

#### Document 2
- **Dot Product:** \( 3 \)
- **Magnitudes:** \( \|A\| = 2 \), \( \|B\| \approx 2.65 \)
- **Similarity:** 
$$
\frac{3}{2 \times 2.65} \approx 0.566
$$

#### Document 3
- **Dot Product:** \( 0 \)
- **Similarity:** 
$$
\frac{0}{2 \times 2.45} = 0
$$

## Summary of Similarity Scores

| Document       | Cosine Similarity with Query |
|----------------|------------------------------|
| Document 1     | 0.447                        |
| Document 2     | 0.566                        |
| Document 3     | 0.0                          |

## Conclusion

The query is most similar to **Document 2** (with a cosine similarity score of 0.566), followed by **Document 1**. Document 3 has no similarity to the query.


In [11]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Define the documents and the query
documents = [
    "Machine learning is a subset of artificial intelligence.",
    "Deep learning is a type of machine learning.",
    "Natural language processing is used in AI applications."
]
query = "Tell me about machine learning."

# Combine documents and query into a single list
all_texts = documents + [query]

# Initialize CountVectorizer to convert texts into BoW vectors
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(all_texts)

# Split the BoW matrix into document vectors and query vector
doc_vectors = bow_matrix[:-1]  # All rows except the last one
query_vector = bow_matrix[-1]  # The last row (query vector)

# Calculate cosine similarity between query and each document
cosine_similarities = cosine_similarity(query_vector, doc_vectors).flatten()

# Print similarity scores
for i, score in enumerate(cosine_similarities):
    print(f"Similarity between query and document {i+1}: {score}")


Similarity between query and document 1: 0.3380617018914066
Similarity between query and document 2: 0.44721359549995787
Similarity between query and document 3: 0.0
