<a href="https://colab.research.google.com/github/raz0208/Techniques-For-Text-Analysis/blob/main/CosineSimilarity_tf_idf_BoW.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Cosine similarity
Cosine similarity is a measure of how similar two vectors are, regardless of their magnitude — it’s based on the angle between them in a multi-dimensional space. It’s commonly used in applications like natural language processing, information retrieval, and computer vision.

Mathematically, cosine similarity between two vectors A and B is defined as:
$$ Cosine Similarity (A,B)=\frac{A⋅B​}{∥A∥∥B∥} $$

Where:
- A⋅B is the dot product of the vectors.
- ∥A∥ and ∥B∥ are the magnitudes (Euclidean norms) of the vectors.

The result ranges from −1 to 1:

- 1 means the vectors are identical in direction.
- 0 means the vectors are perpendicular and share no similarity.
- −1 means the vectors point in completely opposite directions.

### Step 1: Import libraries and read data

In [1]:
# Import required libraries
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [6]:
# Read Data
documents = [
    "The pedestrian detection model performs well on the validation set.",
    "Pedestrian detection is a key task in computer vision projects.",
    "The SSD model is trained on the CityPersons dataset for pedestrian detection."
]

documents

['The pedestrian detection model performs well on the validation set.',
 'Pedestrian detection is a key task in computer vision projects.',
 'The SSD model is trained on the CityPersons dataset for pedestrian detection.']

Step 2: Preprocess Text

In [7]:
# Lowercasing and removing punctuation
clean_texts = [text.lower() for text in documents]

clean_texts

['the pedestrian detection model performs well on the validation set.',
 'pedestrian detection is a key task in computer vision projects.',
 'the ssd model is trained on the citypersons dataset for pedestrian detection.']

### Step 3: Convert Text into Vectors
- TF-IDF
- Bag of Words (BoW)

In [9]:
# Convert Text to Vectors using Bag of Words (BOW)
vectorizer_bow = CountVectorizer()
X_bow = vectorizer_bow.fit_transform(clean_texts)

print(X_bow, "\n")

# Convert Text to Vectors using TF-IDF
vectorizer_tfidf = TfidfVectorizer()
X_tfidf = vectorizer_tfidf.fit_transform(clean_texts)

print(X_tfidf)

  (0, 16)	2
  (0, 10)	1
  (0, 3)	1
  (0, 8)	1
  (0, 11)	1
  (0, 20)	1
  (0, 9)	1
  (0, 18)	1
  (0, 13)	1
  (1, 10)	1
  (1, 3)	1
  (1, 6)	1
  (1, 7)	1
  (1, 15)	1
  (1, 5)	1
  (1, 1)	1
  (1, 19)	1
  (1, 12)	1
  (2, 16)	2
  (2, 10)	1
  (2, 3)	1
  (2, 8)	1
  (2, 9)	1
  (2, 6)	1
  (2, 14)	1
  (2, 17)	1
  (2, 0)	1
  (2, 2)	1
  (2, 4)	1 

  (0, 16)	0.5322120451441776
  (0, 10)	0.20665506514773163
  (0, 3)	0.20665506514773163
  (0, 8)	0.2661060225720888
  (0, 11)	0.34989744090331365
  (0, 20)	0.34989744090331365
  (0, 9)	0.2661060225720888
  (0, 18)	0.34989744090331365
  (0, 13)	0.34989744090331365
  (1, 10)	0.2189562389630259
  (1, 3)	0.2189562389630259
  (1, 6)	0.28194602356415654
  (1, 7)	0.37072513866625695
  (1, 15)	0.37072513866625695
  (1, 5)	0.37072513866625695
  (1, 1)	0.37072513866625695
  (1, 19)	0.37072513866625695
  (1, 12)	0.37072513866625695
  (2, 16)	0.48721503551795814
  (2, 10)	0.1891829691277319
  (2, 3)	0.1891829691277319
  (2, 8)	0.24360751775897907
  (2, 9)	0.24360751775

### Step 4: Compute Cosine Similarity

In [10]:
# BOW Cosine Similarity
cosine_sim_bow = cosine_similarity(X_bow, X_bow)

print(cosine_sim_bow, "\n")

# TF-IDF Cosine Similarity
cosine_sim_tfidf = cosine_similarity(X_tfidf, X_tfidf)

print(cosine_sim_tfidf)

[[1.         0.19245009 0.6172134 ]
 [0.19245009 1.         0.26726124]
 [0.6172134  0.26726124 1.        ]] 

[[1.         0.09049683 0.4671438 ]
 [0.09049683 1.         0.15152975]
 [0.4671438  0.15152975 1.        ]]


### Step 5: Interpretation & Application

In [11]:
# Interpretation & Application
print("Cosine Similarity (Bag of Words):")
print(np.round(cosine_sim_bow, 2))

print("\nCosine Similarity (TF-IDF):")
print(np.round(cosine_sim_tfidf, 2))

Cosine Similarity (Bag of Words):
[[1.   0.19 0.62]
 [0.19 1.   0.27]
 [0.62 0.27 1.  ]]

Cosine Similarity (TF-IDF):
[[1.   0.09 0.47]
 [0.09 1.   0.15]
 [0.47 0.15 1.  ]]
