# Similarity Measures

In vector databases, similarity measures are used to determine how close or related two vectors are. When working with embeddings, whether they represent words, products, or images, we need a way to quantify the distance or similarity between these vectors to make decisions, such as retrieving similar items or clustering related objects. Two common similarity measures used in vector databases are **Euclidean Distance** and **Cosine Similarity**. Each is suited to different kinds of data and use cases.

### 1. Euclidean Distance

#### Definition
Euclidean distance is the straight-line distance between two points in a multi-dimensional space. It’s essentially the generalization of the distance formula between two points on a 2D plane but extended to higher dimensions.

#### Mathematical Formula
For two vectors **A** and **B** with `n` dimensions, the Euclidean distance is calculated as:

$$
d(A, B) = \sqrt{(A_1 - B_1)^2 + (A_2 - B_2)^2 + \dots + (A_n - B_n)^2}
$$

#### When to Use It
Use Euclidean distance when the magnitude of the vector is important. It works well in situations where you want to know how far apart two items are in a continuous space. For example, it’s often used in image recognition or clustering algorithms like k-means where physical distance between points is significant.


#### Python Function (from Scratch)

In [7]:
import math

def euclidean_distance(A, B):
    if len(A) != len(B):
        raise ValueError("Vectors must be of same length.")
    return math.sqrt(sum((a - b) ** 2 for a, b in zip(A, B)))

# Example Usage
A = [1, 2, 3]
B = [4, 6, 8]
print(euclidean_distance(A, B))

7.0710678118654755



### 2. Cosine Similarity

#### Definition
Cosine similarity measures the cosine of the angle between two vectors. Unlike Euclidean distance, it focuses on the direction of the vectors rather than their magnitude, making it useful for comparing the orientation of vectors in space.

#### Mathematical Formula
For two vectors **A** and **B**, the cosine similarity is calculated as:

$$
\text{cosine similarity}(A, B) = \frac{A \cdot B}{\|A\| \|B\|}
$$

Where $A \cdot B$ is the dot product of the vectors, and $|A|$ and $|B|$ are the magnitudes (lengths) of vectors $A$ and $B$.

#### When to Use It
Cosine similarity is commonly used when you care about the direction rather than the magnitude. It’s ideal for textual data where the length of the document (or vector) may vary, but the focus is on how similar the content (direction) is. It’s widely used in NLP tasks like document similarity, sentence similarity, and information retrieval.

In [8]:
import math

def cosine_similarity(A, B):
    if len(A) != len(B):
        raise ValueError("Vectors must be of same length.")
    dot_product = sum(a * b for a, b in zip(A, B))
    magnitude_A = math.sqrt(sum(a ** 2 for a in A))
    magnitude_B = math.sqrt(sum(b ** 2 for b in B))
    return dot_product / (magnitude_A * magnitude_B)

# Example Usage
A = [1, 2, 3]
B = [4, 5, 6]
print(cosine_similarity(A, B))

0.9746318461970762


Cosine similarity will give you a value between -1 and 1, where 1 means the vectors point in the same direction (are very similar), 0 means they are orthogonal (unrelated), and -1 means they point in completely opposite directions (are very dissimilar).

In [9]:
vec1 = [1, 2]
vec2 = [100, 200]

print("Euclidean Distance:", euclidean_distance(vec1, vec2))
print("Cosine Similarity:", cosine_similarity(vec1, vec2))


Euclidean Distance: 221.37072977247917
Cosine Similarity: 1.0


## Why Cosine Similarity Over Euclidean Distance in Vector DBs?

Cosine similarity is often used over Euclidean distance in vector databases because it focuses on the **angle** between vectors, not their **magnitude**. This is useful when the direction matters more than the size. For example, in text similarity (e.g., document search), two documents might have different lengths but similar content. Cosine similarity can measure how similar the documents are based on their word usage patterns, ignoring how many words they contain.

Euclidean distance, on the other hand, measures the actual distance between points in space, which can be misleading when the data points have varying magnitudes but similar structures.

### Simple Python Example

In [10]:
vec1 = [1, 1]   # small magnitude, same direction
vec2 = [100, 100]  # large magnitude, same direction

euclid = euclidean_distance(vec1, vec2)
cosine_sim = cosine_similarity(vec1, vec2)

print("Euclidean Distance:", euclid)
print("Cosine Similarity:", cosine_sim)

Euclidean Distance: 140.0071426749364
Cosine Similarity: 0.9999999999999999


Here, despite the large difference in magnitude, cosine similarity will be **1** (perfect similarity), while Euclidean distance will be large. Cosine similarity focuses on the **relationship**, not size.