<div id="BBox" class="alert alert-info" style="font-family:courier;color:black;justify-content:left;">
<h1> Distance Operations </h1>
NumPy itself doesn’t have built-in distance functions, but it supports vectorized computations that make distance calculations efficient. Commonly, distances can be computed using functions available in scipy.spatial.distance or by leveraging NumPy operations. Below are key distance-related functions and methods in NumPy:
</div>

In [7]:
from IPython.display import display, Math, Latex
import numpy as np

<div id="BBox" class="alert alert-info" style="font-family:courier;color:black;justify-content:left;">
<h3> Euclidean Distance</h3>
<strong>What It Is:</strong><br>
Euclidean distance is the straight-line distance between two points in Euclidean space. It is computed using the formula given below.
<br><br>
<strong>How It’s Used:</strong> <br>
point1 = np.array([1, 2]) <br>
point2 = np.array([4, 6]) <br>
distance = np.linalg.norm(point1 - point2)<br>
<br>
<strong>Why it is important:</strong> <br>
Euclidean distance is widely used in machine learning for similarity measurement in clustering algorithms (like K-Means) and in nearest-neighbor methods (like KNN).
</div>


In [8]:
display(Math(r'd\left( p,q\right)   = \sqrt {\sum _{i=1}^{n}  \left( q_{i}-p_{i}\right)^2 }'))

<IPython.core.display.Math object>

In [9]:
point1 = np.array([1, 2])
point2 = np.array([4, 6])
distance = np.linalg.norm(point1 - point2)
print(distance)


5.0


<div id="BBox" class="alert alert-info" style="font-family:courier;color:black;justify-content:left;">
<h3> Manhattan Distance</h3>
<strong>What It Is:</strong><br>
Manhattan distance, also known as L1 distance, measures the sum of absolute differences between the coordinates of two points.
<br><br>
<strong>How It’s Used:</strong> <br>
point1 = np.array([1, 2]) <br>
point2 = np.array([4, 6]) <br>
distance = np.sum(np.abs(point1 - point2))<br>
<br>
<strong>Why it is important:</strong> <br>
Manhattan distance is useful when dealing with grid-like path calculations and is often used in ML for features where high dimensions and sparse data are involved.
</div>


In [11]:
display(Math(r'd_{\text{Manhattan}} = \sum_{i=1}^n |x_i - y_i|'))

<IPython.core.display.Math object>

In [10]:
point1 = np.array([1, 2])
point2 = np.array([4, 6])
distance = np.sum(np.abs(point1 - point2))
print(distance)

7


<div id="BBox" class="alert alert-info" style="font-family:courier;color:black;justify-content:left;">
<h3> Cosine Similarity</h3>
<strong>What It Is:</strong><br>
Cosine similarity measures the cosine of the angle between two vectors, representing how similar the two vectors are regardless of their magnitudes.
<br><br>
<strong>How It’s Used:</strong> <br>
point1 = np.array([1, 2]) <br>
point2 = np.array([4, 6]) <br>
cosine_similarity = np.dot(point1, point2) / (np.linalg.norm(point1) * np.linalg.norm(point2))
<br><br>
<strong>Why it is important:</strong> <br>
Cosine similarity is crucial in NLP and recommendation systems, where the direction rather than the magnitude of vectors is important, like in comparing text documents or user preferences.
</div>


In [13]:
display(Math(r'\text{cosine\_similarity} = \cos(\theta) = \frac{\vec{A} \cdot \vec{B}}{\|\vec{A}\| \|\vec{B}\|}'))

<IPython.core.display.Math object>

In [15]:
point1 = np.array([1, 2])
point2 = np.array([4, 6])
print(point1)
print(point2)
dot_product = np.dot(point1, point2)
print("Dot Product: ",dot_product)
norm_point1 = np.linalg.norm(point1)
print("Norm of Point 1: ", norm_point1)
norm_point2 = np.linalg.norm(point2)
print("Norm of Point 2: ", norm_point2)
cosine_similarity = dot_product / ( norm_point1 * norm_point2)
print("Cosine Similarity: ",cosine_similarity)

[1 2]
[4 6]
Dot Product:  16
Norm of Point 1:  2.23606797749979
Norm of Point 2:  7.211102550927978
Cosine Similarity:  0.9922778767136677


<div id="BBox" class="alert alert-info" style="font-family:courier;color:black;justify-content:left;">
<h3> Minkowski Distance</h3>
<strong>What It Is:</strong><br>
Minkowski distance is a generalization of Euclidean and Manhattan distances, parameterized by a value \( p \), where \( p = 1 \) gives Manhattan distance and \( p = 2 \) gives Euclidean distance.
<br><br>
<strong>How It’s Used:</strong> <br>
p = 3 <br>
point1 = np.array([1, 2]) <br>
point2 = np.array([4, 6]) <br>
distance = np.sum(np.abs(point1 - point2) ** p) ** (1/p)
<br><br>
<strong>Why it is important:</strong> <br>
Minkowski distance offers flexibility for different distance calculations in machine learning, allowing adjustments for varying data characteristics.
</div>



In [16]:
display(Math(r'd_{\text{Minkowski}} = \left( \sum_{i=1}^n |x_i - y_i|^p \right)^{\frac{1}{p}}'))

<IPython.core.display.Math object>

In [17]:
p = 3
point1 = np.array([1, 2])
point2 = np.array([4, 6])
distance = np.sum(np.abs(point1 - point2) ** p) ** (1/p)
print(distance)

4.497941445275415


<div id="BBox" class="alert alert-info" style="font-family:courier;color:black;justify-content:left;">
<h3> Hamming  Distance</h3>
<strong>What It Is:</strong><br>
Hamming distance calculates the proportion of differing elements between two binary or categorical vectors.
<br><br>
<strong>How It’s Used:</strong> <br>
vector1 = np.array([1, 0, 1, 1]) <br>
vector2 = np.array([1, 1, 0, 1]) <br>
hamming_distance = np.mean(vector1 != vector2)
<br><br>
<strong>Why it is important:</strong> <br>
Hamming distance is widely used in error detection, information retrieval, and in ML for comparing binary strings or categorical data.
</div>



In [19]:
display(Math(r'd_{\text{Hamming}} = \sum_{i=1}^n \delta(x_i, y_i)'))
display(Math(r'where'))
display(Math(r'''
                \delta(x_i, y_i) = 
                \begin{cases} 
                    0 & \text{if } x_i = y_i \\
                    1 & \text{if } x_i \neq y_i 
                \end{cases}
             '''))

<IPython.core.display.Math object>

<IPython.core.display.Math object>

<IPython.core.display.Math object>

In [20]:
vector1 = np.array([1, 0, 1, 1])
vector2 = np.array([1, 1, 0, 1])
hamming_distance = np.mean(vector1 != vector2)
print(hamming_distance)

0.5


<div id="BBox" class="alert alert-info" style="font-family:courier;color:black;justify-content:left;">
<h3> Jaccard Similarity</h3>
<strong>What It Is:</strong><br>
Jaccard similarity measures the similarity between two binary arrays, calculated as the ratio of the intersection to the union of the arrays.
<br><br>
<strong>How It’s Used:</strong> <br>
vector1 = np.array([1, 0, 1, 1]) <br>
vector2 = np.array([1, 1, 0, 1]) <br>
intersection = np.logical_and(vector1, vector2).sum() <br>
union = np.logical_or(vector1, vector2).sum() <br>
jaccard_similarity = intersection / union
<br><br>
<strong>Why it is important:</strong> <br>
Jaccard similarity is useful in document similarity, clustering, and other ML tasks involving categorical or binary data, as it focuses on the degree of overlap.
</div>



In [22]:
display(Math(r'\text{Jaccard\_similarity} = \frac{|A \cap B|}{|A \cup B|}'))

<IPython.core.display.Math object>

In [23]:
vector1 = np.array([1, 0, 1, 1])
vector2 = np.array([1, 1, 0, 1])
intersection = np.logical_and(vector1, vector2).sum()
union = np.logical_or(vector1, vector2).sum()
jaccard_similarity = intersection / union

print(jaccard_similarity)

0.5


<div id="BBox" class="alert alert-info" style="font-family:courier;color:black;justify-content:left;">
<h3> Mahalanobis Distance</h3>
<strong>What It Is:</strong><br>
Mahalanobis distance is a measure of distance between two points in a multivariate space. Unlike Euclidean distance, which assumes all features have the same scale and no correlation, Mahalanobis distance takes into account the correlations between features by using the covariance matrix of the data. It is particularly useful in identifying outliers and measuring distances in high-dimensional data with dependencies.
<br><br>
<strong>How It’s Used:</strong> <br>
To compute Mahalanobis distance in Python with NumPy, we need: <br>
1. The data vectors 𝑥 and 𝑦. <br>
2. The covariance matrix Σ of the dataset. <br>
3. The inverse of the covariance matrix Σ. <br>
<br><br>
<strong>Why it is important:</strong> <br>
<ul>
<li><strong>Handles Correlated Data:</strong> Mahalanobis distance considers the correlation between features, making it suitable for multivariate data where variables are dependent.</li>
<li><strong>Outlier Detection: </strong>This distance metric is effective for identifying outliers in data by measuring how far a point deviates from a distribution.</li>
<li><strong>Invariant to Scale:</strong> Mahalanobis distance is scale-invariant, meaning it accounts for differences in units across features, which is helpful when working with features of varying scales.</li>
<ul>
</div>



In [29]:
display(Math(r'd_{\text{Mahalanobis}} = \sqrt{(\vec{x} - \vec{y})^T \Sigma^{-1} (\vec{x} - \vec{y})}'))

<IPython.core.display.Math object>

In [None]:
# Sample data
data = np.array([
    [4, 2, 0],
    [4, 5, 6],
    [10, 3, 2],
    [3, 7, 4],
    [6, 5, 5],
])

# Vectors between which we calculate the Mahalanobis distance
x = np.array([4, 2, 0])
y = np.array([6, 5, 5])

# Calculate the covariance matrix of the dataset
cov_matrix = np.cov(data, rowvar=False)
# Invert the covariance matrix
cov_matrix_inv = np.linalg.inv(cov_matrix)
# Calculate the difference vector
delta = x - y
# Calculate the Mahalanobis distance
distance = np.sqrt(delta.T @ cov_matrix_inv @ delta)  # @ is for matrix multiplication

print(distance)

2.455357119316824
