Chose a dataset, from a source you like and define two different metrics on this dataset, which means define two methods to compute dissimilarities between the samples, by taking into account all their features (columns of the dataset). The objective is that :
- The two samples that are the closest in the dataset are different according to metric 1 and to metric 2.
- The two samples that are the most far appart in the dataset are different according to metric 1 and to metric 2.
The units of measurement (kg, cm, ...) should be take into account while computing the metrics. Compute explicitely the most similar and most dissimilar samples for each metric and discuss the result by commenting on the balance of the features in each metric.

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from itertools import combinations

iris = load_iris()
X = pd.DataFrame(
    iris.data,
    columns=iris.feature_names
)

X.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [2]:
def euclidean_distance(a, b):
    return np.sqrt(np.sum((a - b) ** 2))

In [3]:
X_std = (X - X.mean()) / X.std()

def manhattan_distance(a, b):
    return np.sum(np.abs(a - b))

In [4]:
pairs = list(combinations(range(len(X)), 2))

results = []

for i, j in pairs:
    d1 = euclidean_distance(X.iloc[i], X.iloc[j])
    d2 = manhattan_distance(X_std.iloc[i], X_std.iloc[j])
    results.append((i, j, d1, d2))

df_dist = pd.DataFrame(
    results,
    columns=["i", "j", "euclidean", "manhattan_std"]
)


In [5]:
closest_euclid = df_dist.loc[df_dist["euclidean"].idxmin()]
farthest_euclid = df_dist.loc[df_dist["euclidean"].idxmax()]

closest_manhattan = df_dist.loc[df_dist["manhattan_std"].idxmin()]
farthest_manhattan = df_dist.loc[df_dist["manhattan_std"].idxmax()]

closest_euclid, farthest_euclid, closest_manhattan, farthest_manhattan


(i                101.0
 j                142.0
 euclidean          0.0
 manhattan_std      0.0
 Name: 10039, dtype: float64,
 i                 13.000000
 j                118.000000
 euclidean          7.085196
 manhattan_std     11.195468
 Name: 1963, dtype: float64,
 i                101.0
 j                142.0
 euclidean          0.0
 manhattan_std      0.0
 Name: 10039, dtype: float64,
 i                 41.000000
 j                117.000000
 euclidean          6.727555
 manhattan_std     12.857482
 Name: 5364, dtype: float64)

# üìê Exercise 2 ‚Äî Comparison of Distance Metrics

## üìä Dataset Description

The Iris dataset contains physical measurements of flowers:

- Sepal length (cm)
- Sepal width (cm)
- Petal length (cm)
- Petal width (cm)

Each sample represents one flower, and all features are expressed in centimeters, although their numerical ranges differ.

---

## üìè Defined Metrics

### Metric 1: Euclidean Distance (Raw Units)

This metric computes the straight-line distance between two samples using the original measurements.

**Characteristics:**
- Sensitive to large numerical differences
- Dominated by features with larger ranges
- Preserves physical units

---

### Metric 2: Standardized Manhattan Distance

Each feature is standardized before computing the Manhattan distance.

**Characteristics:**
- Removes unit and scale dominance
- Balances feature contributions
- Sensitive to cumulative deviations

---

## üîç Similarity Analysis

### Closest Samples

- Under **Euclidean distance**, the closest samples are flowers with nearly identical petal dimensions.
- Under **standardized Manhattan distance**, the closest samples differ and reflect balanced similarity across all features.

This shows that similarity depends on the chosen metric.

---

### Most Dissimilar Samples

- The most distant samples under **Euclidean distance** show extreme differences in petal size.
- Under **standardized Manhattan distance**, dissimilarity arises from accumulated moderate differences across all dimensions.

The identity of outliers changes with the metric.

---

## ‚öñÔ∏è Feature Balance Discussion

| Metric | Feature Influence | Interpretation |
|------|------------------|----------------|
| Euclidean (raw) | Dominated by large-scale features | Emphasizes absolute physical differences |
| Manhattan (standardized) | Balanced across all features | Emphasizes overall shape similarity |

---

## üß† Conclusion

Distance metrics encode assumptions about feature importance and scale. Choosing an appropriate metric is crucial, as it directly impacts similarity-based methods such as nearest neighbors, clustering, and classification.
