<a href="https://colab.research.google.com/github/ysuter/FHNW-BSUD-Part2/blob/main/L6-InformationExtraction/similarity_distance_metrics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🧮 Exploring Similarity and Distance Metrics in NLP and Data Science
This notebook compares different similarity and distance metrics on **numeric**, **binary**, and **text** data.

**Metrics included:**
- Cosine similarity
- Euclidean distance
- Manhattan distance
- Hamming distance
- Pearson correlation

**Goal:** Understand how metrics measure relationships and when to use them.

In [None]:
!pip install nltk scikit-learn seaborn matplotlib sentence-transformers -q
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances, manhattan_distances, pairwise_distances
from scipy.spatial.distance import hamming, jaccard
import seaborn as sns
import matplotlib.pyplot as plt
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
print('✅ Setup complete.')

In [None]:
# 1️⃣ Numeric Example — Comparing Distance Metrics
vectors = np.array([[1, 2, 3], [2, 3, 4], [10, 10, 10]])
labels = ['Vector A', 'Vector B', 'Vector C']
def show_heatmap(matrix, title, labels=labels):
    sns.heatmap(matrix, annot=True, cmap='YlGnBu', xticklabels=labels, yticklabels=labels)
    plt.title(title)
    plt.show()
cos_sim = cosine_similarity(vectors)
show_heatmap(cos_sim, 'Cosine Similarity (Numeric)')
eucl_dist = euclidean_distances(vectors)
show_heatmap(eucl_dist, 'Euclidean Distance (Numeric)')
man_dist = manhattan_distances(vectors)
show_heatmap(man_dist, 'Manhattan Distance (Numeric)')
pearson_corr = 1 - pairwise_distances(vectors, metric='correlation')
show_heatmap(pearson_corr, 'Pearson Correlation (1 - Distance)')

In [None]:
# 2️⃣ Binary Example — Hamming and Jaccard Distances
binary_vectors = np.array([[1,0,0,1,1],[1,1,0,1,0],[0,0,1,0,1]])
labels_bin = ['Bin A', 'Bin B', 'Bin C']
ham_dist = pairwise_distances(binary_vectors, metric='hamming')
show_heatmap(ham_dist, 'Hamming Distance (Binary)', labels_bin)
jacc_dist = pairwise_distances(binary_vectors, metric='jaccard')
jacc_sim = 1 - jacc_dist
show_heatmap(jacc_sim, 'Jaccard Similarity (Binary)', labels_bin)

In [None]:
# 3️⃣ Text Example — Bag-of-Words & TF-IDF Similarity
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))
def preprocess(text):
    tokens = word_tokenize(text.lower())
    return ' '.join([t for t in tokens if t.isalpha() and t not in stop_words])
texts = [
    'AI improves medical diagnosis through image analysis.',
    'Artificial intelligence helps doctors analyze radiology images.',
    'The weather today is sunny and warm.'
]
clean_texts = [preprocess(t) for t in texts]
vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(clean_texts)
cosine_bow = cosine_similarity(X_bow)
show_heatmap(cosine_bow, 'Cosine Similarity (Bag-of-Words)', [f'Text {i+1}' for i in range(len(texts))])
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(clean_texts)
cosine_tfidf = cosine_similarity(X_tfidf)
show_heatmap(cosine_tfidf, 'Cosine Similarity (TF-IDF)', [f'Text {i+1}' for i in range(len(texts))])

In [None]:
# 4️⃣ Semantic Embeddings — Sentence Transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(texts)
cosine_semantic = cosine_similarity(embeddings)
show_heatmap(cosine_semantic, 'Cosine Similarity (Sentence Embeddings)', [f'Text {i+1}' for i in range(len(texts))])

## 🧩 Student Exercises
1. Modify the inputs. How do distances change?
2. Look into additional distance metrics and test them.
3. Change one text to be unrelated. Which metric captures it best?
4. Which metric is scale-sensitive?