# Embedding, Similarity, and Clustering

This notebook provides a practical introduction to embedding, similarity, and clustering—concepts that are foundational to many machine learning applications.

## Objectives:
1. Understand and implement **embedding** to represent real-world data in numerical formats.
2. Learn and calculate **similarity** between embeddings using metrics like Euclidean and Cosine Similarity.
3. Perform **clustering** using the K-means algorithm to group similar data points.

## Table of Contents:
1. Embedding
2. Similarity
3. Clustering
4. K-means Clustering Implementation
5. Conclusion

## 1.0 Embedding
Embedding is the process of representing real-world objects as numerical data to make them analyzable in machine learning applications. Different applications require different embedding methods. For example, text can be embedded using ASCII values, or images can be embedded as pixel intensities.

In [None]:
# Example: Embedding a sentence using ASCII values
sentence = "Hello, ML!"
ascii_embedding = [ord(char) for char in sentence]
print(f"Original Sentence: {sentence}")
print(f"ASCII Embedding: {ascii_embedding}")

## 2.0 Similarity
Similarity measures how close two embeddings are. Common metrics include:
- **Euclidean Similarity**: Measures the straight-line distance between two points in space.
- **Cosine Similarity**: Measures the angle between two vectors, focusing on their orientation rather than magnitude.

In [None]:
import numpy as np

# Define two example vectors
vector_a = np.array([1, 2])
vector_b = np.array([2, 3])

# Euclidean Similarity
euclidean_similarity = np.linalg.norm(vector_a - vector_b)
print(f"Euclidean Similarity: {euclidean_similarity}")

# Cosine Similarity
cosine_similarity = np.dot(vector_a, vector_b) / (np.linalg.norm(vector_a) * np.linalg.norm(vector_b))
print(f"Cosine Similarity: {cosine_similarity}")

## 3.0 Clustering
Clustering groups similar data points together. It helps uncover hidden structures in data, such as identifying genres of books based on features like theme and writing style.

In [None]:
# Example Data: Book themes represented as points in a 2D space
import matplotlib.pyplot as plt

data = np.array([
    [1, 2],  # Book A
    [1, 3],  # Book B
    [6, 7],  # Book C
    [7, 6]   # Book D
])

# Plot data points
plt.scatter(data[:, 0], data[:, 1], c="blue", label="Books")
plt.title("Books in 2D Feature Space")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.show()

## 4.0 K-means Clustering Implementation
K-means clustering assigns data points to a predefined number of clusters by minimizing intra-cluster distances.

In [None]:
from sklearn.cluster import KMeans

# Apply K-means clustering
kmeans = KMeans(n_clusters=2, random_state=42).fit(data)
labels = kmeans.labels_
centroids = kmeans.cluster_centers_

# Visualize clusters
plt.scatter(data[:, 0], data[:, 1], c=labels, cmap="viridis", label="Books")
plt.scatter(centroids[:, 0], centroids[:, 1], c="red", marker="x", label="Centroids")
plt.title("K-means Clustering")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.show()

### 4.1 Elbow Method to Find Optimal Clusters
The Elbow Method helps identify the optimal number of clusters by finding the point where adding more clusters doesn’t significantly improve the model.

In [None]:
# Calculate inertia for different numbers of clusters
inertia = []
for k in range(1, 10):
    km = KMeans(n_clusters=k, random_state=42).fit(data)
    inertia.append(km.inertia_)

# Plot Elbow Curve
plt.plot(range(1, 10), inertia, marker="o")
plt.title("Elbow Method")
plt.xlabel("Number of Clusters")
plt.ylabel("Inertia")
plt.show()

## 5.0 Conclusion
This notebook demonstrated:
- The importance of embeddings in representing real-world data.
- How similarity measures like Euclidean and Cosine help analyze relationships between data points.
- How clustering methods like K-means reveal structures in data.

You can now explore more advanced clustering techniques or apply these methods to real-world datasets!