### Clustering arXiv dataset (50k) w/ Cohere Embedv3

`Clustering` stands as a fundamental task in unsupervised learning, where the goal is to group unlabeled examples into meaningful categories. At its core, the clustering problem relies on finding similar examples (defined by a similarity metric). In this challenge, embeddings emerge as critical players, establishing the links of similarity among the examples.

This notebook demonstrates how to combine the advanced [Cohere Embedv3 model](https://txt.cohere.com/introducing-embed-v3/) with [K-Means](https://en.wikipedia.org/wiki/K-means_clustering) and [HDBSCAN](https://en.wikipedia.org/wiki/HDBSCAN) clustering algorithms. Our playground is an expansive arXiv dataset comprising 50,000 research articles from Artificial Intelligence, Computational Linguistics, Computer Vision and Multiagent Systems. To measure the clustering effectiveness with Embedv3, we visualize the clustering outcomes (post-[UMAP](https://en.wikipedia.org/wiki/Uniform_Manifold_Approximation_and_Projection) dimensionality reduction) against the pre-labeled arXiv dataset.


#### 1. Dataset

In [2]:
%pip install --upgrade datasets python-dotenv scikit-learn --quiet

Note: you may need to restart the kernel to use updated packages.


In [3]:
from sklearn.cluster import KMeans
from datasets import load_dataset

  from .autonotebook import tqdm as notebook_tqdm


Our dataset is available at [HuggingFace](https://huggingface.co/datasets/dcarpintero/arXiv.cs.AI.CL.CV.LG.MA.NE.embedv3). It comprises a collection of the most recent (up to 17 November 2023) 50K arXiv articles' metadata in Artificial Intelligence, Computation and Language, Computer Vision and Multiagent Systems. 

Each article's metadata entry has been enriched with embeddings for the 'title' and 'summary' (abstract), generated using Cohere's Embed-v3. These embeddings will enable us to establish semantic connections among the articles for our clustering task.

In [4]:
dataset = load_dataset("dcarpintero/arXiv.cs.AI.CL.CV.LG.MA.NE.embedv3")

Downloading readme: 100%|██████████| 662/662 [00:00<00:00, 660kB/s]
Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

#### 2. K-Means w/ Cohere Embedv3

The K-Means algorithm begins by randomly selecting `k` points as initial `centroids`, where `k` is the number of clusters we want to identify. The algorithm then iterates through two main steps: assignment and update. In the assignment step, each data point is assigned to the nearest centroid, creating clusters based on proximity. In the update step, the centroids are recalculated as the mean of all points in each cluster. This process of assignment and update continues iteratively until the `centroids` stabilize, indicating that the clusters are as distinct as possible given the data.

#### 3. HDBSCAN w/ Cohere Embedv3

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is an advanced clustering algorithm that extends DBSCAN by adapting to varying density clusters. Unlike K-Means, HDBSCAN does not require pre-specifying the number of clusters, it only has one important hyperparameter, `n`, which establishes the minimum number of examples to put in a cluster. 

In practice, it works by first transforming the space according to the density of the data points, making denser regions (areas where data points are close together in high numbers) more attractive for cluster formation. The algorithm then builds a hierarchy of clusters based on the minimum cluster size established by the hyperparameter `n`, allowing it to distinguish between noise (sparse areas) and dense regions (potential clusters). Finally, HDBSCAN condenses this hierarchy to derive the most persistent clusters, efficiently identifying clusters of different densities and shapes.

#### 4. Visualization