### Clustering and Topic Modeling of arXiv dataset (10k) w/ Cohere Embedv3 | Pydantic | OpenAI | LangChain

This notebook demonstrates how to combine advanced LLMs such as [Cohere](https://txt.cohere.com/introducing-embed-v3/) and [GPT-4](https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo) with [HDBSCAN](https://en.wikipedia.org/wiki/HDBSCAN), [Pydantic](https://docs.pydantic.dev/) and [LangChain](https://www.langchain.com/) for [Clustering](https://en.wikipedia.org/wiki/Cluster_analysis) and [Topic Modeling](https://en.wikipedia.org/wiki/Topic_model). Our playground is a [dataset of 10,000 research arXiv documents](https://huggingface.co/datasets/dcarpintero/arxiv.cs.CL.embedv3.clustering.medium) from Computational Linguistics (Natural Language Processing) published between 2019 and 2023, and enriched with `title` and `abstract` clustering embeddings that have been generated with the newest [Cohere Embedv3](https://txt.cohere.com/introducing-embed-v3/). To measure the clustering and topic modeling effectiveness, we visualize the outcomes after applying [UMAP](https://en.wikipedia.org/wiki/Uniform_Manifold_Approximation_and_Projection) dimensionality reduction.

[Clustering](https://en.wikipedia.org/wiki/Cluster_analysis) stands as a fundamental task in unsupervised learning, where the goal is to group unlabeled examples into meaningful categories. At its core, the clustering problem relies on finding similar examples. In this challenge, embeddings emerge as critical players, establishing the links of similarity among those examples.

[Topic Modeling](https://en.wikipedia.org/wiki/Topic_model) automatically identifies thematic structures within a large collection of text documents, in our case, topic modeling is applied at cluster label using a representation of document `titles`. This process combines [LangChain](https://www.langchain.com/) and [Pydantic](https://docs.pydantic.dev/) with  [GPT-4](https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo) to define a topic pipeline that generates structured output.

In [3]:
%pip install --upgrade altair datasets hdbscan scikit-learn umap-learn --quiet

#### 1. arXiv Dataset w/ Embeddings

Our dataset is available at [HuggingFace Hub](https://huggingface.co/datasets/dcarpintero/arxiv.cs.CL.embedv3.clustering.medium). It comprises a collection of 10K arXiv articles' metadata in Computation and Language. Each article's metadata entry has been enriched with embeddings for the 'title' and 'abstract' fields using Cohere's Embed-v3. These embeddings will enable us to establish semantic connections among the articles for our clustering task.

In [5]:
from datasets import load_dataset
import tqdm as notebook_tqdm

ds = load_dataset("dcarpintero/arxiv.cs.CL.embedv3.clustering.medium", split="train")

In [6]:
ds

Dataset({
    features: ['url', 'url_pdf', 'title', 'authors', 'primary_category', 'categories', 'abstract', 'updated', 'published', 'embeddings_title', 'embeddings_abstract'],
    num_rows: 10000
})

#### 2. HDBSCAN Clustering w/ Cohere Embedv3

[HDBSCAN](https://en.wikipedia.org/wiki/HDBSCAN) (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is an advanced clustering algorithm that extends DBSCAN by adapting to varying density clusters. **Unlike K-Means, HDBSCAN does not require pre-specifying the number of clusters, it only has one important hyperparameter, `n`, which establishes the minimum number of examples to put in a cluster**. As a density-based method, it can also detect outliers in the data.

In practice, it works by first transforming the space according to the density of the data points, making denser regions (areas where data points are close together in high numbers) more attractive for cluster formation. The algorithm then builds a hierarchy of clusters based on the minimum cluster size established by the hyperparameter `n`, allowing it to distinguish between noise (sparse areas) and dense regions (potential clusters). Finally, HDBSCAN condenses this hierarchy to derive the most persistent clusters, efficiently identifying clusters of different densities and shapes.

##### 2.1 Dimensionality Reduction

We perform [dimensionality reduction](https://en.wikipedia.org/wiki/Dimensionality_reduction) over the `abstracts embeddings` to reduce the computational complexity and memory usage of the clustering process. In this regard, [UMAP](https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction#Uniform_manifold_approximation_and_projection) is a popular technique known for its effectiveness in preserving both the local and global structure of the data, making it a preferred choice for complex datasets, including those with high-dimensional embeddings.

Although HDBSCAN clustering allows for dense (micro)-clusters to be found, for simplicity we specify a minimum of `100` related documents to form a cluster. We might try other sizes such as `50` and `75` to see the results.

In [7]:
import umap

umap_reducer = umap.UMAP(n_neighbors=100, n_components=5, min_dist=0.1, metric='cosine')
umap_embedding = umap_reducer.fit_transform(ds['embeddings_abstract'])

##### 2.2 Clustering Abstract Embeddings

The next step is to cluster the reduced embeddings. In particular, the [HDBSCAN](https://en.wikipedia.org/wiki/HDBSCAN) instance is configured to require a minimum cluster size (note that we set the same value as the number of neighbors in UMAP), and uses the Euclidean metric for measuring the distance between data points. The cluster selection is performed using the 'Excess of Mass' (EOM) method, which balances the density and size of clusters.

In [8]:
import hdbscan

hdbscan_model = hdbscan.HDBSCAN(min_cluster_size=100, metric='euclidean', cluster_selection_method='eom')
clusters = hdbscan_model.fit_predict(umap_embedding)

After clustering, we prepare the dataset for visualization by reducing the number of dimensions to '2'.

In [9]:
import pandas as pd

reduced_embeddings = umap.UMAP(n_neighbors=100, n_components=2, min_dist=0.1, metric='cosine').fit_transform(ds['embeddings_abstract'])
df = pd.DataFrame(reduced_embeddings, columns=['x', 'y'])
df['cluster'] = clusters
df['title'] = ds['title']

df = df[df['cluster'] != -1] # remove outliers

In [10]:
df.head()

Unnamed: 0,x,y,cluster,title
1,5.88633,2.620106,8,"Modelling Users, Intentions, and Structure in ..."
2,11.286769,4.433658,10,A Lexicalized Tree Adjoining Grammar for English
4,11.406404,4.353072,10,Conditions on Consistency of Probabilistic Tre...
5,11.453904,4.395419,10,Separating Dependency from Constituency in a T...
6,11.561894,4.47651,10,Incremental Parser Generation for Tree Adjoini...


#### 3. Topic Modeling w/ OpenAI, Pydantic, and LangChain

In this section, we illustrate how to identify the topic of each cluster by combining an LLM such as GPT-4 with Pydantic and LangChain for creating a topic modeling pipeline.

In [11]:
%pip install langchain openai --quiet

from langchain.chat_models.openai import ChatOpenAI
from langchain.output_parsers import PydanticOutputParser
from langchain.prompts import PromptTemplate
from pydantic import BaseModel, Field
from typing import List

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m867.6/867.6 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m312.9/312.9 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m22.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.9/302.9 kB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.4/116.4 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.3/49.3 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━

##### 3.1 Pydantic Model for Topic

[Pydantic Models](https://docs.pydantic.dev/latest/concepts/models/) are classes that derive from `pydantic.BaseModel`, defining fields as type-annotated attributes. They bear a strong resemblance to `Python` dataclasses. However, they have been designed with subtle but significant differences that optimize various operations such as validation, serialization, and `JSON` schema generation. Our `Topic` class defines a field named `category`. This enables the large language model to generate structured output. In our case, we will receive a `Topic` object as response, instead of a block of text.

In [12]:
class Topic(BaseModel):
    """
    Pydantic Model to generate an structured Topic Model
    """
    category: str = Field(..., description="Identified topic")

##### 3.2 LangChain Prompt Template for Topic Identification

[LangChain Prompt Templates](https://python.langchain.com/docs/modules/model_io/prompts/prompt_templates/) are pre-defined recipes for generating prompts for language models.

In [13]:
topic_prompt = """
    Your task is to analyze a set of research paper titles related to Natural Language Processing and determine the overarching topic of the cluster.
    Based on the titles provided, you should identify and label the most relevant topic. The response should be concise, clearly stating the single
    identified topic in JSON format. No additional information or follow-up questions are needed.

    TITLES:
    {titles}

    EXPECTED OUTPUT:
    {{"category": "Topic Name"}}
    """

##### 3.3 Implement Topic Identification w/ LangChain

This section illustrates how to compose a topic pipeline using the [LangChain Expression Language (LCEL)](https://python.langchain.com/docs/expression_language/).

In [20]:
from google.colab import userdata

def TopicModeling(titles: List[str]) -> str:
    """
    Infer the common topic of the given titles w/ LangChain, Pydantic, OpenAI
    """
    openai_api_key = userdata.get('OPENAI_API_KEY')
    llm = ChatOpenAI(model='gpt-4-1106-preview', temperature=0.1, max_tokens=100, openai_api_key=openai_api_key)
    prompt = PromptTemplate.from_template(topic_prompt)
    parser = PydanticOutputParser(pydantic_object=Topic)

    topic_chain = prompt | llm | parser
    return topic_chain.invoke({"titles": titles})

We found that a representation of 10 random paper titles from each cluster enables the model to infer the topic of each cluster.

In [21]:
topics = []
for i, cluster in df.groupby('cluster'):
    titles = cluster['title'].head(10).tolist()
    topic = TopicModeling(titles)
    topics.append(topic.category)
    print(f"Cluster {i}: {topic.category}")

Cluster 0: Text Summarization
Cluster 1: Sentiment Analysis
Cluster 2: Question Answering Systems in Natural Language Processing
Cluster 3: Machine Translation
Cluster 4: Named Entity Recognition
Cluster 5: Information Extraction
Cluster 6: Biomedical Natural Language Processing
Cluster 7: Natural Language Generation
Cluster 8: Dialogue Systems and Speech Processing
Cluster 9: Speech Processing and Recognition
Cluster 10: Syntactic Parsing and Grammar Formalisms
Cluster 11: Computational Morphology
Cluster 12: Word Sense Disambiguation
Cluster 13: Natural Language Processing Techniques and Models


In [23]:
n_clusters = len(df['cluster'].unique())

topic_map = dict(zip(range(n_clusters), topics))
df['topic'] = df['cluster'].map(topic_map)

#### 5. Visualization

In [24]:
%pip install vegafusion[embed]>=1.5.0 --quiet

import altair as alt
alt.data_transformers.enable("vegafusion")

DataTransformerRegistry.enable('vegafusion')

In [25]:
chart = alt.Chart(df).mark_circle(size=5).encode(
    x='x',
    y='y',
    color='topic:N',
    tooltip=['title', 'topic']
).interactive().properties(
    title='10K arXiv Abstracts in NLP | Cohere Embedv3 | UMAP | HDBSCAN | OpenAI',
    width=600,
    height=400,
)
chart.display()

##### 5.1 Top 15 Topics

In [26]:
df['topic'].value_counts().head(15)

topic
Sentiment Analysis                                           1180
Machine Translation                                          1131
Dialogue Systems and Speech Processing                        608
Question Answering Systems in Natural Language Processing     547
Word Sense Disambiguation                                     528
Syntactic Parsing and Grammar Formalisms                      406
Natural Language Processing Techniques and Models             406
Text Summarization                                            351
Speech Processing and Recognition                             290
Computational Morphology                                      256
Biomedical Natural Language Processing                        227
Natural Language Generation                                   221
Named Entity Recognition                                      203
Information Extraction                                        154
Name: count, dtype: int64