### Clustering and Topic Modeling of arXiv dataset (50k) w/ Cohere Embedv3 | Pydantic | OpenAI | LangChain

[Clustering](https://en.wikipedia.org/wiki/Cluster_analysis) stands as a fundamental task in unsupervised learning, where the goal is to group unlabeled examples into meaningful categories. At its core, the clustering problem relies on finding similar examples. In this challenge, embeddings emerge as critical players, establishing the links of similarity among those examples.

[Topic Modeling]

This notebook demonstrates how to combine the advanced [Cohere Embedv3 model](https://txt.cohere.com/introducing-embed-v3/) with [HDBSCAN](https://en.wikipedia.org/wiki/HDBSCAN) clustering. Our playground is an expansive arXiv dataset comprising 50,000 research articles from Computational Linguistics (Natural Language Processing). To measure the clustering and topic modeling effectiveness, we visualize the outcomes after applying [UMAP](https://en.wikipedia.org/wiki/Uniform_Manifold_Approximation_and_Projection) dimensionality reduction.


#### 1. arXiv Dataset w/ Embeddings

In [3]:
%pip install --upgrade altair datasets hdbscan scikit-learn umap-learn --quiet

Note: you may need to restart the kernel to use updated packages.


Our dataset is available at [HuggingFace Hub](https://huggingface.co/datasets/dcarpintero/arXiv.cs.AI.CL.CV.LG.MA.NE.embedv3). It comprises a collection of the most recent (up to 17 November 2023) 50K arXiv articles' metadata in Computation and Language. Each article's metadata entry has been enriched with embeddings for the 'title' and 'abstract', generated using Cohere's Embed-v3. These embeddings will enable us to establish semantic connections among the articles for our clustering task.

In [6]:
from datasets import load_dataset
import tqdm as notebook_tqdm

ds = load_dataset("dcarpintero/arxiv.cs.CL.embedv3.clustering.mini", split="train")

  from .autonotebook import tqdm as notebook_tqdm


In [7]:
ds

Dataset({
    features: ['url', 'url_pdf', 'title', 'authors', 'primary_category', 'categories', 'abstract', 'updated', 'published', 'embeddings_title', 'embeddings_abstract'],
    num_rows: 5000
})

#### 2. HDBSCAN Clustering w/ Cohere Embedv3

[HDBSCAN](https://en.wikipedia.org/wiki/HDBSCAN) (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is an advanced clustering algorithm that extends DBSCAN by adapting to varying density clusters. Unlike K-Means, HDBSCAN does not require pre-specifying the number of clusters, it only has one important hyperparameter, `n`, which establishes the minimum number of examples to put in a cluster. 

In practice, it works by first transforming the space according to the density of the data points, making denser regions (areas where data points are close together in high numbers) more attractive for cluster formation. The algorithm then builds a hierarchy of clusters based on the minimum cluster size established by the hyperparameter `n`, allowing it to distinguish between noise (sparse areas) and dense regions (potential clusters). Finally, HDBSCAN condenses this hierarchy to derive the most persistent clusters, efficiently identifying clusters of different densities and shapes.

##### 2.1 Dimensionality Reduction

In [12]:
import umap

umap_reducer = umap.UMAP(n_neighbors=25, n_components=5, min_dist=0.1, metric='cosine')
umap_embedding = umap_reducer.fit_transform(ds['embeddings_abstract'])

##### 2.2 Clustering

In [13]:
import hdbscan

hdbscan_model = hdbscan.HDBSCAN(min_cluster_size=25, metric='euclidean', cluster_selection_method='eom')
clusters = hdbscan_model.fit_predict(umap_embedding)

In [60]:
import pandas as pd

reduced_embeddings = umap.UMAP(n_neighbors=25, n_components=2, min_dist=0.1, metric='cosine').fit_transform(ds['embeddings_abstract'])
df = pd.DataFrame(reduced_embeddings, columns=['x', 'y'])
df['cluster'] = clusters
df['title'] = ds['title']

df = df[df['cluster'] != -1] # remove outliers

In [61]:
df.head()

Unnamed: 0,x,y,cluster,title
1,4.725616,6.950188,3,"Modelling Users, Intentions, and Structure in ..."
2,8.453406,4.316781,14,A Lexicalized Tree Adjoining Grammar for English
4,8.890798,4.382251,14,Conditions on Consistency of Probabilistic Tre...
5,8.591914,4.438171,14,Separating Dependency from Constituency in a T...
6,8.623578,4.425744,14,Incremental Parser Generation for Tree Adjoini...


#### 3. Topic Modeling w/ OpenAI, Pydantic, and LangChain

In [53]:

from langchain.chat_models.openai import ChatOpenAI
from langchain.output_parsers import PydanticOutputParser
from langchain.prompts import PromptTemplate
from pydantic import BaseModel, Field
from typing import List

##### 3.1 Pydantic Model for Topic

In [54]:
class Topic(BaseModel):
    """
    Pydantic Model to generate an structured Topic Model
    """
    category: str = Field(..., description="Identified topic")

##### 3.2 Prompt Template for Topic Identification

In [55]:
topic_prompt = """
    Your task is to analyze a set of research paper titles related to Natural Language Processing and determine the overarching topic of the cluster.
    Based on the titles provided, you should identify and label the most relevant topic. The response should be concise, clearly stating the single 
    identified topic in JSON format. No additional information or follow-up questions are needed.

    TITLES:
    {titles}

    EXPECTED OUTPUT:
    {{"category": "Topic Name"}}
    """

##### 3.3 Implement Topic Identification Pipeline w/ LangChain

In [56]:
def TopicModeling(titles: List[str]) -> str:
    """
    Infer the common topic of the given titles w/ LangChain, Pydantic, OpenAI
    """
    llm = ChatOpenAI(model='gpt-4-1106-preview', temperature=0.1, max_tokens=100)
    prompt = PromptTemplate.from_template(topic_prompt)
    parser = PydanticOutputParser(pydantic_object=Topic)

    topic_chain = prompt | llm | parser
    return topic_chain.invoke({"titles": titles}) 

In [57]:
topics = []

for i, cluster in df.groupby('cluster'):
    titles = cluster['title'].head(10).tolist()
    topic = TopicModeling(titles)
    topics.append(topic.category)
    print(f"Cluster {i}: {topic.category}")

Cluster 0: Text Summarization
Cluster 1: Sentiment Analysis
Cluster 2: Question Answering Systems
Cluster 3: Dialogue Systems and Speech Processing
Cluster 4: Temporal Information Processing in Natural Language Processing
Cluster 5: Speech Processing and Analysis
Cluster 6: Chinese Word Segmentation and Part-of-Speech Tagging in Natural Language Processing
Cluster 7: Named Entity Recognition
Cluster 8: Knowledge Graphs and Entity Relationship Extraction in NLP
Cluster 9: Biomedical Natural Language Processing
Cluster 10: Stylometry and Authorship Attribution
Cluster 11: Machine Translation
Cluster 12: Discourse Analysis in Natural Language Processing
Cluster 13: Semantic Role Labeling
Cluster 14: Syntactic Parsing and Grammar Formalisms in Natural Language Processing
Cluster 15: Statistical Parsing and Grammar Models in Natural Language Processing
Cluster 16: Semantic Parsing
Cluster 17: Arabic Natural Language Processing
Cluster 18: Part-of-Speech Tagging
Cluster 19: Quantitative Ling

In [64]:
n_clusters = len(df['cluster'].unique())

topic_map = dict(zip(range(n_clusters), topics))
df['topic'] = df['cluster'].map(topic_map)

#### 5. Visualization

In [65]:
import altair as alt

chart = alt.Chart(df).mark_circle(size=5).encode(
    x='x',
    y='y',
    color='topic:N',
    tooltip=['title', 'topic']
).interactive().properties(
    title='50K arXiv Abstracts in NLP | Cohere Embedv3 | UMAP | HDBSCAN | OpenAI',
    width=600,
    height=400,
)
chart.display()

##### 5.1 Top 10 Topics

In [48]:
df['topic'].value_counts().head(10)

topic
Machine Translation                                                         569
Sentiment Analysis                                                          560
Dialogue Systems and Speech Processing                                      394
Question Answering Systems in Natural Language Processing                   241
Word Embeddings and Representations in Natural Language Processing          195
Speech Processing                                                           195
Text Summarization                                                          135
Statistical Parsing and Grammar Induction in Natural Language Processing    119
Biomedical Natural Language Processing                                      118
Part-of-Speech Tagging                                                      112
Name: count, dtype: int64

----