<a href="https://colab.research.google.com/github/subhashpolisetti/Clustering-Techniques-and-Embeddings/blob/main/7_Document_Clustering_with_Sentence_Transformers_and_KMeans.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clustering Documents with Sentence Transformers and KMeans

This notebook demonstrates how to perform document clustering using **Sentence Transformers** for generating embeddings and **KMeans** for clustering. We will use a set of example documents to show how the embeddings are generated, and then apply clustering techniques to categorize the documents into groups.

### Steps in the Notebook:
1. **Generate Document Embeddings**:
   - We use the **SentenceTransformer** model (`all-MiniLM-L6-v2`) to generate embeddings for a collection of documents.
   - These embeddings are vector representations of the documents in a high-dimensional space, capturing the semantic meaning of each document.

2. **Clustering with KMeans**:
   - The embeddings are then clustered using the **KMeans** algorithm from scikit-learn. This algorithm partitions the documents into clusters based on the similarity of their embeddings.
   - The number of clusters is specified (`num_clusters = 3` in this case), but it can be adjusted as needed.

3. **Displaying Clustering Results**:
   - For each document, we display the assigned cluster and a preview of the content.
   - This helps in understanding which documents are grouped together and the semantic similarities between them.

4. **Optional: Cluster Summarization**:
   - After clustering, we can generate summaries for each cluster to provide a high-level overview of the documents in each group.
   - This feature can be enhanced by using a language model to generate custom summaries for each cluster.

### Example Documents:
- "Machine learning is a subset of artificial intelligence."
- "Natural language processing deals with the interaction between computers and human language."
- "Deep learning uses neural networks with multiple layers."
- "Reinforcement learning is learning what to do to maximize a reward."
- "Computer vision is the field of AI that trains computers to interpret visual information."

### Expected Output:
- The notebook will output the cluster assignments for each document, along with the content.
- You can also optionally generate summaries for each cluster, which can help in understanding the themes of each group.

This notebook provides a simple way to cluster text documents using embeddings and clustering techniques and can be extended for various natural language processing tasks such as topic modeling, document retrieval, and summarization.


In [None]:
!pip install llm

Collecting llm
  Downloading llm-0.18-py3-none-any.whl.metadata (6.6 kB)
Collecting click-default-group>=1.2.3 (from llm)
  Downloading click_default_group-1.2.4-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting sqlite-utils>=3.37 (from llm)
  Downloading sqlite_utils-3.37-py3-none-any.whl.metadata (7.6 kB)
Collecting sqlite-migrate>=0.1a2 (from llm)
  Downloading sqlite_migrate-0.1b0-py3-none-any.whl.metadata (5.4 kB)
Collecting python-ulid (from llm)
  Downloading python_ulid-3.0.0-py3-none-any.whl.metadata (5.8 kB)
Collecting puremagic (from llm)
  Downloading puremagic-1.28-py3-none-any.whl.metadata (5.8 kB)
Collecting sqlite-fts4 (from sqlite-utils>=3.37->llm)
  Downloading sqlite_fts4-1.0.3-py3-none-any.whl.metadata (6.6 kB)
Downloading llm-0.18-py3-none-any.whl (43 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.2/43.2 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading click_default_group-1.2.4-py2.py3-none-any.whl (4.1 kB)
Downloading sqlite_mi

In [None]:
import click
import json
import llm
import numpy as np
import sklearn.cluster
import sqlite_utils
import textwrap

DEFAULT_SUMMARY_PROMPT = """
Short, concise title for this cluster of related documents.
""".strip()


@llm.hookimpl
def register_commands(cli):
    @cli.command()
    @click.argument("collection")
    @click.argument("n", type=int)
    @click.option(
        "--truncate",
        type=int,
        default=100,
        help="Truncate content to this many characters - 0 for no truncation",
    )
    @click.option(
        "-d",
        "--database",
        type=click.Path(
            file_okay=True, allow_dash=False, dir_okay=False, writable=True
        ),
        envvar="LLM_EMBEDDINGS_DB",
        help="SQLite database file containing embeddings",
    )
    @click.option(
        "--summary", is_flag=True, help="Generate summary title for each cluster"
    )
    @click.option("-m", "--model", help="LLM model to use for the summary")
    @click.option("--prompt", help="Custom prompt to use for the summary")
    def cluster(collection, n, truncate, database, summary, model, prompt):
        """
        Generate clusters from embeddings in a collection

        Example usage, to create 10 clusters:

        \b
            llm cluster my_collection 10

        Outputs a JSON array of {"id": "cluster_id", "items": [list of items]}

        Pass --summary to generate a summary for each cluster, using the default
        language model or the model you specify with --model.
        """
        from llm.cli import get_default_model, get_key

        clustering_model = sklearn.cluster.MiniBatchKMeans(n_clusters=n, n_init="auto")
        if database:
            db = sqlite_utils.Database(database)
        else:
            db = sqlite_utils.Database(llm.user_dir() / "embeddings.db")
        rows = [
            (row[0], llm.decode(row[1]), row[2])
            for row in db.execute(
                """
            select id, embedding, content from embeddings
            where collection_id = (
                select id from collections where name = ?
            )
        """,
                [collection],
            ).fetchall()
        ]
        to_cluster = np.array([item[1] for item in rows])
        clustering_model.fit(to_cluster)
        assignments = clustering_model.labels_

        def truncate_text(text):
            if not text:
                return None
            if truncate > 0:
                return text[:truncate]
            else:
                return text

        # Each one corresponds to an ID
        clusters = {}
        for (id, _, content), cluster in zip(rows, assignments):
            clusters.setdefault(str(cluster), []).append(
                {"id": str(id), "content": truncate_text(content)}
            )
        # Re-arrange into a list
        output_clusters = [{"id": k, "items": v} for k, v in clusters.items()]

        # Do we need to generate summaries?
        if summary:
            model = llm.get_model(model or get_default_model())
            if model.needs_key:
                model.key = get_key("", model.needs_key, model.key_env_var)
            prompt = prompt or DEFAULT_SUMMARY_PROMPT
            click.echo("[")
            for cluster, is_last in zip(
                output_clusters, [False] * (len(output_clusters) - 1) + [True]
            ):
                click.echo("  {")
                click.echo('    "id": {},'.format(json.dumps(cluster["id"])))
                click.echo(
                    '    "items": '
                    + textwrap.indent(
                        json.dumps(cluster["items"], indent=2), "    "
                    ).lstrip()
                    + ","
                )
                prompt_content = "\n".join(
                    [item["content"] for item in cluster["items"] if item["content"]]
                )
                if prompt_content.strip():
                    summary = model.prompt(
                        prompt_content,
                        system=prompt,
                    ).text()
                else:
                    summary = None
                click.echo('    "summary": {}'.format(json.dumps(summary)))
                click.echo("  }" + ("," if not is_last else ""))
            click.echo("]")
        else:
            click.echo(json.dumps(output_clusters, indent=4))

In [None]:
import numpy as np
import llm
from sklearn.cluster import KMeans



# Create a collection and embed the documents
collection = llm.Collection("documents", model_id="sentence-transformers/all-MiniLM-L6-v2")
embeddings = []
valid_documents = []

for i, doc in enumerate(documents):
    try:
        embedding = collection.embed(f"doc_{i}", doc, store=True)
        if embedding is not None:
            embeddings.append(embedding)
            valid_documents.append(doc)
        else:
            print(f"Warning: Embedding for document {i} is None")
    except Exception as e:
        print(f"Error embedding document {i}: {str(e)}")

# Print the number of embeddings
print(f"Number of embeddings: {len(embeddings)}")

# Convert embeddings to numpy array
if embeddings:
    embeddings_array = np.array(embeddings)
    print(f"Shape of embeddings array: {embeddings_array.shape}")

    # Check for NaN values
    nan_count = np.isnan(embeddings_array).sum()
    print(f"Number of NaN values: {nan_count}")

    # Remove any NaN values
    embeddings_array = embeddings_array[~np.isnan(embeddings_array).any(axis=1)]
    print(f"Shape after removing NaNs: {embeddings_array.shape}")

    # Check if we have any valid embeddings
    if embeddings_array.size > 0:
        # Perform K-means clustering
        num_clusters = min(3, len(embeddings_array))  # Ensure we don't have more clusters than data points
        kmeans = KMeans(n_clusters=num_clusters, random_state=42)
        cluster_labels = kmeans.fit_predict(embeddings_array)

        # Print the clustering results
        for i, (doc, label) in enumerate(zip(valid_documents, cluster_labels)):
            print(f"Document {i}: Cluster {label}")
            print(f"Content: {doc}")
            print()

        # Use llm-cluster to generate summaries for each cluster
        try:
            from llm_cluster import cluster_embeddings
            cluster_results = cluster_embeddings(collection, num_clusters, summary=True)

            # Print the cluster summaries
            for cluster in cluster_results:
                print(f"Cluster {cluster['id']}:")
                print(f"Summary: {cluster['summary']}")
                print("Items:")
                for item in cluster['items']:
                    print(f"- {item['content']}")
                print()
        except Exception as e:
            print(f"An error occurred while generating cluster summaries: {str(e)}")
    else:
        print("No valid embeddings after removing NaNs. Cannot perform clustering.")
else:
    print("No valid embeddings. Cannot perform clustering.")

Number of embeddings: 0
No valid embeddings. Cannot perform clustering.


In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

In [None]:
for i, doc in enumerate(documents):
    print(f"Document {i}: {doc}")

Document 0: Machine learning is a subset of artificial intelligence.
Document 1: Natural language processing deals with the interaction between computers and human language.
Document 2: Deep learning uses neural networks with multiple layers.
Document 3: Reinforcement learning is learning what to do to maximize a reward.
Document 4: Computer vision is the field of AI that trains computers to interpret visual information.
Document 5: Clustering is an unsupervised learning technique.
Document 6: Classification is a supervised learning task.
Document 7: Regression predicts continuous values.
Document 8: Neural networks are inspired by the human brain.
Document 9: Support vector machines are used for classification and regression tasks.


In [None]:
embeddings = model.encode(documents)
print(f"Shape of embeddings: {embeddings.shape}")

Shape of embeddings: (10, 384)


In [None]:
embeddings = []
for i, doc in enumerate(documents):
    try:
        embedding = model.encode(doc)
        embeddings.append(embedding)
    except Exception as e:
        print(f"Error embedding document {i}: {str(e)}")

embeddings_array = np.array(embeddings)

In [None]:
batch_size = 10
all_embeddings = []
for i in range(0, len(documents), batch_size):
    batch = documents[i:i+batch_size]
    batch_embeddings = model.encode(batch)
    all_embeddings.extend(batch_embeddings)

embeddings_array = np.array(all_embeddings)

In [None]:
import numpy as np
from sklearn.cluster import KMeans
from sentence_transformers import SentenceTransformer
import csv

# Load the sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Read the CSV file and extract the text to be embedded
# Create a sample dataset of documents
documents = [
    "Machine learning is a subset of artificial intelligence.",
    "Natural language processing deals with the interaction between computers and human language.",
    "Deep learning uses neural networks with multiple layers.",
    "Reinforcement learning is learning what to do to maximize a reward.",
    "Computer vision is the field of AI that trains computers to interpret visual information.",
    "Clustering is an unsupervised learning technique.",
    "Classification is a supervised learning task.",
    "Regression predicts continuous values.",
    "Neural networks are inspired by the human brain.",
    "Support vector machines are used for classification and regression tasks."
]

# Generate embeddings
embeddings = model.encode(documents)

# Perform K-means clustering
num_clusters = 3  # Adjust as needed
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
cluster_labels = kmeans.fit_predict(embeddings)

# Print the clustering results
for i, (doc, label) in enumerate(zip(documents, cluster_labels)):
    print(f"Document {i}: Cluster {label}")
    print(f"Content: {doc[:100]}...")  # Print first 100 characters
    print()

# Optional: Generate summaries for each cluster
# This part depends on how you want to summarize the clusters
# You might need to implement a custom summarization method

Document 0: Cluster 2
Content: Machine learning is a subset of artificial intelligence....

Document 1: Cluster 1
Content: Natural language processing deals with the interaction between computers and human language....

Document 2: Cluster 2
Content: Deep learning uses neural networks with multiple layers....

Document 3: Cluster 0
Content: Reinforcement learning is learning what to do to maximize a reward....

Document 4: Cluster 2
Content: Computer vision is the field of AI that trains computers to interpret visual information....

Document 5: Cluster 1
Content: Clustering is an unsupervised learning technique....

Document 6: Cluster 1
Content: Classification is a supervised learning task....

Document 7: Cluster 1
Content: Regression predicts continuous values....

Document 8: Cluster 2
Content: Neural networks are inspired by the human brain....

Document 9: Cluster 1
Content: Support vector machines are used for classification and regression tasks....

