<a target="_blank" href="https://colab.research.google.com/github/rapidsai-community/showcase/blob/main/getting_started_tutorials/rapids-pip-colab-template.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Install RAPIDS into Colab"/>
</a>

# RAPIDS cuDF is now already on your Colab instance!
RAPIDS cuDF is preinstalled on Google Colab and instantly accelerates Pandas with zero code changes. [You can quickly get started with our tutorial notebook](https://nvda.ws/rapids-cudf). This notebook template is for users who want to utilize the full suite of the RAPIDS libraries for their workflows on Colab.  

# Environment Sanity Check #

Click the _Runtime_ dropdown at the top of the page, then _Change Runtime Type_ and confirm the instance type is _GPU_.

You can check the output of `!nvidia-smi` to check which GPU you have.  Please uncomment the cell below if you'd like to do that.  Currently, RAPIDS runs on all available Colab GPU instances.

In [1]:
# !nvidia-smi

#Setup:
This set up script:

1. Checks to make sure that the GPU is RAPIDS compatible
1. Pip Installs the RAPIDS' libraries, which are:
  1. cuDF
  1. cuML
  1. cuGraph
  1. cuSpatial
  1. cuxFilter
  1. cuCIM
  1. xgboost

# Controlling Which RAPIDS Version is Installed
This line in the cell below, `!python rapidsai-csp-utils/colab/pip-install.py`, kicks off the RAPIDS installation script.  You can control the RAPIDS version installed by adding either `latest`, `nightlies` or the default/blank option.  Example:

`!python rapidsai-csp-utils/colab/pip-install.py <option>`

You can now tell the script to install:
1. **RAPIDS + Colab Default Version**, by leaving the install script option blank (or giving an invalid option), adds the rest of the RAPIDS libraries to the RAPIDS cuDF library preinstalled on Colab.  **This is the default and recommended version.**  Example: `!python rapidsai-csp-utils/colab/pip-install.py`
1. **Latest known working RAPIDS stable version**, by using the option `latest` upgrades all RAPIDS labraries to the latest working RAPIDS stable version.  Usually early access for future RAPIDS+Colab functionality - some functionality may not work, but can be same as the default version. Example: `!python rapidsai-csp-utils/colab/pip-install.py latest`
1. **the current nightlies version**, by using the option, `nightlies`, installs current RAPIDS nightlies version.  For RAPIDS Developer use - **not recommended/untested**.  Example: `!python rapidsai-csp-utils/colab/pip-install.py nightlies`


**This will complete in about 5-6 minutes**

In [2]:
# This get the RAPIDS-Colab install files and test check your GPU.  Run this and the next cell only.
# Please read the output of this cell.  If your Colab Instance is not RAPIDS compatible, it will warn you and give you remediation steps.
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/pip-install.py


Cloning into 'rapidsai-csp-utils'...
remote: Enumerating objects: 592, done.[K
remote: Counting objects: 100% (158/158), done.[K
remote: Compressing objects: 100% (76/76), done.[K
remote: Total 592 (delta 125), reused 82 (delta 82), pack-reused 434 (from 3)[K
Receiving objects: 100% (592/592), 194.79 KiB | 21.64 MiB/s, done.
Resolving deltas: 100% (299/299), done.
Installing RAPIDS remaining 25.04 libraries
Using Python 3.11.12 environment at: /usr
Resolved 173 packages in 11.89s
Downloading libcuml-cu12 (404.9MiB)
Downloading cudf-cu12 (1.7MiB)
Downloading cugraph-cu12 (3.0MiB)
Downloading rmm-cu12 (1.5MiB)
Downloading ucx-py-cu12 (2.2MiB)
Downloading dask (1.3MiB)
Downloading datashader (17.5MiB)
Downloading bokeh (6.6MiB)
Downloading libcuspatial-cu12 (31.1MiB)
Downloading libcuvs-cu12 (1.1GiB)
Downloading pylibcudf-cu12 (26.4MiB)
Downloading librmm-cu12 (2.9MiB)
Downloading libcudf-cu12 (538.8MiB)
Downloading shapely (2.4MiB)
Downloading libcugraph-cu12 (1.4GiB)
Downloading raf

# RAPIDS is now installed on Colab.  
You can copy your code into the cells below or use the below to validate your RAPIDS installation and version.  
# Enjoy!

In [3]:
import cudf
cudf.__version__

'25.04.00'

In [4]:
import cuml
cuml.__version__

'25.04.00'

In [None]:
import cugraph
cugraph.__version__

'24.04.00'

In [None]:
import cuspatial
cuspatial.__version__

'24.04.00'

In [None]:
import cuxfilter
cuxfilter.__version__

'24.04.01'

# Next Steps #

For an overview of how you can access and work with your own datasets in Colab, check out [this guide](https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92).

For more RAPIDS examples, check out our RAPIDS notebooks repos:
1. https://github.com/rapidsai/notebooks
2. https://github.com/rapidsai/notebooks-contrib

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
import cuml
from cuml.cluster import HDBSCAN
import h5py
import numpy as np

# Load embeddings from mounted drive
with h5py.File('/content/drive/MyDrive/large_scale_embeddings.h5', 'r') as f:
    embeddings = f['embeddings'][:]
    conversation_ids = [cid.decode('utf-8') for cid in f['conversation_ids'][:]]

print(f"Loaded {len(embeddings):,} embeddings")

Loaded 563,166 embeddings


In [11]:
# Fit cuML HDBSCAN
clusterer = HDBSCAN(
    min_cluster_size=100,
    min_samples=25,
    cluster_selection_epsilon=0.0,
    metric='euclidean',
    prediction_data=True,
    compute_core_distances=True  # For exemplars
)

[2025-05-30 03:12:42.253] [CUML] [info] Unused keyword parameter: compute_core_distances during cuML estimator initialization


In [12]:
print("Fitting HDBSCAN...")
clusterer.fit(embeddings)

print(f"Found {len(np.unique(clusterer.labels_))-1} clusters")

Fitting HDBSCAN...
Found 400 clusters


In [13]:
import numpy as np
import h5py
import pickle
import json
import time
from pathlib import Path

ef convert_numpy_types_for_json(obj):
    """
    Convert numpy types to Python native types for JSON serialization
    Based on: https://www.geeksforgeeks.org/fix-type-error-numpy-array-is-not-json-serializable/
    """
    if isinstance(obj, np.integer):
        return int(obj)
    elif isinstance(obj, np.floating):
        return float(obj)
    elif isinstance(obj, np.ndarray):
        return obj.tolist()
    elif isinstance(obj, dict):
        return {key: convert_numpy_types_for_json(value) for key, value in obj.items()}
    elif isinstance(obj, list):
        return [convert_numpy_types_for_json(item) for item in obj]
    elif isinstance(obj, tuple):
        return tuple(convert_numpy_types_for_json(item) for item in obj)
    else:
        return obj

def save_cuml_hdbscan_results(clusterer, embeddings, conversation_ids, output_dir="hdbscan_results"):
    """
    Save all necessary results from cuML HDBSCAN for local processing

    Args:
        clusterer: Fitted cuML HDBSCAN object
        embeddings: Original embeddings array
        conversation_ids: List of conversation IDs
        output_dir: Directory to save results
    """

    print("💾 Saving cuML HDBSCAN results for local import...")
    Path(output_dir).mkdir(exist_ok=True)

    # 1. CORE CLUSTERING RESULTS
    print("📊 Saving core clustering results...")

    # Handle cuML-specific objects that may not have .copy() method
    def safe_copy_or_convert(obj, attr_name):
        """Safely copy or convert cuML objects to serializable format"""
        if not hasattr(clusterer, attr_name):
            return None

        attr = getattr(clusterer, attr_name)
        if attr is None:
            return None

        # Try to copy first
        if hasattr(attr, 'copy'):
            try:
                return attr.copy()
            except:
                pass

        # Try to convert to numpy array
        if hasattr(attr, 'to_numpy'):
            try:
                return attr.to_numpy()
            except:
                pass

        # Try to convert to pandas DataFrame then to dict
        if hasattr(attr, 'to_pandas'):
            try:
                df = attr.to_pandas()
                return df.to_dict()
            except:
                pass

        # For CondensedTree objects, try to extract meaningful data
        if hasattr(attr, '__dict__'):
            try:
                # Try to serialize the object's attributes
                result = {}
                for key, value in attr.__dict__.items():
                    if isinstance(value, (int, float, str, bool, type(None))):
                        result[key] = value
                    elif hasattr(value, 'tolist'):  # numpy arrays
                        result[key] = value.tolist()
                return result
            except:
                pass

        # If all else fails, return a placeholder
        print(f"⚠️ Could not serialize {attr_name}, skipping...")
        return f"cuml_object_{attr_name}_not_serializable"

    core_results = {
        # Essential cluster assignments
        'labels_': clusterer.labels_.copy() if hasattr(clusterer.labels_, 'copy') else clusterer.labels_,
        'probabilities_': safe_copy_or_convert(clusterer, 'probabilities_'),

        # Cluster statistics
        'cluster_persistence_': safe_copy_or_convert(clusterer, 'cluster_persistence_'),
        'condensed_tree_': safe_copy_or_convert(clusterer, 'condensed_tree_'),
        'single_linkage_tree_': safe_copy_or_convert(clusterer, 'single_linkage_tree_'),

        # Exemplars and representatives
        'exemplars_': safe_copy_or_convert(clusterer, 'exemplars_'),
        'outlier_scores_': safe_copy_or_convert(clusterer, 'outlier_scores_'),

        # Configuration used
        'min_cluster_size': clusterer.min_cluster_size,
        'min_samples': clusterer.min_samples,
        'cluster_selection_epsilon': clusterer.cluster_selection_epsilon,
        'algorithm': clusterer.algorithm if hasattr(clusterer, 'algorithm') else 'auto',
        'metric': clusterer.metric if hasattr(clusterer, 'metric') else 'euclidean'
    }

    # Save core results
    with open(f"{output_dir}/core_results.pkl", 'wb') as f:
        pickle.dump(core_results, f)

    # 2. CONVERSATION-TO-CLUSTER MAPPING
    print("🗺️ Saving conversation-to-cluster mapping...")

    conversation_mapping = {
        'conversation_ids': conversation_ids,
        'cluster_labels': clusterer.labels_.copy() if hasattr(clusterer.labels_, 'copy') else clusterer.labels_,
        'soft_probabilities': safe_copy_or_convert(clusterer, 'probabilities_'),
        'outlier_scores': safe_copy_or_convert(clusterer, 'outlier_scores_'),
        'total_conversations': len(conversation_ids)
    }

    with open(f"{output_dir}/conversation_mapping.pkl", 'wb') as f:
        pickle.dump(conversation_mapping, f)

    # 3. CLUSTER STATISTICS AND PROTOTYPES
    print("📈 Computing and saving cluster statistics...")

    unique_labels = np.unique(clusterer.labels_)
    valid_clusters = unique_labels[unique_labels != -1]  # Exclude noise

    # Cache converted objects for efficiency
    outlier_scores = safe_copy_or_convert(clusterer, 'outlier_scores_')
    probabilities = safe_copy_or_convert(clusterer, 'probabilities_')
    cluster_persistence = safe_copy_or_convert(clusterer, 'cluster_persistence_')

    cluster_stats = {}
    cluster_prototypes = {}

    for cluster_id in valid_clusters:
        cluster_mask = clusterer.labels_ == cluster_id
        cluster_indices = np.where(cluster_mask)[0]
        cluster_embeddings = embeddings[cluster_indices]

        # Safely compute statistics
        mean_outlier_score = None
        if outlier_scores is not None and not isinstance(outlier_scores, str):
            try:
                mean_outlier_score = float(np.mean(outlier_scores[cluster_indices]))
            except:
                mean_outlier_score = None

        mean_probability = None
        if probabilities is not None and not isinstance(probabilities, str):
            try:
                mean_probability = float(np.mean(probabilities[cluster_indices]))
            except:
                mean_probability = None

        persistence_score = None
        if cluster_persistence is not None and not isinstance(cluster_persistence, str):
            try:
                if cluster_id < len(cluster_persistence):
                    persistence_score = float(cluster_persistence[cluster_id])
            except:
                persistence_score = None

        # Basic statistics
        cluster_stats[int(cluster_id)] = {
            'size': len(cluster_indices),
            'conversation_indices': cluster_indices.tolist(),
            'mean_outlier_score': mean_outlier_score,
            'mean_probability': mean_probability,
            'persistence': persistence_score
        }

        # Prototype selection (multiple methods for robustness)
        prototypes = select_cluster_prototypes(
            cluster_embeddings,
            cluster_indices,
            conversation_ids,
            max_prototypes=15
        )

        cluster_prototypes[int(cluster_id)] = prototypes

    # Save cluster analysis
    with open(f"{output_dir}/cluster_statistics.pkl", 'wb') as f:
        pickle.dump(cluster_stats, f)

    with open(f"{output_dir}/cluster_prototypes.pkl", 'wb') as f:
        pickle.dump(cluster_prototypes, f)

    # 4. SOFT CLUSTERING RESULTS (if available)
    if probabilities is not None and not isinstance(probabilities, str):
        print("🎯 Saving soft clustering results...")

        try:
            # Calculate uncertainty scores safely
            if probabilities.ndim > 1:
                uncertainty_scores = 1 - np.max(probabilities, axis=1)
            else:
                uncertainty_scores = 1 - probabilities

            soft_clustering = {
                'probabilities': probabilities.tolist() if hasattr(probabilities, 'tolist') else probabilities,
                'membership_vectors': get_soft_membership_vectors(clusterer.labels_, probabilities),
                'uncertainty_scores': uncertainty_scores.tolist() if hasattr(uncertainty_scores, 'tolist') else uncertainty_scores
            }

            with open(f"{output_dir}/soft_clustering.pkl", 'wb') as f:
                pickle.dump(soft_clustering, f)
        except Exception as e:
            print(f"⚠️ Could not save soft clustering results: {e}")
    else:
        print("ℹ️ No soft clustering probabilities available")

    # 5. HIERARCHY INFORMATION (if available)
    condensed_tree = safe_copy_or_convert(clusterer, 'condensed_tree_')
    single_linkage_tree = safe_copy_or_convert(clusterer, 'single_linkage_tree_')

    if condensed_tree is not None and not isinstance(condensed_tree, str):
        print("🌳 Saving cluster hierarchy...")

        try:
            hierarchy_info = {
                'condensed_tree': condensed_tree,
                'single_linkage_tree': single_linkage_tree if single_linkage_tree is not None and not isinstance(single_linkage_tree, str) else None,
                'cluster_hierarchy': extract_cluster_hierarchy(clusterer)
            }

            with open(f"{output_dir}/hierarchy_info.pkl", 'wb') as f:
                pickle.dump(hierarchy_info, f)
        except Exception as e:
            print(f"⚠️ Could not save hierarchy information: {e}")
    else:
        print("ℹ️ No hierarchy information available")

    # 6. SUMMARY METADATA
    print("📋 Saving summary metadata...")

    summary = {
        'timestamp': time.time(),
        'total_conversations': len(conversation_ids),
        'total_clusters': len(valid_clusters),
        'noise_points': int(np.sum(clusterer.labels_ == -1)),
        'noise_percentage': float(np.sum(clusterer.labels_ == -1) / len(clusterer.labels_) * 100),
        'largest_cluster_size': int(max([stats['size'] for stats in cluster_stats.values()])) if cluster_stats else 0,
        'smallest_cluster_size': int(min([stats['size'] for stats in cluster_stats.values()])) if cluster_stats else 0,
        'mean_cluster_size': float(np.mean([stats['size'] for stats in cluster_stats.values()])) if cluster_stats else 0.0,
        'has_soft_clustering': probabilities is not None and not isinstance(probabilities, str),
        'has_hierarchy': condensed_tree is not None and not isinstance(condensed_tree, str),
        'algorithm_params': {
            'min_cluster_size': clusterer.min_cluster_size,
            'min_samples': clusterer.min_samples,
            'cluster_selection_epsilon': clusterer.cluster_selection_epsilon
        }
    }

    with open(f"{output_dir}/summary.json", 'w') as f:
        json.dump(convert_numpy_types_for_json(summary), f, indent=2)

    print("✅ All cuML HDBSCAN results saved successfully!")
    print(f"📁 Results saved to: {output_dir}/")
    print(f"📊 {len(valid_clusters)} clusters, {np.sum(clusterer.labels_ == -1)} noise points")

    return output_dir

def select_cluster_prototypes(cluster_embeddings, cluster_indices, conversation_ids, max_prototypes=15):
    """Select diverse prototypes from a cluster using multiple methods"""

    if len(cluster_embeddings) <= max_prototypes:
        return {
            'all_indices': cluster_indices.tolist(),
            'all_conversation_ids': [conversation_ids[i] for i in cluster_indices],
            'selection_method': 'all_points'
        }

    # Method 1: Centroid-nearest
    centroid = np.mean(cluster_embeddings, axis=0)
    distances_to_centroid = np.linalg.norm(cluster_embeddings - centroid, axis=1)
    centroid_nearest_idx = np.argsort(distances_to_centroid)[:max_prototypes//3]

    # Method 2: Diverse exemplars (farthest-first traversal)
    diverse_idx = farthest_first_traversal(cluster_embeddings, max_prototypes//3)

    # Method 3: Boundary cases (points with highest variance from centroid)
    boundary_idx = np.argsort(distances_to_centroid)[-max_prototypes//3:]

    # Combine and deduplicate
    all_prototype_idx = np.unique(np.concatenate([centroid_nearest_idx, diverse_idx, boundary_idx]))

    # Convert to original indices
    prototype_indices = cluster_indices[all_prototype_idx]

    return {
        'prototype_indices': prototype_indices.tolist(),
        'prototype_conversation_ids': [conversation_ids[i] for i in prototype_indices],
        'centroid_nearest': cluster_indices[centroid_nearest_idx].tolist(),
        'diverse_exemplars': cluster_indices[diverse_idx].tolist(),
        'boundary_cases': cluster_indices[boundary_idx].tolist(),
        'selection_methods': ['centroid_nearest', 'diverse_exemplars', 'boundary_cases']
    }

def farthest_first_traversal(embeddings, n_exemplars):
    """Select diverse exemplars using farthest-first traversal"""
    if n_exemplars >= len(embeddings):
        return np.arange(len(embeddings))

    selected = [0]  # Start with first point

    for _ in range(n_exemplars - 1):
        max_min_distance = -1
        best_candidate = -1

        for candidate in range(len(embeddings)):
            if candidate in selected:
                continue

            # Find minimum distance to any selected point
            min_distance = float('inf')
            for selected_idx in selected:
                distance = np.linalg.norm(embeddings[candidate] - embeddings[selected_idx])
                min_distance = min(min_distance, distance)

            # Update best candidate if this has larger minimum distance
            if min_distance > max_min_distance:
                max_min_distance = min_distance
                best_candidate = candidate

        if best_candidate != -1:
            selected.append(best_candidate)

    return np.array(selected)

def get_soft_membership_vectors(labels, probabilities):
    """Extract soft membership vectors for each cluster"""
    unique_labels = np.unique(labels)
    valid_clusters = unique_labels[unique_labels != -1]

    if probabilities.ndim == 1:
        # Single probability per point
        membership = {}
        for cluster_id in valid_clusters:
            cluster_mask = labels == cluster_id
            membership[int(cluster_id)] = probabilities[cluster_mask].tolist()
        return membership
    else:
        # Probability matrix
        membership = {}
        for i, cluster_id in enumerate(valid_clusters):
            membership[int(cluster_id)] = probabilities[:, i].tolist()
        return membership

def extract_cluster_hierarchy(clusterer):
    """Extract interpretable cluster hierarchy information"""
    if not hasattr(clusterer, 'condensed_tree_'):
        return None

    # This would need to be customized based on cuML's specific tree structure
    # For now, return basic information
    return {
        'tree_structure': 'condensed_tree_available',
        'note': 'Detailed hierarchy extraction depends on cuML tree format'
    }

In [14]:
output_dir = save_cuml_hdbscan_results(
    clusterer,
    embeddings,
    conversation_ids,
    output_dir="/content/drive/MyDrive/hdbscan_results"
)

💾 Saving cuML HDBSCAN results for local import...
📊 Saving core clustering results...
🗺️ Saving conversation-to-cluster mapping...
📈 Computing and saving cluster statistics...
🎯 Saving soft clustering results...
🌳 Saving cluster hierarchy...
📋 Saving summary metadata...


TypeError: Object of type int64 is not JSON serializable