### Cluster-based downsampling: Notes

#### Mean shift clustering

**Mean Shift Clustering:**

**Idea:** Mean Shift identifies clusters by iteratively shifting each data point towards the mode (peak) of the data's density distribution.

**Pros:** It doesn't require specifying the number of clusters in advance and can discover clusters of arbitrary shapes.

**Cons:** It can be computationally intensive, and the algorithm's performance depends on the choice of bandwidth parameter.

**Under mean shift clustering:**

bandwidth=0.00000475 found to be suitable for THIS instance of AFM data

bandwidth=0.000005 eradicated level2 terrace in sample data

bandwidth<0.00000475 led to monatomic-width pseudolevels between terraces

#### DBScan clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise):

**Idea:** DBSCAN groups together data points that are closely packed and labels outliers as noise. It doesn't require specifying the number of clusters in advance.

**Pros:** It's robust to varying cluster shapes and sizes and can identify noise points.

**Cons:** It might not work well when clusters have varying densities.

**Overview:** DBSCAN is a density-based clustering algorithm that groups together points that are closely packed and labels outliers as noise. It's effective at identifying dense regions and separating them from sparse regions.

**Advantages:** DBSCAN can retain complex structures, including curved terraces, by focusing on the density of points. It automatically identifies clusters of varying sizes and shapes.

**How to Use:** You can use DBSCAN to cluster your point cloud and then select representative points from each cluster as your downsampled points. Adjust the epsilon and min_samples parameters to control the clustering sensitivity.

#### OPTICS clustering

OPTICS (Ordering Points To Identify the Clustering Structure):

**Overview:** OPTICS is an extension of DBSCAN that produces a hierarchical clustering structure. It orders data points based on their reachability distance.

**Advantages:** OPTICS can capture clusters of varying densities and sizes, making it suitable for retaining complex structures. It provides a hierarchy of clusters.

**How to Use:** Apply OPTICS to your point cloud and select representative points from different levels of the hierarchy. Adjust the epsilon and other parameters to control the results.

**OPTICS (Ordering Points To Identify the Clustering Structure)** is a density-based clustering algorithm that identifies clusters of data points in a dataset by capturing the hierarchical structure of the data. It's an extension of the well-known DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm. Here's a step-by-step explanation of how OPTICS works:

**Core Distance and Reachability Distance:**

OPTICS starts by computing two important distances for each data point:

**Core Distance (core_dist):** This is the minimum distance required to consider a point as the core point of a cluster. It's defined by the min_samples parameter.

**Reachability Distance (reach_dist):** This is the distance from a point to its nearest core point. It measures how easily a point can be reached from a core point.

**Building the Reachability Plot:**

OPTICS constructs a reachability plot by comparing each point's reachability distance to the core distance. The reachability plot is a sorted list of points based on their reachability distances.

**Extracting Clusters and Hierarchical Structure:** 

OPTICS extracts clusters and captures the hierarchical structure of the data from the reachability plot. It does this by examining local minima in the plot.

When it encounters a point with a higher reachability distance than the previous point, it marks a potential cluster boundary.

When it encounters a point with a lower reachability distance than the previous point, it signifies the start of a new cluster.

The hierarchy of clusters is captured based on the transitions in reachability distances.

**Hierarchical Cluster Ordering:** 

OPTICS produces a hierarchical ordering of clusters, where clusters at different levels of the hierarchy represent different granularities of grouping.

Clusters at the top of the hierarchy are larger, encompassing more points, while clusters lower in the hierarchy are smaller and more specific.

**Cluster Extraction:**

After constructing the hierarchy, you can extract clusters at different levels based on your specific needs. The xi parameter controls the level of granularity in cluster extraction.

Lower values of xi result in more detailed clusters, while higher values simplify the hierarchy by merging clusters.

**Noise and Undefined Points:**

Points that do not belong to any cluster or do not reach the min_samples threshold are considered noise or undefined.

**Visualization and Analysis:**

The hierarchical structure of OPTICS allows for the exploration of clusters at different levels, which can provide insights into the data's underlying structure.

**In summary**:

OPTICS is a density-based clustering algorithm that captures the hierarchical structure of data by computing reachability distances and identifying transitions in the reachability plot. It allows for the identification of clusters of varying sizes and shapes, making it suitable for datasets with complex structures. The hierarchical representation provides flexibility in analyzing data at different levels of granularity.

**In OPTICS (Ordering Points To Identify the Clustering Structure)**, several parameters can be tuned to control the behavior of the algorithm and adapt it to the characteristics of your data. The primary parameter in OPTICS is the min_samples parameter. Here's an explanation of the main parameters:

_**min_samples:**_

**Description:** This parameter defines the minimum number of samples (points) required to form a dense region or cluster.

**Impact:** A higher min_samples value will require a denser region to be recognized as a cluster, potentially resulting in fewer and larger clusters. Conversely, a lower value can lead to more but smaller clusters.

**Tuning:** You can experiment with different values for min_samples to control the sensitivity of cluster identification. Smaller values make the algorithm more sensitive to small, dense regions, while larger values require larger, denser regions to be recognized as clusters.

_**xi (Xi):**_

**Description:** The xi parameter influences the steepness of the cluster hierarchy produced by OPTICS.

**Impact:** Higher values of xi result in a flatter hierarchy, with fewer levels of clusters. Lower values create a more hierarchical structure with more levels.

**Tuning:** Adjust xi to control the level of granularity in the hierarchy. Larger values simplify the hierarchy, while smaller values provide more detail.

**min_cluster_size:**

**Description:** This parameter specifies the minimum number of samples required for a cluster to be recognized.

**Impact:** Clusters with fewer than min_cluster_size samples will not be labeled as clusters.

**Tuning:** You can set min_cluster_size to control the minimum size of clusters you want to identify. Smaller values will result in the identification of smaller clusters.

_**metric:**_

**Description:** The metric parameter defines the distance metric used to measure the similarity or reachability between points.
Impact: The choice of distance metric can affect how points are connected and clustered.

**Tuning:** Select an appropriate distance metric based on the characteristics of your data. Common choices include Euclidean distance (euclidean), Manhattan distance (manhattan), and others depending on your data's nature.

_**cluster_method:**_

**Description:** This parameter determines the method used to extract clusters from the OPTICS result. Common options include "xi," "leaf," or "dbscan."

**Impact:** The choice of cluster_method can affect how clusters are defined and represented in the output.

**Tuning:** Depending on your analysis goals, you can choose a suitable cluster extraction method. "xi" is often used for a simplified view of clusters, while "dbscan" may provide more detailed cluster definitions.

#### HDBScan

**HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise):**

**Advantages:**

**Hierarchical Clustering:** HDBSCAN, like OPTICS, provides a hierarchical view of the data's clustering structure.

**Robustness:** It is robust to varying cluster densities and noisy data points.

**Automated Parameter Selection:** HDBSCAN can automatically select the appropriate value of min_samples, reducing the need for parameter tuning.

**Limitations:**

**Computational Complexity:** HDBSCAN can also be computationally intensive, especially for large datasets.


#### Choosing Between OPTICS, DBSCAN, and HDBSCAN:

**OPTICS and HDBSCAN** share the advantage of providing a hierarchical view of clusters, which can be valuable when analyzing complex structures like atomic terraces.

**HDBSCAN** offers the additional benefit of automated parameter selection for min_samples, which can simplify the clustering process.

Consider the size of your dataset and computational resources. Both OPTICS and HDBSCAN can be computationally intensive, so it's important to assess whether your dataset can be processed within reasonable time constraints.

**Experimentation:** Since the suitability of a clustering algorithm depends on the specific characteristics of your data, it's often advisable to experiment with both OPTICS and HDBSCAN, as well as potentially DBSCAN, to see which one provides the best results in terms of capturing the shape and structure of your atomic force micrographs.