### Clustering and Downsampling: Notes

#### Mean shift clustering

**Mean Shift Clustering:**

**Idea:** Mean Shift identifies clusters by iteratively shifting each data point towards the mode (peak) of the data's density distribution.

**Pros:** It doesn't require specifying the number of clusters in advance and can discover clusters of arbitrary shapes.

**Cons:** It can be computationally intensive, and the algorithm's performance depends on the choice of bandwidth parameter.

**Under mean shift clustering:**

bandwidth=0.00000475 found to be suitable for THIS instance of AFM data

bandwidth=0.000005 eradicated level2 terrace in sample data

bandwidth<0.00000475 led to monatomic-width pseudolevels between terraces

#### Downsampling/interpolation

**Methods:**

**Linear Interpolation:** Linear interpolation connects data points with straight lines. It's a simple method that assumes a constant rate of change between adjacent points.

**Nearest Neighbor Interpolation:** Nearest neighbor interpolation assigns a new point the value of its nearest existing neighbor. It's straightforward but can result in blocky results, as it doesn't consider gradients.

**Cubic Spline Interpolation:** Cubic spline interpolation fits a piecewise cubic polynomial to the data. It provides smoother results than linear interpolation and is suitable for capturing gradual changes in the data.

**B-spline Interpolation:** B-spline interpolation uses piecewise polynomial functions (splines) to represent data. It offers versatility with adjustable degrees, allowing control over the level of smoothness while preserving local detail.

Using griddata for downsampling in the X and Y dimensions while preserving the Z values:

**Advantages:**

**Preservation of Z Values:** This method effectively preserves the Z values, which is crucial for maintaining the structure of the atomic force microscopy (AFM) data. It ensures that the fine details and variations in the Z dimension are retained.

**Shape Preservation:** Like the previous method, this approach also helps preserve the overall shape of the point cloud. It achieves downsampling in the X and Y dimensions while minimizing the loss of structural information.

**Customizable:** You can choose the interpolation method, such as 'linear' as used in the example, or other interpolation methods based on your data characteristics. This allows you to adapt the method to the specific requirements of your data.

**Disadvantages:**

**Complexity:** The use of griddata for interpolation introduces some complexity into the code. You need to specify interpolation methods and handle potential missing values (e.g., using fill_value). This complexity might make the code slightly harder to understand and maintain.

**Potential for Artifacts:** Depending on the choice of interpolation method and the nature of the data, there is a possibility of introducing artifacts or smoothing that could affect the fine details. Choosing the appropriate interpolation method and parameters is important to mitigate this issue.

**Computational Cost:** Interpolation, especially for large datasets, can be computationally expensive. While this may not be a significant concern for relatively small AFM datasets, it's worth considering for larger datasets.

In summary, this alternative interpolation method is advantageous because it preserves Z values and maintains the shape of the point cloud. However, it requires careful selection of interpolation methods and parameters to minimize potential downsides, such as artifacts or computational cost. Ultimately, the choice between this method and the previous one should be based on your specific data characteristics and the trade-offs you are willing to make.

**Interpolation methods using griddata:**

**'linear' (Linear Interpolation):**

**Advantages:** Simple and Fast. It maintains the linear relationships between data points, making it suitable for data with gradual and continuous variations. Linear interpolation is predictable and easy to understand, making it a common choice for many applications.

**Disadvantages:** Linear interpolation can produce results with abrupt changes, which may not capture gradual variations or smooth transitions in the data. It may not work well for data with sharp, non-linear features or discontinuities, as it tends to create straight-line segments between points.

**'nearest' (Nearest-Neighbor Interpolation):**

**Advantages:** Simple and fast. Preserves discrete values and sharp transitions in the data.

**Disadvantages:** Can create blocky results, not suitable for capturing gradual variations.

**'cubic' (Cubic Spline Interpolation):**

**Advantages:** Provides smoother results than linear interpolation. Suitable for capturing gradual changes in the data.

**Disadvantages:** May introduce some smoothing, might not work well for very sharp features.

Choosing the appropriate interpolation method depends on the nature of your data and your goals. If your data has well-defined and distinct z-levels (terraces) and you want to preserve these levels without generating noisy intermediate values, you might consider using the 'nearest' interpolation method. It will assign the value of the nearest data point to each interpolation point, which can help preserve the distinct levels.

#### Terrace Smoothing - Gaussian Filter

Gaussian filtering is a technique used for smoothing data or reducing noise in a dataset. It's based on the Gaussian distribution (bell curve) and works by giving more weight to data points near the center of the distribution while reducing the influence of data points far from the center. Here's a breakdown of how the gaussian_filter function operates on your data:

**Data Preparation:** Your data is typically represented as a set of points with x, y, and z coordinates. In your case, you want to smooth the x and y coordinates while preserving the z coordinates (terrace levels).

**Sorting:** To apply smoothing at the terrace edges, it's helpful to sort your data based on the z coordinates. This ensures that when you apply the smoothing, it will affect each terrace level separately without mixing them.

**Iterating Through Terrace Levels:** The data is separated into different groups based on the unique z-levels (terrace levels). For each terrace level, the process is applied independently.

**Smoothing in X and Y:** Within each terrace level, Gaussian smoothing is applied to the x and y coordinates of the points. This smoothing involves convolving a Gaussian kernel with the data, where points near the center of the kernel have a higher influence on the smoothed result than points at the edges of the kernel. This effectively smooths out irregularities or noise in the x and y directions.

**Updating Coordinates:** After smoothing, new x and y coordinates are generated for the points at the current terrace level while keeping the original z coordinates unchanged.

**Combining Results:** The smoothed points from each terrace level are combined to form a single, smoothed point cloud.

The **sigma parameter** in the gaussian_filter function controls the width of the Gaussian kernel. A larger sigma results in a broader kernel and smoother smoothing, while a smaller sigma narrows the kernel and retains more detail but may still reduce noise.

By applying this process to your data, you can effectively smooth the terrace edges without overlapping successive terraces and without impacting the bulk terrace and box edge, as you desired. This allows you to maintain the overall shape and structure of the terraces while reducing noise or jagged edges.

#### Normalisation

If you're comparing two point clouds (or structures) using topological data analysis (TDA), normalization is crucial, especially if the structures are being compared in terms of their shapes, densities, and connections rather than their exact spatial locations. The actual coordinate values can be arbitrary, and thus scaling or normalization provides a common ground to compare the structures.

Given the context:

AFM Data: AFM (Atomic Force Microscopy) data is high-resolution, capturing surface topography in real-world units (like nanometers). If you're using TDA, you're likely more interested in the "shape" and "features" of the topography rather than the exact nanometer-level details.

Centroid Coordinates: These are already a simplified representation, effectively "averaging" out the data into voxel centroids. Their exact coordinate values in real units might not be as critical, especially if you're doing a comparison using TDA.

Considering these points, scaling both datasets to a common range (like 0-32 for x, y, and 0-n for z) can be a sound strategy:

Pros:

Uniformity: Both datasets would be in a similar coordinate system, making it easier to compare.
Simplification: Operations, like the boundary-filling you're trying to do, become much easier when you're working within a known range.
Better TDA Comparisons: Normalizing or scaling data can lead to better TDA results since the algorithms will be comparing shape features at a similar scale.
Cons:

Loss of Real-World Units: While the relative positions remain, the exact "real-world" spatial locations are lost. However, they can be easily transformed back using the scaling relations if needed.

Resolution Concerns: If the dataset is scaled without care, there can be resolution concerns where certain features might get exaggerated or diminished. However, since we're already working with voxelized data, this might not be a big concern.

Given these points, if the main aim is TDA-based comparison, scaling both datasets to a similar coordinate system makes a lot of sense. If, however, there are other analyses where the exact real-world locations are essential, then keeping a copy of the original scaled data and a copy of the normalized data would be ideal.