### Clustering and Downsampling: Notes

#### Mean shift clustering

**Mean Shift Clustering:**

**Idea:** Mean Shift identifies clusters by iteratively shifting each data point towards the mode (peak) of the data's density distribution.

**Pros:** It doesn't require specifying the number of clusters in advance and can discover clusters of arbitrary shapes.

**Cons:** It can be computationally intensive, and the algorithm's performance depends on the choice of bandwidth parameter.

**Under mean shift clustering:**

bandwidth=0.00000475 found to be suitable for THIS instance of AFM data

bandwidth=0.000005 eradicated level2 terrace in sample data

bandwidth<0.00000475 led to monatomic-width pseudolevels between terraces

#### Regular grid interpolation

Interpolation captures finer details and curved edges by estimating data values at positions within the original dataset based on the known data points. Here's how it works:

**Interpolation Methods:**

When you use interpolation, you essentially create a mathematical model or function that approximates the data values between the original data points.
Different interpolation methods, such as linear, cubic, or spline interpolation, use varying degrees of mathematical complexity to estimate the values.
These methods can capture intricate patterns, including curved edges, by fitting a smooth curve or surface through the data points.

**Higher Resolution:**

By defining a new grid with a higher resolution (more points) for interpolation, you increase the density of estimation points.
This higher density of points allows the interpolation method to capture finer details in the data, such as subtle variations and curved shapes.

**Local Detail Preservation:**

Many interpolation methods, especially spline-based techniques, excel at preserving local detail, including curvature and sharp changes in the data.
They create a smooth transition between known data points, which helps maintain the integrity of features like curved edges.

While interpolation can be effective for preserving details, it's important to consider its limitations:

**Advantages of Interpolation:**

Preserves fine details and features, including curved edges.
Allows for high-resolution downsampling when applied to a denser grid.
Relatively straightforward to implement and use with scikit-learn's RegularGridInterpolator.

**Disadvantages of Interpolation:**

Can introduce artificial smoothness: Some interpolation methods might oversmooth the data, potentially reducing the ability to capture sharp features.
Sensitive to grid resolution: The effectiveness of interpolation depends on the density of the grid, and too high a resolution might lead to overfitting.
May not handle outliers well: Interpolation methods can be sensitive to outliers or noise in the data.

#### t-SNE (t-distributed stochastic neighbor embedding)

t-SNE (t-distributed stochastic neighbor embedding) is an unsupervised dimensionality reduction technique that doesn't require labeled training data. Instead, it operates on the input data directly. Here's how it works:

**Data Structure Preservation:** t-SNE tries to map high-dimensional data points to a lower-dimensional space while preserving the pairwise similarity structure of the data as much as possible. It's effective at capturing clusters, local structures, and non-linear relationships.

**Parameter Tuning:** t-SNE has a few parameters, such as the perplexity, which you can adjust to control the trade-off between preserving local and global structures. The optimal perplexity value often depends on the specific dataset and use case.

**No Pre-trained Models:** It's designed to work directly with your data, making it suitable for a wide range of applications.

**Computational Cost:** t-SNE can be computationally expensive, especially for large datasets. However, there are approximate algorithms like "MulticoreTSNE" that can speed up the process.

Regarding the amount of data needed, t-SNE can be effective with relatively small to moderate-sized datasets. It doesn't require a large amount of data, but its performance can improve with more data points, especially when you want to capture subtle data relationships and structures.

If you have a specific application where you want to use t-SNE to visualize or reduce the dimensionality of your data, you can start with a reasonable subset of your AFM data to see how t-SNE performs. You can then adjust the perplexity and other parameters to fine-tune the results.

You can find various implementations of t-SNE in Python libraries like scikit-learn and MulticoreTSNE, which provide straightforward ways to apply t-SNE to your data.