*Clustering*

    The goal is to group similar instances together into clusters. Clustering is a great tool for data analysis, customer segmentation, recommender systems, search engines, image segmentation, semisupervised learning, dimensionality reduction, and more.

*Anomaly detection*

    The objective is to learn what “normal” data looks like, and then use that to detect abnormal instances, such as defective items on a production line or a new trend in a time series.

*Density estimation*

    This is the task of estimating the probability density function (PDF) of the random process that generated the dataset. Density estimation is commonly used for anomaly detection: instances located in very low-density regions are likely to be anomalies. It is also useful for data analysis and visualization.

*Affinity* is any measure of how well an instance fits into a cluster

Anomaly detection is particularly useful in detecting defects in manufacturing, or for *fraud detection*.

Some clustering algorithms look for instances centered around a particular point, called a *centroid*.

In K-Means clustering, instead of assigning each instance to a single cluster, which is called *hard clustering*, it can be useful to give each instance a score per cluster, which is called *soft clustering*.

K-means uses a performance metric called the model's *inertia*, which is the mean squared distance between each instance and its closest centroid.

An "elbow" method for choosing the best value for the number of clusters is rather coarse. A more precise approach (but also more computationally expensive) is to use the *silhouette score*, which is the mean *silhouette coefficient* over all the instances.

$(b - a)/\text{max}(a, b)$

$a$ is the mean distance to the other instances in the same cluster and $b$ is the mean nearest-cluster distance.

*Image segmentation* is the task of partitioning an image into multiple segments. In *semantic segmentation*, all pixels that are part of the same object type get assigned to the same segment. In *instance segmentation*, all pixels that are part of the same individual object are assigned to the same segment.

*Label propagation* is a semi-supervised machine learning algorithm that assigns labels to previously unlabeled data points. At the start of the algorithm, a (generally small) subset of the data points have labels (or classifications). These labels are propagated to the unlabeled points throughout the course of the algorithm.

*DBSCAN* defines clusters as continuous regions of high density. Here is how it
works:

    • For each instance, the algorithm counts how many instances are located within a small distance ε (epsilon) from it. This region is called the instance’s ε-neighborhood.

    • If an instance has at least min_samples instances in its ε-neighborhood (including itself), then it is considered a core instance. In other words, core instances are those that are located in dense regions.

    • All instances in the neighborhood of a core instance belong to the same cluster. This neighborhood may include other core instances; therefore, a long sequence of neighboring core instances forms a single cluster.

    • Any instance that is not a core instance and does not have one in its neighborhood is considered an anomaly.

Scikit-Learn implements several more clustering algorithms that you should take a look at. Here is a list of few: *Agglomerative clustering, BIRCH, Mean-Shift, Affinity propagation, Spectral clustering*.

A *Gaussian mixture model* (GMM) is a probabilistic model that assumes that the instances were generated from a mixture of several Gaussian distributions whose parameters are unknown.

*Observed variables* (sometimes called observable variables or measured variables) are actually measured by the researcher. The opposite of an observed variable is a *latent variable*, also referred to as a *factor* or *construct*. A latent variable is hidden, and therefore can’t be observed.

An *expectation–maximization* (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables.

*Generative models* can generate new data instances.

For selecting a number of clusters for GMM, check out pg 267-270