# <a id='toc1_'></a>[Unsupervised Learning](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [Unsupervised Learning](#toc1_)    
  - [Optimization Based](#toc1_1_)    
    - [K-means](#toc1_1_1_)    
    - [K-medians](#toc1_1_2_)    
  - [Hierarchical](#toc1_2_)    
    - [Agglomerative](#toc1_2_1_)    
    - [Divisive](#toc1_2_2_)    
  - [Density Based](#toc1_3_)    
    - [DBSCAN](#toc1_3_1_)    
    - [OPTICS](#toc1_3_2_)    
    - [Spectral Clustering](#toc1_3_3_)    
  - [Model Based](#toc1_4_)    
    - [Gaussian Mixture](#toc1_4_1_)    
    - [Finite Mixture](#toc1_4_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

Unsupervised learning is a machine learning approach where the model learns patterns and structures in unlabeled data without explicit guidance or supervision from labeled examples. Unlike supervised learning, unsupervised learning does not have a specific target variable to predict. Instead, it focuses on discovering underlying patterns, relationships, or groupings within the data.

Real-world examples of unsupervised learning include:

1. **Clustering Customer Segmentation**: In marketing, clustering techniques can be used to segment customers based on their purchasing behavior, demographics, or browsing patterns. This helps businesses identify distinct customer groups for targeted marketing campaigns or personalized recommendations.

2. **Anomaly Detection in Network Traffic**: Unsupervised learning algorithms can be employed to detect anomalies or unusual patterns in network traffic data. By analyzing network behavior, the algorithms can identify potential security threats or abnormal activities that deviate from normal network behavior.

Evaluating the performance of clustering ML models generally involves the following approaches:

 1. **Intrinsic Evaluation**: This involves assessing the quality of clustering directly using intrinsic measures. Examples include silhouette coefficient, Dunn index, or Calinski-Harabasz index, which evaluate the compactness and separation of clusters based on internal cluster characteristics.

 2. **Extrinsic Evaluation**: Here, the clustering results are evaluated based on external criteria or domain-specific knowledge. This may involve comparing the clustering results with known ground truth labels or expert interpretations to measure the accuracy or agreement.

 3. **Visual Inspection**: Visualizing the clustering results can provide insights into the quality and meaningfulness of the clusters. Techniques like scatter plots, heatmaps, or dendrograms can help assess the coherence and separability of the clusters.

 4. **Application-Specific Evaluation**: Evaluation can also be based on the specific application or use case. For example, in customer segmentation, the effectiveness of the clusters can be evaluated by analyzing their impact on marketing campaign performance or customer satisfaction metrics.

It's important to note that evaluation in unsupervised learning is often more subjective and challenging compared to supervised learning, as there is no ground truth or explicit target variable. Therefore, a combination of evaluation methods and domain expertise is often employed to assess the performance and usefulness of clustering ML models.

## <a id='toc1_1_'></a>[Optimization Based](#toc0_)

Optimization-based clustering methods aim to optimize a certain objective function to create the best possible clusters. The most common optimization-based clustering method is K-means, which minimizes the within-cluster sum of squares.

**General Intuition:** The algorithm starts with a random initialization of cluster centers and iteratively updates the centers and the cluster assignments until the objective function converges.

**Limitations:** 
- It assumes that clusters are convex and isotropic, which is not always the case in real-world data.
- It's sensitive to initialization, and different initializations can lead to different results.
- It requires the number of clusters to be specified beforehand.

### <a id='toc1_1_1_'></a>[K-means](#toc0_)

- **Description and Intuition of Test:** K-means is a centroid-based clustering algorithm that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.

- **Use Case for Test:** K-means is used when you have unlabeled data and you're trying to identify groups of similar instances within the data.

- **The Intuition for Using it for Classification:** K-means is not typically used for classification tasks as it is an unsupervised learning algorithm. However, the clusters formed by K-means can be used as labels to train a separate supervised learning model.

- **The Intuition for Using it for Regression:** K-means is not used for regression tasks as it is a clustering algorithm and does not predict continuous outcomes.

- **The Formula for Probability:** Not applicable for K-means as it is not a probabilistic model.

- **The Formula for the Cost Function:** The cost function for K-means, also known as the inertia, is the sum of squared distances of samples to their closest cluster center.

- **How to Code it:**

```python
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
```

- **Most Important Hyperparameters to Tune:** The K-means clustering technique, a centroid-based clustering algorithm, relies on several assumptions that influence its clustering process. These assumptions include:

  - `n_clusters`: The number of clusters to form and the number of centroids to generate.
  - `init`: Method for initialization.
  - `n_init`: Number of time the k-means algorithm will be run with different centroid seeds.

- **Assumptions of the ML Model:**
  - The variance of the distribution of each attribute (variable) is spherical.
  - All clusters have the same variance.
  - The prior probability for all k clusters are the same, i.e., each cluster has roughly equal number of observations.

  1. **Initial Centroid Assumption**: K-means assumes the initial placement of centroids within the feature space. The algorithm begins by randomly assigning the initial centroids to represent the cluster centers.

  2. **Euclidean Distance Assumption**: K-means assumes the use of the Euclidean distance metric to measure the similarity or dissimilarity between data points and centroids. This assumption implies that the clustering process aims to minimize the Euclidean distances between data points and the centroids of their assigned clusters.

  3. **Cluster Membership Assumption**: K-means assumes that each data point exclusively belongs to only one cluster. This assumption implies that data points are assigned to the cluster with the closest centroid based on the Euclidean distance measure.

  4. **Cluster Centroid Optimization Assumption**: K-means assumes that the optimal clustering solution is achieved when the sum of squared distances between data points and their respective cluster centroids is minimized. This assumption drives the iterative optimization process of adjusting the centroids to minimize the within-cluster variation.

  5. **Cluster Size Assumption**: K-means assumes that clusters in the dataset have approximately equal sizes or variances. This assumption implies that each cluster contains a similar number of data points, contributing to a balanced representation of the dataset.

  These assumptions guide the K-means algorithm during the iterative optimization process, where the centroids are updated to minimize the within-cluster sum of squared distances. They shape the behavior of the algorithm and influence the resulting cluster assignments.

- **How to Interpret the Model's Coefficients:** K-means does not produce model coefficients as it is a clustering algorithm and not a regression model.



### <a id='toc1_1_2_'></a>[K-medians](#toc0_)

- **Description and Intuition of Test:** K-medians is similar to K-means but instead of the mean, it uses the median of the cluster. It is more robust to outliers than K-means.

- **Use Case for Test:** K-medians is used when you have unlabeled data and you're trying to identify groups of similar instances within the data, especially when the data contains outliers.

- **The Intuition for Using it for Classification:** Like K-means, K-medians is not typically used for classification tasks as it is an unsupervised learning algorithm. However, the clusters formed by K-medians can be used as labels to train a separate supervised learning model.

- **The Intuition for Using it for Regression:** K-medians is not used for regression tasks as it is a clustering algorithm and does not predict continuous outcomes.

- **The Formula for Probability:** Not applicable for K-medians as it is not a probabilistic model.

- **The Formula for the Cost Function:** The cost function for K-medians is the sum of absolute deviations of samples to their closest cluster center.

- **How to Code it:** Sklearn does not have a direct implementation of K-medians. However, it can be implemented using the KMeans function with the "manhattan" distance metric.

- **Most Important Hyperparameters to Tune:** Same as K-means.

- **Assumptions of the ML Model:** Same as K-means. The K-median clustering technique, an alternative to K-means, also relies on specific assumptions that affect its clustering process. These assumptions include:

    1. **Initial Median Assumption**: K-median assumes the initial placement of medians within the feature space. The algorithm begins by randomly assigning the initial medians to represent the cluster centers.

    2. **Manhattan Distance Assumption**: K-median assumes the use of the Manhattan distance metric (also known as L1 distance) to measure the similarity or dissimilarity between data points and medians. This assumption implies that the clustering process aims to minimize the Manhattan distances between data points and the medians of their assigned clusters.

    3. **Cluster Membership Assumption**: K-median assumes that each data point exclusively belongs to only one cluster. This assumption implies that data points are assigned to the cluster with the closest median based on the Manhattan distance measure.

    4. **Cluster Median Optimization Assumption**: K-median assumes that the optimal clustering solution is achieved when the sum of absolute distances (Manhattan distances) between data points and their respective cluster medians is minimized. This assumption drives the iterative optimization process of adjusting the medians to minimize the within-cluster variation.

    5. **Cluster Size Assumption**: K-median assumes that clusters in the dataset have approximately equal sizes or variances, similar to K-means. This assumption implies that each cluster contains a similar number of data points, contributing to a balanced representation of the dataset.

    These assumptions guide the K-median algorithm during the iterative optimization process, where the medians are updated to minimize the within-cluster sum of absolute distances. They shape the behavior of the algorithm and influence the resulting cluster assignments.


- **How to Interpret the Model's Coefficients:** K-medians does not produce model coefficients as it is a clustering algorithm and not a regression model.



## <a id='toc1_2_'></a>[Hierarchical](#toc0_)

Hierarchical clustering methods build a hierarchy of clusters either in a bottom-up (agglomerative) or top-down (divisive) fashion.

**General Intuition:** Agglomerative clustering starts with each data point as a separate cluster and merges the closest pair of clusters at each step. Divisive clustering starts with all data points in one cluster and splits the cluster at each step.

**Limitations:** 
- They can be computationally expensive for large datasets.
- Once a decision is made to combine two clusters, it cannot be undone.
- They may suffer from noise and outliers.




### <a id='toc1_2_1_'></a>[Agglomerative](#toc0_)

- **Description and Intuition of Test:** Agglomerative clustering is a hierarchical clustering method that starts with each observation in its own cluster, and then merges the clusters iteratively based on their distance until only one cluster (or k clusters) remain.

- **Use Case for Test:** Agglomerative clustering is used when you want to build a hierarchy of clusters or when you don't know the number of clusters in advance.

- **The Intuition for Using it for Classification:** Agglomerative clustering is not typically used for classification tasks as it is an unsupervised learning algorithm. However, the clusters formed by agglomerative clustering can be used as labels to train a separate supervised learning model.

- **The Intuition for Using it for Regression:** Agglomerative clustering is not used for regression tasks as it is a clustering algorithm and does not predict continuous outcomes.

- **The Formula for Probability:** Not applicable for agglomerative clustering as it is not a probabilistic model.

- **The Formula for the Cost Function:** There is no explicit cost function for agglomerative clustering. The algorithm is driven by the linkage criteria, which can be single linkage (minimum distance), complete linkage (maximum distance), average linkage, etc.

- **How to Code it:**

```python
from sklearn.cluster import AgglomerativeClustering
agg = AgglomerativeClustering(n_clusters=3)
agg.fit(X)
```

- **Most Important Hyperparameters to Tune:**
  - `n_clusters`: The number of clusters to find.
  - `affinity`: Metric used to compute the linkage.
  - `linkage`: Which linkage criterion to use.

- **Assumptions of the ML Model:** The agglomerative hierarchical clustering technique, also known as bottom-up clustering, is based on several assumptions that govern its clustering process. These assumptions include:

  - The data is not too noisy.
  - The clusters are isotropic (same in all directions), and not elongated or irregularly shaped.

  1. **Singleton Assumption**: Agglomerative clustering assumes that each object initially forms a separate cluster. This means that at the beginning of the clustering process, each object is considered as an individual cluster.

  2. **Proximity Assumption**: Agglomerative clustering assumes that the proximity or similarity between two clusters can be measured based on the distances or similarities between their constituent objects. This assumption allows the algorithm to determine the merging order of clusters based on their proximity.

  3. **Continuity Assumption**: Agglomerative clustering assumes that objects that are close to each other in the feature space tend to belong to the same cluster. This assumption is based on the notion that nearby objects are more likely to share similar characteristics or attributes.

  4. **Agglomeration Criterion Assumption**: Agglomerative clustering assumes the use of a specific criterion to determine the similarity or dissimilarity between clusters during the merging process. Common agglomeration criteria include single linkage, complete linkage, and average linkage, which define the distance or similarity measure between clusters.

  5. **Hierarchy Assumption**: Agglomerative clustering assumes that clusters can be organized in a hierarchical structure, where smaller clusters are successively merged to form larger clusters. This hierarchical representation allows for the exploration of clusters at different levels of granularity.

  These assumptions guide the agglomerative hierarchical clustering algorithm and shape the construction of the cluster hierarchy. They define the rules for merging clusters based on proximity measures and determine the overall clustering behavior of the algorithm.
  
- **How to Interpret the Model's Coefficients:** Agglomerative clustering does not produce model coefficients as it is a clustering algorithm and not a regression model.



### <a id='toc1_2_2_'></a>[Divisive](#toc0_)

- **Description and Intuition of Test:** Divisive clustering is a hierarchical clustering method that starts with all observations in one cluster and then splits the clusters iteratively based on their distance until each observation is in its own cluster.

- **Use Case for Test:** Divisive clustering is used when you want to build a hierarchy of clusters or when you don't know the number of clusters in advance.

- **The Intuition for Using it for Classification:** Divisive clustering is not typically used for classification tasks as it is an unsupervised learning algorithm. However, the clusters formed by divisive clustering can be used as labels to train a separate supervised learning model.

- **The Intuition for Using it for Regression:** Divisive clustering is not used for regression tasks as it is a clustering algorithm and does not predict continuous outcomes.

- **The Formula for Probability:** Not applicable for divisive clustering as it is not a probabilistic model.

- **The Formula for the Cost Function:** There is no explicit cost function for divisive clustering. The algorithm is driven by the distance metric and the splitting criteria.

- **How to Code it:** Sklearn does not have a direct implementation of divisive hierarchical clustering. However, it can be implemented using other libraries or custom code.

- **Most Important Hyperparameters to Tune:** The choice of distance metric and the splitting criteria are the key factors to tune in divisive clustering.

- **Assumptions of the ML Model:** Same as agglomerative clustering. The divisive hierarchical clustering technique, also known as top-down clustering, makes certain assumptions during the clustering process. These assumptions include:


  1. **Monothetic Assumption**: Divisive clustering assumes that each cluster can be represented by a single prototype or centroid. This means that all objects within a cluster are similar to the prototype but differ from objects in other clusters.

  2. **Nesting Assumption**: Divisive clustering assumes that clusters are nested within each other in a hierarchical structure. This implies that at each level of the hierarchy, a cluster can be further divided into subclusters.

  3. **Non-overlapping Assumption**: Divisive clustering assumes that clusters do not overlap, meaning that each object belongs to only one cluster in the hierarchy. This assumption helps create a clear distinction between clusters.

  4. **Homogeneity Assumption**: Divisive clustering assumes that objects within a cluster are more similar to each other than to objects in other clusters. This assumption is based on the idea that clusters should be internally homogeneous and externally dissimilar.

  5. **Complete Linkage Assumption**: Divisive clustering typically uses complete linkage as the proximity measure between clusters. Complete linkage calculates the dissimilarity between two clusters based on the maximum dissimilarity between any pair of objects from each cluster.

    These assumptions guide the divisive hierarchical clustering algorithm and shape the resulting hierarchy of clusters. They help define the criteria for splitting clusters and the similarity measures used to determine cluster relationships.

- **How to Interpret the Model's Coefficients:** Divisive clustering does not produce model coefficients as it is a clustering algorithm and not a regression model.




## <a id='toc1_3_'></a>[Density Based](#toc0_)

Density-based clustering methods group together data points that are in regions of high density and separate data points that are in regions of low density. The most common density-based clustering method is DBSCAN.

**General Intuition:** The algorithm starts with an arbitrary data point, and if there are at least 'minPts' nearby, a new cluster is started. The algorithm then iteratively adds all directly reachable points to the cluster. If a point is not directly reachable, the algorithm checks the next point.

**Limitations:** 
- It assumes that clusters are dense regions in the data space separated by regions of lower object density.
- It may have difficulty finding clusters of varying densities.
- It requires the specification of the 'eps' parameter, which can be difficult to estimate.





### <a id='toc1_3_1_'></a>[DBSCAN](#toc0_)

- **Description and Intuition of Test:** DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that groups together points that are packed closely together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions.

- **Use Case for Test:** DBSCAN is used when the clusters are of arbitrary shape (not necessarily spherical), and when there is noise in the data.

- **The Intuition for Using it for Classification:** DBSCAN is not typically used for classification tasks as it is an unsupervised learning algorithm. However, the clusters formed by DBSCAN can be used as labels to train a separate supervised learning model.

- **The Intuition for Using it for Regression:** DBSCAN is not used for regression tasks as it is a clustering algorithm and does not predict continuous outcomes.

- **The Formula for Probability:** Not applicable for DBSCAN as it is not a probabilistic model.

- **The Formula for the Cost Function:** There is no explicit cost function for DBSCAN. The algorithm is driven by the density estimation of the dataset.

- **How to Code it:**

```python
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan.fit(X)
```

- **Most Important Hyperparameters to Tune:**
  - `eps`: The maximum distance between two samples for them to be considered as in the same neighborhood.
  - `min_samples`: The number of samples in a neighborhood for a point to be considered as a core point.

- **Assumptions of the ML Model:** DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular density-based clustering algorithm. Here are some of its key assumptions:
  - The density within each cluster is uniform.
  - Noise is present in the data.


  1. **Density Reachability:** In DBSCAN, a point p is directly density-reachable from a point q if point p is within the 'eps' distance from point q, and q has sufficient number of points in its neighbors which is determined by 'minPts'.

  2. **Density Connectivity:** A point p is density-connected to a point q if there's a point o such that both p and q are density-reachable from o.

  3. **Density-Based Clustering:** Clusters are dense regions in the data space separated by regions of lower object density. A cluster is defined as a maximal set of density-connected points.

  4. **Noise Points:** DBSCAN explicitly identifies points that are not part of any cluster. These points are classified as noise.

  5. **Arbitrary Shapes:** DBSCAN can discover clusters of arbitrary shape, not just spherical clusters like K-means.

  6. **Distance Metric:** DBSCAN uses a distance metric (such as Euclidean distance) to compute the distance between points. The choice of distance metric can have a significant impact on the results.

  7. **Parameter Selection:** The quality of DBSCAN depends on the distance measure used (Euclidean for numerical data, Hamming for categorical data, etc.) and it also depends heavily on the parameters eps and minPts. Different parameter settings can lead to significantly different clustering results.


- **How to Interpret the Model's Coefficients:** DBSCAN does not produce model coefficients as it is a clustering algorithm and not a regression model.



### <a id='toc1_3_2_'></a>[OPTICS](#toc0_)

- **Description and Intuition of Test:** OPTICS (Ordering Points To Identify the Clustering Structure) is a density-based clustering algorithm similar to DBSCAN, but it addresses one of DBSCAN's major weaknesses: the problem of choosing an appropriate value for the eps parameter. OPTICS generates an augmented ordering of the database representing its density-based clustering structure, which can be visualized and analyzed with various tools.

- **Use Case for Test:** OPTICS is used when the clusters are of arbitrary shape and when there is noise in the data. It's also used when the density varies across the clusters.

- **The Intuition for Using it for Classification:** OPTICS is not typically used for classification tasks as it is an unsupervised learning algorithm. However, the clusters formed by OPTICS can be used as labels to train a separate supervised learning model.

- **The Intuition for Using it for Regression:** OPTICS is not used for regression tasks as it is a clustering algorithm and does not predict continuous outcomes.

- **The Formula for Probability:** Not applicable for OPTICS as it is not a probabilistic model.

- **The Formula for the Cost Function:** There is no explicit cost function for OPTICS. The algorithm is driven by the density estimation of the dataset.

- **How to Code it:**

```python
from sklearn.cluster import OPTICS
optics = OPTICS(min_samples=5)
optics.fit(X)
```

- **Most Important Hyperparameters to Tune:**
  - `min_samples`: The number of samples in a neighborhood for a point to be considered as a core point.
  - `xi`: Determines the minimum steepness on the reachability plot that constitutes a cluster boundary.

- **Assumptions of the ML Model:**
  - The density within each cluster is not necessarily uniform, allowing




### <a id='toc1_3_3_'></a>[Spectral Clustering](#toc0_)

- **Description and Intuition of Test:** Spectral Clustering is a technique that makes use of the spectrum (eigenvalues) of the similarity matrix of the data to perform dimensionality reduction before clustering in fewer dimensions. The similarity matrix is provided as input and consists of measurements of the similarity of each pair of points in the dataset.

- **Use Case for Test:** Spectral Clustering is used when the data is not linearly separable or when the clusters are of arbitrary shapes. It's also used when the structure of the data is complex and cannot be captured by traditional clustering methods.

- **The Intuition for Using it for Classification:** Spectral Clustering is not typically used for classification tasks as it is an unsupervised learning algorithm. However, the clusters formed by Spectral Clustering can be used as labels to train a separate supervised learning model.

- **The Intuition for Using it for Regression:** Spectral Clustering is not used for regression tasks as it is a clustering algorithm and does not predict continuous outcomes.

- **The Formula for Probability:** Not applicable for Spectral Clustering as it is not a probabilistic model.

- **The Formula for the Cost Function:** There is no explicit cost function for Spectral Clustering. The algorithm is driven by the eigenvalues of the Laplacian of the similarity matrix of the data.

- **How to Code it:**

```python
from sklearn.cluster import SpectralClustering
spectral = SpectralClustering(n_clusters=3)
spectral.fit(X)
```

- **Most Important Hyperparameters to Tune:**
  - `n_clusters`: The number of clusters to form.
  - `affinity`: How to construct the affinity matrix. Can be ‘nearest_neighbors’, ‘precomputed’, or a callable.

- **Assumptions of the ML Model:**
  - The data is not necessarily linearly separable.
  - The clusters are connected and each cluster is a connected graph.

- **How to Interpret the Model's Coefficients:** Spectral Clustering does not produce model coefficients as it is a clustering algorithm and not a regression model.

## <a id='toc1_4_'></a>[Model Based](#toc0_)

Model-based clustering methods assume that the data is generated from a mixture of underlying probability distributions. The most common model-based clustering method is the Gaussian Mixture Model (GMM).

**General Intuition:** The algorithm estimates the parameters of the underlying distributions (such as the mean and variance for a Gaussian distribution) to maximize the likelihood of the observed data.

**Limitations:** 
- It assumes that the data is generated from a mixture of specific types of distributions, which may not always be the case.
- It can be sensitive to initialization.
- It may converge to local optima, depending on the initial parameter estimates.

### <a id='toc1_4_1_'></a>[Gaussian Mixture](#toc0_)

- **Description and Intuition of Test:** A Gaussian Mixture Model (GMM) is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters.

- **Use Case for Test:** GMMs are used when the data appears to be generated from a few different "groups" or "clusters", each of which can be modeled as a Gaussian distribution.

- **The Intuition for Using it for Classification:** GMMs can be used for classification tasks by treating each component of the mixture as a class. The posterior probability of each component given a data point can be used as a measure of the likelihood of that data point belonging to the corresponding class.

- **The Intuition for Using it for Regression:** GMMs are not typically used for regression tasks as they are a clustering algorithm and do not predict continuous outcomes.

- **The Formula for Probability:** The probability of a data point in a GMM is given by the weighted sum of the probabilities of that data point in each of the Gaussian distributions.

- **The Formula for the Cost Function:** The cost function for GMMs is the negative log-likelihood of the data given the parameters of the model.

- **How to Code it:**

```python
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=3)
gmm.fit(X)
```

- **Most Important Hyperparameters to Tune:**
  - `n_components`: The number of mixture components.
  - `covariance_type`: The type of covariance parameters to use.

- **Assumptions of the ML Model:**
  - The data is generated from a mixture of Gaussian distributions.
  - Each Gaussian component can be uniquely identified with a set of parameters (mean and covariance).

- **How to Interpret the Model's Coefficients:** The coefficients of a GMM are the means and covariances of the Gaussian components, and the weights of each component. They can be interpreted as the parameters of the Gaussian distributions from which the data is assumed to be generated.



### <a id='toc1_4_2_'></a>[Finite Mixture](#toc0_)

- **Description and Intuition of Test:** A Finite Mixture Model (FMM) is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of distributions with unknown parameters. These distributions can be of any type, not necessarily Gaussian.

- **Use Case for Test:** FMMs are used when the data appears to be generated from a few different "groups" or "clusters", each of which can be modeled as a specific distribution.

- **The Intuition for Using it for Classification:** FMMs can be used for classification tasks by treating each component of the mixture as a class. The posterior probability of each component given a data point can be used as a measure of the likelihood of that data point belonging to the corresponding class.

- **The Intuition for Using it for Regression:** FMMs are not typically used for regression tasks as they are a clustering algorithm and do not predict continuous outcomes.

- **The Formula for Probability:** The probability of a data point in a FMM is given by the weighted sum of the probabilities of that data point in each of the distributions.

- **The Formula for the Cost Function:** The cost function for FMMs is the negative log-likelihood of the data given the parameters of the model.

- **How to Code it:** Sklearn does not have a direct implementation of FMMs. However, they can be implemented using other libraries or custom code.

- **Most Important Hyperparameters to Tune:** The number of components and the type of distributions used are the key factors to tune in FMMs.

- **Assumptions**: The assumptions of a Finite Mixture Model (FMM) include:

  1. **Component Distributions:** The data is generated from a finite mixture of different distributions. These distributions can be of any type, not necessarily Gaussian.

  2. **Independence:** Each data point is independently drawn from one of the component distributions.

  3. **Parameter Estimation:** The parameters of each component distribution (such as mean and variance for a Gaussian distribution) are unknown and are to be estimated from the data.

  4. **Component Membership:** Each data point belongs to one of the component distributions, but the exact membership is unknown and is treated as a latent variable.

  5. **Homogeneity within Components:** Data points belonging to the same component distribution are homogeneous, i.e., they follow the same distribution.

- **Model Coefficient Interpretation:**  In a Finite Mixture Model (FMM), the coefficients represent the parameters of the component distributions and the mixing proportions of these distributions. Here's how to interpret these coefficients:

  1. **Component Distribution Parameters:** These are the parameters of each component distribution in the mixture. For example, if the component distributions are Gaussian, each one will have a mean and a variance. These parameters describe the characteristics of each cluster in the data. For instance, the mean of a Gaussian component represents the center of a cluster, and the variance represents the spread of the cluster.

  2. **Mixing Proportions:** These are the proportions of the overall data that each component distribution accounts for. They sum to 1 across all components. A higher mixing proportion for a component means that a larger fraction of the data is generated from that component.

  It's important to note that the coefficients of a FMM do not have a direct interpretation in terms of the influence on a response variable like in a regression model, because FMM is an unsupervised learning method used for clustering rather than prediction.

  Also, the assignment of data points to clusters is not as straightforward as in some other clustering methods. Each data point has a probability of belonging to each cluster (component), rather than a definite assignment. The cluster with the highest probability is usually taken as the assigned cluster for each data point.

## Comparison of Algorithms

So, why would we choose GMM or hierarchical clustering over DBSCAN or K-Means?

K-Means and Gaussian Mixture Models allow us to make fast predictions on new data. Hierarchical clustering can help determine distances between points and sub-clusters at multiple levels. DBSCAN allows us to fit data with strangely shaped patterns and to categorize certain points as noise.

As well as these differences, K-means is the quickest on large data sets (especially its optimized version, `MiniBatchKMeans`). K-means is probably the most widely used as well, but it never hurts to try a couple of other methods for comparison.

Measuring clustering algorithms is not easy. If we don't know the 'ground truth', it mostly comes down to using some internal measure of coherence or variance within a cluster, or eyeballing a graph across a few dimensions. For example, we can compare the performance of the different algorithms across all the datasets we have seen thus far: