# Module 1: Introduction to Scikit-Learn

## Part 1: Gaussian Mixture Models (GMM)

In this part, we will explore Gaussian Mixture Models (GMM), a probabilistic clustering algorithm used in machine learning for modeling complex data distributions. They are particularly useful for clustering and density estimation tasks.

### 1.1 Understanding Gaussian Mixture Models (GMM)

Gaussian Mixture Models (GMM) is a probabilistic model that represents the data distribution as a combination of Gaussian distributions. It assumes that the dataset is generated from a mixture of underlying Gaussian distributions, where each Gaussian component represents a cluster. GMM assigns probabilities to each data point belonging to each cluster, allowing for soft assignment of data points to clusters.

The key idea behind GMM is to estimate the parameters of the Gaussian distributions, including the means, covariances, and mixture weights. These parameters are learned using an expectation-maximization (EM) algorithm, which iteratively maximizes the likelihood of the observed data.

Choosing the appropriate number of components (clusters) in GMM is an important consideration. It can be determined through techniques such as the Bayesian Information Criterion (BIC), Akaike Information Criterion (AIC), or through cross-validation. These methods help in selecting the optimal number of components that balance model complexity and data likelihood.

It is recommended to scale the features before applying GMM clustering to ensure that all features contribute equally to the clustering process. StandardScaler or MinMaxScaler can be used to scale the features appropriately.

### 1.2 Training and Evaluation

To apply GMM, we need an unlabeled dataset. The algorithm estimates the parameters of the Gaussian distributions based on the observed data. It then assigns probabilities to each data point belonging to each cluster.

Once trained, we can use the GMM model to predict the cluster labels for new, unseen data points. The model assigns each data point to the most probable cluster based on the computed probabilities.

Evaluating the quality of a clustering process can be done using various internal and external metrics, depending on whether you have ground truth labels (external evaluation) or not (internal evaluation). In our case, since we will use synthetic data without ground truth labels, we'll focus on internal evaluation metrics. Here are a few commonly used metrics for evaluating clustering results:
- Silhouette Score: The silhouette score measures how similar each data point is to its own cluster (cohesion) compared to other clusters (separation). It ranges from -1 to 1.
    - A score close to 1 indicates that data points within the same cluster are close to each other, and clusters are well-separated.
    - A score close to 0 suggests overlapping clusters or that the data might be too noisy.
    - A negative score indicates that data points have been assigned to the wrong clusters.
- Davies-Bouldin Index: This index measures the average similarity ratio of each cluster with the cluster that is most similar to it. It has no fixed range of values.
    - Lower values indicate better clustering. The closer the index is to zero, the better the clustering.
- Calinski-Harabasz Index (Variance Ratio Criterion): This index measures the ratio of between-cluster variance to within-cluster variance. It has no fixed range of values.
    - Higher values indicate better clustering. You typically compare this index across different clustering solutions, where a higher score implies better separation between clusters.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
from sklearn.metrics import davies_bouldin_score
from sklearn.metrics import calinski_harabasz_score

X, _ = make_blobs(n_samples=400, n_features=2, centers=3, cluster_std=2.0, random_state=42)

gmm = GaussianMixture(n_components=3, random_state=42)
gmm.fit(X)
labels = gmm.predict(X)

plt.figure(figsize=(6,4))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=20)
plt.scatter(gmm.means_[:, 0], gmm.means_[:, 1], c='red', s=40, marker='X', label='Cluster Centers')
plt.title('Gaussian Mixture Model Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

silhouette_avg = silhouette_score(X, labels)
print(f"Silhouette Score: {silhouette_avg:.2f}")
db_index = davies_bouldin_score(X, labels)
print(f"Davies-Bouldin Index: {db_index:.2f}")
ch_score = calinski_harabasz_score(X, labels)
print(f"Calinski-Harabasz Index: {ch_score:.2f}")

This code snippet demonstrates the application of Gaussian Mixture Model (GMM) clustering to synthetic data and evaluates the clustering quality using three commonly used internal clustering evaluation metrics.

The code generates synthetic data with 300 samples, two features, and three clusters, each with a standard deviation of 3.0. It then fits a GMM to this data, predicts cluster labels, and visualizes the results, with cluster centers marked in red 'X' markers.

To assess the clustering quality, three metrics are calculated:

1. Silhouette Score: This metric quantifies how well-separated and cohesive the clusters are. A higher score (in this case, 0.58) suggests that the clusters are relatively well-defined.

2. Davies-Bouldin Index: This index measures the average similarity ratio between clusters, with a lower value (0.60) indicating better clustering.

3. Calinski-Harabasz Index: This index assesses the ratio of between-cluster variance to within-cluster variance. The higher the value (647.91), the better the separation between clusters.

These metrics provide insights into the quality of the clustering results, with higher values generally indicating better clustering. However, it's important to interpret these scores alongside visual inspection and domain knowledge to make informed decisions about the clustering outcome.

### 1.3 Summary

Gaussian Mixture Models (GMM) is a powerful probabilistic clustering algorithm for discovering clusters in a dataset. It models the data distribution as a combination of Gaussian distributions and assigns probabilities to each data point belonging to each cluster.

GMM assumes that the underlying clusters are Gaussian distributions, which may not hold true for all datasets. It can also be sensitive to the initialization of parameters and may converge to a local optimum. GMM is also computationally more expensive than some other clustering algorithms. It is recommended to scale the features before applying GMM clustering to ensure that all features contribute equally to the clustering process.

In summary, Gaussian Mixture Models in scikit-learn are a versatile tool for clustering, density estimation, and generative modeling. They can capture complex data distributions by modeling data as a combination of Gaussian components.