# Module 1: Introduction to Scikit-Learn

## Part 3: Affinity Propagation Clustering

In this part, we will explore Affinity Propagation clustering, a popular unsupervised learning algorithm used for clustering tasks. It doesn't require you to specify the number of clusters beforehand. Instead, it discovers clusters by finding "exemplar" data points that best represent each cluster.

### 3.1 Understanding Affinity Propagation

Affinity Propagation, unlike traditional clustering methods that require specifying the number of clusters in advance, identifies clusters based on a "message-passing" approach. 

In affinity propagation, data points are considered as potential "exemplars" that can represent clusters. Exemplars are data points that are most representative of their respective clusters.

Affinity propagation can discover clusters of varying sizes and shapes, making it suitable for datasets where clusters have irregular shapes and different densities.

### 3.2 Training

To apply the k-means algorithm, we need an unlabeled dataset.

The algorithm starts with a similarity matrix that quantifies the similarity between pairs of data points. The similarity can be defined using various distance metrics, such as Euclidean distance or negative squared Euclidean distance.

Affinity propagation iteratively updates two matrices: the responsibility matrix and the availability matrix. The responsibility matrix represents how well-suited one data point is to be an exemplar for another. The availability matrix represents the evidence that a data point should choose another as its exemplar. This can be computationally expensive, especially for large datasets, as it requires maintaining a similarity matrix.

Affinity propagation identifies clusters by repeatedly updating the responsibility and availability matrices until a stopping criterion is met. The final exemplars represent the cluster centers.

It may produce a different number of clusters than expected, so it's essential to interpret the results carefully.

In [None]:
import numpy as np
from sklearn.cluster import AffinityPropagation
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

X, _ = make_blobs(n_samples=300, n_features=2, centers=3, cluster_std=3.0, random_state=42)

affinity_propagation = AffinityPropagation()
affinity_propagation.fit(X)

labels = affinity_propagation.labels_
exemplars = affinity_propagation.cluster_centers_indices_

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=20)
plt.scatter(X[exemplars, 0], X[exemplars, 1], c='red', s=200, marker='X', label='Exemplars')
plt.title('Affinity Propagation Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

In this Affinity Propagation clustering example, we applied the algorithm to synthetic data with three distinct clusters. Notably, without specifying the damping factor or preference value, the algorithm autonomously determined the number of clusters and exemplar data points. However, it's essential to be aware that Affinity Propagation may not always find the appropriate number of clusters, as it can be influenced by data characteristics and parameter settings. The visualization displayed data points color-coded by their assigned clusters, with red 'X' markers indicating exemplars—the representative points for each cluster. Affinity Propagation's strength lies in its ability to automatically identify both the number of clusters and exemplar points based on the input data's inherent structure, but it may require parameter tuning to achieve the desired clustering outcome.

### 3.3 Hyperparameter Tuning

Affinity Propagation is known for its sensitivity to input parameters, particularly the damping factor (damping) and the preference value (preference). These parameters can significantly influence the number of clusters generated.

The damping factor controls the extent to which current availabilities and responsibilities are updated in each iteration. A lower damping factor (e.g., 0.5) can help prevent too many clusters from forming.

The preference value determines the number of exemplars. A higher preference value will result in more exemplars and, consequently, more clustes.

After adjusting the parameters, it's crucial to visualize the clustering results, including the exemplars, to assess the quality and number of clusters effectively.

In [None]:
import numpy as np
from sklearn.cluster import AffinityPropagation
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

X, _ = make_blobs(n_samples=400, n_features=2, centers=3, cluster_std=2.0, random_state=42)

damping = 0.5
preference = np.median(-X) * 0.5

affinity_propagation = AffinityPropagation(damping=damping, preference=-500)
affinity_propagation.fit(X)
labels = affinity_propagation.labels_
exemplars = affinity_propagation.cluster_centers_indices_

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=20)
plt.scatter(X[exemplars, 0], X[exemplars, 1], c='red', s=40, marker='X', label='Exemplars')
plt.title('Affinity Propagation Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

In this Affinity Propagation clustering example, we applied the algorithm to synthetic data with three underlying clusters. We customized the parameters to control the clustering outcome, using a damping factor of 0.5 and setting the preference value to -500, which influences the number of exemplars and clusters. The resulting clustering revealed multiple clusters, with some data points serving as exemplars (cluster representatives). The visualization displayed data points color-coded by their assigned clusters, and red 'X' markers indicated exemplar points. Affinity Propagation's sensitivity to parameter tuning is evident here, as it produced more clusters than expected. Careful parameter selection is essential when using this algorithm to achieve meaningful clustering results tailored to your specific dataset and objectives.

### 3.4 Summary

Affinity Propagation is a distinctive clustering algorithm that discovers clusters by selecting exemplar data points based on similarity scores. It doesn't require specifying the number of clusters in advance, making it suitable for various data scenarios. While it provides flexibility and can handle complex structures, it also demands careful parameter tuning and may result in uneven cluster sizes. Affinity Propagation is particularly useful when the true number of clusters is unclear or when you need to identify meaningful exemplars within your data.