# Clustering

# Anomaly Detection

# Density Estimation

# K-Means

- not behave very well when the blobs have very different diameters

- hard clustering => assign each instance to a single cluster

- soft clustering => distance between the instance and the centroid

In [None]:
from sklearn.cluster import KMeans

k = 5
kmeans = KMeans(n_clusters = k)
y_pred = kmeans.fit_predict(X)

kmeans.cluster_centers_ # get the cluster centers

X_new = np.array([[0, 2], [3, 2], [-3, 3], [-3, 2.5]])
kmeans.predict(X_new) # hard clustering --- just give the class
kmeans.transform(X_new) # soft clustering --- give the distance of that point to each cluster. 

## Mitigation of Converge not to the right solution

- Centroid initialization methods (know approximately where the centroids should be)

    ```
    good_init = np.array([[-3, 3], [-3, 2], [-3, 1], [-1, 2], [0, 2]])
    kmeans = KMeans(n_clusters = 5, inti = good_init, n_init = 1)
    ```

- run the algorithm multiple times with different random initializations and keep the best one

    - measure method: `kmeans.inertia_`

    - measure socre: `kmeans.score(X)`

## Mini-batch K-Means

`from sklearn.cluster import MiniBatchKMeans`

## Find the optimal # of clusters

- plot the inertia vs clusters k => find the elbow

- use silhouette_score function 

    ```
    from sklearn.metrics import silhouette_score
    silhouette_score(X, kmeans.labels_)
    ```

- check with silhouette diagram, check both score and the size of the cluster

## Usage

- Image/ Color Segmentation

    ```
    X = iamge.reshape(-1, 3)
    kmeans = KMeans(n_cluster = 8).fit(X)
    segmented_img = kmeans.cluster_ceters_[kmeans.labels_]
    segmented_img = segmented_img.reshape(image.shape)
    ```

- Preprocessing

    ```
    pipeline = Pipeline([
        ('kmeans', KMeans()), 
        ('log_reg', LogisticRegression())
    ])

    param_grid = dict(kmeans__n_clusters = range(2, 100))
    grid_clf = GridSearchCV(pipeline, param_grid, cv = 3, verbose = 2)
    grid_clf.fit(X_train, y_train)
    ```

- semi-supervised learning (plenty of unlabeled while few labeled)

    - cluster the training sets into labeled instances

    - label propagation: propagate the labels to all the other instances in the same cluster

    - propate the 20% closest to the centroids

In [None]:
# for the semi-supervised learning

# cluster the tranining sets
k = 50
kmeans = KMeans(n_clusters=k)
X_digits_dist = kmeans.fit_transform(X_train)
representative_digit_idx = np.argmin(X_digits_dist, axis=0)
X_representative_digits = X_train[representative_digit_idx]

# label manually
y_representative_digits = np.array([4, 8, 0, 6, 8, 3, ..., 7, 6, 2, 3, 1, 1])

# label propagation
y_train_propagated = np.empty(len(X_train), dtype=np.int32)
for i in range(k):
    y_train_propagated[kmeans.labels_==i] = y_representative_digits[i]

# 20% percentile_clostest
percentile_closest = 20
​
X_cluster_dist = X_digits_dist[np.arange(len(X_train)), kmeans.labels_]
for i in range(k):
    in_cluster = (kmeans.labels_ == i)
    cluster_dist = X_cluster_dist[in_cluster]
    cutoff_distance = np.percentile(cluster_dist, percentile_closest)
    above_cutoff = (X_cluster_dist > cutoff_distance)
    X_cluster_dist[in_cluster & above_cutoff] = -1

partially_propagated = (X_cluster_dist != -1)
X_train_partially_propagated = X_train[partially_propagated]
y_train_partially_propagated = y_train_propagated[partially_propagated]

|cluster-alg | pros | cons|
|---|---|---|
|k means| fast/ scalable | suboptimal <- run several times| 
| - | - | need to specify # of clusters| 
| - | - | impact if clusters have varying sizes| 

# DBSCAN

It cannot do prediction but can predict with knn

Need to set the esp properly. 

In [None]:
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=1000, noise=0.05)
dbscan = DBSCAN(eps=0.05, min_samples=5)
dbscan.fit(X)

# label all the instances
dbscan.labels_
dbscan.core_sample_indices_
dbscan.components_

# fit knn with the dbscan
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=50)
knn.fit(dbscan.components_, dbscan.labels_[dbscan.core_sample_indices_])

X_new = np.array([[-0.5, 0], [0, 0.5], [1, -0.1], [2, 1]])
knn.predict(X_new)

# within the new data, it can make good predict, though the new data could be anomaly. 
knn.predict_proba(X_new)

y_dist, y_pred_idx = knn.kneighbors(X_new, n_neighbors=1)

y_pred = dbscan.labels_[dbscan.core_sample_indices_][y_pred_idx]

y_pred[y_dist > 0.2] = -1

y_pred.ravel()