# Clustering

Situation where we only know the features and not the target (real world). UNSUPERVISED LEARNING

The goal of clustering is to identify those talent groupings of observations, which is done well, allows us to predict the class of observations even without a target vector.

# K - MEANS

Splitting up into k groups.

The algorithm attempts to group observations into k groups, having each group equal variance.

K is selecter by the user as hyperparameter.

1-. k cluster center points are created at random locations

2-. For each observation : The distance between each observation and the k center points is calculated. + The observation assigned to the cluster of the nearest center point.

3-. The center points are moved to the means (centers) of their respective cluster

4-. steps2 and 3 are repeated until no observation changes in cluster membership

The model is considered as converged and stop.


K means assumes that:
Clusters are convex shaped 

All features are equally scaled. (thats why we standardize the data)

The groups are balanced.

If we can't fulfil this requirements it would be a good idea to try another algorithm.

In [1]:
# Load libraries

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

In [2]:
# Load data

iris = load_iris()

features = iris.data

In [3]:
# Creating scaler

scaler = StandardScaler()

In [4]:
# Standarize features

features_standardized = scaler.fit_transform(features)

In [5]:
# Create K - Means object

cluster = KMeans(n_clusters = 3, random_state = 0, n_jobs = -1)

In [6]:
# Train the model

model = cluster.fit(features_standardized)
model

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=3, n_init=10, n_jobs=-1, precompute_distances='auto',
       random_state=0, tol=0.0001, verbose=0)

In [7]:
# View predict class ( Predicted classes)

model.labels_

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0,
       2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 0, 0, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0,
       0, 0, 0, 2, 2, 0, 0, 0, 0, 2, 0, 2, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0,
       0, 2, 2, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 2], dtype=int32)

If we compare this to the observations true we can see despite the diffence in class labels , k-means did reasonably well

In [8]:
iris.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [9]:
# Create new observation

new_observation = [[0.8, 0.8, 0.8, 0.8]]

In [10]:
# Predict observation class

model.predict(new_observation)

array([0], dtype=int32)

In [11]:
# View clusters centers

model.cluster_centers_

array([[ 1.13597027,  0.08842168,  0.99615451,  1.01752612],
       [-1.01457897,  0.85326268, -1.30498732, -1.25489349],
       [-0.05021989, -0.88337647,  0.34773781,  0.2815273 ]])

Silhouette measure the similarit of the clusters.

# Speeding Up K-Means using Minibatch

If k-means takes too long this reduces the time required with an small cost in quality.

Works similarly to Kmeans, batch-size determines the time required.


In [12]:
# Load libraries

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import MiniBatchKMeans

In [13]:
# Load data

iris = load_iris()

features = iris.data

In [14]:
# Creating scaler

scaler = StandardScaler()

In [15]:
# Standarize features

features_standardized = scaler.fit_transform(features)

In [16]:
cluster = MiniBatchKMeans(n_clusters = 3, random_state = 0, batch_size = 100)

In [17]:
# Train model

model = cluster.fit(features_standardized)
model

MiniBatchKMeans(batch_size=100, compute_labels=True, init='k-means++',
                init_size=None, max_iter=100, max_no_improvement=10,
                n_clusters=3, n_init=3, random_state=0, reassignment_ratio=0.01,
                tol=0.0, verbose=0)

# Clustering using Meanshift

Group observation without assuming the number of clusters or their shape.

You don't have to select the k. 

---

**Concept : **

Imagine you are on a very foggy football field ( two dimensional feature space). with 100 people standing on it. (observations)

Every minute each person take a step in the direction the more people they can see.

As times goes on, people start to group up as they repeatedly take steps toward larger and larger crowds. 

The end result is clisters of people around the field and people are assigned to the cluster the end up.

---

band width : set the radius of the area an observation uses to determine the direction to shift.

how far a person can see in the fog

sometimes there are no other observations within an observation kernel. Meanshift assing all this orphan values to the closest observation's kernel

cluster_all 0 False : orphan observations are given the label of -1.

In [18]:
# Load libraries

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import MeanShift

In [19]:
# Load data

iris = load_iris()

features = iris.data

In [20]:
# Creating scaler

scaler = StandardScaler()

In [21]:
# Standarize features

features_standardized = scaler.fit_transform(features)

In [22]:
cluster = MeanShift(n_jobs =-1)

In [23]:
# Train model

model = cluster.fit(features_standardized)
model

MeanShift(bandwidth=None, bin_seeding=False, cluster_all=True, max_iter=300,
          min_bin_freq=1, n_jobs=-1, seeds=None)

# Clustering using DBSCAN

Group observations into clusters of high density.

### DBSCAN

Motivated by the idea that clusters will be areas where many observations are densely packer together and makes no assumption of cluster shape.

- 1-. A random observation is chosen
 
- 2-. If X has a minimum number of close neighbors we consider it to be part of a cluster

- 3-. Previous step is repeated recursively for all xś neighbors, then neighbor's neighbor and so on. These are the cluster's core observations.

- 4-. Once step 3 euns out of nearby observations, a new random point is chosen (restarting step 1)

Any observation close to the cluster but not a core sample will be considered as part of the cluster.

**Parameters**

eps: Maximum distance fron an observation for another observation to be considered its neighbors.

min_samples: The minimum number of observations less than eps distance from an observation fot it to be considered a core observation.

metric : Distance metric used. minkowski (p intensity), euclidean

In [24]:
# Load libraries

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN

In [25]:
# Load data

iris = load_iris()

features = iris.data

In [26]:
# Creating scaler

scaler = StandardScaler()

In [27]:
# Standarize features

features_standardized = scaler.fit_transform(features)

In [28]:
# Create DBSCAN object

cluster = DBSCAN(n_jobs = -1)

In [29]:
# Train model

model = cluster.fit(features_standardized)
model

DBSCAN(algorithm='auto', eps=0.5, leaf_size=30, metric='euclidean',
       metric_params=None, min_samples=5, n_jobs=-1, p=None)

In [30]:
model.labels_

array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, -1, -1,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, -1, -1,
        0,  0,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  1,
        1,  1,  1,  1,  1, -1, -1,  1, -1, -1,  1, -1,  1,  1,  1,  1,  1,
       -1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
       -1,  1, -1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1, -1,  1, -1,  1,
        1,  1,  1, -1, -1, -1, -1, -1,  1,  1,  1,  1, -1,  1,  1, -1, -1,
       -1,  1,  1, -1,  1,  1, -1,  1,  1,  1, -1, -1, -1,  1,  1,  1, -1,
       -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1])

# Clustering Using Hierarchical Merging

Group observations using hierarchy of clusters.

### Agglomerative Clustering.

Powerful, flexible hierarchical clustering algorithm.

All observations starts as their own clusters.

Clusters meeting some criteria are merged together.

This process is repeated, growing cluster until some end point is reached. Use linkage parameter to determine merging strategy to minimize the following:

- Variance of merged clusters
- Average distance between observations from pairs of clusters (average)
- Maximun distance between observations from pairs of clusters (complete)

Affinity : Determines the distance metric used for linkage. minkowski, euclidean
n_clusters : number of cluster it will attempt to find.

In [32]:
# Load libraries

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering

In [33]:
# Load data

iris = load_iris()

features = iris.data

In [34]:
# Creating scaler

scaler = StandardScaler()

In [35]:
# Standarize features

features_standardized = scaler.fit_transform(features)

In [36]:
# Create DBSCAN object

cluster = AgglomerativeClustering(n_clusters = 3)

In [37]:
# Train model

model = cluster.fit(features_standardized)
model

AgglomerativeClustering(affinity='euclidean', compute_full_tree='auto',
                        connectivity=None, distance_threshold=None,
                        linkage='ward', memory=None, n_clusters=3)

In [38]:
model.labels_

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 0, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 0, 2, 0,
       2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 0, 2, 0, 0, 2,
       2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])