<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Comparing Clustering Algorithm Effectiveness

_Authors: Kiefer Katovich (SF)_

---

In this lab, you'll test three of the clustering algorithms we've covered on seven data sets that are specifically designed to evaluate clustering algorithm effectiveness.

This lab is exploratory and heavy on data visualization.


In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.cluster import KMeans, k_means
from sklearn.metrics import silhouette_score
from sklearn.datasets.samples_generator import make_blobs
import hdbscan
from sklearn.cluster import DBSCAN
import seaborn as sns

plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

---

### 1) Load the data sets.

Each of the seven data sets have three columns:

    x
    y
    label
    
Because they each only have two variables, they're easy to examine visually. You’ll compare the “true” labels for the data to the clusters the algorithms find.

> Remember that in unsupervised learning methods like clustering, you will generally _not_ have "true labels."  They are provided here simply as a convenience to give you some behind-the-scenes insight.

In [None]:
flame = pd.read_csv('./datasets/flame.csv')
agg = pd.read_csv('./datasets/aggregation.csv')
comp = pd.read_csv('./datasets/compound.csv')
jain = pd.read_csv('./datasets/jain.csv')
path = pd.read_csv('./datasets/pathbased.csv')
r15 = pd.read_csv('./datasets/r15.csv')
spiral = pd.read_csv('./datasets/spiral.csv')

In [None]:
flame.head()

In [None]:
flame.label.value_counts()

In [None]:
agg.label.value_counts()

In [None]:
comp.label.value_counts()

In [None]:
jain.label.value_counts()

In [None]:
path.label.value_counts()

In [None]:
r15.label.value_counts()

In [None]:
spiral.label.value_counts()

---

### 2) Plot each of the data sets with colored true labels.

The data sets have different numbers of unique labels, so you'll need to figure out how many there are for each and color the clusters accordingly (for example, `r15` has 15 different clusters).

In [None]:
# Plotting function.
def plot_clusters(df, title):
    fig, ax = plt.subplots(figsize=(5,5))
    
    colors = plt.cm.Spectral(np.linspace(0, 1, len(df.label.unique())))
    
    for label, color in zip(df.label.unique(), colors):
        X = df[df.label == label]
        ax.scatter(X.iloc[:,0], X.iloc[:,1], s=70, 
                   color=color, label=label, alpha=0.9)
        
    ax.set_title(title, fontsize=20)
    ax.legend(loc='lower right')
    
    plt.show()

In [None]:
# Plot each data set with the true cluster labels.
plot_clusters(flame,2)

In [None]:
plot_clusters(agg,7)

In [None]:
plot_clusters(comp,6)

In [None]:
plot_clusters(jain,2)

In [None]:
plot_clusters(path,3)

In [None]:
plot_clusters(r15,15)

In [None]:
plot_clusters(spiral,3)

---

### 3) Write a plotting function (or functions) to compare the performance of the three clustering algorithms.

Load in the three clustering algorithms we covered earlier in the class.

    K-means: k-means clustering.
    Agglomerative clustering: hierarchical clustering (bottom up).
    DBSCAN: density-based clustering.
    
Your function(s) should allow you to visually examine the effects of changing different parameters in the clustering algorithms. The parameters you should explore, at minimum, are:

    K-means:
        n_clusters
    Agglomerative clustering:
        n_clusters
    DBSCAN
        eps
        min_samples
        
Feel free to explore other parameters for these models.


In [None]:
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN

In [4]:
def plotting (df, model, n_clusters=0, eps=0, min_samples=0, dist=0):
    
     fig, axarr = plt.subplots(1,4, figsize=(24,7))
    
    # True:
    colors = plt.cm.Spectral(np.linspace(0, 1, len(df.label.unique())))
    
    for label, color in zip(df.label.unique(), colors):
        X_ = df[df.label == label]
        axarr[0].scatter(X_.iloc[:,0], X_.iloc[:,1], s=70, 
                         color=color, label=label, alpha=0.9)
        
    axarr[0].set_title(title+' '+'true', fontsize=20)
    axarr[0].legend(loc='lower right')
    
    
    if model.lower() == 'kmeans':
        model = KMeans(n_clusters=n_clusters)
        model.fit(df)
        predicted = model.labels_
        centroids = model.cluster_centers_
        print(centroids)
        
        colors = plt.cm.Spectral(np.linspace(0, 1, len(predicted.unique())))
            axarr[0].scatter(X_.iloc[:,0], X_.iloc[:,1], s=70, 
            color=color, predicted=label, alpha=0.9)
            axarr[1].set_title(title+' '+'kmeans', fontsize=20)
            axarr[1].legend(loc='lower right')
             
   
#     elif model == 'clustering':
#         clustering = (df, n_clusters=3, dist=10, eps=1, min_samples=1):
#     elif model.lower() == 'dbscan';
#         dbscn = DBSCAN(eps = eps,
#               min_samples = min_samples) 
#         dbscn.fit(df)
#         core_samples = dbscn.core_sample_indices_
#     else:
#         print('give a valid model')


#     plt.figure(figsize=(10,10))

#     df.plot(x="x", y="y", kind="scatter", c=df['predicted'], colormap='gist_rainbow', alpha=.7)
#     plt.scatter(centroids[:,:1], centroids[:,1:], marker='x', s=150, alpha=.7, c='black')

IndentationError: unindent does not match any outer indentation level (<tokenize>, line 6)

In [None]:
def plotting(df, title, n_clusters_kmeans=3, n_clusters_agg=3,
                         dbscan_eps=3, dbscan_min_samples=5):
    
    fig, axarr = plt.subplots(1,4, figsize=(24,7))
    
    # True:
    colors = plt.cm.Spectral(np.linspace(0, 1, len(df.label.unique())))
    
    for label, color in zip(df.label.unique(), colors):
        X_ = df[df.label == label]
        axarr[0].scatter(X_.iloc[:,0], X_.iloc[:,1], s=70, 
                         color=color, label=label, alpha=0.9)
        
    axarr[0].set_title(title+' '+'true', fontsize=20)
    axarr[0].legend(loc='lower right')
    
    # Set up x.
    X = df.iloc[:, 0:2]
    
    # K-means:
    kmeans = KMeans(n_clusters=n_clusters_kmeans)
    kmeans.fit(X.iloc[:, 0:2])
    
    X['kmeans_label'] = kmeans.labels_
    
    colors = plt.cm.Spectral(np.linspace(0, 1, len(X.kmeans_label.unique())))
    
    for label, color in zip(X.kmeans_label.unique(), colors):
        X_ = X[X.kmeans_label == label]
        axarr[1].scatter(X_.iloc[:,0], X_.iloc[:,1], s=70, 
                         color=color, label=label, alpha=0.9)
        
    axarr[1].set_title(title+' '+'kmeans', fontsize=20)
    axarr[1].legend(loc='lower right')
    
    
    # Hierarchical/agglomerative:
    aggclust = AgglomerativeClustering(n_clusters=n_clusters_agg)
    aggclust.fit(X.iloc[:, 0:2])
    
    X['aggclust_label'] = aggclust.labels_
    
    colors = plt.cm.Spectral(np.linspace(0, 1, len(X.aggclust_label.unique())))
    
    for label, color in zip(X.aggclust_label.unique(), colors):
        X_ = X[X.aggclust_label == label]
        axarr[2].scatter(X_.iloc[:,0], X_.iloc[:,1], s=70, 
                         color=color, label=label, alpha=0.9)
        
    axarr[2].set_title(title+' '+'agglomerative', fontsize=20)
    axarr[2].legend(loc='lower right')
    
    
    # DBSCAN:
    dbscan = DBSCAN(eps=dbscan_eps, min_samples=dbscan_min_samples)
    dbscan.fit(X.iloc[:, 0:2])
    
    X['dbscan_label'] = dbscan.labels_
    
    colors = plt.cm.Spectral(np.linspace(0, 1, len(X.dbscan_label.unique())))
    
    for label, color in zip(X.dbscan_label.unique(), colors):
        X_ = X[X.dbscan_label == label]
        axarr[3].scatter(X_.iloc[:,0], X_.iloc[:,1], s=70, 
                         color=color, label=label, alpha=0.9)
        
    axarr[3].set_title(title+' '+'DBSCAN', fontsize=20)
    axarr[3].legend(loc='lower right')

In [None]:
 model == 'clustering':
        clustering = (flame, n_clusters=3, dist=10, eps=1, min_samples=1):
              plt.figure(figsize=(10,10))

fig, ax =  plt.subplots(figsize=(5,5))

ax.scatter(flame['x'], flame['y'], c=colors_2)
plt.title('Heirarchical Clustering with '+str(dist) + ' Dist')

In [None]:
dbscn = DBSCAN(eps = .4,
              min_samples = 3) 
dbscn.fit(flame)
core_samples = dbscn.core_sample_indices_

fig, ax =  plt.subplots(figsize=(5,5))
ax.scatter(flame['x'], flame['y'])
plt.title('DBSCN Clustering with '+str(eps) + ' Eps, and ' + str(min_samples) + ' Min. samples')

In [None]:

model = KMeans(n_clusters=5)
model.fit(flame)

predicted = model.labels_
flame['predicted'] = predicted
centroids = model.cluster_centers_
    

plt.figure(figsize=(10,10))

flame.plot(x="x", y="y", kind="scatter", c=flame['predicted'], colormap='gist_rainbow', alpha=.7)
plt.scatter(centroids[:,:1], centroids[:,-1:], marker='x', s=150, alpha=.7, c='black')

In [None]:
centroids[:,:1].shape, centroids[:,-1:].shape

In [None]:
centroids

In [None]:
# Write a function that will plot the results of the three
# clustering algorithms for comparison.
# Plotting function.
#kmeans= KMeans(n_clusters=k)
def plot_kmeans(df, kmeans):
   
    fig, ax = plt.subplots(figsize=(5,5))
    
    colors = plt.cm.Spectral(np.linspace(0, 1, len(n_clusters())))
    
    for n_clusters, color in zip(kmeans, colors):
        X = kmeans
        ax.scatter(X.iloc[:,0], X.iloc[:,1], s=70, 
                   color=color, label=label, alpha=0.9)
        
    ax.set_title(title, fontsize=20)
    ax.legend(loc='lower right')
    
    plt.show()

In [None]:
k = range(1,10)
# 1. Run k-means against our two features with three clusters.

model = KMeans(n_clusters=k, random_state=42)
model.fit(df[features].values)

# 2. Assign clusters back to our DataFrame.
#df['cluster'] = model.labels_

# 3. Get our centroids.
centroids = model.cluster_centers_
cc = pd.DataFrame(centroids)

# 4. Plot the scatter of our points with calculated centroids.
def plot_kmeans(df, kmeans):
    
    fig, ax = plt.subplots(figsize=(10,8))

ax.scatter(df[features[0]], df[features[1]], c=df['cluster'])
ax.scatter(cc[0], cc[1], marker='x', s=100)

In [None]:
# Write a function that will plot the results of the three
# clustering algorithms for comparison.

def compare_clustering(df, n_clusters=3, dist=10, eps=1, min_samples=1):
    # Plot df with Actual Labels
    plot_clusters(df, 'Actual Labels')
    
    # Plot and score KMeans Clustering
    kmeans_model = KMeans(n_clusters=n_clusters).fit(df)
    print('KMeans Inertia: ', kmeans_model.inertia_)
    print('KMeans Silhouette: ', silhouette_score(df[['x','y']], kmeans_model.labels_))
    
    kmeans_centroids = kmeans_model.cluster_centers_
    kmeans_cc = pd.DataFrame(kmeans_centroids)

    base_colors = ['0.05', '0.1', '0.15', '0.2', '0.25', '0.3', '0.35', '0.4', '0.45', '0.5', 
                   '0.55', '0.60', '0.65', '0.70', '0.75', '0.8', '0.85', '0.9', '0.95']
    colors = [base_colors[centroid] for centroid in kmeans_model.labels_]

    fig, ax =  plt.subplots(figsize=(5,5))

    ax.scatter(df['x'], df['y'], c=colors)
    plt.title('KMeans Clustering with '+str(n_clusters) + ' clusters')
    
    # Plot and score Hierarchical Clustering
    Z = linkage(df, 'ward')
    clusters = fcluster(Z, dist, criterion='distance')
    
    colors_2 = [base_colors[i] for i in clusters]
    
    fig, ax =  plt.subplots(figsize=(5,5))

    ax.scatter(df['x'], df['y'], c=colors_2)
    plt.title('Heirarchical Clustering with '+str(dist) + ' Dist')
    
    # Plot and score DBSCAN Clustering
    ss = StandardScaler()
    Xs = ss.fit_transform(df)
    
    dbscn = DBSCAN(eps=eps, min_samples=min_samples)
    dbscn.fit(Xs)
    
    colors_3 = [base_colors[i] for i in dbscn.labels_]
    
    fig, ax =  plt.subplots(figsize=(5,5))

    ax.scatter(df['x'], df['y'], c=colors_3)
    plt.title('DBSCN Clustering with '+str(eps) + ' Eps, and ' + str(min_samples) + ' Min. samples')
    


### 4) Tinkering with clustering parameters.

In the following sections, look at how the parameters affect the clustering algorithms and try to get clusters that make sense. There is no right answer here, as these are unsupervised techniques.

---

### 4.A) Find good parameters for the `flame` data set.

Which algorithm (visually) performs best?

In [None]:
# A:

---

### 4.B) Find good parameters for the `agg` data set.

Which algorithm (visually) performs best?

In [None]:
# A:

---

### 4.C) Find good parameters for the `comp` data set.

Which algorithm (visually) performs best?

In [None]:
# A:

---

### 4.D) Find good parameters for the `jain` data set.

Which algorithm (visually) performs best?

In [None]:
# A:

---

### 4.E) Find good parameters for the `pathbased` data set.

Which algorithm (visually) performs best?

In [None]:
# A:

---

### 4.F) Find good parameters for the `r15` data set.

Which algorithm (visually) performs best?

In [None]:
# A:

---

### 4.G) Find good parameters for the `spiral` data set.

Which algorithm (visually) performs best?

In [None]:
# A:

---

## 5) [Bonus] Explore other clustering algorithms.

Scikit-learn comes with a variety of unsupervised clustering algorithms, some of which we haven’t covered in class. Two algorithms that may be particularly interesting to you are:

1) [Affinity propagation](http://scikit-learn.org/dev/modules/clustering.html#affinity-propagation) finds clusters by sending messages from a group of points to other points. Points group into clusters based on a "damping factor." Affinity propagation’s main appeal is that the number of clusters doesn’t need to be specified by the user (like DBSCAN).
2) [Birch](http://scikit-learn.org/dev/modules/clustering.html#birch) finds clusters with a tree-based algorithm that is somewhat reminiscent of decision trees. It evaluates branches/nodes on a tree that best describe the data's features.

In [None]:
# A: