In [6]:
from bokeh.io import output_notebook, show
from bokeh.layouts import column, row, widgetbox
from bokeh.plotting import figure
from bokeh.models import HoverTool, ColumnDataSource, LabelSet, CustomJS, Slider, Range1d
from bokeh.models.widgets import Select, Panel, Tabs
import pandas as pd
import os
import numpy as np
from scipy import stats
from scipy.optimize import curve_fit
import math
import copy
import csv
from sklearn.cluster import KMeans, DBSCAN
from sklearn import preprocessing

In [7]:
output_notebook()

### Clustering Data Set 1
<br>
The data has been standarized by <b>transforming it to a range of 0 to 100.</b> <br>
The two clustering algorithms selected are: <br>
1. K-Means: K-means clustering aims to partition 'n' observations into 'k' clusters <b>(K here is 2, after some visual experimentation)</b> in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.<br>
2. DBScan: It is a density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away). <br>
For DBScan: e: The maximum distance between two samples for them to be considered as in the same neighborhood. <b> This has been set to 10</b><br>
min_samples: The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself. <b>This has also been set to10</b>

In [16]:
with open('Wholesale customers data.csv', 'rt') as csvfile:
    original_data = csv.reader(csvfile, delimiter=',')
    original_data = np.array(list(original_data))

#separating column names from data
columns = original_data[0]
original_data = original_data[1:,:]
original_data = original_data.astype(np.float)
min_max_scaler = preprocessing.MinMaxScaler(feature_range=(0, 100))
original_data_minmax = min_max_scaler.fit_transform(original_data)
kmeans = KMeans(n_clusters=2, max_iter=750).fit(original_data_minmax[:,2:])
DBScan = DBSCAN(eps=10, min_samples=10, metric='euclidean', algorithm='auto', leaf_size=30, p=None, n_jobs=-1).fit(original_data_minmax[:,3:])
labels_DBScan = DBScan.labels_ + 1
labels_KMeans = kmeans.labels_

colorscheme = {
    0: 'green',
    1: 'red',
    2: 'yellow',
    3: 'navy'
}
    
legendScheme = {
    0: 'Cluster 1',
    1: 'Cluster 2',
    2: 'Cluster 3',
    3: 'Cluster 4'
}
    
plots_KMeans = []
plots_DBScan = []
plotNum = 0
actualPlotNum = 1
ToPlot = [1,3,5,6,8,10,11,12,13,14,15,18,24]
for i in range(2,len(original_data_minmax[0])):
    for j in range(2,len(original_data_minmax[0])):
        if j == i:
            continue
        if plotNum not in ToPlot:
            plotNum+=1
            continue
        #KMeans
        plots_KMeans.append(figure(plot_width = 400, plot_height = 400, title='KMeans Plot '+str(actualPlotNum)+' - X: Column '+str(i+1)+', Y: Column'+str(j+1)))
        for k in range(max(labels_KMeans)+1):
            plots_KMeans[-1].scatter(x=[original_data_minmax[ind,i] for ind,val in enumerate(labels_KMeans) if val == k], y=[original_data_minmax[ind,j] for ind,val in enumerate(labels_KMeans) if val == k], marker='x', size=5, line_color = colorscheme[k], legend=legendScheme[k])
            plots_KMeans[-1].circle(x=kmeans.cluster_centers_[k][i-2],y=kmeans.cluster_centers_[k][j-2], size=10)
        plots_KMeans[-1].legend.click_policy="hide"
        plots_KMeans[-1].xaxis.axis_label = columns[i]
        plots_KMeans[-1].yaxis.axis_label = columns[j]
        #DBScan
        plots_DBScan.append(figure(plot_width = 400, plot_height = 400, title='DBScan Plot '+str(actualPlotNum)+' - X: Column '+str(i+1)+', Y: Column '+str(j+1), x_range=plots_KMeans[-1].x_range, y_range=plots_KMeans[-1].y_range))
        for k in range(max(labels_DBScan)+1):
            plots_DBScan[-1].scatter(x=[original_data_minmax[ind,i] for ind,val in enumerate(labels_DBScan) if val == k], y=[original_data_minmax[ind,j] for ind,val in enumerate(labels_DBScan) if val == k], marker='x', size=5, line_color = colorscheme[k], legend=legendScheme[k])
        plots_DBScan[-1].legend.click_policy="hide"
        plots_DBScan[-1].xaxis.axis_label = columns[i]
        plots_DBScan[-1].yaxis.axis_label = columns[j]
        
        show(row(plots_KMeans[-1],plots_DBScan[-1]))
        plotNum+=1
        actualPlotNum+=1

### Discussion
The above were selected from all the possible combinations of plots as they were deemed meaningful to the discussion in some way.<br>
On the left is the KMeans implementation and on the right is the same graph clustered by DBScan.<br>
Each of the plots is linked to the one on its side in panning and zooming, so one can observe the same area of the data in both clustering techniques together. Also, the legends for the clusters can be clicked to hide/show the clusters. In the KMeans plots, th cluster centers are shown in blue circles. The clusters are otherwise color coded.<br>
Some important comparisons and observations from these visualizations:
1. K-Means was run with K = (2,3,4,5) but for all K > 2, the clusters were being formed randomly and changed on re-runs. This means that the data inherently does not have too many clusters, or K Means just does not have the capability to cluster the data well. DBScan was run with various parameter values, but since the data is of varying density, a lot clusters were often very small (due to small min_samples or e) or it turned  out to be just one big cluster and some outliers (due to bigger min_samples or e). A value of 10,10 for min_samples and e turned up clusters similar to K Means.
2. DBScan return just 1 cluster, and the other points are all outliers which lie in no cluster. All these points are shown as Cluster 1 (green) in all the plots. DB Scan, is, therefore, not a good clustering algorithm for the data set.
3. Plots 1, 2, 3, 6, 8, 9 seem very similar with respect to clusters from both K Means and DBScan. Most of them have a dense cluster with lower values of x and y, and another sparse cluster with higher values of x (3, 6, 8), y (1, 2), or both (9).
4. Plots 4, 5, 7, 10, 12, 13 seem almost similar but demonstrate the difference in both the algorithms. The K Means algorithm runs on distance to cluster centres, and so has a perpendicular bisector between them effectively dividing the space into two with a line. DBScan's cluster 1, however, has points which are not seperated by this imaginary bisector, and circle around the denser Cluster 2 in many of the plots.
5. Plot 11 has been included to portray how not every dimension when plotted with another can effectively visualize the clusters. The cluster centers are near each other, and DBScan also shows overlapping clusters. This visualization therefore does not provide much information, and needs a different perspective, a newer dimension or an entirely different axis to provide information about the clusters.
6. Some interpretations can be made about clusters from these plots, like from plot 3, one can say that there are regions where a lot of milk is bought, but not so much fresh vegetables, and another region where less milk is bought, irrespective of fresh vegetables. It is not clear, however, what the interpretation of most the of plots comes to. It seems that the dataset is not one on which clustering can be successfully employed to gain valuable new information, as it does not return obvious cluster shapes even when plotting all dimensions against each other. There is, however, the issue of columns 1 and 2, channel and region. They were not used in clustering, but when plotted, it is clear that they are already clusters. This makes them labels in the data, and were not used in unsupervised clustering because of exactly the same reason.