We use matplotlib for display of graphs and segmented images

In [2]:
import matplotlib.pyplot as plt

# Display all plots inline
%matplotlib inline

## 1. Density Peak (DP) Clustering with Test Data
First we read in Excel data that have been set up manually to visualise and illustrate how DP clustering works on a small test data set. All code for the DP algorithm has been tested first on this very small test data set to ensure that results are sound. 
It is relatively easy to identify and fix bugs by testing the code on some 20+ data sets. However, it is nearly impossible to undertake similar tests on thousands of pixels in an image. 

In [None]:
import xlrd


def scatter_plot(x,y): 
    plt.figure(figsize=(10, 6))    
    plt.xlabel("x-axis")
    plt.ylabel("y-axis")
    plt.title("Scatterplot of points")
    plt.scatter(x, y, marker= 'o')
    plt.show()


inputs = list()
with xlrd.open_workbook('testdata/testdata.xlsx') as workbook:
    worksheet = workbook.sheet_by_name('test')
    for row_idx in range(1, worksheet.nrows):
        record = [worksheet.cell(row_idx, col_idx).value for col_idx in range(worksheet.ncols)]
        # append new case (as tuple with x and y component)
        inputs.append((record[0], record[1]))

# unzip list of tuples and display scatterplot
x, y = zip(*inputs)
scatter_plot(x,y)  


Figure 1-1. Points visualised in scatterplot

Intuitively we would group the points into three different groups: one to the top at the middle, a second to bottom at the left and the third to the right in the middle. 

We test the newly developed DP clustering algorithm against this data and visualise the results, by displaying the points in different colours. We use the Euclidean metric, as it is a plausible metric to describe the distances between points in the diagram. 

First we import the DP class definition. Then we instantiate the class and run the code

In [4]:
from dp import DPPoints 

Now we run the DP algorithm with some specific hyper-parameter settings that work best on such a small data set. It is not surprising that those settings have to be modified when we run the same code against actual images (with hundreds of thousands of pixels in three-dimensional space). 

We set granularity (cube length) to 1, as there is no need to slice the data space. Only some 20 records or so are in the Excel file. 

In [5]:
dp = DPPoints()
dp.GRANULARITY = 1 # work on each single point, without pre-clustering 
dp.D_SCALING =7.0  # this scaling factor produced good results for the data set 
dp.run(inputs)     # next run dp clustering 


We display the density for each point in a scatterplot 

In [None]:
def scatter_plot_by_cluster(): 
    plt.figure(figsize=(10, 6))    
    plt.xlabel("x-axis")
    plt.ylabel("y-axis")
    plt.title("Scatterplot of points")
    
    for p in dp.pnts:
        plt.annotate(str(round(dp.dens[p],1)), p)
        plt.scatter(p[0], p[1], marker= 'o')
    
    plt.show()


#visualise data in scatterplot 
scatter_plot_by_cluster() 

Figure 1-2 Points in the scatter plot annotated with their calculated density (rho) 

The highest density (100) is for a point in the top cluster, then a high density point can be found in the cluster to the right and also in the cluster in the bottom left. 

In the next step, we convert the data to a Decision Graph, with density (rho) as x-axis and distance (delta) as y-axis.

In [None]:
def decision_graph(): 
    plt.figure(figsize=(10, 6))    
    plt.xlabel("Density")
    plt.ylabel("Distance")
    plt.title("Decision Graph")
    
    for p in dp.pnts:
        plt.scatter(dp.dens[p], dp.dst[p], marker= 'o')
    plt.show()


#visualise data in scatterplot 
decision_graph() 

Figure 1-3 Decision Graph with test data

The Decision Graph identifies 3 points that stand out as outliers and can be designated as the centroids of the new clusters. 
Now we visualise the outliers in the Scatterplot. 

In [None]:
def decision_graph_with_outliers(): 
    plt.figure(figsize=(10, 6))    
    plt.xlabel("Density")
    plt.ylabel("Distance")
    plt.title("Decision Graph with Outliers")
    
    # Plot the density/distance pairs into the Decision Graph
    for key in dp.pnts.keys():
        if key in dp.centroids:
            plt.scatter(dp.dens[key], dp.dst[key], marker= 'o', color = 'red')
        else:
            plt.scatter(dp.dens[key], dp.dst[key], marker= 'o', color = 'blue')

    plt.fill([dp.dens_threshold,dp.DG_SCALING+10,dp.DG_SCALING+10,dp.dens_threshold], 
             [dp.dst_threshold,dp.dst_threshold,dp.DG_SCALING+10,dp.DG_SCALING+10], 'b', alpha=0.1)
    plt.annotate('Greyed area contains the outliers', [dp.dens_threshold+30,dp.DG_SCALING+13])
    # show the graph 
    plt.show()

# run the code in this function 
decision_graph_with_outliers() 

Figure 1-4 Decision Graph with highlighted area for outliers. 

Points in the greyed area are classified as outliers and marked as the cluster centres. We take as minimum value for the density 5% of the density and as minimum value for the distance a value that is modelled by the exponential distribution. The resultant outlier area is shaded in grey. 

Finally, we visualise the outliers in the original scatterplot. 

In [None]:
def scatter_plot_by_cluster(): 
    plt.figure(figsize=(10, 6))    
    plt.xlabel("x-axis")
    plt.ylabel("y-axis")
    plt.title("Scatterplot annotated with cluster groups and centroids of each cluster in red")
    # Plot the points into the scatterplot and annotate with their respective groups (1, ... k)
    points_per_group = dp.get_data() 
    colors = ['red','green','blue']
    for p in dp.pnts:
        for i,c in enumerate(dp.centroids):
            if dp.assigned_group[dp.p_map[p]] == c:
                if c == p:
                    plt.annotate('<--Centroid', [p[0]+0.5,p[1]-0.7])
                plt.scatter(p[0],p[1], marker= 'o', color = colors[i])
    plt.show()
#visualise data in scatterplot 
scatter_plot_by_cluster()

Figure 1-5 Scatterplot with cluster groups and centroids

The three groups are coloured and produce intuitively correct results that had been predicted when the data had first been seen (see comments to Figure 1-1). The centroids are somewhat in the centres of their respective groups. 

This concludes the test of the DP algorithm for the test data set in the Excel file. Now we apply the same algorithm to images. 

## 2. Density-Peak (DP) Segmentation for Images

We start by defining a support function for inline display of pictures in this workbook. Library datetime is imported so we can measure the runtime of the Density-Peak (DP) process against various configuration settings and images. 

We import class DPImage, which has been developed for the clustering of images, using DP. Library PIL (Python Imaging Library) is used for pixel exchange between images (before and after the pixel manipulations).


In [10]:
from dp import DPImage
from PIL import Image

We run the DP algorithm

In [None]:
dist = 'Euclidean'    
dp = DPImage(dist)
imgname = "images/k.jpeg"
img1 = Image.open(imgname)
img2 = Image.open(imgname)
dp.run_img(img1)
img2.putdata(dp.get_data())
f, axarr = plt.subplots(1,2)
f.suptitle("DP - Granularity = "+str(dp.GRANULARITY)+", Number of Distinct Pixels = "+str(len(dp.pnts.keys())), fontsize=16)
axarr[0].set_title('Original')
axarr[0].imshow(img1)
axarr[0].axis('off')
axarr[1].set_title('Clustered')
axarr[1].imshow(img2)   
axarr[1].axis('off') 
plt.show() 

print('Granularity:                         '+str(dp.GRANULARITY))
print('Number of pixel clusters:            '+str(len(dp.pnts.keys())))
print('Number of centroids:                 '+str(len(dp.centroids)))
print('Total Number of pixels in image:     '+str(sum(dp.pnts.values())))
print('Max distance between two pixels:     '+str(round(dp.max_dist,1)))
print('Percentage DC to max pixel distance: '+str(round(100*dp.dc/dp.max_dist,1))+'%')
print('Runtime (in seconds):                '+str(dp.seconds))


Figure 1-6 Output for Clustered Image with Details and Runtime Performance 

For the percentage DC to max pixel distance, the paper from Zhenshong et al. recommend 0.5% of the mx distance between two pixels. In my own tests on preclustered images (with a cube length of 16 pixels) it turns out that 3% works well on most images. 
This slightly larger value may be explainable: Through the pre-clustering larger 'void' areas between pixels have been introduced: after all, the minimum distance between 2 pixel cube centres is now 16! This means DC needs to be slightly increased to amplify impact of density calculations for pixels outside its own cube. 

Runtime on my McBook Air with 8GB RAM is excellent (just 4 seconds). 

Now we plot the data in the decision graph. 

In [None]:
def decision_graph_with_outliers(): 
    plt.figure(figsize=(10, 6))    
    plt.xlabel("Density")
    plt.ylabel("Distance")
    plt.title("Decision Graph with Outliers")
    
    # Plot the density/distance pairs into the Decision Graph
    for key in dp.pnts.keys():
        if key in dp.centroids:
            plt.scatter(dp.dens[key], dp.dst[key], marker= 'o', color = 'red')
        else:
            plt.scatter(dp.dens[key], dp.dst[key], marker= 'o', color = 'blue')

    plt.fill([dp.dens_threshold,dp.DG_SCALING+10,dp.DG_SCALING+10,dp.dens_threshold], 
             [dp.dst_threshold,dp.dst_threshold,dp.DG_SCALING+10,dp.DG_SCALING+10], 'b', alpha=0.1)
    plt.annotate('Greyed area contains the outliers', [dp.dens_threshold+30,dp.DG_SCALING+13])
    # show the graph 
    plt.show()

# run the code in this function 
decision_graph_with_outliers() 

Figure 1-7 Decision Graph with Outliers for Clustered Image at Figure 1-6

The program identifies 3 clusters (outliers) in the Decision Graph. All other pixels are close to the x-axis (density) with a distance measure of close to 0. 



This concludes the coding section of the Project Report for the Density Peak (DP) Algorithm.  