### **IMAGE DESCRIPTOR - COLOR HISTOGRAMS**
* Unlike the **mean and standard deviation which attempt to summarize the pixel intensity distribution**, a **color histogram explicitly represents it**.
* In fact, a **color histogram is the color distribution**.
* Assumption that **images with similar color distributions contain equally similar visual contents**.
* In this **example**, we are going to take small dataset of images — but instead of ranking, we are going to cluster and group them into two distinct classes using color histograms.

### **COLOR HISTOGRAMS:**
* **Color histogram counts the number of times a given pixel intensity occurs in an image**.
* Using a color histogram we can express the **actual distribution or “amount” of each color in an image**.
* The **counts for each color/color range are used as feature vectors**.
* If we decided to utilize a **3D color histogram with 8 bins per channel**, we could represent any image of any size using only **8 x 8 x 8 = 512 bins, or a feature vector of 512-d**.
* The **size of an image has no effect on our output color histogram** — although it’s wise to resize large images to more manageable dimension to increase the speed of the histogram computation.

### **k-means Clustering Algorithm:**
* k-means is a **clustering algorithm**.
* k-means is to **partition n data points into k clusters**.
* **Each of the n data points will be assigned to a cluster with the nearest mean**.
* The **mean of each cluster** is called its “**centroid**” or “**center**”.
* **Applying k-means yields k separate clusters of the original n data points**.
* Data points inside a particular cluster are considered to be “more similar” to each other than data points that belong to other clusters.
* In this particular **program**, we will be **clustering the color histograms extracted from the images in our dataset** — but in reality, you could be clustering any type of feature vector.
* **Histograms that belong to a given cluster will be more similar in color distribution** than histograms belonging to a separate cluster.
* One **caveat of k-means** is that we **need to specify the number of clusters** we want to generate ahead of time.
* There are **algorithms that automatically select the optimal value of k**.
* For the time being, we will be manually supplying a value of k=2 to separate the two classes of images.

**Program** : Cluster the vacation photo dataset into two distinct groups. At first, need to extract color histograms from each of the 10 images in the dataset. We will create a LabHistogram class to extract color histograms from images in the L*a*b* color space.


In [12]:
# import the necessary packages
from sklearn.cluster import KMeans
from imutils import paths
from google.colab.patches import cv2_imshow
import numpy as np
import cv2
from matplotlib import pyplot as plt

In [13]:
def describe(image, mask=None):
	# convert the image to the L*a*b* color space, compute a 3D histogram and normalize it
	# Euclidean distance between two colors in the L*a*b* has perceptual and noticeable meaning.
	# And since the k-means clustering algorithm assumes a Euclidean space, we will get better
  # clusters by using the L*a*b* color space than RGB or HSV.
	# If we do not normalize, then images with the exact same contents but different sizes would
  # have dramatically different histograms. Instead, by normalizing our histogram we ensure that
  # the width and height of our input image has no effect on the output histogram.
	lab = cv2.cvtColor(image, cv2.COLOR_BGR2LAB)
	hist = cv2.calcHist([lab], [0, 1, 2], mask, [8,8,8],[0, 256, 0, 256, 0, 256])
	hist = cv2.normalize(hist, hist).flatten()
	# return the histogram
	return hist

In [14]:
#  initialize a list, data, to store the color histograms extracted from our image.
data = []

# grab the image paths from the dataset directory (upload the dataset to the specified location)
imagePaths = list(paths.list_images("/content/sample_data/dataset"))
imagePaths = np.array(sorted(imagePaths))
print(imagePaths)

['/content/sample_data/dataset/antelopecanyon_01.png'
 '/content/sample_data/dataset/antelopecanyon_02.png'
 '/content/sample_data/dataset/antelopecanyon_03.png'
 '/content/sample_data/dataset/antelopecanyon_04.png'
 '/content/sample_data/dataset/antelopecanyon_05.png'
 '/content/sample_data/dataset/grandcanyon_01.png'
 '/content/sample_data/dataset/grandcanyon_02.png'
 '/content/sample_data/dataset/grandcanyon_03.png'
 '/content/sample_data/dataset/grandcanyon_04.png'
 '/content/sample_data/dataset/grandcanyon_05.png']


In [15]:
# loop over the input dataset of images
for imagePath in imagePaths:
	# load the image, describe the image, then update the list of data
	image = cv2.imread(imagePath)
	hist = describe(image)
	data.append(hist)

#print (data)

In [23]:
# Now that we have all of our color features extracted, we can cluster the feature vector using
# the k-means algorithm. We initialize k-means using the supplied number of clusters via
# command line argument. And a call to clt.fit_predict  not only performs the actual clustering,
# but performs the prediction as to which histogram (and thus which associated image) belongs
# to which of the 2 clusters.
# Number of times the k-means algorithm is run with different centroid seeds.
# The final results is the best output of n_init consecutive runs in terms of inertia.
clt = KMeans(n_clusters=2, n_init=10)
labels = clt.fit_predict(data)

In [24]:
# print labels
# Now that we have our color histograms clustered, we need to grab the unique IDs for each
# cluster. This is handled by making a call to np.unique, which returns the unique values inside
# a list. For each unique label , we need to grab the image paths that belong to the cluster. And
# for each of the images that belong to the current cluster, we load and display the image to our
# screen. loop over the unique labels.
for label in np.unique(labels):
	# grab all image paths that are assigned to the current label
	labelPaths = imagePaths[np.where(labels == label)]
	# loop over the image paths that belong to the current label
	for (i, path) in enumerate(labelPaths):
		# load the image and display it
		image = cv2.imread(path)
		print(label + 1)
		print(i + 1)
		cv2_imshow(image)

	# wait for a keypress and then close all open windows
	cv2.waitKey(0)
	cv2.destroyAllWindows()

Output hidden; open in https://colab.research.google.com to view.