###### Clustering

Clustering is an interesting field of Unsupervised Machine learning where we classify 
datasets into set of similar groups. It is part of ‘Unsupervised learning’ meaning, where
there is no prior training happening and the dataset will be unlabeled. Clustering can be
done using different techniques like K-means clustering, Mean Shift clustering, DB Scan 
clustering, Hierarchical clustering etc. 

###### Image clustering


Image clustering is an essential data analysis tool in machine
learning and computer vision. Many applications
such as content-based image annotation and
image retrieval can be viewed as different instances
of image clustering. Technically, image clustering
is the process of grouping images into clusters such that the
images within the same clusters are similar to each other,
while those in different clusters are dissimilar.

In [30]:
import os
# Code: import Kmeans library from sklearn ( 1 point)
import keras
from sklearn.cluster import KMeans

###### VGG 

VGG is a convolutional neural network model for image recognition proposed by the Visual Geometry Group in the University of Oxford, where VGG16 refers to a VGG model with 16 weight layers, and VGG19 refers to a VGG model with 19 weight layers. The architecture of VGG16: the input layer takes an image in the size of (224 x 224 x 3), and the output layer is a softmax prediction on 1000 classes. From the input layer to the last max pooling layer (labeled by 7 x 7 x 512) is regarded as the feature extraction part of the model.

In [31]:
from keras.preprocessing import image
# Code: import VGG feature extraction from keras application as VGG16 (1 point)
from keras.applications.vgg16 import VGG16
from keras.applications.vgg16 import preprocess_input
import numpy as np

model = VGG16(weights='imagenet', include_top=False)    

img_path = "dataset/train_dataset/_83930440_lion-think-976.jpg"
# Code: Specify path of the random image from the training dataset. (1 point)
img = image.load_img(img_path, target_size=(224, 224)) 
img_data = image.img_to_array(img)
img_data = np.expand_dims(img_data, axis=0)

vgg16_feature = model.predict(img_data)  

# Code: print the shape of the vgg16_feature  (1 point)
# the shape of feature extracted by VGG16
vgg16_feature.shape

(1, 7, 7, 512)

In [32]:
# The given function will extract the features from the images.
def extract_feature(directory):
    vgg16_feature_list = []

    for filename in os.listdir(directory):

        img = image.load_img(os.path.join(directory,filename), target_size=(224, 224))
        img_data = image.img_to_array(img)
        img_data = np.expand_dims(img_data, axis=0)
        img_data = preprocess_input(img_data)

        vgg16_feature = model.predict(img_data)
        vgg16_feature_np = np.array(vgg16_feature)
        vgg16_feature_list.append(vgg16_feature_np.flatten())

    vgg16_feature_list_np = np.array(vgg16_feature_list)
    
    return vgg16_feature_list_np

The given dataset has three classes that are: Lion , Fish and Zebra, but we are not providing any 
    supervision to the model i.e. we are not specifying which image is associated with which
    class / cluster. For this we using unsupervised image clustering to create the clusters.

In [33]:
 # pass the path of the folder where you have the training dataset
train_feature_vector = extract_feature("dataset/train_dataset") 

# Code: create the kmeans object and initialize it with the number_of_clusters = 3   (2 point)

kmeans_model =KMeans(n_clusters=3, random_state=0) 
kmeans_model.fit(train_feature_vector) 
   


KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=0, tol=0.0001, verbose=0)

In [34]:
# create a test vector using extract_feature function. It will return a feature vector of size 
# number of images * size of the feature vector

test_vector  = extract_feature("dataset/test_dataset")  # (1 point)

In [35]:
# Code: print the shape of the test vector   # (1 point)
test_vector.shape

(33, 25088)

In [38]:

# Code: use the kmeans model to predict the labels for the test vector (1 point)
labels = kmeans_model.predict(test_vector)

In [39]:
labels

array([2, 2, 2, 2, 2, 2, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 2, 1, 2, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [40]:
def createDirectory(directoryName):
      # Create target Directory if don't exist
    if not os.path.exists(directoryName):
        os.mkdir(directoryName)
        print("Directory " , directoryName ,  " Created ")
    else:    
        print("Directory " , directoryName ,  " already exists")

In [41]:
createDirectory("dataset/output/Zebra")
createDirectory("dataset/output/Lion")
createDirectory("dataset/output/Fish")

Directory  dataset/output/Zebra  already exists
Directory  dataset/output/Lion  already exists
Directory  dataset/output/Fish  already exists


In [29]:
# Code: Using the labels and the images, save the test images in the different folders in respective 
#clusters.   (2 point)
from shutil import copyfile
length_array = len(labels)
test_dataset_dir="dataset/test_dataset/"
output_zebra_dir="dataset/output/Zebra/"
output_fish_dir="dataset/output/Fish/"
output_lion_dir="dataset/output/Lion/"
file_arr = []
for filenames in os.listdir(test_dataset_dir):
    if not filenames.startswith('.DS_Store'):
        file_arr.append(filenames)

print("The length of the Array is ", length_array)
print("The length of the files in the directory is",len(file_arr))

for i in range((len(labels))):
    print("Processing ", file_arr[i] , "label -->", labels[i], " Iterator i ", i)

    if labels[i] == 2:
        copyfile(test_dataset_dir + file_arr[i], output_lion_dir + file_arr[i])
    elif labels[i] == 1:
        copyfile(test_dataset_dir + file_arr[i], output_fish_dir + file_arr[i])
    else:
        copyfile(test_dataset_dir + file_arr[i], output_zebra_dir + file_arr[i])


The length of the Array is  33
The length of the files in the directory is 33
Processing  african-lionadapt19001JPG.jpg label --> 2  Iterator i  0
Processing  africanlion-001.jpg label --> 2  Iterator i  1
Processing  africanlion-005.jpg label --> 2  Iterator i  2
Processing  animals_hero_lions_0.jpg label --> 2  Iterator i  3
Processing  asiatic-lion_thumbJPG.jpg label --> 2  Iterator i  4
Processing  black-maned-lion-shem-compion-590x390.jpg label --> 2  Iterator i  5
Processing  dangers-of-uneaten-fish-food.jpg label --> 1  Iterator i  6
Processing  DCTM_Penguin_UK_DK_AL458223_sjvgvt.jpg label --> 0  Iterator i  7
Processing  DCTM_Penguin_UK_DK_AL644648_p7nd0z.jpg label --> 1  Iterator i  8
Processing  discus-fish-1943755__340.jpg label --> 1  Iterator i  9
Processing  DlCOrbzYTw4.jpg label --> 1  Iterator i  10
Processing  e06dc834cacfac12b5f0c00f3af93845.jpg label --> 0  Iterator i  11
Processing  Equus_quagga.jpg label --> 0  Iterator i  12
Processing  Equus_quagga_burchellii_-_E