# Exploring Superclass Content with Clustering
[Google Landmark Recognition 2021
](https://www.kaggle.com/c/landmark-recognition-2021)

This notebook explores the target landmark ID class imbalance in this challenge and uses k-means clustering on images within a target superclass (i.e., target class with superhigh number of images in the training set) to determine how superclasses are organized.

In [None]:
import math, os, re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.image as img
from keras.preprocessing.image import load_img 
from keras.preprocessing.image import img_to_array 
from keras.applications.vgg16 import preprocess_input 
from keras.applications.vgg16 import VGG16 
from keras.models import Model
from sklearn.cluster import KMeans
from sklearn import metrics
from sklearn.decomposition import PCA

print(os.listdir('../input/landmark-recognition-2021'))

In [None]:
path = '/kaggle/input/landmark-recognition-2021'
os.listdir(path)
train_images = f'{path}/train'
train_df = pd.read_csv(f'{path}/train.csv')
train_df['path'] = train_df['id'].apply(lambda f: os.path.join('../input/landmark-recognition-2021/train',f[0], f[1], f[2], f + '.jpg'))
test_images = f'{path}/test'
test_df = pd.read_csv(f'{path}/sample_submission.csv')
test_df['path'] = test_df['id'].apply(lambda f: os.path.join('../input/landmark-recognition-2021/test',f[0], f[1], f[2], f + '.jpg'))

num_classes = train_df['landmark_id'].nunique()
print('number of target classes:', num_classes)
print('number of images in training set:', len(train_df))

With such a high number of unique landmark IDs given the size of the training set, there is likely going to be imbalanced classsification issues. Viewing the distribution of class counts confirms imbalance.

In [None]:
counts = train_df['landmark_id'].value_counts()
counts.describe()

In [None]:
# let's take the landmark ID with 5th highest image count
superclass = train_df[train_df['landmark_id'] == counts.iloc[[5]].index[0]]
images = superclass.path.to_list()
len(images)

Looking at a sample of images from a superclass shows a lot of diversity. The same landmark ID covers different buildings and landscapes.

In [None]:
sample = superclass.sample(n=12, replace=False)

plt.subplots(3, 4, figsize=(160, 160))
for i in range(len(sample)):
    plt.subplot(3, 4, i + 1)
    plt.axis('Off')
    image = img.imread(sample.iloc[i][2])
    plt.imshow(image)
    plt.title(f'landmark id:{sample.iloc[i][1]} ', fontsize=0)

Landmarks with a lot of images in the training set have a lot of different representations contained therein. Clustering can group images based on similar feature vectors.
Steps:
1. preprocess images into arrays with batch dimension 
2. extract features vector using VGG model prediction
3. PCA to select most important components
4. K-means clustering to group images (n clusters chosen with SSE elbow method)
5. viewing images in clusters to compare

### Image preprocessing and feature extraction

In [None]:
model = VGG16()
model = Model(inputs = model.inputs, outputs = model.layers[-2].output)

def extract_features(file, model):
    # VGG expects 224x224 arrays
    img = load_img(file, target_size=(224,224))
    img = np.array(img) 
    # reshape the data for the model reshape(batch size, height pixel, channels)
    reshaped_img = img.reshape(1,224,224,3) 
    # prepare image for model
    imgx = preprocess_input(reshaped_img)
    # get the feature vector
    features = model.predict(imgx, use_multiprocessing=True)
    return features

In [None]:
extract_features(images[0], model)

In [None]:
%%time
features = {}

# loop through each image in the dataset
for image in images:
    # extract features and update dictionary (filepath=key)
    feature = extract_features(image, model)
    features[image] = feature
       
# get a list of the filenames
filenames = np.array(list(features.keys()))

# get a list of just the features
feat = np.array(list(features.values()))
feat.shape

In [None]:
# reshape to sample size count of 4096 vectors
feat = feat.reshape(-1,4096)
feat.shape

### PCA to identify important components
Max n_components must be less than or equal to the number of images in the superclass as the square covariance matrix is sized according to the number of images

In [None]:
# PCA to select most important of 4,096 dimensions
if len(images) >= 4096:
    pca = PCA(n_components=len(images)//100, random_state=888)
else:
    pca = PCA(n_components=len(images)//60, random_state=888)
    
pca.fit(feat)
x = pca.transform(feat)

In [None]:
# k-means clustering
# ID good number of clusters using SSE distance to cluster center 'elbow method'
np.random.seed(888)
s = np.zeros(50)

for k in range(0, 50):
    est = KMeans(n_clusters=k+2)
    est.fit(x)
    s[k] = est.inertia_
    

plt.plot(range(0,50), s)
plt.xlabel('cluster counts')
plt.ylabel('distortion')
plt.title('');

There is no clear elbow in the cluster count plot, so I will choose 15 clusters to balance complexity in the model and the SSE cost function


In [None]:
kmeans = KMeans(n_clusters=15, random_state=888).fit(x)

In [None]:
groups = {}
for file, cluster in zip(images,kmeans.labels_):
    if cluster not in groups.keys():
        groups[cluster] = []
        groups[cluster].append(file)
    else:
        groups[cluster].append(file)

In [None]:
def view_images_in_cluster(cluster):
    plt.figure(figsize = (25,25));
    files = groups[cluster]
    # view <=50 images in cluster
    if len(files) > 50:
        print(f'showing 50 of {len(files)} images in cluster')
        files = files[:49]
    for index, file in enumerate(files):
        plt.subplot(10,10,index+1);
        img = load_img(file)
        img = np.array(img)
        plt.imshow(img)
        plt.axis('off')

In [None]:
view_images_in_cluster(2)

In [None]:
view_images_in_cluster(4)

In [None]:
view_images_in_cluster(12)