<h2>Image Similarity App.</h2>
<h3>Part #1. Feature Extraction</h3>

In [4]:
# Import modules and packages
import numpy as np
from numpy.linalg import norm
import pickle
from tqdm import tqdm, tqdm_notebook
import os
import time
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.resnet50 import ResNet50, preprocess_input

<p>Load the <i>ResNet</i> model without the top classification layer, so we get only the <i>bottleneck features</i>. Then define a function that takes an image path, loads the image, resizes it to proper dimensions supported by <i>ResNet-50</i>, extracts the featues, and then normalizes them.</p>

In [5]:
model = ResNet50(weights='imagenet', include_top=False, input_shape=(224, 224, 3))

Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Downloading data from https://github.com/keras-team/keras-applications/releases/download/resnet/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5


In [16]:
def extract_features(img_path, model):
    input_shape = (224, 224, 3)
    img = image.load_img(img_path, target_size=(input_shape[0], input_shape[1]))
    img_array = image.img_to_array(img)
    expanded_img_array = np.expand_dims(img_array, axis=0)
    preprocessed_img = preprocess_input(expanded_img_array)
    features = model.predict(preprocessed_img)
    flattened_features = features.flatten()
    normalize_features = flattened_features / norm(flattened_features)
    return normalize_features

In [18]:
# Let's see the feature length that the model generates
features = extract_features('./caltech101/ant/image_0020.jpg', model)
print('Lenght of features for a given sample image is {}.'.format(len(features)))

Lenght of features for a given sample image is 100352.


<p>The <i>ResNet-50</i> model generates <i>100352</i> features from the provided image. Each feature is a floating-point value between <code>0</code> and <code>1</code>.</p>

<p>We need to extract features for the <b>entire dataset</b>. First, we get all the filenames with a special function, which recursively looksfor all the image files (defined by their extensions) under a directory.</p>

In [20]:
# Define possible extensions of image files
EXTENSIONS = ['.jpg', '.JPG', '.jpeg', '.JPEG', '.png', '.PNG']

def get_file_list(root_dir):
    file_list = []
    counter = 1
    for root, directories, filenames in os.walk(root_dir):
        for filename in filenames:
            if any(ext in filename for ext in EXTENSIONS):
                file_list.append(os.path.join(root, filename))
                counter += 1
    return file_list

In [23]:
# Path to the datsets
ROOT_DIR = './caltech101'
filenames = sorted(get_file_list(ROOT_DIR))
print('We found {} image files for a given directory (including subcategories).'.format(len(filenames)))

We found 8677 image files for a given directory (including subcategories).


<p>Then we define a variable that will store all of the features, go through all filenames in the dataset, extract their features, and append the to the previously defined variable.</p>

In [24]:
feature_list = []
for i in tqdm_notebook(range(len(filenames))):
    feature_list.append(extract_features(filenames[i], model))

HBox(children=(IntProgress(value=0, max=8677), HTML(value='')))




<p>Finally, write these features to a pickle file so that we can use them in the future without having to recalculate them.</p>

In [None]:
pickle.dump(feature_list, open('./data/features-caltech101-resnet.pickle', 'wb'))
pickle.dump(filenames, open('./data/filenames-caltech101.pickle', 'wb'))

<p>Generated files take a quite huge amout of disk space:
<ul>
<li><b>File #1.</b> <u>features-caltech101-resnet.pickle</u> - 3.48 GB</li>
<li><b>File #2.</b> <u>filenames-caltech101.pickle</u> - 398 KB</li>
</ul>
</p>