**Featurize Data**

*Summary of this notebook:*  
Obtain a low-dimensional feature vector for each image in an input dataset using a ImageNet based pretrained model (MobileNet, here). Load the dataset in a generator object, preprocess based on the model, run predict on every image to obtain a feature vector. Save the feature vector and the filenames in a separate pickle file.

*Definition of Done:*

In [None]:
import os
import tensorflow
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import math
import pickle
import rasterio
import numpy as np

In [None]:
from google.colab import drive 
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
os.chdir("/content/gdrive/Shared drives/2020_FDLUSA_Earth Science_Knowledge Discovery Framework/Code")

In [None]:
tensorflow.test.gpu_device_name()

'/device:GPU:0'

In [None]:
## FIX

dataset = "training_set_tmp"
dataPath = ("Datasets/" + dataset + "/np_arrays")
modelName = "KDF_modis"
files = os.listdir(dataPath)

['subset_5000_200_500.npy', 'subset_0_0_500.npy', 'subset_0_1000_500.npy']


Data Generator
1. Get input            : input_path -> image
2. Get output           : input_path -> label
3. Pre-process input    : image -> pre-processing step -> image
4. Get generator output : ( batch_input, batch_labels )


In [None]:
def get_input(path):
    
    img = load( path )
    
    return( img )

In [None]:
def get_output( path, label_file = None ):
    
    img_id = path.split('/')[-1].split('.')[0]
    labels = label_file.loc[img_id].values
    
    return(labels)


In [None]:
def preprocess_input( image ):
    
    --- Rescale Image
    --- Rotate Image
    --- Resize Image
    --- Flip Image
    --- PCA etc.
    
    return( image )

In [None]:
def image_generator(files, label_file, batch_size = 64):
    
    while True:
          # Select files (paths/indices) for the batch
          batch_paths  = np.random.choice(a = files, 
                                          size = batch_size)
          batch_input  = []
          batch_output = [] 
          
          # Read in each input, perform preprocessing and get labels
          for input_path in batch_paths:
              input = get_input(input_path )
              output = get_output(input_path,label_file=label_file )
            
              input = preprocess_input(image=input)
              batch_input += [ input ]
              batch_output += [ output ]
          # Return a tuple of (input, output) to feed the network
          batch_x = np.array( batch_input )
          batch_y = np.array( batch_output )
        
          yield( batch_x, batch_y )

In [None]:
dataGenerator = ImageDataGenerator(
    preprocessing_function = preprocess_input
)

In [None]:
batch_size = 32
trainGenerator = dataGenerator.flow_from_directory(
        dataPath,
        target_size=(224, 224),
        batch_size= batch_size,
        class_mode= None, 
        shuffle = False)

Found 2100 images belonging to 21 classes.


Generate Feature Vector from User-defined dataset

In [None]:
nImages = len(trainGenerator.filenames)
nLoops = int(math.ceil(nImages / batch_size))

In [None]:
bottleneckFeaturesTrain = model.predict(trainGenerator, nLoops, verbose = 1)



In [None]:
print(bottleneckFeaturesTrain.shape)

(2100, 1024)


Dump features and filenames into GDrive folder


In [None]:
pickle.dump(bottleneckFeaturesTrain, file = open(("Features/" + modelName + "_" + dataset + "_features.pkl"), mode = 'wb'))
pickle.dump(trainGenerator.filenames, file = open(("Features/" + modelName + "_" + dataset + "_filenames.pkl"), mode = 'wb'))