<a href="https://colab.research.google.com/github/vinayakShenoy/DL4CV/blob/master/Code/pyimage/io/hdf5py.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Extraction
- Next few chapters will be on the concept of transfer learning, the ability to use a pretrained model as a shortcut to learn patterns from data it was not originally trained on.
- There are two types of transfer learning when applied to deep learning for computer vision:
  - Tresting networks as arbitrary feature extractors.
  - Removing the fully-connected layers of an existing network, placing new FC layer set on top of CNN and fine-tuning these weights(and optionally previous layers) to recognize object classes
- This chapter, we discuss first method, i.e, treating networks as feature extractors.

## Extracting features with a Pre-trained CNN
- Usually, we treat CNN as end-to-end image classifiers.
  - We input an image to the network
  - The image forward propagates through the network.
  - We obtain the final classification probabilities from the end of the network.
- We can stop the propagation at an arbitrary layer and extract the values at this time and use them as feature vectors, that quantifies the contents of an image.
- If this repeated for an entire dataset, we cana train SVM, Logistic regression, or random forest on top of these features to obtain a classifier that recognizer new class of images.
-the trick is extracting these features and storing them in an efficient manner. To accomplish this task, we’ll need HDF5.

## HDF5
- HDF5 is a binary data format used to store gigantic datasets on disk while facilitating easy access and computation on rows of the datasets. 
- Written in C, but we can gain access to C API using Python library, allowing us store huge amounts of data in our HDF5 dataset and manipulate the data in a numpy-like fashion.
- Datasets stored in HDF5 format is portable and can be accessed in C, Matlab and java.
- Below we will write a custom python class that allows us to efficiently accept input data and write it to HDF5 dataset. 
  - Facilitate method to apply transfer learning by taking extracted features from VGG16 and writing them to HDF5.
  - Allow us to generate HDF5 datasets from raw images to facilitate faster training




In [None]:
!pip install h5py



In [None]:
import os
import h5py
from tensorflow.keras.utils import to_categorical
import numpy as np

In [None]:
class HDF5DatasetWriter:
  # dims parameter control dimensions of data that will be stored in dataste.
  #     If we are storing flatten raw pixel intensities of 28x28=784, for 70000 examples then dims=(70000, 784)
  #     If we store raw CIFAR10 (unflattened), dims=(60000, 32, 32, 3)
  #     In context of feature extraction, if the final POOL layer is 512x7x7 when flatten, it is a feature vector of length 25088. We set dims=(N, 25088) 
  #     where N is number of images in dataset
  # datakey indicates that we are storing extracted features from CNN.
  # bufSize: controls the size of our in-memory buffer, which we default to 1,000 feature vectors/images. Once we reach bufSize, we’ll flush the buffer to the HDF5 dataset

  def __init__(self, dims, outputPath, dataKey="images", bufSize=1000):
    #check to see if output path exists
    if os.path.exists(outputPath):
      raise ValueError("Path exists", outputPath)

    #copen hdf5 db for writing and create two datasets
    #one to store images and another to store class labels
    self.db = h5py.File(outputPath, "w")
    self.data = self.db.create_dataset(dataKey, dims, dtype="float")
    self.labels = self.db.create_dataset("labels", (dims[0],), dtype="int")

    self.bufSize = bufSize
    self.buffer = {"data":[], "labels":[]}
    self.idx = 0
  
  def add(self, rows, labels):
    # add rows and labels to the buffer
    self.buffer["data"].extend(rows)
    self.buffer["labels"].extend(labels)

    if len(self.buffer["data"]) >= self.bufSize:
      self.flush()
  
  def flush(self):
    # write buffers to disk then reset buffer
    i = self.idx + len(self.buffer["data"])
    self.data[self.idx:i] = self.buffer["data"]
    self.labels[self.idx:i] = self.buffer["labels"]
    self.idx = i
    self.buffer = {"data":[], "labels":[]}

  def storeClassLabels(self, classLabels):
    # create a dataset to store the actual class label  names,
    # then store the class labels
    dt = h5py.special_dtype(vlen=unicode)
    labelSet = self.db.create_dataset("label_names", (len(classLabels),), dtype=dt)
    labelSet[:] = classLabels

  def close(self):
    # check if there are any entries in buffer 
    # that need to be flushed to disk
    if len(self.buffer["data"])>0:
      self.flush()
      self.db.close()

In [None]:
class HDF5DatasetGenerator:
  def __init__(self, dbPath, batchSize, preprocessors=None,
    aug=None, binarizer=True, classes=2):
    self.batchSize = batchSize
    self.preprocessors = preprocessors
    self.aug = aug
    self.binarize = binarize
    self.classes = classes
        
    self.db = h5py.File(dbPath,"r")
    self.numImages = self.db["labels"].shape[0]
        
  def generator(self, passes=np.inf):
    epochs = 0
    while epochs<passes:
      for i in np.arange(0, self.numImages, self.batchSize):
        images = self.db["images"][i:i+self.batchSize]
        labels = self.db["labels"][i:i+self.batchSize]

        if self.binarize:
          labels = to_categorical(labels, self.classes)

        if self.preprocessors is not None:
          procImages = []
          for image in images:
            for p in self.preprocessors:
              image = p.preprocess(image)

            procImages.append(image)
          images = np.array(procImages)

        if self.aug is not None:
          (images, labels) = next(self.aug.flow(images, labels,
                                                batch_size=self.batchSize))

        yield (images, labels)

      epochs += 1
    
  def close(self):
    self.db.close()