<a name='Image-feature-extraction'></a>
# **Image Feature Extraction With Pre-Trained CNN Models**

Image feature extraction is a critical step in image captioning, as it involves converting raw image data into a set of numerical features that can be used as input to the image captioning model. One way to perform to do is to use a pretrained deep learning CNN models such as VGG16, ResNet, Densenet201 or Inception. These models are trained on large datasets such as ImageNet, and have learned to extract meaningful and discriminative features from images.

**The process of using a pretrained model for image feature extraction involves the following steps:**

1. Load the Pretrained Model
2. Remove the Classification Layers: These layers are designed to predict the class label of an image, which is not relevant for image feature extraction.
3. Extracting Image Features: This is done by passing each image through the modified model, and obtaining the output of one of the intermediate layers. The output of this layer represents a set of high-level features that capture the visual content of the image.
4. Save the Extracted Features

Overall, using a pretrained CNN model for image feature extraction is a powerful technique that can significantly improve the performance of image captioning models. By leveraging the power of these models, we can obtain a rich representation of the visual content of images, which can be used to generate accurate and informative captions.

<a name='Models'></a>
## **Models**
Here we will utilize Densenet201 and VGG16 for extracting features from images.
VGG16 is a relatively simple architecture compared to DenseNet201, with 16 layers of convolution and pooling operations. It has been widely used for transfer learning in image classification tasks and has achieved state-of-the-art performance on several benchmark datasets. However, VGG16 is computationally expensive and can be slow to train.

DenseNet201, on the other hand, has more layers and a more complex architecture that allows for better feature reuse and can improve the flow of information through the network. It has also shown good performance on a variety of image classification tasks, and is relatively computationally efficient compared to some other deep neural network architectures.

In general, if you have a large dataset and computational resources, DenseNet201 may be a better choice, as it can capture more complex features and can potentially achieve better performance. If you have limited data and computational resources, VGG16 may be a better choice, as it is simpler and faster to train.

## **Contents:**
- [Image Feature Extraction With Pre-Trained CNN Models](#Image-feature-extraction)
- [Models](#Models)
- [Imports](#Imports)
- [Functions](#Functions)
- [Pre-Trained DenseNet201 Model](#Densenet)
- [Pre-Trained VGG16 Model](#VGG16)
- [Image Feature Extraction Summary](#Image-Feature-Extraction-Summary)

<a name='Imports'></a>
## **Imports**

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import pickle
from tensorflow import keras
from tensorflow.keras.preprocessing.image import load_img, img_to_array
from tensorflow.keras.applications import DenseNet201
from keras.applications.densenet import preprocess_input
from tensorflow.keras.models import Model, Sequential


from tensorflow.keras.applications.vgg16 import VGG16 , preprocess_input # extract features from image data.

pd.set_option("display.max_colwidth", None)

In [None]:
# Mount to drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### **Load Data**

In [None]:
# Read cleaned caption file
df = pd.read_csv('drive/MyDrive/Colab Notebooks/Capstone_Project/data/cleaned_caption.csv')

# Define the path of data
image_path = 'drive/MyDrive/Colab Notebooks/Capstone_Project/data/Images'

# Define the path of data folder
data_path = 'drive/MyDrive/Colab Notebooks/Capstone_Project/data/extracted_image_features'

<a name='Functions'></a>
## **Functions**

Below function will be used for executiong different pre-trained models for image feature extraction.

In [None]:
# Function for extracting image features
def extracting_features (img_size, image, model, data, name):
  features = {}
  images = data[image].unique().tolist()

  for image_name in images:
    
    # Load the image from file
    path = image_path + '/' + image_name
    img = load_img(path,target_size = (img_size,img_size))
    
    # Convert image to array
    img = np.array(img)
    
    # Normalize the image by dividing to max
    #img = img/255
    
    # Reshape the image by adding another dimension to preprocess in a RGB
    img = img.reshape((1, img.shape[0], img.shape[1], img.shape[2]))
    
    # preprocess image for scaling the pixel values
    img = preprocess_input(img)
    
    # Extract features
    img_feat = model.predict(img, verbose=0)
        
    # Store feature
    features[image_name] = img_feat

  # Save extracted features
  with open(data_path + '/' + name + '.pkl', 'wb') as pickle_out:
    pickle_out = pickle.dump(features, pickle_out)

  return features

<a name='Densenet'></a>
## **Pre-Trained DenseNet201 Model**

We set the input layer to be the same as the input layer of the original DenseNet201 model, and the output layer to be the second-to-last layer of the original model. This removes the final classification layer, which was responsible for predicting the class labels of the input images.

In [None]:
# Instantiate DenseNet201() model for image feature extraction
densenet_model = DenseNet201()

# Restructure model
densenet_model = Model(inputs=densenet_model.input, outputs=densenet_model.layers[-2].output)

# Summerize
print(densenet_model.summary())

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/densenet/densenet201_weights_tf_dim_ordering_tf_kernels.h5
Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, 224, 224, 3  0           []                               
                                )]                                                                
                                                                                                  
 zero_padding2d (ZeroPadding2D)  (None, 230, 230, 3)  0          ['input_1[0][0]']                
                                                                                                  
 conv1/conv (Conv2D)            (None, 112, 112, 64  9408        ['zero_padding2d[0][0]']         
                                )                  

- Execute feature extraction function for DenseNet201 model and save the extracted features.

In [None]:
# Execute feature extraction function for DenseNet201 model
extracting_features (img_size=224, image='image', model=densenet_model, data=df, name='densenet_img_feature_ex')

{'1000268201_693b08cb0e.jpg': array([[ 0.        ,  0.        ,  0.02089716, ..., 17.095543  ,
          0.6510734 ,  0.14316882]], dtype=float32),
 '1001773457_577c3a7d70.jpg': array([[1.1297526e-06, 0.0000000e+00, 2.4402274e-03, ..., 1.1449293e+00,
         6.3626361e+00, 1.9069627e-01]], dtype=float32),
 '1002674143_1b742ab4b8.jpg': array([[ 0.        ,  0.        ,  0.02022222, ...,  5.542681  ,
         10.543776  ,  0.83629465]], dtype=float32),
 '1003163366_44323f5815.jpg': array([[4.0896884e-06, 0.0000000e+00, 9.7770081e-04, ..., 1.6475077e+00,
         7.7980608e-01, 9.0657169e-01]], dtype=float32),
 '1007129816_e794419615.jpg': array([[0.0000000e+00, 0.0000000e+00, 1.7467703e-03, ..., 6.5489321e+00,
         4.7779069e+00, 4.4508930e-02]], dtype=float32),
 '1007320043_627395c3d8.jpg': array([[1.2575913e-05, 0.0000000e+00, 1.0925459e-02, ..., 5.4353004e+00,
         8.5426968e-01, 2.7488060e+00]], dtype=float32),
 '1009434119_febe49276a.jpg': array([[0.        , 0.        , 0.

<a name='VGG16'></a>
## **Pre-Trained VGG16 Model**

We set the input layer to be the same as the input layer of the original DenseNet201 model, and the output layer to be the second-to-last layer of the original model. This removes the final classification layer, which was responsible for predicting the class labels of the input images.

In [None]:
# Instantiate VGG16() model for image feature extraction
vgg16_model = VGG16()

# Restructure model
vgg16_model = Model(inputs = vgg16_model.inputs , outputs = vgg16_model.layers[-2].output)

# Summerize
print(vgg16_model.summary())

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels.h5
Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 224, 224, 3)]     0         
                                                                 
 block1_conv1 (Conv2D)       (None, 224, 224, 64)      1792      
                                                                 
 block1_conv2 (Conv2D)       (None, 224, 224, 64)      36928     
                                                                 
 block1_pool (MaxPooling2D)  (None, 112, 112, 64)      0         
                                                                 
 block2_conv1 (Conv2D)       (None, 112, 112, 128)     73856     
                                                                 
 block2_conv2 (Conv2D)       (None, 112, 112, 128)     147

- Execute feature extraction function for VGG16 model and save the extracted features.

In [None]:
# Execute feature extraction function for VGG16 model
extracting_features (img_size=224, image='image', model=vgg16_model, data=df, name='vgg16_img_feature_ex')

{'1000268201_693b08cb0e.jpg': array([[2.5076475, 0.       , 0.       , ..., 0.       , 0.       ,
         0.       ]], dtype=float32),
 '1001773457_577c3a7d70.jpg': array([[0.        , 0.        , 0.49410808, ..., 0.        , 0.        ,
         0.        ]], dtype=float32),
 '1002674143_1b742ab4b8.jpg': array([[1.4937091, 0.       , 0.5356834, ..., 2.315413 , 3.7418401,
         0.       ]], dtype=float32),
 '1003163366_44323f5815.jpg': array([[0., 0., 0., ..., 0., 0., 0.]], dtype=float32),
 '1007129816_e794419615.jpg': array([[0.        , 0.09227633, 0.        , ..., 0.        , 0.        ,
         0.0652895 ]], dtype=float32),
 '1007320043_627395c3d8.jpg': array([[0.       , 0.       , 0.       , ..., 0.       , 3.3386393,
         0.       ]], dtype=float32),
 '1009434119_febe49276a.jpg': array([[2.096294  , 2.1193202 , 3.5624332 , ..., 0.64263886, 2.714652  ,
         0.        ]], dtype=float32),
 '1012212859_01547e3f17.jpg': array([[0.       , 0.       , 0.9873711, ..., 0.   

<a name='Image-Feature-Extraction-Summary'></a>
## **Image Feature Extraction Summary**

- loaded two pre-trained models, VGG16 and DenseNet201.
- Removed the classification layer from both models.
- Passed each image in our dataset through each of these models to obtain a set of features for the image.
- Extracted features stored in separate files.

**These extracted features will be used as input to train an image captioning model**