# Vector Databases Assignment (Graded): Pinecone

Welcome to your programming assignment on Vector Databases! You will create a Pinecone Database Instance to store the feature vectors.

## Problem Description

- In this assignemnt, you will develop a machine learning system to extract and store visual features from the Flickr8k dataset using a pre-trained ResNet-50 model in a Pinecone vector database. 

- The system should efficiently process and store image features to enable fast similarity search and retrieval for downstream computer vision tasks

## Dataset Description

The Flickr8k dataset consists of:
- 8,000 unique images from Flickr
- 5 different caption descriptions per image (40,000 total captions)
- Images with varying visual content, perspectives, and lighting conditions
- Standard image format (.jpg files)

Key characteristics:
- Training set: 6,000 images
- Validation set: 1,000 images
- Test set: 1,000 images

For more information, refer to this link: [Flickr-8k Dataset](https://www.kaggle.com/datasets/dibyansudiptiman/flickr-8k)

## Assignment Tasks

1. **Feature Extraction Setup**
- Initialize and configure the pre-trained ResNet-50 model
- Remove the final classification layer to obtain 2048-dimensional feature vectors
- Set up Pinecone database with appropriate index configuration

2. **Image Processing Pipeline**
- Implement image preprocessing functions (resizing, normalization)
- Create batch processing mechanism for efficient feature extraction
- Handle potential errors and edge cases in image processing

3. **Database Integration**
- Configure Pinecone connection with proper authentication
- Design vector storage schema including metadata
- Implement batch upsert functionality for efficient data storage

## Instructions

- Only write code when you see any of the below prompts,

    ```
    # YOUR CODE GOES HERE
    # YOUR CODE ENDS HERE
    # TODO
    ```

- Do not modify any other section of the code unless tated otherwise in the comments.

# Code Section

In [None]:
import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
from pinecone import Pinecone, ServerlessSpec
from dataclasses import dataclass
import os
import requests
import tarfile
import zipfile
from helpers.methods import download_dataset, verify_dataset, visualize_sample_images, visualize_feature_distribution, visualize_pinecone_results
from tests.test_methods import test_initialize_feature_extractor, test_setup_pinecone, test_batch_process_images

## Task: Setup the configurations(Do it iteratively while you traverse the code)

**Task Hints:**

Complete the configuration of the system by setting up the missing fields in the configuration class.

- Update **Model configuration** for ResNet50 model (e.g., specify any other required parameters).
- Update **Pinecone configuration** with your environment and key settings.

- `image_size`: Specify the size of the images for model processing.
- `batch_size`: Define the number of images processed in a single batch.

Complete the fields marked as "Your CODE GOES HERE" with appropriate values.

In [None]:
# Configuration
@dataclass
class Config:
    # Dataset configuration
    # Do not change the code below
    data_root: str = 'data'
    flickr8k_path: str = 'data/Flickr8k'
    image_dir: str = 'data/Flickr8k/Flicker8k_Dataset'
    caption_file: str = 'data/Flickr8k/Flickr8k.token.txt'
    train_images_file: str = 'data/Flickr8k/Flickr_8k.trainImages.txt'
    test_images_file: str = 'data/Flickr8k/Flickr_8k.testImages.txt'
    # Do not change the code above
    
    # Model configuration (For ResNet50) : Your CODE GOES HERE
    image_size: tuple = 
    batch_size: int = 
    
    # Pinecone configuration: Your CODE GOES HERE
    pinecone_api_key: str = ''
    pinecone_env: str = ''
    index_name: str = 'flickr8k-features'

In [None]:
# Initialize configuration
config = Config()

## Downloading the dataset

In [None]:
# Download and verify dataset: Do not change the code below
print("Downloading dataset...")
download_dataset(config)

if verify_dataset(config):
    print("Dataset verified successfully!")
else:
    raise Exception("Dataset verification failed!")

## Task: Loading and getting to know about the dataset

**Task Hints:**

Complete the `load_and_analyze_captions` function to analyze the captions dataset.

- **Loading Data**: Ensure the captions file is read correctly using `pd.read_csv()`.
- **Image Parsing**: Split the `image_and_caption_id` column on `#` and extract the image filename into a new column called `image`.
  - Use `apply()` with a lambda function to achieve this.
  
#### Analysis Tasks:
- Print the total number of captions.
- Print the total number of unique images.
- Print the average number of captions per image.

#### Additional Analysis:
- Create a new column `caption_length` to store the length of each caption (in characters).
- Create a new column `word_count` to store the number of words in each caption.
  - Use `apply()` with a lambda function to count the number of words.

#### Visualization:
- Create two subplots:
  - The first subplot should display the distribution of caption lengths (number of characters).
  - The second subplot should display the distribution of word counts (number of words).

- Use `sns.histplot()` for visualization.

Make sure to handle exceptions during file loading and print the first line of the file if an error occurs.

In [None]:
def load_and_analyze_captions(config: Config):
    """Load and perform EDA on captions"""
    # Read the token file with proper format
    try:
        df = pd.read_csv(config.caption_file, sep='\t', header=None, 
                        names=['image_and_caption_id', 'caption'])
        
        # Your code goes here
        
        # Task: Split the 'image_and_caption_id' column on '#' and create a new column 'image' with the image filename
        # Hint: Use the apply() function with a lambda function to split the column
        df['image'] = 
        
        print("\n## Caption Analysis")
        # Task: Print the total number of captions
        print(f"Total number of captions: {}")
        
        # Task: Print the total number of unique images
        print(f"Unique images: {}")
        
        # Task: Print the average number of captions per image
        print(f"Average captions per image: {}")
        
        # Calculate caption statistics
        # Task: Create a new column 'caption_length' with the length of each caption
        df['caption_length'] = 
        
        # Task: Create a new column 'word_count' with the number of words in each caption
        # Hint: Use the apply() function with a lambda function to split the caption
        df['word_count'] = 
        
        # Visualize distributions
        plt.figure(figsize=(15, 5))
        
        plt.subplot(1, 2, 1)
        sns.histplot(data=df['caption_length'], bins=50)
        plt.title('Distribution of Caption Lengths')
        plt.xlabel('Number of Characters')
        
        plt.subplot(1, 2, 2)
        sns.histplot(data=df['word_count'], bins=30)
        plt.title('Distribution of Word Counts')
        plt.xlabel('Number of Words')
        
        plt.tight_layout()
        plt.show()
        
        return df
        
    except Exception as e:
        print(f"Error loading captions: {str(e)}")
        print("\nAttempting to read file contents:")
        try:
            with open(config.caption_file, 'r', encoding='utf-8') as f:
                print(f.readline())  # Print first line to see format
        except Exception as read_error:
            print(f"Could not read file: {str(read_error)}")
        raise
    
    
# Load and analyze dataset: Do not change the code below
print("\nAnalyzing dataset...")
df = load_and_analyze_captions(config)

In [None]:
# Visualize sample images: Do not change the code below
config = Config()
visualize_sample_images(config, df, num_samples=5)

## Task: Model Building

**Task Hints:**

Complete the `initialize_feature_extractor` function to set up the ResNet-50 model for feature extraction.

- **Loading the ResNet50 Model**: Load the ResNet50 model with the following parameters:
  - `include_top=False`: Exclude the top fully connected layers.
  - `weights='imagenet'`: Use pre-trained weights from ImageNet.
  - `input_shape=(*config.image_size, 3)`: Set the input shape according to the configured image size.

- **Building the Model**:
  - Create a Sequential model that includes:
    1. The ResNet50 base model.
    2. A `GlobalAveragePooling2D` layer for down-sampling the feature maps.

- **Model Configuration**:
  - Use `tf.keras.Sequential()` to construct the model.
  - Use `tf.keras.layers.GlobalAveragePooling2D()` to add the pooling layer.
  - The ResNet50 model should be the first layer in the sequence.
  - The model should not be compiled or trained at this stage.

- **Testing**:
  - After setting up the model, it will be tested automatically using `test_initialize_feature_extractor()`. You do not need to change the test code.

Ensure the model is returned from the function at the end.

In [None]:
def initialize_feature_extractor(config: Config):
    """Initialize and configure ResNet-50 model"""
    
    # Your code goes here
    # Task: Load the ResNet50 model with the following parameters:
    # - include_top=False
    # - weights='imagenet'
    # - input_shape=(*config.image_size, 3)
    
    base_model = 
    
    # Task: Create a Sequential model with the ResNet50 base model and a GlobalAveragePooling2D layer
    # Hint: Use tf.keras.Sequential() with a list of layers
    # Hint: The GlobalAveragePooling2D layer can be added using tf.keras.layers.GlobalAveragePooling2D()
    # Hint: The base model should be the first layer in the list
    # Hint: The model should not be compiled or trained
    model = 
    
    # Test the model: Do not change the code below
    test_initialize_feature_extractor(model, config)
    
    # Return the model
    return model

**Task Hints:**

Complete the `preprocess_image` function to prepare an image for feature extraction using the ResNet-50 model.

- **Loading the Image**:
  - Load the image from the provided file path.
  - Resize the image to the target size specified in the `config.image_size`.
  - Use `tf.keras.preprocessing.image.load_img()` with the `target_size` parameter to resize the image.

- **Converting Image to Array**:
  - Convert the loaded image to a NumPy array for further processing.
  - Use `tf.keras.preprocessing.image.img_to_array()` to perform the conversion.

- **Preprocessing for ResNet-50**:
  - Preprocess the image array to match the input format required by ResNet-50.
  - Use `tf.keras.applications.resnet50.preprocess_input()` to preprocess the image for ResNet-50.

- **Return**:
  - Ensure that the function returns the preprocessed image array.

Make sure to follow the image preprocessing steps precisely for correct feature extraction.

In [None]:
# Task: Preprocess image for feature extraction
def preprocess_image(image_path, config: Config):
    """Preprocess single image"""
    # Your code goes here
    
    # Task: Load the image from the file path and resize it to the target size
    # Hint: Use tf.keras.preprocessing.image.load_img() with the target_size parameter
    
    img = 
    
    # Task: Convert the image to a numpy array and preprocess it for the ResNet50 model
    # Hint: Use tf.keras.preprocessing.image.img_to_array() and tf.keras.applications.resnet50.preprocess_input()
    img_array = 
    
    # Task: Preprocess the image array for the ResNet50 model
    # Hint: Use tf.keras.applications.resnet50.preprocess_input()
    img_array = 
    
    # Return the preprocessed image array
    return img_array


**Task Hints:**

Complete the `extract_features` function to extract image features from a random sample of images using the model.

- **Sampling Images**:
  - A random sample of images is selected from the `config.image_dir`. This part of the code is already provided and should not be changed.
  
- **Feature Extraction**:
  - For each sampled image:
    1. **Preprocess the Image**: 
        - Preprocess each image using the `preprocess_image()` function. 
        - Ensure the image is correctly processed for the ResNet-50 model.
    2. **Predict Features**:
        - Use `model.predict()` to extract features from the preprocessed image.
        - The input to `model.predict()` must be expanded with `np.expand_dims()` to add a batch dimension.
    3. **Store Features**:
        - Extracted features should be appended to a list.
        - Use `.squeeze()` to remove the unnecessary dimensions from the feature array.

- **Error Handling**:
  - If any exception occurs while processing an image, print an error message and continue processing the rest of the images.

- **Return**:
  - Return the features as a NumPy array.

The function should iterate through the randomly sampled images and return the extracted feature representations.

In [None]:
def extract_features(model, config: Config, num_samples=100):
    """Extract features from random sample of images using the model"""
    # Get random sample of images: Do not change the code below
    image_files = os.listdir(config.image_dir)
    np.random.shuffle(image_files)
    sample_files = image_files[:num_samples]
    
    # Your code goes here
    
    # Task: Extract features from the sample images using the model
    # Hint: Preprocess each image using the preprocess_image() function
    # Hint: Use model.predict() to extract features from the preprocessed image
    # Hint: Append the features to a list
    features = []
    for img_file in tqdm(sample_files, desc="Extracting features"):
        try:
            img_path = os.path.join(config.image_dir, img_file)
            # YOUR CODE GOES HERE
            
        except Exception as e:
            print(f"Error processing {img_file}: {str(e)}")
            continue
    
    return np.array(features)

## Task: Setting up Pinecone

**Step-by-Step Instructions**:

1. **Initialize Pinecone with API Key**:
   - Use the API key provided in the `config` to initialize Pinecone.
   - **Hint**: Use the `Pinecone(api_key=config.pinecone_api_key)` command to achieve this.

2. **Check for Index Existence**:
   - Check if the index with the name `config.index_name` already exists in Pinecone.
   - **Hint**: Use `pc.list_indexes()` to get a list of existing indexes.
   - If the index does not exist, proceed to create one.

3. **Create a New Pinecone Index** (if it doesn’t exist):
   - If the index doesn't exist, create a new one using the following configuration:
     - `name`: Use the index name from `config.index_name`.
     - `dimension`: Use `2048` as the feature dimension (since features from ResNet-50 are of size 2048).
     - `metric`: Set the metric to `cosine` for similarity search.
     - `spec`: Set up the serverless specifications for the index, using AWS as the cloud provider and the region from `config.pinecone_env`.
   - **Hint**: Use `pc.create_index()` for creating the new index.

4. **Get the Index Object**:
   - Whether the index is newly created or already exists, retrieve the index object for further operations.
   - **Hint**: Use `pc.Index(config.index_name)` to get the specific index.

5. **Test Pinecone Setup**:
   - A test function `test_setup_pinecone` is provided to check if everything is set up correctly. You don’t need to modify this part of the code.


### Return:

- The function returns the initialized index object that will be used for further operations.

In [None]:
def setup_pinecone(config: Config):
    """Initialize Pinecone database with new API"""
    try:
        # Task: Initialize Pinecone with the API key from the configuration
        # Hint: Use Pinecone() with the api_key parameter
        pc = 
        
        # Create index if it doesn't exist
        if config.index_name not in pc.list_indexes().names():
            # Task: Create a new index with the specified name, dimension, metric, and serverless spec
            # YOUR CODE GOES HERE
            
            
            print(f"Created new index: {config.index_name}")
        else:
            print(f"Using existing index: {config.index_name}")
        
        # Task: Get the index object for the specified index name
        # Hint: Use pc.Index() with the index name
        index = 
        
        # Test the Pinecone setup: Do not change the code below
        test_setup_pinecone(index, config, pc)
        
        return index
        
    except Exception as e:
        print(f"Error setting up Pinecone: {str(e)}")
        print("\nPlease ensure:")
        print("1. You have installed the latest pinecone-client package")
        print("2. Your API key is correct")
        print("3. You have sufficient permissions")
        raise

## Batch Process images to store into the database

**Task Hints**

- Step 1: Get Unique Images
Retrieve unique image names from the DataFrame.

- Step 2: Processing Images in Batches
Iterate over unique images in increments defined by `config.batch_size`.

- Step 3: Get a Batch of Unique Images
Extract a batch of unique images using slicing within the loop.

- Step 4: Extract Features from the Batch of Images
Process each image in the current batch to extract features.

- Step 5: Preprocess Image
Prepare the image using the `preprocess_image()` function.

- Step 6: Extract Features from the Preprocessed Image
Use `model.predict()` with `np.expand_dims()` to obtain image features.

- Step 7: Get Captions for the Image
Retrieve the associated captions for the current image from the DataFrame.

- Step 8: Create a Vector Dictionary
Construct a dictionary with keys `'id'`, `'values'`, and `'metadata'`.

- Step 9: Append the Vector to the List of Vectors
Add the vector dictionary to the `vectors` list.

- Step 10: Upsert the Batch of Vectors to the Pinecone Index
Upsert the prepared batch of vectors into the Pinecone index using `index.upsert()`.

- Step 11: Handle Exceptions
Implement error handling to catch and report exceptions during processing.

- Final Step: Testing the Batch Processing
Keep the test code unchanged at the end of the function for validation.

In [None]:
def batch_process_images(model, config: Config, df: pd.DataFrame, index):
    """Process images in batches and store features"""
    
    # Get unique images
    unique_images = df['image'].unique()
    
    # Your code goes here
    
    # Process images in batches
    for i in tqdm(range(0, len(unique_images), config.batch_size)):
        
        # Task: Get a batch of unique images
        batch_images = 
        
        # Task: Extract features from the batch of images using the model
        
        # Task: Create a list of vectors with the image features and metadata
        vectors = []
        
        for img_name in batch_images:
            try:
                img_path = os.path.join(config.image_dir, img_name)
                if not os.path.exists(img_path):
                    print(f"Image not found: {img_path}")
                    continue
                
                # Task: Preprocess image
                # Hint: Use the preprocess_image() function
                img_array = 
                
                # Task: Extract features from the preprocessed image
                # Hint: Use model.predict() with np.expand_dims() to add a batch dimension
                features = 
                
                # Task: Get captions for the image
                # Hint: Use the DataFrame to filter captions for the image
                captions = 
                
                # Task: Create a vector dictionary with the image features and metadata
                # Hint: The vector should have keys 'id', 'values', and 'metadata'
                # Hint: The 'values' should be the features squeezed and converted to a list
                # Hint: The 'metadata' should include the 'captions' and 'filename
                vector = 
                
                
                # Task: Append the vector to the list of vectors
                
                
            except Exception as e:
                print(f"Error processing {img_name}: {str(e)}")
                continue
        
        if vectors:
            try:
                # Task: Upsert the batch of vectors to the Pinecone index
                # Hint: Use the index.upsert() method with the vectors parameter
                
            except Exception as e:
                print(f"Error upserting batch to Pinecone: {str(e)}")
                
    # Test the batch processing: Do not change the code below
    test_batch_process_images(config, df, model, index)

In [None]:
def main():
    
    # Initialize model
    print("\nInitializing feature extractor...")
    model = initialize_feature_extractor(config)
    
    # Extract features
    features = extract_features(model, config, num_samples=100)
    
    # Visualize distributions
    if len(features) > 0:
        visualize_feature_distribution(features)
    else:
        print("No features were extracted successfully")
    
    # Setup Pinecone and process images
    print("\nSetting up Pinecone and processing images...")
    index = setup_pinecone(config)
    
    # Process images
    batch_process_images(model, config, df, index)
    
    print("\nProcessing complete!")

In [None]:
if __name__ == "__main__":
    main()

In [None]:
# Do not modify the code below
# Visualize Pinecone results: Fetches feature vectores for any 5 random images and visualizes them
visualize_pinecone_results(config)