# Generate Embeddings
---
### This notebook is used to generate embeddings for a given folder with a specific structure

The dataset should contain 2 sub-folders. For example: 

    i) dataset/anomaly
    ii) dataset/non-anomaly

**This code will take all the videos ending with .mp4 from both the folders and generate embeddings for each video**

### Importing necessary Libraries

To generate embeddings of a video, we need to first make a `Window` object. Then with that object we will create `WindowEmbedded` object. Using the object, we will get the embeddings. To do that we need these libraies:

In [1]:
import cv2
import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F

from random import shuffle
from glob import glob
from tqdm import tqdm

### Device Selection

We will utilize GPU if possible. For M1 MacBooks, we will utilize Metal Accelarator `mps`. If run on any cuda supported machine, we will try to utilize `cuda`. If no accelarator found, we will set the device to `cpu`. Also, when cuda is selected, we print GPU properties using `torch` function.

In [2]:
device = torch.device("mps"
                      if torch.backends.mps.is_available()
                      else ("cuda" if torch.cuda.is_available() else "cpu")
                      )

# Select Device According to Availability
print("Device selected:", device)

if device.type == "cuda":
    !nvidia-smi
    print()
    print("Device type:", device.type)
    print("Capability:", torch.cuda.get_device_capability(device))
else:
    print("Device capabilities are limited on MPSs and CPUs.")

Device selected: cuda
Mon Feb 19 13:30:44 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla P100-PCIE-16GB           Off | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P0              27W / 250W |      2MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                              

### The Window Class

This class takes a video path and return a `Window` object. A `Window` object is a generator object which can be used to iterate over the video frames in a sliding window fashion.

In [30]:
class Window:
    def __init__(self, video_path,
                 class_label_index,
                 true_class_name='anomaly', 
                 window_size=4, 
                 stride=2,
                 frame_average_count=5):
        """
        A class to manage sliding windows over video frames.

        :param video_path (str): The path to the video file.
        :param window_size (int): The size of the sliding window (default is 4).
        :param stride (int): The step size for moving the window (default is 2).
        :param frame_average_count (int): The number of frames to be averaged for each group (default is 5).
        """
        # The duration of the video in seconds
        self.__video_duration = None
        # The size of the sliding window
        self.__window_size = window_size
        # The step size for moving the window
        self.__stride = stride
        # The number of frames to be averaged for each group
        self.__frame_average_count = frame_average_count
        # The path to the video file
        self.__video_path = video_path
        # An array containing processed frames from the video
        self.__total_frames = self.__get_frames()
        # The average frame of the video
        self.__averaged_frames = self.__get_average_frame()
        # A sliding window of frames
        self.__windows = self.__prepare_window()
        # The index of the current window in the windowed clip
        self.__window_index = 0
        # Class label of the video path
        self.class_label = 1 if video_path.split('/')[class_label_index] == true_class_name else 0

    def next(self):
        """
        This method strides to the next window in the windowed clip.
        :return: numpy.ndarray: The next window in the windowed clip.
        """
        if self.has_next():
            self.__window_index += 1
            return self.__windows[self.__window_index - 1]
        else:
            print("No next window")
            return None

    def has_next(self):
        """
        This method checks if there is a next window in the windowed clip.
        :return: bool: True if there is a next window, False otherwise.
        """
        return self.__window_index < len(self.__windows)

    def current_window(self):
        """
        This method returns the current window in the windowed clip.
        :return: numpy.ndarray: The current window in the windowed clip.
        """
        return self.__windows[self.__window_index]

    def previous(self):
        """
        This method strides to the previous window in the windowed clip.
        :return: numpy.ndarray: The previous window in the windowed clip.
        """
        if self.__window_index > 0:
            self.__window_index -= 1
            return self.__windows[self.__window_index]
        else:
            print("No previous window")
            return None

    def reset(self):
        """
        This method resets the window index to the beginning of the windowed clip.
        """
        self.__window_index = 0

    def __prepare_window(self):
        """
        Prepares a sliding window of frames from the average frame of the video.

        :returns:
            numpy.ndarray: A sliding window of frames.

        Notes:
            This method divides the average frame into overlapping windows of 'window_size' frames.
            The stride parameter determines the step size for moving the window.
            The number of steps is calculated based on the average frame shape and video duration.
            The resulting windows are stored in a numpy array.
        """
        window = []
        fps = self.__averaged_frames.shape[0] // self.__video_duration
        steps = int(round((fps * self.__window_size) // self.__stride))

        for i in range(0, len(self.__averaged_frames) - steps, steps):
            window.append(self.__averaged_frames[i: i + steps * 2])
        return np.array(window)

    def __get_average_frame(self):
        """
        Computes the average frame of the video by taking the mean of consecutive frames.

        :returns:
            numpy.ndarray: The average frame of the video.

        Notes:
            This method divides the video frames into groups, each containing 'avg_no' frames.
            For each group, it computes the mean frame by averaging the pixel values of all frames in the group.
        """
        reduced_frames = []
        for i in range(0, len(self.__total_frames), self.__frame_average_count):
            frames = self.__total_frames[i:i + self.__frame_average_count]
            mean = np.mean(frames, axis=0)
            reduced_frames.append(mean)
        return np.array(reduced_frames)

    def __get_frames(self):
        """
        This method reads the video file and returns the frames

        :return:
            numpy.ndarray: An array containing processed frames from the video.
        :raises:
            IOError: If the video file cannot be read or does not exist.
        """
        video = cv2.VideoCapture(self.__video_path)
        if not video.isOpened():
            raise IOError("Error reading video file")

        frame_count = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
        fps = video.get(cv2.CAP_PROP_FPS)
        self.__video_duration = frame_count // fps
        frames = []
        for i in range(0, frame_count):
            video.set(1, i)
            ret, frame = video.read()
            if ret:
                frame = cv2.resize(frame, (224, 224))
                frame = frame.astype("float32") / 255.0
                frames.append(frame)
        video.release()
        return np.array(frames)

    def __iter__(self):
        """
        This method makes the object iterable.
        :return: window: The object itself.
        """
        return self  # Return self to make the object iterable

    def __next__(self):
        """
        This method returns the next window in the windowed clip. (used for iteration)
        :return: numpy.ndarray: The next window in the windowed clip.
        """
        if self.has_next():
            self.__window_index += 1
            return self.__windows[self.__window_index - 1]
        else:
            raise StopIteration

    def get_current_window_stats(self):
        """
        This method returns the stats of the current window.

        :return:
            dict: A dictionary containing the statistics of the current window.
        """
        return {
            "index": self.__window_index,
            "frame_count": len(self.__windows[self.__window_index]),
            "frame_shape": self.__windows[self.__window_index].shape,
            "frame_dtype": self.__windows[self.__window_index].dtype,
        }

    def __repr__(self):
        return (f"Window (\n\tVideo Path = {self.__video_path}\n"
                f"\tTotal Window = {self.__window_size}\n"
                f"\tStride = {self.__stride}\n"
                f"\tNo. Frames Averaged = {self.__frame_average_count}\n"
                f"\tTotal Frames = {len(self.__total_frames)}\n"
                f"\tAveraged Frames = {len(self.__averaged_frames)}\n"
                f"\tShape = {self.__windows.shape}\n)")

    @property
    def shape(self):
        return self.__windows.shape

### The Embedding Model


This is a simple CNN model for the embeddings generation. The model is a simple CNN model with 3 convolutional layers and 1 fully connected layer. The model is defined using the PyTorch library. The model is defined in the CNN class.

The model is defined as follows:
1. The first convolutional layer has 3 input channels, 32 output channels, 3 kernel size, 1 stride, and 0 padding.
2. The second convolutional layer has 32 input channels, 64 output channels, 3 kernel size, 1 stride, and 0 padding.
3. The third convolutional layer has 64 input channels, 128 output channels, 3 kernel size, 1 stride, and 0 padding.
4. The pooling layer is a max pooling layer with 2 kernel size and 2 stride.
5. The fully connected layer has 128 * 26 * 26 input features and 1024 output features.


**NOTE:
THIS IS A SIMPLE AND EXPERIMENTAL CNN MODEL FOR THE EMBEDDINGS. DON'T USE THIS MODEL FOR PRODUCTION. USE A BETTER MODEL FOR PRODUCTION**

In [31]:
class CNN(nn.Module):
    """
    A simple convolutional neural network (CNN) for generating embeddings from frames.

    Methods:
        forward(self, x): Defines the forward pass of the model.
    """
    def __init__(self):
        super(CNN, self).__init__()
        self.conv1 = nn.Conv2d(3, 32, 3, 1, 0, device=device)
        self.conv2 = nn.Conv2d(32, 64, 3, 1, 0, device=device)
        self.d = nn.Dropout(0.5)
        self.conv3 = nn.Conv2d(64, 128, 3, 1, 0, device=device)
        # Pooling layer, All are same
        self.pool = nn.MaxPool2d(2, 2)
        # Fully connected layer
        self.fc1 = nn.Linear(128 * 26 * 26, 1024, device=device)  # Adjust input size based on your frame size

    def forward(self, x):
        """
        Forward pass of the CNN.

        Takes an input batch of images `x` and performs the following:
            1. Converts the input to the device (e.g., GPU).
            2. Applies ReLU activation to the output of the first convolutional layer.
            3. Performs max pooling with a kernel size of 2.
            4. Repeats steps 2 and 3 for the second and third convolutional layers.
            5. Flattens the output of the last pooling layer.
            6. Applies ReLU activation to the output of the fully-connected layer.
            7. Returns the final feature vector.

        Args:
            x (torch.Tensor): Input batch of images of shape (batch_size, channels, height, width).

        Returns:
            torch.Tensor: Output feature vector of shape (batch_size, 1024).
        """
        x = x.to(device)
        x = F.relu(self.conv1(x))
        x = self.pool(x)
        x = F.relu(self.conv2(x))
        x = self.pool(x)
        x = F.relu(self.conv3(x))
        x = self.pool(x)
        # For batch size 1
        x = x.view(-1, 128 * 26 * 26)
        # For batch size > 1
        # x = x.view(x.size(0), -1)
        x = F.relu(self.fc1(x))
        return x

### The WindowEmbedded Class

This class takes a window object and returns the embeddings of the window using the CNN model. It maintains the sliding window of frames and extracts the embeddings of each frame using the CNN model.

In [5]:
class WindowEmbedded:
    def __init__(self, windows: Window):
        self.__windows = windows
        self.__embedding_model = CNN().to(device)
        self.window_embeddings = self.__get_embeddings()
        self.class_label = windows.class_label

    def __get_embeddings(self):
        embeddings_list = []
        # for window in tqdm.tqdm(self.__windows, desc="Extracting Embeddings", total=self.__windows.shape[0]):
        for window in self.__windows:
            frames = torch.tensor(window).permute(0, 3, 1, 2).to(device)
            frame_embeddings_list = []
            for frame in frames:
                frame = frame.unsqueeze(0)
                frame = frame.to(device)
                frame_embeddings = self.__embedding_model(frame)
                frame_embeddings_list.append(frame_embeddings.flatten().cpu().detach().numpy())
            embeddings_list.append(frame_embeddings_list)
        self.window_embeddings = np.array(embeddings_list)
        return self.window_embeddings

### Creating the Dataset

In this function we create the dataset, which is basically 2 files. 

- `/kaggle/working/embeddings.npy`
- `/kaggle/working/labels.npy`

`/kaggle/working/embeddings.npy` contains all the embeddings of the video.
`/kaggle/working/labels.npy` contains all the labels corresponding to the embeddings. We will feed these two files into our model to train.


In [27]:
def create_dataset(file_path: str,
                   class_label_index: int,
                   true_class_name: str = 'anomaly',
                   shuffle_data: bool = True,
                   video_ext: str = 'mp4',
                   save: bool = True,
                   checkpoints_count: int = 0):
    """
    This function will take the folder path and class_label_index as input and will return the embeddings and
    labels. We can also specify the true_class_name, shuffle_data and video_ext.

    Parameters
    ----------
    file_path : str
        The folder path containing the videos.
    class_label_index : int
        The index of the class name in the file path. For example, if the file path is
        'dataset/anomaly/1.mp4', then the class_label_index will be 1. The index starts from 0.
    true_class_name : str, optional
        The true class name, where the class label is 1.
        Default: 'anomaly'
    shuffle_data : bool, optional
        Whether to shuffle the data or not.
        Default: True
    video_ext : str, optional
        The extension of the videos.
        Default: 'mp4'
    save : bool, optional
        Whether to save the embeddings and labels or not.
        Default: True
    checkpoints_count : int, optional
        The number of videos to save the embeddings and labels after. 0 means No checkpoints.
        Default: 0

    Notes
    ----------
    The embeddings will be of shape [x, 4, 24, 1024] and labels will be of shape [x]. The labels will be 0 for
    non-anomaly and 1 for anomaly. The embeddings and labels will be saved in the same folder with the name
    embeddings.npy and labels.npy. Some videos may have less than 4 frames or 24 features. This problem is solved by
    padding the videos with zeros to make them of the same shape using np.pad.

    Returns
    ----------
    all_embeddings : np.ndarray
        The embeddings and labels of the videos.
    all_labels : np.ndarray
        The labels of the videos.
    """
    # Create glob path for the videos
    rgx = file_path + f'/*/*.{video_ext}'
    # extract the paths of the videos
    paths = glob(rgx)
    # Randomize the order of the videos
    if shuffle_data:
        shuffle(paths)

    # Embeddings and Labels
    all_embeddings = []
    all_labels = []

    # Variables for checkpoint
    # Save the embeddings and labels after every 100 videos
    count = 0
    checkpoints = 0
    # Iterate over the paths and extract the embeddings
    for video_path in tqdm(paths, desc="Extracting Embeddings"):
        if checkpoints_count != 0:
            if count == checkpoints:
                # create checkpoint
                np.save(f'embeddings_{checkpoints}.npy', all_embeddings)
                np.save(f'labels_{checkpoints}.npy', all_labels)
                checkpoints += 1
                count = 0
        try:
            window = Window(video_path, class_label_index, true_class_name=true_class_name)
            window_embed_object = WindowEmbedded(window)
            embeddings = window_embed_object.window_embeddings
            # Try to append the embeddings and labels
            all_embeddings.append(embeddings)
            all_labels.append(window.class_label)
            count += 1
        except ValueError:
            # print error in red
            print(f"\n\033[91mError windowing video: {video_path}\033[0m")
            continue

    # Validate the embeddings and labels
    assert len(all_embeddings) == len(all_labels)
    
    """
    Now we have created the list of embeddings and the list of labels. But, there might be cases where the number of
    windows, frames, or features in the embeddings is not the same for all the embeddings. Normally the shape should
    be [x, 4, 24, 1024], where x is the number of videos.
    
    To solve this problem, we will pad the embeddings with 0 where the number of windows, frames, or features is less
    than the maximum. This will make all the embeddings of the same shape.
    
    We will use np.pad to pad the embeddings with 0. keras.preprocessing.sequence.pad_sequences can also be used but
    it is slower than np.pad and also it is not recommended for 3D arrays.
    """


    # Pad the embeddings and labels
    max_windows = max([len(embeddings) for embeddings in all_embeddings])
    max_frames = max([len(embeddings[0]) for embeddings in all_embeddings])
    max_features = max([len(embeddings[0][0]) for embeddings in all_embeddings])

    # Pad the embeddings with 0 where the number of windows, frames, or features is less than the maximum
    padded_embeddings = []
    for embedding in tqdm(all_embeddings, desc="Padding Embeddings where necessary"):
        if embedding.shape[0] < max_windows or embedding.shape[1] < max_frames or embedding.shape[2] < max_features:
            pad_widths = [(0, max_windows - embedding.shape[0]),
                          (0, max_frames - embedding.shape[1]),
                          (0, max_features - embedding.shape[2])]
            padded_embedding = np.pad(embedding, pad_widths, mode='constant', constant_values=0.0)
            padded_embeddings.append(padded_embedding)
        else:
            padded_embeddings.append(embedding)

    padded_arrays = np.asarray(padded_embeddings, dtype='float32')

    # Save the embeddings and labels and return
    all_embeddings = np.array(padded_arrays)
    all_labels = np.array(all_labels)
    if save:
        np.save('embeddings.npy', all_embeddings)
        np.save('labels.npy', all_labels)
    return all_embeddings, all_labels

#### This cell will initiate the dataset creation process

In [28]:
embeddings, labels = create_dataset(
    '/kaggle/input/fydp-dataset-v1/data',
    class_label_index=5,
    true_class_name='anomaly'
)

Extracting Embeddings:  10%|█         | 369/3655 [07:38<1:06:31,  1.21s/it]


[91mError windowing video: /kaggle/input/fydp-dataset-v1/data/anomaly/714_p2_Arson007_x264_016.mp4[0m


Extracting Embeddings:  15%|█▍        | 541/3655 [11:19<1:09:38,  1.34s/it]


[91mError windowing video: /kaggle/input/fydp-dataset-v1/data/nonanomaly/Normal_Videos581_x264_008.mp4[0m


Extracting Embeddings:  19%|█▉        | 692/3655 [14:38<1:05:03,  1.32s/it]


[91mError windowing video: /kaggle/input/fydp-dataset-v1/data/nonanomaly/Normal_Videos039_x264_005.mp4[0m


Extracting Embeddings:  33%|███▎      | 1204/3655 [26:07<52:03,  1.27s/it]  


[91mError windowing video: /kaggle/input/fydp-dataset-v1/data/anomaly/721_p2_Arson011_x264_000.mp4[0m


Extracting Embeddings:  40%|███▉      | 1453/3655 [31:44<46:46,  1.27s/it]  


[91mError windowing video: /kaggle/input/fydp-dataset-v1/data/anomaly/854_p3_Arson013_x264_006.mp4[0m


Extracting Embeddings:  51%|█████     | 1868/3655 [41:03<36:32,  1.23s/it]


[91mError windowing video: /kaggle/input/fydp-dataset-v1/data/anomaly/74_Abuse004_x264_055.mp4[0m


Extracting Embeddings:  52%|█████▏    | 1910/3655 [42:00<37:00,  1.27s/it]


[91mError windowing video: /kaggle/input/fydp-dataset-v1/data/anomaly/33_Arson016_x264_005.mp4[0m


Extracting Embeddings:  54%|█████▍    | 1987/3655 [43:45<35:02,  1.26s/it]


[91mError windowing video: /kaggle/input/fydp-dataset-v1/data/anomaly/Fighting002_x264_008.mp4[0m


Extracting Embeddings:  55%|█████▍    | 1993/3655 [43:53<34:19,  1.24s/it]


[91mError windowing video: /kaggle/input/fydp-dataset-v1/data/anomaly/RoadAccidents028_x264_001.mp4[0m


Extracting Embeddings:  55%|█████▌    | 2027/3655 [44:38<33:49,  1.25s/it]


[91mError windowing video: /kaggle/input/fydp-dataset-v1/data/nonanomaly/Normal_Videos140_x264_005.mp4[0m


Extracting Embeddings:  61%|██████    | 2237/3655 [49:22<30:40,  1.30s/it]


[91mError windowing video: /kaggle/input/fydp-dataset-v1/data/anomaly/785_p2_Arson026_x264_018.mp4[0m


Extracting Embeddings:  71%|███████   | 2578/3655 [57:01<22:32,  1.26s/it]


[91mError windowing video: /kaggle/input/fydp-dataset-v1/data/anomaly/Shooting046_x264_016.mp4[0m


Extracting Embeddings:  75%|███████▍  | 2724/3655 [1:00:16<19:24,  1.25s/it]


[91mError windowing video: /kaggle/input/fydp-dataset-v1/data/nonanomaly/890_p4_Arson013_x264_001.mp4[0m


Extracting Embeddings:  78%|███████▊  | 2847/3655 [1:03:02<16:42,  1.24s/it]


[91mError windowing video: /kaggle/input/fydp-dataset-v1/data/anomaly/865_p3_Arson031_x264_003.mp4[0m


Extracting Embeddings:  82%|████████▏ | 2989/3655 [1:06:11<13:21,  1.20s/it]


[91mError windowing video: /kaggle/input/fydp-dataset-v1/data/anomaly/155_p14_Abuse011_x264_001.mp4[0m


Extracting Embeddings:  84%|████████▍ | 3079/3655 [1:08:12<13:06,  1.37s/it]


[91mError windowing video: /kaggle/input/fydp-dataset-v1/data/anomaly/RoadAccidents033_x264_001.mp4[0m


Extracting Embeddings:  94%|█████████▎| 3422/3655 [1:15:52<05:07,  1.32s/it]


[91mError windowing video: /kaggle/input/fydp-dataset-v1/data/nonanomaly/Normal_Videos061_x264_058.mp4[0m


Extracting Embeddings:  94%|█████████▍| 3448/3655 [1:16:27<04:14,  1.23s/it]


[91mError windowing video: /kaggle/input/fydp-dataset-v1/data/nonanomaly/623_Abuse045_x264_016.mp4[0m


Extracting Embeddings:  97%|█████████▋| 3545/3655 [1:18:39<02:28,  1.35s/it]


[91mError windowing video: /kaggle/input/fydp-dataset-v1/data/anomaly/306_p1_Abuse025_x264_000.mp4[0m


Extracting Embeddings: 100%|██████████| 3655/3655 [1:21:11<00:00,  1.33s/it]
Padding Embeddings where necessary: 100%|██████████| 3636/3636 [00:00<00:00, 475522.73it/s]


### Checking the shapes

This cell will verify if the two npy files are OK

In [34]:
print("Video Embeddings Shape: ", embeddings.shape)
print("Video Labels Shape: ", labels.shape)

Video Embeddings Shape:  (3636, 4, 24, 1024)
Video Labels Shape:  (3636,)
