## 1. Introduction

Video classification is a fundamental problem in computer vision with
applications in action recognition, surveillance, sports analytics, and
human–computer interaction. The task involves assigning semantic labels to
video sequences by analyzing spatial and temporal information present across
frames.

This assignment presents a comparative study of classical and deep learning
approaches for video classification. Classical methods rely on hand-crafted
features combined with traditional machine learning algorithms, while modern
deep learning approaches learn spatiotemporal representations directly from
data.

The objective of this work is to analyze the evolution from classical to
deep learning–based video classification techniques and to understand the
trade-offs in terms of performance, interpretability, and computational cost.


## 2. Dataset Description

A subset of the HMDB51 dataset was used for this study. HMDB51 is a widely
used benchmark dataset for human action recognition, consisting of videos
from diverse real-world scenarios.

For this assignment, a binary classification task was formulated using the
following action classes:
- Walking
- Running

Each video is represented as a sequence of RGB frames stored in
class-specific folders. A reduced subset was chosen to ensure stable training
and evaluation under limited computational resources.


## 3. Methodology

### 3.1 Classical Video Classification Approach

The classical approach follows a traditional video analytics pipeline
consisting of feature extraction, feature aggregation, and classification
using machine learning algorithms.

#### 3.1.1 Feature Extraction

Three types of features were extracted from video frames:

- **Color Features:** RGB color histograms were computed to capture global
  color distribution across frames.
- **Texture Features:** Local Binary Pattern (LBP) descriptors were used to
  capture texture information from grayscale frames.
- **Motion Features:** Frame differencing was applied between consecutive
  frames to estimate motion intensity.

Frame-level features were aggregated temporally using statistical measures
such as mean and standard deviation to form a single video-level feature
vector.


### 3.2 Deep Learning–Based Approach

Deep learning approaches were implemented to automatically learn
spatiotemporal representations from video data.

#### 3.2.1 2D CNN with Temporal Pooling

A pre-trained ResNet-18 model was used as a frame-level feature extractor.
Features from multiple frames were aggregated using temporal average pooling
to obtain a video-level representation. Transfer learning was employed by
initializing the model with ImageNet weights.

#### 3.2.2 3D Convolutional Neural Network

A lightweight 3D CNN was implemented to demonstrate joint spatial and
temporal feature learning using 3D convolutions. Due to computational and GPU
session constraints, this model was demonstrated architecturally with limited
evaluation rather than full-scale training.


## 4. Experimental Setup

The dataset was split into training and testing sets using an 80:20 ratio.
All video sequences were temporally normalized to a fixed number of frames
to enable batch processing.

For classical methods, features were standardized before training machine
learning models including Support Vector Machines, Random Forests, and
Logistic Regression.

For deep learning models, training was performed using the Adam optimizer
and cross-entropy loss. Due to computational constraints, limited epochs
were used for training.
