R(2+1)D (Residual (2+1)D ConvNet)
The R(2+1)D model is a convolutional neural network (CNN) for video that improves upon standard 3D CNNs by factorizing the 3D convolution operation.

Concept and Architecture
A standard 3D convolution (t, h, w) learns spatial and temporal features simultaneously from a video volume. The core idea of R(2+1)D is to break this down (factorize) into two separate, sequential steps:

2D Spatial Convolution: A convolution is applied across the height and width of each frame individually. This captures spatial features, similar to an image CNN. (Kernel size: (1, k, k))
1D Temporal Convolution: A second convolution is applied across the time axis. This captures how the spatial features change over time, learning motion. (Kernel size: (k, 1, 1))
This "(2+1)D" block is more efficient and has been shown to produce better results than a single 3D block. These blocks are then inserted into a standard ResNet (Residual Network) architecture, giving it the name R(2+1)D.

Step-by-Step Workflow
Input: A video clip is provided as a tensor, typically in the shape [Batch, Channels, Time, Height, Width].
Factorized Convolutions: The video passes through a series of (2+1)D residual blocks. In each block, the model first learns spatial features (what's in the frame) and then temporal features (how it's moving).
Pooling: Between blocks, pooling layers reduce the dimensions of Time, Height, and Width, creating a compact feature representation.
Classification: Finally, the features are globally averaged and passed to a fully connected layer, which outputs a prediction (e.g., a probability for each action class).
PyTorch Example
This example uses a pre-trained R(2+1)D model from torchvision to classify a dummy video tensor.

Reusability and Fine-tuning
Feature Extractor: You can use the pre-trained R(2+1)D model to extract powerful spatio-temporal features from your videos. Simply remove the final classification layer (model.fc). The output features can then be used to train a simpler classifier (like an SVM or a small neural network) for a different task.
Fine-tuning: To adapt the model to your own dataset (e.g., for a specific set of actions), you can replace the final layer (model.fc) with a new one that matches your number of classes. You can then "fine-tune" the model by training it on your data with a small learning rate. You might choose to freeze the early layers and only train the later ones, or train the entire network.
2. SlowFast Network
The SlowFast network is a dual-pathway architecture inspired by the human visual system, which processes scenes at varying speeds.

Concept and Architecture
The model uses two parallel CNNs that process the video at different frame rates:

Slow Pathway: This is a deep and heavy 3D CNN (e.g., a 3D ResNet) that runs at a low frame rate. It processes a few, sparsely sampled frames (e.g., 4 or 8 frames from a 32-frame clip). Its purpose is to capture the fine-grained spatial details and semantics of the scene (the "what").
Fast Pathway: This is a lightweight CNN with fewer channels and higher temporal resolution. It runs at a high frame rate, processing many frames (e.g., all 32 frames). Its purpose is to capture fast-changing motion (the "how" or "when").
Features from the high-motion Fast pathway are periodically fused into the Slow pathway via lateral connections, allowing the model to build a representation that understands both detail and motion.

Step-by-Step Workflow
Input Sampling: A raw video clip is sampled into two streams: a sparse stream for the Slow path and a dense stream for the Fast path.
Parallel Processing: Both streams are fed into their respective CNN backbones simultaneously.
Lateral Fusion: At several points in the network, the features from the Fast pathway are fused with the features from the Slow pathway.
Concatenation: After the final stage, the feature maps from both pathways are concatenated.
Classification: The combined features are pooled and passed to a classifier to make the final prediction.
PyTorch Example
This example uses a pre-trained SlowFast model from PyTorch Hub. The key difference is that the model expects a list of two tensors as input: one for the Slow path and one for the Fast path.

In [26]:
import torch
from torchvision.models.video import r2plus1d_18, R2Plus1D_18_Weights

print("--- R(2+1)D Example ---")

# 1. Load a pre-trained R(2+1)D model
# This model was trained on the Kinetics-400 dataset.
weights = R2Plus1D_18_Weights.KINETICS400_V1
model = r2plus1d_18(weights=weights)
model.eval() # Set the model to evaluation mode

# 2. Get the model-specific preprocessing function
# This handles normalization, resizing, etc.
preprocess = weights.transforms()

# 3. Create a dummy video tensor
# The transform expects a tensor of shape (T, C, H, W)
# Let's simulate a clip of 16 frames, 3 channels (RGB), and 180x180 resolution.
dummy_video_frames = torch.randint(0, 256, (16, 3, 180, 180), dtype=torch.uint8)

# 4. Preprocess the video
# The transform will resize, normalize, and change the shape to (C, T, H, W)
processed_video = preprocess(dummy_video_frames)

# Add a batch dimension to create a batch of 1 video
processed_video = processed_video.unsqueeze(0) # Shape: [1, 3, 16, 224, 224]

# 5. Make a prediction
with torch.no_grad(): # Disable gradient calculation for inference
    prediction = model(processed_video)

# 6. Interpret the output
# The output is a tensor of logits for the 400 Kinetics classes
predicted_class_id = torch.argmax(prediction, dim=1).item()
kinetics_classes = weights.meta["categories"]
predicted_class_name = kinetics_classes[predicted_class_id]

print(f"Input video shape after preprocessing: {processed_video.shape}")
print(f"Predicted Class ID: {predicted_class_id}")
print(f"Predicted Class Name (from a random tensor): {predicted_class_name}\n")


--- R(2+1)D Example ---
Input video shape after preprocessing: torch.Size([1, 3, 16, 112, 112])
Predicted Class ID: 37
Predicted Class Name (from a random tensor): brushing teeth



In [8]:
!pip install fvcore

Collecting fvcore
  Using cached fvcore-0.1.5.post20221221.tar.gz (50 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting yacs>=0.1.6 (from fvcore)
  Using cached yacs-0.1.8-py3-none-any.whl.metadata (639 bytes)
Collecting termcolor>=1.1 (from fvcore)
  Using cached termcolor-3.1.0-py3-none-any.whl.metadata (6.4 kB)
Collecting iopath>=0.1.7 (from fvcore)
  Using cached iopath-0.1.10.tar.gz (42 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting portalocker (from iopath>=0.1.7->fvcore)
  Using cached portalocker-3.2.0-py3-none-any.whl.metadata (8.7 kB)
Using cached termcolor-3.1.0-py3-none-any.whl (7.7 kB)
Using cached yacs-0.1.8-py3-none-any.whl (14 kB)
Using cached portalocker-3.2.0-py3-none-any.whl (22 kB)
Building wheels for collected packages: fvcore, iopath
  Building wheel for fvcore (setup.py) ... [?25ldone
[?25h  Created wheel for fvcore: filename=fvcore-0.1.5.post20221221-py3-none-any.whl size=61396 sha256=fec48ec237a07dae6f8d69b586b1dea0cf1e9cd4fce59612b

In [24]:
import torch

print("--- SlowFast Example ---")

# 1. Load a pre-trained SlowFast model from PyTorch Hub
# This model was also trained on Kinetics-400
model = torch.hub.load('facebookresearch/pytorchvideo', 'slowfast_r50', pretrained=True)
model.eval() # Set the model to evaluation mode

# 2. Create dummy video tensors for the two pathways
# In a real application, a special dataloader would sample a video clip
# into these two tensors. Here, we create them manually.

# Slow pathway input: low frame rate. e.g., 8 frames.
# Shape: [Batch, Channels, Time, Height, Width]
slow_path_video = torch.randn(1, 3, 8, 256, 256)

# Fast pathway input: high frame rate. The temporal dimension is `alpha` times
# the slow path's. Here, alpha is 4, so 8 * 4 = 32 frames.
fast_path_video = torch.randn(1, 3, 32, 256, 256)

# The model expects a list of tensors: [slow_input, fast_input]
list_of_tensors = [slow_path_video, fast_path_video]

# 3. Make a prediction
with torch.no_grad():
    prediction = model(list_of_tensors)

# 4. Interpret the output
predicted_class_id = torch.argmax(prediction, dim=1).item()

print(f"Input video shapes: {[t.shape for t in list_of_tensors]}")
print(f"Predicted Class ID (from a random tensor): {predicted_class_id}\n")


--- SlowFast Example ---


Using cache found in /Users/sanjeev/.cache/torch/hub/facebookresearch_pytorchvideo_main


Input video shapes: [torch.Size([1, 64, 8, 64, 64]), torch.Size([1, 8, 32, 64, 64])]
Predicted Class ID (from a random tensor): 358



Reusability and Fine-tuning
Feature Extractor: The concatenated features from both pathways (before the final classifier) provide a very rich video representation that captures both appearance and motion. This is excellent for downstream tasks.
Fine-tuning: You can replace the final classification head (model.blocks[6].proj) and fine-tune the model on a new dataset. Given its two-pathway nature, you could experiment with freezing one path while training the other, depending on whether your task is more motion- or appearance-dependent.
3. TimeSformer (Time-Space Transformer)
TimeSformer adapts the highly successful Vision Transformer (ViT) architecture from images to videos, using self-attention instead of convolutions.

Concept and Architecture
The main challenge in applying a Transformer to video is the massive computational cost of calculating self-attention across every pixel in every frame. TimeSformer introduces an efficient scheme called Divided Space-Time Attention.

Patching: The input video is first broken down into a sequence of non-overlapping 3D "patches" or "tubes" (e.g., 16x16 pixels across 2 frames).
Embedding: Each patch is flattened and linearly embedded into a vector. A positional embedding is added to retain location information.
Divided Attention: Instead of one massive self-attention calculation, each Transformer block performs two sequential, more manageable steps:
Temporal Attention: First, self-attention is calculated only along the time axis. Each patch attends to other patches at the same spatial location but in different frames. This captures how patches evolve over time.
Spatial Attention: Second, self-attention is calculated across all spatial locations within the same frame. This captures spatial relationships, similar to a CNN's receptive field.
This factorization makes the self-attention mechanism computationally feasible for video.

Step-by-Step Workflow
Tokenization: A video clip is converted into a sequence of patch embeddings. A special [CLS] (classification) token is added to the beginning of the sequence.
Transformer Encoder: The sequence of embeddings is processed by a series of Transformer blocks, each applying temporal attention followed by spatial attention.
Aggregation: As the sequence passes through the layers, the [CLS] token aggregates information from all other patch tokens.
Classification: The final output embedding corresponding to the [CLS] token is fed into a small MLP head to produce the classification result.
PyTorch Example
This example uses the timm (PyTorch Image Models) library, which provides an easy-to-use implementation of TimeSformer.

Note on Input Shape: torchvision models typically use (B, C, T, H, W). The timm implementation of TimeSformer expects (B, T, C, H, W) to be more consistent with NLP Transformers where the sequence length (Time) is the second dimension.