# Convolutional Neural Networks (CNNs)

## Table of Contents

1. [Computer Vision](#1-Computer-Vision)  
   &nbsp;&nbsp;1.1. [Image Data](#11-Image-Data)  
2. [Convolutional Filters](#2-Convolutional-Filters)  
   &nbsp;&nbsp;2.1. [Feature Detectors](#21-Feature-Detectors)  
   &nbsp;&nbsp;2.2. [Translation Equivariance](#22-Translation-Equivariance)  
   &nbsp;&nbsp;2.3. [Padding](#23-Padding)  
   &nbsp;&nbsp;2.4. [Strided Convolutions](#24-Strided-Convolutions)  
   &nbsp;&nbsp;2.5. [Multi-dimensional Convolutions](#25-Multi-dimensional-Convolutions)  
   &nbsp;&nbsp;2.6. [Pooling](#26-Pooling)  
   &nbsp;&nbsp;2.7. [Multilayer Convolutions](#27-Multilayer-Convolutions)  
   &nbsp;&nbsp;2.8. [Example Network Architectures](#28-Example-Network-Architectures)  
3. [Visualizing Trained CNNs](#3-Visualizing-Trained-CNNs)  
   &nbsp;&nbsp;3.1. [Visual Cortex](#31-Visual-Cortex)  
   &nbsp;&nbsp;3.2. [Visualizing Trained Filters](#32-Visualizing-Trained-Filters)  
   &nbsp;&nbsp;3.3. [Saliency Maps](#33-Saliency-Maps)   
4. [Object Detection](#4-Object-Detection)  
   &nbsp;&nbsp;4.1. [Bounding Boxes](#41-Bounding-Boxes)  
   &nbsp;&nbsp;4.2. [Intersection-over-Union (IoU)](#42-Intersection-over-Union-IoU)  
   &nbsp;&nbsp;4.3. [Sliding Windows](#43-Sliding-Windows)  
   &nbsp;&nbsp;4.4. [Detection Across Scales](#44-Detection-Across-Scales)  
   &nbsp;&nbsp;4.5. [Non-Max Suppression (NMS)](#45-Non-Max-Suppression-NMS)  
   &nbsp;&nbsp;4.6. [Fast Region CNNs (R-CNN family)](#46-Fast-Region-CNNs-R-CNN-family)  
5. [Image Segmentation](#5-Image-Segmentation)  
   &nbsp;&nbsp;5.1. [Convolutional Segmentation](#51-Convolutional-Segmentation)  
   &nbsp;&nbsp;5.2. [Up-sampling](#52-Up-sampling)  
   &nbsp;&nbsp;5.3. [Fully Convolutional Networks (FCNs) & Transpose Convolution](#53-Fully-Convolutional-Networks-FCNs--Transpose-Convolution)  
   &nbsp;&nbsp;5.4. [The U-Net Architecture](#54-The-U-Net-Architecture)  

---


## 1. Computer Vision
<a id='1-Computer-Vision'></a>

* **What is it?** Computer vision is a field focused on enabling computers to "see" and interpret images and videos, much like humans do. It's a major area where machine learning, especially deep learning, has made huge impacts.

* **Why CNNs?** While traditional methods involved hand-crafting features, CNNs learn these features automatically from data, leading to better performance in many tasks.

* **Common Applications in Biomedical Engineering (and beyond):**

  * **Image Classification:** E.g., classifying medical scans as benign or malignant.

  * **Object Detection:** E.g., identifying cells or anomalies in microscopy images.

  * **Image Segmentation:** E.g., outlining organs or tumors in MRI/CT scans.

  * **Caption Generation:** Describing medical images.

  * **Image Synthesis:** Generating synthetic medical images for training or data augmentation.

  * **Inpainting:** Reconstructing missing parts of an image (e.g., removing artifacts).

  * **Super-resolution:** Enhancing the resolution of medical images.

  * **Depth Prediction:** Estimating 3D structure from 2D medical images.

  * **Scene Reconstruction:** Building 3D models from multiple 2D scans.


### 1.1 Image Data
<a id='11-Image-Data'></a>

* **Structure:** Images are grids of pixels. Each pixel has intensity values (e.g., grey-scale) or color values (e.g., Red, Green, Blue - RGB channels).

  * Medical images can also be 3D (voxels, like in MRI) or sequences over time (videos).

* **Challenges with Standard Neural Networks:**

  * **High Dimensionality:** Many pixels = massive parameters for standard fully connected networks.

  * **Ignoring Structure:** Standard networks treat pixels as unstructured data, missing the crucial spatial relationships (nearby pixels are often correlated). Randomly shuffling pixels makes an image unrecognizable.

* **CNNs Exploit Image Structure:** CNNs are designed with inductive biases that leverage the structured nature of images, leading to:

  * Fewer parameters.

  * Better generalization (performing well on unseen data).


## 2. Convolutional Filters
<a id='2-Convolutional-Filters'></a>

* **Motivation:** Standard fully connected networks are impractical for large images due to the huge number of parameters and the need to learn invariances (e.g., an object is the same regardless of its position) from scratch.

* **Key Concepts for CNNs:**

  * **Hierarchy:** Features are learned in a hierarchy – simple features (edges) combine to form more complex features (shapes, parts of objects), which then form objects.

  * **Locality:** Initial feature detection happens in small local regions of an image.

  * **Equivariance:** If an input shifts, the feature representation also shifts.

  * **Invariance:** The final output (e.g., classification) should be unchanged by certain transformations (like small shifts).


### 2.1 Feature Detectors
<a id='21-Feature-Detectors'></a>

* **Receptive Field:** A unit in a CNN's first layer looks at a small patch of the input image, called its receptive field.

  <img src="image/Figure_1_a.png" width="250px">

  *(a) A hidden unit receives input from a 3x3 patch (receptive field).)*

* **Kernel/Filter:** The weights for this unit also form a small grid, called a filter or kernel. This filter is designed to detect a specific low-level feature (e.g., an edge, a corner).

  <img src="image/Figure_1_b.png" width="250px">

  *(b) The weights (kernel) for the hidden unit.)*

* **How it works:** The unit calculates a weighted sum of the pixel values in its receptive field (plus a bias) and applies an activation function (commonly ReLU).

  * $z = \text{ReLU}(w^T x + w_0)$

* **Feature Detection:** The unit gives a high output when the image patch it's looking at "matches" its kernel.


### 2.2 Translation Equivariance
<a id='22-Translation-Equivariance'></a>

* **The Idea:** If a feature (like an eye) appears in one part of an image, the same pattern of pixels in another part should also represent an eye.

* **Weight Sharing:** To achieve this, the same filter (set of weights) is applied across different locations in the image. This is the core idea of **convolution**.

  * Units in a layer that share the same weights form a **feature map**.

  * This drastically reduces the number of parameters to learn.

  * Connections are **sparse** (each unit only connects to a local patch).

  <img src="image/Figure_2.png" width="250px">

  *(Illustration of 1D convolution with shared weights (same color = same weight).)*

* **2D Convolution:** For an image $I$ and a filter $K$, the feature map $C$ is given by:

  * $C(j,k) = \sum_l \sum_m I(j+l, k+m) K(l,m)$ 
  
  (this is technically cross-correlation but called convolution in ML).

  <img src="image/Figure_3.png" width="550px">

  *(Example of a 3x3 image convolved with a 2x2 filter, producing a 2x2 feature map.)*

* **Example: Edge Detection:**

  * Hand-crafted filters can detect edges:

    * Vertical edge filter:

      | -1 |  0 |  1 |
      |----|----|----|
      | -1 |  0 |  1 |
      | -1 |  0 |  1 |

    * Horizontal edge filter:

      | -1 | -1 | -1 |
      |----|----|----|
      |  0 |  0 |  0 |
      |  1 |  1 |  1 |

  <img src="image/Figure_4.png" width="550px">

  *(a) Original image, (b) Vertical edge detection, (c) Horizontal edge detection.)*

* **Advantages of Convolutional Structure:**

  * Sparse connections (fewer weights).

  * Shared weights (fewer parameters, less data needed).

  * Can be applied to images of different sizes.

  * Efficient computation on GPUs.


### 2.3 Padding
<a id='23-Padding'></a>

* **Problem:** Convolution reduces the size of the feature map. (e.g., $J \times K$ image with $M \times M$ filter gives $(J-M+1) \times (K-M+1)$ map).

* **Solution:** Add extra pixels (padding) around the image border.

  * **Valid Convolution:** No padding ($P=0$).

  * **Same Convolution:** Padding chosen so output map has the same size as input ($P=(M-1)/2$ for odd $M$). Commonly padded with zeros.

  <img src="image/Figure_5.png" width="350px">

  *(A 4x4 image padded to become a 6x6 image.)*


### 2.4 Strided Convolutions
<a id='24-Strided-Convolutions'></a>

* **Purpose:** To reduce the size of feature maps significantly.

* **How it works:** Instead of moving the filter one pixel at a time, move it in larger steps (stride $S$).

* Output size with stride $S$ and padding $P$:

  * $\lfloor (J+2P-M)/S + 1 \rfloor \times \lfloor (K+2P-M)/S + 1 \rfloor$ 


### 2.5 Multi-dimensional Convolutions
<a id='25-Multi-dimensional-Convolutions'></a>

* **Color Images:** Have multiple channels (e.g., 3 for RGB).

* **Filters for Multiple Input Channels:** The filter also has a depth matching the number of input channels (e.g., $M \times M \times C_{IN}$). It processes all input channels simultaneously to produce a single 2D feature map.

  <img src="image/Figure_6_a.png" width="350px">

  *(a) Filter processing R, G, B channels.)*

  <img src="image/Figure_6_b.png" width="350px">

  *(b) Kernel visualization for a 3x3x3 filter.)*

* **Multiple Output Channels:** To detect various features, a convolutional layer uses multiple filters. Each filter produces its own 2D feature map (output channel).

  * If there are $C_{OUT}$ filters, the output is a stack of $C_{OUT}$ feature maps.

  * Filter tensor dimensionality: $M \times M \times C_{IN} \times C_{OUT}$.

  * Total parameters: $(M^2 C_{IN} + 1) C_{OUT}$ (including bias for each output channel).

  <img src="image/Figure_7.png" width="350px">

  *(Extending to multiple independent filter channels.)*

* **$1 \times 1$ Convolutions:**

  * Filter size is $1 \times 1 \times C_{IN}$.

  * Used to change the number of channels (depth) without changing spatial dimensions.

  * Can reduce dimensionality or introduce more non-linearity.


### 2.6 Pooling
<a id='26-Pooling'></a>

* **Purpose:**

  * Introduce some local translation invariance (small shifts in feature location don't change output much).

  * Reduce dimensionality (down-sample feature maps).

* **How it works:** Applied to outputs of convolutional layers. A pooling unit takes input from a receptive field and applies a fixed function (no learnable parameters).

* **Max-Pooling:** Outputs the maximum value from the receptive field.

  * Preserves information about feature presence and strength, discards some precise positional info.

  <img src="image/Figure_8.png" width="450px">

  *(Max-pooling with 2x2 blocks and stride 2.)*

* **Average Pooling:** Outputs the average value.

* **Stride:** Often, stride = filter size for non-overlapping pooling.

* **Other Invariances:** Pooling across channels can help learn other invariances (e.g., rotation if different channels detect features at different orientations).

* **Variable Input Sizes:** Pooling can help adapt CNNs to process images of varying sizes by adjusting stride to keep pooled output size constant.


### 2.7 Multilayer Convolutions
<a id='27-Multilayer-Convolutions'></a>

* **Deep CNNs:** Stack multiple layers of \[Convolution -> Activation -> Pooling\].

* **Hierarchical Feature Learning:**

  * Early layers learn simple features (edges, textures).

  * Later layers combine these to learn more complex features (parts of objects, objects).

* **Effective Receptive Field:** The receptive field of a unit in a deeper layer, when mapped back to the original input image, is larger than that of units in earlier layers.

  <img src="image/Figure_9.png" width="450px">

  *(Effective receptive field grows with network depth.)*

* **Fully Connected Layers:** Often, the final stages of a CNN for classification are one or more standard fully connected layers. These combine information from across the entire (now down-sampled) feature map.


### 2.8 Example Network Architectures
<a id='28-Example-Network-Architectures'></a>

* **LeNet (LeCun et al., 1989, 1998):** Early CNN for handwritten digit recognition.

* **ImageNet Challenge:** A large dataset (millions of images, thousands of categories) that spurred CNN development.

* **AlexNet (Krizhevsky et al., 2012):** Breakthrough CNN that won ImageNet 2012. Used ReLUs, GPUs, dropout.

* **VGG-16 (Simonyan & Zisserman, 2014):**

  * Known for its simple and uniform architecture (16 learnable layers).

  * Uses small $3 \times 3$ convolutional filters exclusively.

  * Stacks multiple convolutional layers before each max-pooling layer.

  * Input: $224 \times 224 \times 3$ image.

  * **Architecture:**
    * Sequence of blocks:
      * (Conv3x3 → ReLU) layers (multiple per block)
      * MaxPool2x2 after each group of conv layers
    * After each pooling, number of channels increases:
      * 64 → 128 → 256 → 512
    * Ends with fully connected (dense) layers

  <img src="image/Figure_10.png" width="650px">

  *(VGG-16 architecture.)*

  * Total parameters: \~138 million (mostly in fully connected layers).


## 3. Visualizing Trained CNNs
<a id='3-Visualizing-Trained-CNNs'></a>

Understanding what a trained CNN has learned.


### 3.1 Visual Cortex
<a id='31-Visual-Cortex'></a>

* **Inspiration for CNNs:** Early neuroscience research on the mammalian visual cortex.

  * **Simple Cells (Hubel & Wiesel, 1959):** Respond to edges at specific orientations and locations. Modeled by **Gabor filters**.

  <img src="image/Figure_11.png" width="450px">

  *(Examples of Gabor filters.)*

  * **Complex Cells:** Respond to more complex stimuli, combine outputs of simple cells, show some invariance to small shifts (like pooling).

* **Neocognitron (Fukushima, 1980):** Forerunner of CNNs, inspired by visual cortex, had local receptive fields, shared weights, and pooling, but lacked end-to-end training.


### 3.2 Visualizing Trained Filters
<a id='32-Visualizing-Trained-Filters'></a>

* **First Layer Filters:** Easy to visualize as they operate directly on image patches.

  * When trained on natural images (like ImageNet), these filters often resemble Gabor filters (detecting edges and textures) and color blobs.

  <img src="image/Figure_12.png" width="450px">

  *(Learned filters from AlexNet's first layer. Note similarity to Gabor filters.)*

* **Deeper Layer Filters:** Harder to interpret directly as they operate on feature maps from previous layers.

  * **Method 1: Finding Maximally Activating Patches:** Present many image patches and see which ones cause the highest activation for a given unit/channel.

    * Shows hierarchical structure: Layer 1 (edges), Layer 2 (textures, simple shapes), Layer 3 (object parts), Layer 5 (entire objects).

  <img src="image/Figure_13.png" width="450px">

  *(Image patches causing strongest activations in different layers of a CNN.)*

  * **Method 2: Optimization:** Numerically optimize an input image (starting from noise or a base image) to maximize the activation of a specific unit or class score. Requires regularization to look like natural images.

  <img src="image/Figure_14.png" width="450px">

  *(Synthetic images generated by maximizing class probability.)*


### 3.3 Saliency Maps
<a id='33-Saliency-Maps'></a>

* **Purpose:** Identify regions of an input image most influential for a particular classification.

* **Grad-CAM (Gradient Class Activation Mapping):**

  * Focuses on the final convolutional layer (high-level semantics + spatial info).

  * Calculates gradients of the class score with respect to feature map activations in this layer.

  * These gradients are averaged per channel to get weights ($\alpha_k$).

  * A weighted sum of feature maps ($L = \sum_k \alpha_k A^{(k)}$) creates a heatmap showing important regions.

  <img src="image/Figure_15.png" width="450px">

  *(Saliency maps for 'dog' and 'cat' classes using Grad-CAM.)*


## 4. Object Detection
<a id='4-Object-Detection'></a>

Beyond classifying an entire image, finding multiple objects and their locations.


### 4.1 Bounding Boxes
<a id='41-Bounding-Boxes'></a>

* **Definition:** A rectangle tightly enclosing an object in an image.

* **Representation:** Typically $(b_x, b_y, b_W, b_H)$ - center coordinates, width, and height.

* **CNN for Localization:** A CNN can be trained to output not only class probabilities but also 4 continuous values for bounding box coordinates.

  <img src="image/Figure_19.png" width="450px">

  *(Image with objects and their bounding boxes.)*


### 4.2 Intersection-over-Union (IoU)
<a id='42-Intersection-over-Union-IoU'></a>

* **Purpose:** A metric to measure the accuracy of a predicted bounding box against a ground truth box.

* **Definition:** $IoU = \frac{\text{Area of Intersection}}{\text{Area of Union}}$

  <img src="image/Figure_20.png" width="450px">

  *(Illustration of Intersection-over-Union.)*

* **Usage:** Values range from 0 to 1. A prediction is often considered correct if IoU > 0.5. Not typically used directly as a loss function due to difficulty in optimization.


### 4.3 Sliding Windows
<a id='43-Sliding-Windows'></a>

* **Traditional Approach:**

  1. Train a classifier on cropped object examples and background examples.

  2. Slide this classifier (window) across a new image at all possible locations and scales.

  3. Where the classifier gives a high score, an object is detected.

* **Drawbacks:** Computationally expensive, especially with deep CNNs and many scales/locations.

* **Efficient Implementation in CNNs:**

  * Convolutional layers inherently perform a sliding operation.

  * A CNN trained on fixed-size inputs can be applied to a larger image by converting its fully connected layers into equivalent convolutional layers. This processes the entire image in one pass, effectively evaluating all window positions simultaneously.

  <img src="image/Figure_21.png" width="350px">

  *(Overlapping sliding windows lead to redundant computations.)*

  <img src="image/Figure_22.png" width="450px">

  *(Simple CNN for object detection.)*

  <img src="image/Figure_23.png" width="450px">

  *(Applying the CNN to a larger image efficiently by extending convolutional/pooling layers.)*


### 4.4 Detection Across Scales
<a id='44-Detection-Across-Scales'></a>

* **Challenge:** Objects appear at different sizes and aspect ratios.

* **Method:** Instead of different sized detectors, create multiple scaled/resized versions of the input image. Apply a fixed-size detector to each. Transform detected bounding boxes back to original image coordinates.

  <img src="image/Figure_24.png" width="450px">

  *(Detecting objects at multiple scales by scaling the image.)*


### 4.5 Non-Max Suppression (NMS)
<a id='45-Non-Max-Suppression-NMS'></a>

* **Problem:** Sliding window approaches often detect the same object multiple times with slightly different, overlapping bounding boxes.

* **Solution (for each class):**

  1. Get all predicted bounding boxes with their confidence scores.

  2. Discard boxes below a confidence threshold.

  3. Select the box with the highest confidence. This is a detection.

  4. Discard any other boxes that have a high IoU (e.g., > 0.5) with this selected box.

  5. Repeat from step 3 with the remaining boxes until no boxes are left.

  <img src="image/Figure_25.png" width="450px">

  *(Non-max suppression removes redundant overlapping detections.)*


### 4.6 Fast Region CNNs (R-CNN family)
<a id='46-Fast-Region-CNNs-R-CNN-family'></a>

* **Idea:** Instead of densely scanning, first use a cheaper method (region proposal network or algorithm like Selective Search) to identify a smaller set of candidate regions in the image that likely contain objects. Then, apply the powerful CNN classifier only to these regions.

* **Examples:** R-CNN, Fast R-CNN, Faster R-CNN (integrates region proposal into the network for end-to-end training).


## 5. Image Segmentation
<a id='5-Image-Segmentation'></a>

Assigning a class label to *every pixel* in an image.

* **Semantic Segmentation:** Divides image into regions corresponding to semantic classes (e.g., "cell nucleus", "cytoplasm", "background" in a microscopy image).

  <img src="image/Figure_26_a.png" width="450px"> <img src="image/Figure_26_b.png" width="450px">

  *(Example of an image and its semantic segmentation.)*


### 5.1 Convolutional Segmentation
<a id='51-Convolutional-Segmentation'></a>

* **Naive Approach:** Train a CNN to classify the central pixel of an input patch. Slide this patch over the image. Very inefficient due to redundant computations.

* **Efficient Approach (Fully Convolutional):** Use a CNN where all layers are convolutional (no fully connected layers that fix input size). If no pooling or striding, output map is same size as input. Still costly if many layers/channels are needed for high accuracy.


### 5.2 Up-sampling
<a id='52-Up-sampling'></a>

* **Problem:** Typical CNNs use pooling/striding, which reduces feature map resolution. For segmentation, we need a full-resolution output.

* **Solution:** Add up-sampling layers to an encoder-decoder architecture.

  * **Encoder:** Standard CNN that down-samples, extracting features.

  * **Decoder:** Up-samples the low-resolution feature maps back to original image resolution, producing the segmentation map.

  <img src="image/Figure_27.png" width="650px">

  *(Encoder-decoder architecture for segmentation.)*

* **Unpooling (analogue of pooling):**

  * **Average Unpooling:** Copy input value to all corresponding output units in a block.

  <img src="image/Figure_28_a.png" width="450px">

  *(Average unpooling.)*

  * **Max-Unpooling:** Copy input value to one location in the output block (e.g., top-left or, better, the location of the max from the corresponding max-pooling layer during encoding), others zero.

  <img src="image/Figure_28_b.png" width="450px">

  *(Max unpooling - basic.)*

  <img src="image/Figure_29.png" width="650px">

  *(Max unpooling preserving spatial information from encoder's max-pooling indices.)*


### 5.3 Fully Convolutional Networks (FCNs) & Transpose Convolution
<a id='53-Fully-Convolutional-Networks-FCNs--Transpose-Convolution'></a>

* **Transpose Convolution (or "deconvolution", "fractionally strided convolution"):** A learnable up-sampling layer.

  * Each input unit contributes to a patch in the output layer via a filter.

  * Stride is effectively < 1 (e.g., moving 1 step in input moves >1 step in output).

  * Overlapping contributions are summed or averaged.

      <img src="image/Figure_30.png" width="250px">

  *(Transpose convolution for up-sampling.)*

* **Fully Convolutional Network (Long et al., 2014):** A network using only convolutional layers (including transpose convolutions for up-sampling), no pooling layers that require fixed input sizes. Can process arbitrarily sized images.


### 5.4 The U-Net Architecture
<a id='54-The-U-Net-Architecture'></a>

* **Key Idea (Ronneberger et al., 2015):** An encoder-decoder architecture with "skip connections."

  * **Symmetric Structure:** Down-sampling path (encoder) and up-sampling path (decoder).

  * **Skip Connections:** Feature maps from the encoder are concatenated with the corresponding feature maps in the decoder. This provides the up-sampling path with high-resolution spatial information from earlier layers, which helps in precise localization for segmentation.

  * Very successful in biomedical image segmentation.

  <img src="image/Figure_31.png" width="650px">

  *(U-Net architecture with skip connections.)*


### 5.5 Frameworks over PyTorch

| Framework/Library                                    | Link                                                                                           | GitHub Stars | Description                                                                                   |
|------------------------------------------------------|------------------------------------------------------------------------------------------------|--------------|-----------------------------------------------------------------------------------------------|
| MMDetection                                          | [https://github.com/open-mmlab/mmdetection](https://github.com/open-mmlab/mmdetection)         | ~27k         | OpenMMLab’s toolbox for object detection and instance segmentation based on PyTorch.          |
| PyTorch Lightning                                    | [https://www.pytorchlightning.ai/](https://www.pytorchlightning.ai/)                           | ~26k         | Lightweight PyTorch wrapper for scalable, reproducible, and organized deep learning research. |
| Detectron2                                           | [https://github.com/facebookresearch/detectron2](https://github.com/facebookresearch/detectron2) | ~26k         | Facebook AI Research's next-gen library for object detection and segmentation in PyTorch.     |
| fastai                                               | [https://www.fast.ai/](https://www.fast.ai/)                                                   | ~25k         | High-level library built on PyTorch for rapid prototyping and research, especially in vision, text, and tabular data. |
| PyTorch Image Models (timm)                          | [https://github.com/huggingface/pytorch-image-models](https://github.com/huggingface/pytorch-image-models) | ~21k         | Collection of image models, layers, utilities, and pretrained weights for PyTorch.            |
| torchvision                                          | [https://pytorch.org/vision/stable/index.html](https://pytorch.org/vision/stable/index.html)   | ~17k         | Official PyTorch vision library with datasets, model architectures, and image transformations.|
| Kornia                                               | [https://kornia.org/](https://kornia.org/)                                                     | ~8k          | Computer vision library for PyTorch, with differentiable image processing and geometry ops.   |
| segmentation_models.pytorch                          | [https://github.com/qubvel/segmentation_models.pytorch](https://github.com/qubvel/segmentation_models.pytorch) | ~6k          | Collection of image segmentation architectures and pretrained weights for PyTorch.            |
| MONAI                                                | [https://monai.io/](https://monai.io/)                                                         | ~5k          | Medical Open Network for AI: PyTorch-based framework for deep learning in healthcare imaging. |
| TorchIO                                              | [https://torchio.readthedocs.io/](https://torchio.readthedocs.io/)                             | ~2k          | Medical imaging preprocessing and augmentation for deep learning in PyTorch.                  |

---
**Reference:**

Bishop, C. M. (2024). *Deep Learning: Foundations and Concepts*. Springer. (Chapter 10: Convolutional Neural Networks).
