<h1><center>Laboratory work 4.</center></h1>
<h2><center>PyTorch Computer Vision Exercises</center></h2>

**Completed:** Last name and First name

**Variant:** #__

<a class="anchor" id="4"></a>

## Content

1. [Task 1. Load the dataset.](#4.1)
2. [Task 2. Turn the loaded dataset into `torch.utils.data.DataLoader`.](#4.2)
3. [Task 3. Create a CNN model.](#4.3)
4. [Task 4. Train the model on the corresponding dataset.](#4.4)
5. [Task 5. Make predictions.](#4.5)

In [1]:
# Import torch
import torch

# Exercises require PyTorch > 1.10.0
print(torch.__version__)

# TODO: Setup device agnostic code


1.13.1


<a class="anchor" id="4.1"></a>

## <span style="color:red; font-size:1.5em;">Task 1. Load the dataset</span>

[Go back to the content](#4)

**Variant 1:**
Load the **Stanford Dogs** dataset from a custom directory structure, ensuring each dog breed has its own subfolder. Resize images to 128×128 and normalize them using ImageNet mean and std. Split into train and validation sets by ensuring each breed is well-represented in both.

*Technical note:*
- Use `torchvision.datasets.ImageFolder` pointing to your dataset directory.
- Apply `transforms.Resize(128)` and `transforms.Normalize(mean, std)` for preprocessing.
- Maintain an 80/20 split for train/val to ensure coverage of all breeds.

---
**Variant 2:**
Download the **Tiny ImageNet** dataset (200 classes) and store it locally. Use random transformations like random rotation (±15 degrees) and color jitter to boost variation. Separate the dataset into train/val/test sets, ensuring each set has the same class distribution.

*Technical note:*
- Rely on `transforms.RandomRotation` and `transforms.ColorJitter` for data augmentation.
- Keep the image size at 64×64.
- Confirm the ratio splits (e.g., 70% train, 15% val, 15% test).

---
**Variant 3:**
Prepare a dataset of **flower images** (e.g., from the Oxford 102 Flowers dataset) by manually downloading and organizing them. Convert all images to grayscale to mimic single-channel data. Then create train/val splits with stratified sampling based on flower type.

*Technical note:*
- Use `PIL.Image.convert("L")` or `transforms.Grayscale(num_output_channels=1)` for conversion.
- Keep a record of class names vs. folder structure.
- Possibly reduce the image size to 64×64 to handle memory constraints.

---
**Variant 4:**
Obtain **satellite imagery** from a public dataset like EuroSAT. Use separate channels (e.g., RGB + near-infrared if available) and stack them into a 4-channel input. Resize them consistently to 128×128. Split data by region: training images from certain zones, validation/test from others.

*Technical note:*
- Custom transform with `transforms.ToTensor()` handling 4 channels if needed.
- This ensures real-world “domain-split” for train/val/test.
- Normalization can be channel-wise, using means and stds computed from the dataset.

---
**Variant 5:**
Load a custom **medical imaging** dataset in PNG format (e.g., X-ray or CT scans). Apply histogram equalization as a preprocessing step for better contrast. Classify into normal vs. abnormal. Keep training and testing sets in separate folders to enforce no patient overlap.

*Technical note:*
- Implement a custom transform that performs histogram equalization via OpenCV or `PIL`.
- Store data in `ImageFolder`-like structure for normal/abnormal.
- Carefully verify that no patient ID appears in both sets.

---
**Variant 6:**
Use the **Caltech-256** dataset for image classification. Implement a custom sampler that selects 5 random categories for the training set and 5 different categories for the validation set for a zero-shot-like scenario. Keep the rest for standard testing.

*Technical note:*
- Programmatically choose which categories go to train vs. validation.
- `torch.utils.data.Subset` or a custom approach to filter classes.
- Investigate zero-shot performance on the new classes.

---
**Variant 7:**
Load the **Places365** dataset in a reduced form (e.g., only 30 scene classes). Convert all images to LAB color space instead of RGB to see if it aids classification. Then split into train/validation/test in equal proportions for each scene type.

*Technical note:*
- Use an image conversion step: `PIL.Image.convert("LAB")` is not standard, so might need custom code.
- Alternatively, transform each image with a library that supports LAB.
- Keep balanced splits across the 30 selected classes.

---
**Variant 8:**
Create a **synthetic shape dataset** of polygons rendered on plain backgrounds. For instance, squares, triangles, circles, each in different colors. Randomly generate 10,000 images (64×64) and label them by shape type. Reserve 2,000 for testing. This tests model ability to learn geometric forms.

*Technical note:*
- Use a Python script (e.g., with PIL or OpenCV) to draw shapes on blank backgrounds.
- Each shape class gets random color, position, and orientation.
- Save them as PNG, and load via a custom `Dataset` or `ImageFolder`-style structure.

---
**Variant 9:**
Load the **iNaturalist** dataset for different species identification. Focus on a subset of bird species (say 10 species). Resize images to 224×224. Use an 80/10/10 split for train/val/test. This large and diverse data challenges real biodiversity classification tasks.

*Technical note:*
- iNaturalist typically has a large number of classes, so specify a subset index or name filter.
- Use `transforms.Resize((224, 224))` plus standard normalization.
- Aim for robust coverage of each bird species in all splits.

---
**Variant 10:**
Obtain the **Food-101** dataset and filter out only 20 classes (dishes). Shuffle images to ensure random distribution. Convert them to 3-channel RGB if any are mismatched. Then apply minimal transformations (crop and horizontal flip) before final storage.

*Technical note:*
- Use `glob` or `os.listdir` to gather only the chosen classes.
- Keep `transforms.RandomResizedCrop(224)` plus `transforms.RandomHorizontalFlip()`.
- Store final data objects for subsequent tasks.

---
**Variant 11:**
Use your **own image collection** from a smartphone camera. Label them in 5 categories (e.g., indoors, outdoors, food, pets, vehicles). Resize each to 256×256. Then use a random 70/30 split for train/val. This fosters a personal dataset with unique image properties.

*Technical note:*
- Manually sort your images into subfolders.
- Use `transforms.Resize((256,256))` or adapt the shape as needed.
- Keep a stable random seed for reproducible splits.

---
**Variant 12:**
Download a **GAN-generated** set of faces vs. real faces for a “real vs. fake” classification. Use 5000 total images, half from a generative model, half from real life. Resize to 128×128. Maintain a 75/25 train/test split to see how well the model discerns authenticity.

*Technical note:*
- Store “real” and “fake” in separate subdirs, each with 2500 images.
- `transforms.Resize(128)` for uniform shape.
- Potential advanced transformations to handle face alignment or lighting conditions.

---
**Variant 13:**
Build a **time-lapse dataset** of sky images over a day, each labeled with “sunrise,” “afternoon,” “sunset,” or “night.” Pre-crop them to remove irrelevant parts. Convert them to grayscale for a simpler channel input. Then do an 80/20 split for training and validation.

*Technical note:*
- Cropping can remove ground features so the model focuses on sky color patterns.
- `transforms.Grayscale()` to reduce complexity.
- This tests whether the model can learn time-of-day from sky visuals.

---
**Variant 14:**
Use **UW Faces** dataset for various face angles. Crop images to the central face region, then scale to 96×96. Partition them by subject ID: 80% of IDs in train, 10% in val, 10% in test, ensuring no subject overlap. The goal is face classification or recognition tasks.

*Technical note:*
- A custom dataset class might handle face cropping with bounding box info if available.
- ID-based splitting ensures the model never sees the same person in multiple sets.
- This is relevant for recognition or verification.

---
**Variant 15:**
Collect **microscopy images** of cells (e.g., brightfield vs. fluorescent). Label them with the cell type. Convert each to a 2D single-channel format. Remove images with poor focus (blurry) using a threshold-based filter. Use a random 80/20 train/test approach.

*Technical note:*
- Implement a “focus measure” (e.g., variance of Laplacian) to drop blurry images.
- Convert to single-channel float for accurate intensity representation.
- Potentially do minimal augmentation to maintain cell morphology.

---
**Variant 16:**
Load a **handwritten digit dataset** from multiple languages (e.g., Arabic, Devanagari). Combine them into a single multi-class set. Use uniform resizing to 32×32. Shuffle thoroughly, and keep language distribution the same across splits so each language appears in train/val/test.

*Technical note:*
- Rely on multiple source folders, each representing a different script.
- Merge them, label each digit 0–9 but also track script if needed.
- Ensure balanced splits for each digit–script combination.

---
**Variant 17:**
Utilize the **Flowers-102** dataset from torchvision for image classification. This dataset contains images of 102 different flower categories. Resize images to 128×128 and normalize them using ImageNet mean and std. Split the dataset into training and validation sets, maintaining an 80/20 split and ensuring each flower category is well-represented in both.

*Technical note:*
- Use `torchvision.datasets.Flowers102` to load the dataset.
- Apply `transforms.Resize(128)` and `transforms.Normalize(mean, std)` for preprocessing.
- Implement a stratified 80/20 split to ensure class balance across train and validation sets.

---
**Variant 18:**
Create a **meme classification** dataset from various online memes, labeling them by template type (e.g., Drake Meme, Distracted Boyfriend, etc.). Manually gather ~1,000 images per class. Convert them to 224×224 RGB. Keep a 70/15/15 train/val/test distribution.

*Technical note:*
- Manually scraping requires filtering duplicates.
- Use `ImageFolder` structure with each meme template as a folder.
- Potential strong data augmentation for text or overlay disclaimers.

---
**Variant 19:**
Download **road sign images** from different countries (like GTSRB for Germany, plus other local sign datasets). Merge them into a single dataset with sign type labels. Standardize image size to 64×64. Split by country: use one country for validation, another for testing, rest for training, to test cross-domain generalization.

*Technical note:*
- Ensure that each sign type has consistent labeling across countries.
- Possibly keep separate transforms for color normalization if sign color differs.
- This tests domain adaptation across different sign designs.

---
**Variant 20:**
Gather a **multi-lighting set** of the same objects captured in bright, dim, and normal lighting. Label them by object category (10 categories). Each category has ~300 images across lighting conditions. Partition randomly into train/val/test. This explores illumination invariance.

*Technical note:*
- Acquire or simulate varying light conditions.
- Possibly apply gamma correction during load to unify brightness.
- Balanced splits help ensure each lighting scenario is seen in train/val/test.

---

<a class="anchor" id="4.2"></a>

## <span style="color:red; font-size:1.5em;">Task 2. Turn the loaded dataset into `torch.utils.data.DataLoader`</span>

[Go back to the content](#4)

**Variant 1:**
Implement **oversampling** in the DataLoader for an imbalanced dataset (e.g., some rare class). Use `WeightedRandomSampler` with higher weights for the underrepresented classes, ensuring more frequent sampling of minority samples.

*Technical note:*
- Create a sampler that accounts for class distribution.
- `DataLoader(..., sampler=weighted_sampler, ...)` overrides the default shuffle.
- Validate if oversampling truly improves minority class performance.

---
**Variant 2:**
Create **multiple DataLoaders** for a hierarchical classification scenario: one loader for coarse categories (e.g., animals vs. vehicles) and one for fine-grained categories (dog, cat, truck, car). The model can train in two phases, using each DataLoader separately.

*Technical note:*
- Tag each image with both coarse and fine label.
- Set up two separate `Dataset` objects or handle one dataset with different label calls.
- This approach can be used for multi-level classification tasks.

---
**Variant 3:**
Use a **Contrastive Dataloader** that outputs pairs of images (similar or dissimilar). Implement a custom `__getitem__` that picks two images with a label indicating same class or not. This is useful for Siamese networks or metric learning.

*Technical note:*
- Each batch item is (img1, img2, label_is_same).
- `collate_fn` might need customization to stack these pairs properly.
- Great for tasks that rely on embedding or distance-based classification.

---
**Variant 4:**
Adopt a **grouped DataLoader** for a large dataset so that each batch has images from exactly 2 classes, making it easier for the model to learn subtle differences. The `collate_fn` can ensure each batch has only classes A and B, then the next batch has C and D, etc.

*Technical note:*
- Shuffle class pairs, then sample images from those two classes in each batch.
- Helps model focus on fewer categories at once.
- In practice, might require a custom indexing strategy.

---
**Variant 5:**
Generate an **online data augmentation pipeline** that randomly rotates, flips, or color-jitters each image during the DataLoader fetch. Set `pin_memory=True` for faster GPU transfers. Confirm that transformations differ each epoch for better model generalization.

*Technical note:*
- Use `transforms.RandomHorizontalFlip()`, `RandomRotation(degrees=15)`, etc.
- Data augmentation is typically done in `transforms.Compose`.
- `pin_memory=True` can improve performance if using GPU.

---
**Variant 6:**
Implement a **curriculum-based DataLoader** that first yields “easiest” samples (e.g., large objects with high contrast) and gradually includes “harder” samples (small objects, occluded). Keep track of difficulty in the dataset and re-sample accordingly over epochs.

*Technical note:*
- Possibly store a “difficulty score” for each image in the dataset.
- Dynamically adjust the sampling distribution each epoch.
- This approach can help the model learn from easy to challenging examples.

---
**Variant 7:**
Build a **semi-supervised DataLoader** with a small labeled set and a large unlabeled set. The labeled set has standard labels, while the unlabeled set only provides data. Return them in separate batches or combined with special labels indicating unlabeled samples.

*Technical note:*
- Possibly define two datasets, LabeledDataset and UnlabeledDataset, each with their own indexing.
- A custom collate function might put a flag for unlabeled items.
- This approach is crucial for advanced semi-supervised training strategies.

---
**Variant 8:**
Use a **stratified sampling** approach so each batch approximates the overall class distribution. For multi-class with big data, this ensures each mini-batch is representative. This can help avoid epochs where some classes are scarcely sampled.

*Technical note:*
- Compute class frequencies beforehand.
- Implement a sampler that picks from each class in proportion to its frequency.
- Maintain randomization within class sub-samples.

---
**Variant 9:**
Using the **iNaturalist** dataset, explore loading multiple images *per instance* to simulate different views or perspectives of the same species.  For each species, instead of loading just one image at a time, design a DataLoader that can load a small set of images (e.g., 2-3 images) associated with the same instance.  The task remains image classification for individual images, but the data loading should demonstrate handling multiple images per sample.

*Technical note:*
-  Adapt the `ImageFolder` logic or create a custom `Dataset` to group images by instance (species in this case). You might need to restructure or understand how iNaturalist organizes images.
-  Instead of a single image tensor per item, each item from the DataLoader will yield a *list* or *tuple* of image tensors (e.g., of length 2 or 3), each of shape [C, H, W].
-  For the classification model, you will still process each image individually (e.g., using a CNN), but the data loading step will prepare you for scenarios where multiple views or inputs are available for each instance.

---
**Variant 10:**
Implement a **“sliding window”** DataLoader for medical imaging volumes (e.g., 3D MRIs). Each sample is a sub-volume chunk of shape [depth, height, width]. Slide through the full volume with overlap. Return sub-volumes plus a label (e.g., tumor or not).

*Technical note:*
- Use a custom indexing approach that enumerates sub-volume coordinates.
- Potentially set a stride for the sliding window.
- Returns partial volumes, good for patch-based training.

---
**Variant 11:**
Create a DataLoader that **balances** the classes at the *batch* level specifically. For example, each mini-batch of 32 images must have exactly 8 images from each of 4 classes. This ensures constant class balance in every batch.

*Technical note:*
- Possibly implement a custom sampler that cycles through classes in round-robin fashion.
- Good for smaller classes that risk underrepresentation in random sampling.
- Must handle leftover images carefully.

---
**Variant 12:**
Use a DataLoader that includes **metadata** per sample, such as bounding box coordinates or textual captions, in addition to the image. The `__getitem__` returns `(image, label, metadata)`. For instance, some tasks might require bounding boxes for classification awareness.

*Technical note:*
- Extend your dataset class to read annotation files or CSV with bounding boxes/captions.
- The DataLoader can pass these along in each batch for downstream usage.
- Keep the data structure consistent in `collate_fn`.

---
**Variant 13:**
Implement a **lazy-loading** DataLoader that reads large images from disk only when needed, then caches a certain number of them in memory. This approach suits extremely large image datasets that can’t fit fully in RAM.

*Technical note:*
- Use a caching mechanism (e.g., an LRU cache) or partial in-memory for frequently accessed images.
- `__getitem__` only loads from disk if not in cache.
- Speeds up repeated epochs on large sets.

---
**Variant 14:**
Construct a **DataLoader with partial label** (multi-label scenario) for each image. For example, each image might have labels [“animal”, “brown”], [“vehicle”, “red”], etc. Return a multi-hot vector for each sample. The model can then handle multi-label classification.

*Technical note:*
- The dataset must store label vectors for each image.
- Collate them so each batch has shape [batch_size, number_of_possible_labels].
- Typically pairs well with a `BCEWithLogitsLoss`.

---
**Variant 15:**
For a **paired image translation** task (e.g., day→night), your DataLoader should return `(img_day, img_night)` pairs. Possibly from a folder structure that matches day and night images by filename. This is common for pix2pix-like training.

*Technical note:*
- Implement a custom dataset class that looks up the matching counterpart for each day image in a “night” folder.
- Return both images in a single sample.
- Sorting filenames or storing pairs in a dictionary can help.

---
**Variant 16:**
Develop a **DataLoader that merges two datasets** on the fly. For instance, 50% of the time sample from dataset A (CIFAR-10) and 50% from dataset B (SVHN). This helps training a network that sees mixed data from two domains.

*Technical note:*
- Inside `__getitem__`, randomly choose which dataset to pull from.
- Keep separate indices for each dataset or unify them in a single index range.
- Could be used for domain adaptation or multi-task learning.

---
**Variant 17:**
Create a **DataLoader for super-resolution** tasks. For each high-resolution image in the dataset, downscale it by a factor (e.g., 4×) to produce a low-res input. Return `(lr_image, hr_image)`. Perfect for training a super-resolution CNN.

*Technical note:*
- Use something like `transforms.Resize` to produce the low-res version.
- The dataset must store the high-res original for the ground-truth.
- This approach is standard in SR research.

---
**Variant 18:**
Implement a **DataLoader for style transfer**. Each batch item includes `(content_image, style_image)`. Randomly select pairs from a “content” folder and a “style” folder. The model can then learn to blend style and content in training.

*Technical note:*
- Partition content images and style images separately.
- `__getitem__` picks one from each set randomly.
- Good for quick prototyping of style transfer training loops.

---
**Variant 19:**
Adopt a **DataLoader that yields partial or corrupted images** to implement inpainting tasks. For instance, zero out a random patch in the image to simulate missing data. The label or “target” is the original uncorrupted image.

*Technical note:*
- The dataset stores the full image.
- On-the-fly transform removes a patch from the input, producing “corrupted_image.”
- Return `(corrupted_image, full_image)` as `(input, target)`.

---
**Variant 20:**
Use a **custom multi-resolution DataLoader** that returns the same image in multiple scales (e.g., 224×224 and 112×112). The model can have a branch for each scale, or you can feed the smaller scale in a different stage. This can enhance multi-scale feature learning.

*Technical note:*
- For each sample, generate two versions with `transforms.Resize(224)` and `transforms.Resize(112)`.
- Return `(img_224, img_112, label)`.
- Useful for architectures expecting multi-level resolution inputs.

---

<a class="anchor" id="4.3"></a>

## <span style="color:red; font-size:1.5em;">Task 3. Create a CNN model</span>

[Go back to the content](#4)

**Variant 1:**
Develop a **MobileNetV2-inspired** CNN. Use depthwise separable convolutions for efficiency. Have a few inverted residual blocks. Conclude with a global average pooling and a linear layer for classification. This structure ensures a lightweight yet powerful model.

*Technical note:*
- Each “bottleneck” block reduces channels, applies depthwise conv, then expands back.
- Use ReLU6 or standard ReLU as activation.
- Output dimension depends on the dataset classes.

---
**Variant 2:**
Build a **DenseNet-like** model with dense connections between convolutional layers. After each small block, concatenate feature maps and apply a transition layer. Final classification layer uses average pooling and a fully connected output.

*Technical note:*
- Each dense block: `Conv -> ReLU -> BN`, then concat with input.
- A transition block can include `1x1 conv` + average pool to reduce dimensions.
- Suitable for deeper architectures in PyTorch.

---
**Variant 3:**
Implement a **UNet-inspired** model for segmentation but adapt it for classification by replacing the decoder with a classification head. The encoder uses downsampling conv blocks, and skip connections exist but feed into a final FC layer. Good for tasks requiring spatial detail in early layers.

*Technical note:*
- Typical UNet has symmetrical encoder-decoder; here the “decoder” is replaced with a simpler aggregator.
- High-level skip connections might be concatenated or averaged.
- The last output dimension is the number of classes.

---
**Variant 4:**
Recreate `model_2` by including **Squeeze-and-Excitation (SE) blocks**. Each convolutional block is followed by an SE step that learns channel-wise attention. Combine them in your CNN layers, culminating in a final classification layer.

*Technical note:*
- SE block: global pooling → small FC → ReLU → second FC → Sigmoid → multiply with feature maps.
- Improves channel attention for each conv block.
- Helps highlight important channels for each feature map.

---
**Variant 5:**
Construct a **ResNet-18** style network from scratch using basic blocks with 2-layer conv and skip connections. End with a fully connected layer for classification. This trains from initialization or can load partial pretrained weights if available.

*Technical note:*
- BasicBlock: (3×3 conv + BN + ReLU) × 2, then add skip input.
- Four stages with increasing channel depth.
- Use final average pooling and a linear layer for output.

---
**Variant 6:**
Propose a **Vision Transformer (ViT)-inspired** model. Split each image into patches, project them linearly, then feed them through a small transformer encoder. Conclude with an MLP head for classification. This merges advanced transformer ideas in model_2.

*Technical note:*
- Each image patch is flattened, then embedded with a linear layer.
- Encoder: multi-head self-attention + feedforward blocks.
- Classification token or pool the final patch embeddings.

---
**Variant 7:**
Create a **lightweight CNN** with only 3 convolutional layers, each followed by batch norm and ReLU. Use a final global average pooling. Suitable for resource-constrained devices. This minimal approach is easily comparable to the lecture’s baseline.

*Technical note:*
- Example: conv(3→16), conv(16→32), conv(32→64).
- BN+ReLU after each conv.
- Final layer: linear from 64 to number_of_classes.

---
**Variant 8:**
Design a **dual-input CNN** (for example, grayscale + edges as separate input channels) within the same forward pass. Concatenate their feature maps mid-network, then proceed with classification. This can replicate multi-modal input in a single model.

*Technical note:*
- Input dimension might be 2 channels: [batch, 2, H, W].
- Or process them separately in two conv branches, then merge.
- Summation or concatenation merges the branches.

---
**Variant 9:**
Implement a **Grouped Convolution** approach: in early layers, split channels into two groups for parallel conv. Then merge. This reduces parameter count and sometimes accelerates inference. Continue with standard conv layers after grouping.

*Technical note:*
- `nn.Conv2d(in_channels, out_channels, groups=2)` splits the channels.
- Grouped conv is used in ShuffleNet or Xception-like architectures.
- Ensure channel divisibility aligns with grouping.

---
**Variant 10:**
Use a **feature pyramid network** idea: extract features at multiple scales from a backbone CNN (like a simplified ResNet). Combine them into a single representation for classification. This approach helps the model handle objects at different scales.

*Technical note:*
- Feature pyramid merges outputs from multiple resolution stages.
- Possibly upsample lower-resolution features to match higher resolution.
- Then do a final classification from the aggregated feature map.

---
**Variant 11:**
Construct a **ShuffleNet** style model. Use channel shuffling between groups after grouped convolutions. Follow with BN/ReLU. End with average pooling and an FC layer. This aims for high efficiency with relatively few parameters.

*Technical note:*
- The shuffle operation ensures cross-group information exchange.
- Reproduce the “shuffle” pattern in forward pass after group conv.
- Commonly used in mobile device scenarios.

---
**Variant 12:**
Implement a **Dilated CNN** for classification, using dilated convolutions in the later layers to capture a wider receptive field. This model can handle large context without added parameters. Finish with a global pooling or flatten + FC.

*Technical note:*
- Use conv2d with `dilation=2` or `dilation=4` in deeper layers.
- Helps see more context around each pixel.
- Manage kernel size carefully to avoid excessive memory usage.

---
**Variant 13:**
Create a **ResNeXt** style block in place of standard residual blocks. Each block has grouped convolutions (e.g., group=32). Summation with the skip connection. This often yields better performance for the same depth.

*Technical note:*
- The cardinality (number of groups) is a key hyperparameter.
- Use `nn.Conv2d(in_channels, out_channels, groups=32)` in each branch.
- Then add skip input for the final output.

---
**Variant 14:**
Implement a **multi-task** version of `model_2` that outputs both class prediction and an auxiliary regression (e.g., bounding box). The network has a shared backbone but two separate heads. Ideal if you want classification plus location info.

*Technical note:*
- Shared feature extractor with convolutional layers.
- Head 1: linear for classification. Head 2: linear for bounding box coords.
- The forward pass returns two outputs.

---
**Variant 15:**
Build a **CapsNet** style classifier. Each conv block forms primary capsules. Then a dynamic routing mechanism aggregates them into classification capsules. This significantly differs from standard CNN but can be integrated into `model_2` logic.

*Technical note:*
- PrimaryCaps layer = a set of small conv filters that output “capsule” vectors.
- Routing by agreement uses iterative refinement of capsule weights.
- Output capsules match the number of classes.

---
**Variant 16:**
Design a CNN that includes a **spatial transformer** module early on. The transformer learns to warp or align input images. After that module, feed the transformed output to standard conv layers and end with a classification head.

*Technical note:*
- `nn.Sequential(SpatialTransformer(), conv_block, ...)`
- Spatial transformer often includes a localization network and a grid sampler.
- Great for tasks where the object might be at varied positions.

---
**Variant 17:**
Propose a **NASNet-like** model with repeated “normal cells” and “reduction cells.” Although manual, mimic the NASNet approach with skip or parallel conv branches. The final aggregator is a global average pool + FC layer.

*Technical note:*
- Each “cell” is an arrangement of conv branches with merges.
- Reduction cells reduce spatial size with stride 2.
- Normal cells keep the same spatial dimension.

---
**Variant 18:**
Implement a **self-attention** block within a CNN. After certain conv layers, apply a self-attention mechanism to highlight important spatial features. Then continue with standard CNN layers. This merges CNN with transformer-like attention.

*Technical note:*
- Flatten the feature map and compute attention scores for each location.
- Weighted sum of feature vectors is the attention output.
- Improves modeling of global dependencies in images.

---
**Variant 19:**
Create a **GAN discriminator**-like architecture for classification, repurposing typical discriminator blocks (Conv→LeakyReLU→Conv→LeakyReLU→...). Then finalize with a Sigmoid or linear output for the classification. This reuses style from generative tasks.

*Technical note:*
- Discriminator blocks usually have fewer pooling layers, but multiple strided convs.
- Keep an eye on dimension reduction as you go deeper.
- Output is 1 for binary or multiple for multi-class.

---
**Variant 20:**
Use a **hybrid CNN-RNN** approach: a CNN processes the image row by row, then a small RNN scans across the row embeddings. Conclude with a classification layer. This is unconventional but can experiment with capturing row-wise dependencies.

*Technical note:*
- Flatten each row’s conv features into a sequence dimension for the RNN.
- LSTM or GRU can handle the temporal dimension (each row).
- Potentially interesting for text-like structural images.

---

<a class="anchor" id="4.4"></a>

## <span style="color:red; font-size:1.5em;">Task 4. Train the model on the corresponding dataset</span>

[Go back to the content](#4)

**Variant 1:**
Use a **one-cycle learning rate policy**. Set an initial small LR, gradually increase to a maximum mid-epoch, then decay. This often yields faster convergence. Track training loss each iteration to confirm it matches the cyclical LR pattern.

*Technical note:*
- `torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=...)` is built-in.
- The schedule needs a total_step or epoch × iteration count.
- Monitor if it helps skip local minima.

---
**Variant 2:**
Train with **gradient accumulation** to effectively use a large batch size. Accumulate gradients over 4 sub-batches, then update. This helps if GPU memory is limited but you want large-batch benefits like stable gradient estimates.

*Technical note:*
- `for i, (data, target) in enumerate(train_loader): ... loss.backward()`
- If `(i+1) % accumulate_steps == 0: optimizer.step(); optimizer.zero_grad()`
- Keep consistent averaging of the loss for each sub-batch.

---
**Variant 3:**
Apply **label smoothing**: instead of one-hot labels, use 0.9 for the true class and 0.1 distributed among others. This can reduce overconfidence. Evaluate if your validation accuracy or calibration improves.

*Technical note:*
- Implement label smoothing manually or use PyTorch’s built-in cross-entropy with label_smoothing param.
- Typically helps large datasets with overfitting or confident misclassifications.

---
**Variant 4:**
Perform **mixup** training. For each mini-batch, randomly mix pairs of images and their labels. The model learns from interpolated samples. This often boosts robustness and reduces overfitting.

*Technical note:*
- For images x1, x2, labels y1, y2, do x_m = λx1 + (1−λ)x2, y_m = λy1 + (1−λ)y2.
- Train with MSE or cross-entropy adapted for mixed labels.
- Parameter λ ~ Beta(α, α).

---
**Variant 5:**
Use **SAM (Sharpness-Aware Minimization)** optimizer for training. This technique updates weights in a way that favors flatter minima. Evaluate if generalization improves compared to standard SGD or Adam.

*Technical note:*
- SAM requires a two-step process each update: compute grads, move weights in grad direction, recalc grads, then finalize update.
- PyTorch libraries or code snippets for SAM exist externally.
- Keep an eye on extra compute overhead.

---
**Variant 6:**
Integrate **selective backprop**: skip backward pass for easy samples (where the loss is below a threshold) so the model focuses on hard examples. Evaluate if it speeds up or improves final accuracy. Keep standard forward pass for all, but selectively do `.backward()`.

*Technical note:*
- If `loss_item < threshold: skip backward`.
- The threshold can dynamically adjust each epoch or remain fixed.
- Good for large data with many trivial examples.

---
**Variant 7:**
Adopt **stochastic depth** training. Randomly skip entire residual blocks in each forward pass with some probability. This is known to regularize deeper networks like ResNets, making them behave like an ensemble of shallower subnets.

*Technical note:*
- For each residual block, with probability p, forward = identity skip only.
- Grad is zero for that block if it’s skipped.
- Raises the model’s effective depth variability.

---
**Variant 8:**
Perform **multi-task training** for classification plus an auxiliary segmentation or bounding box regression. Combine losses in a weighted sum. See if the shared features help classification. Evaluate both tasks’ metrics.

*Technical note:*
- Suppose total_loss = α * classification_loss + β * bounding_box_loss.
- Adjust α, β to balance tasks.
- Proper data structure if some images don’t have bounding boxes or segmentation labels.

---
**Variant 9:**
Use a **frozen backbone** for the first 10 epochs (like a pretrained ResNet). Only train the new classification head. Then unfreeze the backbone for the next 10 epochs to fine-tune all layers. This two-stage approach is standard in transfer learning.

*Technical note:*
- Set `param.requires_grad = False` for backbone initially.
- After initial stage, set True for all.
- Helps avoid large gradient updates on pretrained weights early on.

---
**Variant 10:**
Train with a **cosine annealing learning rate** schedule over 50 epochs. The LR decreases following a cosine curve from initial LR to a small min LR by the final epoch. This scheduling often stabilizes training for CNNs.

*Technical note:*
- Use `torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50, eta_min=1e-5)`.
- Validate performance after each epoch; the LR will be automatically adjusted.
- Tends to yield smooth convergence.

---
**Variant 11:**
Employ **distributed data parallel** training across multiple GPUs or multiple nodes. Synchronously update the model parameters. Track speedup vs. single-GPU. Confirm that final accuracy matches single-GPU baseline, assuming correct synchronization.

*Technical note:*
- Use `torch.distributed.init_process_group(...)` then `DistributedDataParallel(model)`.
- Partition data across processes with `DistributedSampler`.
- Keep an eye on setup complexities (master address, rank, etc.).

---
**Variant 12:**
Train with an **adversarial training** strategy. Generate small adversarial perturbations (FGSM or PGD) on training samples and train on these adversarially perturbed images. This can improve the model’s robustness.

*Technical note:*
- FGSM: x_adv = x + ε sign(∇_x loss).
- Recompute the gradient for the perturbed input.
- Greatly increases training time but yields more robust classification.

---
**Variant 13:**
Implement a **knowledge distillation** training loop. Use a larger teacher model (already trained) to provide soft targets for the smaller student. Combine the student’s cross-entropy with the teacher’s KL divergence at a chosen temperature.

*Technical note:*
- `loss = α * CrossEntropy(student, hard_labels) + β * KLDiv(student, teacher_softmax)`.
- Temperature > 1 for teacher’s output softening.
- The student sees both the dataset labels and teacher’s distribution.

---
**Variant 14:**
Train with a **reinforcement of correct predictions** approach: if the model is confident and correct on a sample, reduce its loss weight next time. If the model is uncertain or wrong, keep/increase the weight. This is a dynamic weighting scheme.

*Technical note:*
- Maintain a difficulty or correctness score for each sample.
- Gradually adjust each sample’s effective loss weight over epochs.
- Similar to boosting algorithms in classical ML.

---
**Variant 15:**
Use an **auxiliary classifier** in the middle of the network. For instance, at an intermediate layer, branch out a small classifier that predicts the same label. Add its loss to the main classification loss. This can help gradients flow earlier in deep networks (like GoogLeNet’s Inception).

*Technical note:*
- mid_output = mid_classifier(mid_features), compute mid_loss.
- final_output = main_classifier(final_features), compute main_loss.
- total_loss = main_loss + λ * mid_loss.

---
**Variant 16:**
Integrate a **cutout** augmentation approach. In each training image, randomly mask out a square region (e.g., 16×16) by setting it to zero or a mean color. Combine it with random crops and flips. Then proceed with normal training.

*Technical note:*
- For each input image in the batch, pick a random location, set a square patch to 0.
- This helps the model rely on other visual cues, not just one region.
- Implement via a custom transform or inside the training loop.

---
**Variant 17:**
Use **longer training** with a **warm restarts** schedule. For example, every 10 epochs, reset the LR to a higher value, then decay over subsequent epochs. Compare if this cyclical approach helps avoid local minima for your CNN.

*Technical note:*
- Use `CosineAnnealingWarmRestarts(optimizer, T_0=10, T_mult=2)` for example.
- The LR resets each T_0, but the period grows by T_mult each time.
- Plot LR vs. epoch to confirm.

---
**Variant 18:**
Train with **fp16 mixed precision** to reduce memory usage and speed up on GPU. Keep careful track of numerical stability. Compare final accuracy with a full fp32 baseline. Mixed precision can accelerate large CNNs significantly.

*Technical note:*
- Use `torch.cuda.amp.autocast()` plus `GradScaler`.
- If instability arises, fallback to fp32 for certain layers or set `GradScaler` growth intervals.
- Typically no accuracy drop if done carefully.

---
**Variant 19:**
Use **checkpoint averaging**: train the model for 100 epochs, but every 10 epochs, save a checkpoint. At the end, average the weights from these 10 checkpoints. Evaluate if the ensemble effect yields better accuracy than the final single checkpoint.

*Technical note:*
- Implement a function to load each checkpoint’s state_dict, accumulate them, and divide by the number of checkpoints.
- This is simpler than SWA but similar in concept.
- Usually stabilizes final performance.

---
**Variant 20:**
Train your model with a **large-scale hyperparameter search** across learning rates, momentums, or weight decays. Automate this via a script that tries multiple combinations (grid or random). Evaluate each run’s validation accuracy and pick the best. This helps find an optimal training recipe.

*Technical note:*
- Could integrate libraries like Optuna or Ray Tune for hyperparameter optimization.
- Each hyperparam set is a separate training run.
- Summarize final val accuracy and pick top config.

---

In [2]:
# your code here

<a class="anchor" id="4.5"></a>

## <span style="color:red; font-size:1.5em;">Task 5. Make predictions</span>

[Go back to the content](#4)

**Variant 1:**
Generate **class activation maps** (CAM) for each test image. Visualize which regions contributed most to the predicted class. Overlay the heatmap on the original image. This clarifies model interpretability in classification tasks.

*Technical note:*
- Forward pass with `model`.
- Retrieve feature maps from the final conv layer, multiply by weights from the classifier.
- Upsample to original image resolution, then color-map overlay.

---
**Variant 2:**
Implement a **batch-inference** pipeline that processes the test set in large batches. Evaluate speed vs. single-image inference. Then measure accuracy across the entire set. This is standard for quick predictions at scale.

*Technical note:*
- `model.eval()`, then loop over test_loader with large batch_size.
- Use `torch.no_grad()` or `torch.inference_mode()`.
- Compare total time for different batch sizes.

---
**Variant 3:**
Compute **per-class metrics** (precision, recall, F1) for multi-class classification. Summarize them in a table or confusion matrix to see which classes are hardest. Possibly plot them in a bar chart for clarity.

*Technical note:*
- After predictions, gather them in arrays, use `sklearn.metrics.classification_report`.
- For multi-class, the report includes precision, recall, f1 for each label.
- Helps interpret class imbalances in performance.

---
**Variant 4:**
Perform **MC Dropout** during inference by keeping dropout layers active. Run multiple forward passes (e.g., 30 times) for the same image. Compute mean and variance of predictions. This yields an uncertainty estimate.

*Technical note:*
- `model.train()` for the dropout effect, but do not do a backward pass.
- A loop: `for _ in range(n_samples): preds.append(model(x))`.
- Compute standard deviation across those predictions.

---
**Variant 5:**
Implement a **TTA (Test-Time Augmentation)** approach: for each test image, apply random flips, crops, or color jitters. Average the model’s outputs across these variants. Compare if TTA improves final accuracy.

*Technical note:*
- e.g., do 5 random augmentations + original image, sum probabilities, then divide by 6.
- Common in Kaggle competitions.
- Usually yields a small accuracy boost.

---
**Variant 6:**
Predict with a **sliding window** approach for large images or segmentation tasks. Split the test image into overlapping patches, pass each patch through the model, then merge predictions. Good for high-resolution inputs.

*Technical note:*
- For classification, might treat each patch as local ROI. For segmentation, piece together patch outputs.
- Overlap can reduce edge artifacts.
- Potentially large memory usage, so watch out for efficiency.

---
**Variant 7:**
Perform **out-of-distribution detection**. Supply images from a different domain, then measure the model’s confidence. If the predicted probability is low or the feature embedding is far from known classes, label it as OOD. This tests model robustness.

*Technical note:*
- Possibly measure max softmax probability or distance to training set embeddings.
- Evaluate how the model behaves on classes not in training.
- Helps see if the system confuses unknowns with known classes.

---
**Variant 8:**
Use a **threshold calibration** approach after predictions. Instead of the default 0.5 for binary classification, find the best threshold via validation F1 or any chosen metric. Then apply that threshold to the test set predictions.

*Technical note:*
- Sort predicted probabilities on the validation set.
- Evaluate metric at each possible threshold.
- The threshold that yields best metric is used for final test predictions.

---
**Variant 9:**
Visualize predictions in an **embedding space**. Pass each test image through the final layer before classification, collect the embeddings. Then apply TSNE or UMAP to reduce to 2D. Color points by predicted label. This can show cluster separations.

*Technical note:*
- `model.forward(..., return_embedding=True)` or extract from penultimate layer.
- Use scikit-learn’s `TSNE(n_components=2)` or UMAP.
- Plot with matplotlib, color-coded by predicted class.

---
**Variant 10:**
Perform **multi-crop testing**: for each test image, extract multiple overlapping crops (e.g., 5 or 10) at corners and center. Average the model outputs across these crops for a final prediction. Compare if it improves performance over single-crop.

*Technical note:*
- Typically done with `transforms.FiveCrop` or custom logic.
- Each crop is fed to `model.eval()`.
- The final class is the one with the highest average probability.

---
**Variant 11:**
Generate a **model ensemble** for prediction. Combine 3–5 trained networks (potentially with different seeds). For each test image, gather each model’s softmax probabilities and average them to get the final predicted class.

*Technical note:*
- Each model is in `eval()` mode, do a forward pass.
- Probability-level averaging is common.
- Usually yields higher accuracy but more inference cost.

---
**Variant 12:**
Implement **confidence calibration** plotting: produce a reliability diagram showing predicted probability vs. actual accuracy in that probability bin. If the model is well-calibrated, it should align with the diagonal. Then compute ECE or MCE.

*Technical note:*
- Use `p, y` across test set, bin them, measure average p vs. fraction of positives.
- Plot the difference from diagonal.
- Tools like `torchmetrics` or scikit-learn can help.

---
**Variant 13:**
Provide **per-instance prediction explanations** using LIME or SHAP. After the model predicts for an image, approximate local feature importance. Then highlight which pixels (or super-pixels) contributed most to the classification decision.

*Technical note:*
- LIME runs perturbations around the input to see changes in prediction.
- SHAP can integrate gradients or approximate local surrogate models.
- Often used for interpretability with end users.

---
**Variant 14:**
Test the model on **adversarially perturbed images**. Evaluate how predictions differ from the standard images. Plot side-by-side results. This reveals how robust (or brittle) your model is under small input perturbations.

*Technical note:*
- Use an FGSM or PGD approach to create adversarial examples.
- Compare predicted labels on clean vs. adversarial images.
- Summarize success rate of attacks or degrade in accuracy.

---
**Variant 15:**
Compute the **logits** for every test sample, then apply a **softmax** temperature scaling at inference (T > 1 or < 1). Observe how it changes predicted confidence. This is a post-training calibration trick.

*Technical note:*
- If `logits = model(x)`, then `softmax(logits / T)` for a chosen T.
- T>1 flattens the distribution, T<1 sharpens it.
- Evaluate if calibration or accuracy is improved.

---
**Variant 16:**
Perform a **class-based error analysis**: gather all test samples the model got wrong for each class. Inspect them or generate a mini-HTML gallery. This helps see if there’s a pattern, e.g., certain backgrounds or angles that cause confusion.

*Technical note:*
- Compare `pred_label != true_label`.
- Organize misclassified samples in folders or a webpage for manual inspection.
- Potentially reveal data distribution issues or consistent failure modes.

---
**Variant 17:**
For a multi-label classification, compute predictions as multiple independent sigmoid outputs. Then apply a chosen threshold for each label or a global threshold. Summarize the average precision for each label. Show examples of partial labeling success.

*Technical note:*
- `model` returns logit vector. Sigmoid for each dimension.
- Compare each dimension’s thresholded output to ground truth.
- Summarize with per-label F1 or average precision.

---
**Variant 18:**
Evaluate **speed vs. accuracy** trade-offs by dynamically adjusting image resolution at inference time. Predict at 224×224, then at 160×160, etc., measuring any drop in accuracy. Plot a curve of resolution vs. inference speed and accuracy.

*Technical note:*
- Downsample test images to different sizes before feeding the model.
- Typically, lower resolution = faster inference, but lower accuracy.
- Summarize results to find an optimal compromise for deployment.

---
**Variant 19:**
Run a **cross-dataset evaluation**: train on your original dataset, then test on a similar but distinct dataset (e.g., train on CIFAR-10, test on STL-10). Compare how the model generalizes. This clarifies domain transfer ability.

*Technical note:*
- Directly use `model.eval()` on images from the new dataset with same label set if possible.
- If labels differ, adapt the evaluation metrics or classes.
- Typically results in accuracy drop, quantifying domain gap.

---
**Variant 20:**
Implement a **two-stage prediction** for large label sets. First, predict a super-class or coarse category. If needed, apply a second specialized classifier for the sub-category. Compare if this hierarchical approach improves accuracy or speed.

*Technical note:*
- Stage 1: coarse classification (e.g., animal vs. vehicle).
- Stage 2: specialized model for the sub-group (e.g., dog vs. cat).
- Potentially skip second stage for confident predictions.

---

<a class="anchor" id="4.7"></a>