# Filter Dataset and Train a PyTorch Model
In this Jupyter notebook, we will explore the implementation of a modified version of the `ImageFolder` dataset from the PyTorch `torchvision` package. This modified dataset filters out samples whose filenames are listed in a given CSV file. You obtain the CSV file by running fastdup (see [this notebook](./analyze.ipynb)) or dropping us an email at info@visual-layer.com .

<!--<badge>--><a href="https://colab.research.google.com/github/visual-layer/vl-datasets/blob/master/notebooks/train-clean-pytorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a><!--</badge>-->

## Installation & Setting Up

First, we need to install the fastdup and matplotlib libraries. Run the following command in your Jupyter notebook to install them.

In [None]:
!pip install -U torch torchvision pandas pathlib

## Download foods-101 Dataset
Next, we need to download the dataset. For this tutorial, we will use the foods-101 dataset. Run the following commands in your Jupyter notebook to download and extract the dataset:

In [None]:
!wget http://data.vision.ee.ethz.ch/cvl/food-101.tar.gz
!tar -xf food-101.tar.gz

## Imports
We will use the following libraries in this tutorial. Import them in your Jupyter notebook by running the following commands:

In [1]:
import os
from pathlib import Path

import pandas as pd

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, random_split

import torchvision
from torchvision.datasets import ImageFolder
import torchvision.transforms as transforms

## Define the preprocessing transforms
We define the preprocessing transforms for the dataset. We have two transforms: `train_transform` and `valid_transform`.

In [2]:
train_transform = transforms.Compose(
    [
        transforms.Resize(150),
        transforms.RandomCrop(128),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ]
)

valid_transform = transforms.Compose(
    [
        transforms.Resize(156),
        transforms.CenterCrop(150),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ]
)

## Define a custom class FilteredDataset
We define a custom class `FilteredDataset` that extends `ImageFolder` class. 
This class will allow us to exclude files from the dataset with a `.csv` file. 

Here is the code for the `FilteredDataset` class:

In [3]:
class FilteredDataset(ImageFolder):
    """
    A modified version of torchvision.datasets.ImageFolder that filters out samples whose filenames
    are listed in a given CSV file.

    See: https://pytorch.org/vision/main/generated/torchvision.datasets.ImageFolder.html

    Args:
        root_dir (string): Root directory path of the dataset.
        csv_path (string, optional): Path to a CSV file containing a list of excluded filenames.
                                     Default: None.
        transform (callable, optional): A function/transform that takes in a PIL image and returns a
                                         transformed version. E.g, ``transforms.RandomCrop``
                                         Default: None.
        target_transform (callable, optional): A function/transform that takes in the target and
                                                transforms it. Default: None.

    Example usage:

        # Load the dataset and exclude certain samples
        dataset = FilteredDataset("dataset/images", "files-to-exclude.csv", transform=transforms.ToTensor())

        # Create a dataloader
        dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
    """

    def __init__(self, root_dir, csv_path=None, transform=None, target_transform=None):
        root_dir = Path(root_dir)
        super().__init__(root_dir, transform=transform, target_transform=target_transform)

        if csv_path:
            self.excluded_files = pd.read_csv(csv_path, header=0)
            self.excluded_files['filename'] = self.excluded_files['filename'].apply(lambda x: str(root_dir) + "/" + x )
            self.excluded_filenames = set(self.excluded_files['filename'])
 
            print(f"Original Samples: {len(self.samples)} in {root_dir}")
            print(f"Excluded: {len(self.excluded_filenames)} in {root_dir}")
            excluded_indices = [i for i, (path, _) in enumerate(self.samples) if path in self.excluded_filenames]
            self.samples = [sample for i, sample in enumerate(self.samples) if i not in excluded_indices]
            self.targets = [target for i, target in enumerate(self.targets) if i not in excluded_indices]
            print(f"Cleaned Samples: {len(self.samples)} in {root_dir}")

## Exclude files
Using the custom `FilteredDataset` class, we can conveniently exclude the files specified in the `.csv` files from being loaaded into the dataset.

In [4]:
dataset = FilteredDataset("food-101/images", "food_101_vl-datasets_analysis.csv", transform=train_transform)

Original Samples: 101000 in food-101/images
Excluded: 498 in food-101/images
Cleaned Samples: 100502 in food-101/images


We can also view the exclude files with:

In [5]:
dataset.excluded_files

Unnamed: 0,filename,reason,value,prototype
0,food-101/images/apple_pie/1487150.jpg,Duplicate,0.9662,apple_pie/1486972.jpg
1,food-101/images/apple_pie/3324492.jpg,Duplicate,0.9817,apple_pie/2106005.jpg
2,food-101/images/apple_pie/3670966.jpg,Duplicate,0.9879,apple_pie/3670548.jpg
3,food-101/images/apple_pie/839845.jpg,Duplicate,0.9964,apple_pie/839808.jpg
4,food-101/images/baby_back_ribs/2306066.jpg,Duplicate,0.9862,baby_back_ribs/2306008.jpg
...,...,...,...,...
508,food-101/images/sashimi/241368.jpg,Dark,15.7813,
509,food-101/images/scallops/3314913.jpg,Dark,13.7173,
510,food-101/images/spring_rolls/182658.jpg,Dark,8.9502,
511,food-101/images/bread_pudding/444890.jpg,File-Size,9715.0000,


## Split Dataset
We will split the dataset into 80% train and 20% validation sets.

In [6]:
train_size = int(0.8 * len(dataset))
valid_size = len(dataset) - train_size
train_dataset, valid_dataset = random_split(dataset, [train_size, valid_size])

In [7]:
train_loader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=os.cpu_count())
valid_loader = DataLoader(valid_dataset, batch_size=32, shuffle=True, num_workers=os.cpu_count())

## Define the model architecture
Let's construct a basic convolutional model, Resnet18 from Torchvision.

In [8]:
model = torchvision.models.resnet18(weights=torchvision.models.ResNet18_Weights.DEFAULT)
num_ftrs = model.fc.in_features
model.fc = nn.Linear(num_ftrs, len(dataset.classes))

## Define the loss function and optimizer

In [9]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

## Train the model
Now, let's write a simple training loop to train the model for 10 epochs on a GPU or CPU.

In [10]:
num_epochs = 10
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
model.to(device)

for epoch in range(num_epochs):
    running_loss = 0.0
    for i, data in enumerate(train_loader):
        inputs, labels = data
        inputs, labels = inputs.to(device), labels.to(device)

        optimizer.zero_grad()

        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    print(f"Epoch {epoch+1} - Loss: {running_loss/len(train_loader)}")


Using device: cuda
Epoch 1 - Loss: 2.838328029691167
Epoch 2 - Loss: 2.1036031943755105
Epoch 3 - Loss: 1.8239957706558594
Epoch 4 - Loss: 1.6393448251783753
Epoch 5 - Loss: 1.4989980335677364
Epoch 6 - Loss: 1.3904197197816661
Epoch 7 - Loss: 1.2933714455079592
Epoch 8 - Loss: 1.2070332413568348
Epoch 9 - Loss: 1.1377574345443857
Epoch 10 - Loss: 1.0723884810178532


## Evaluate the model
Finally we evaluate the model on the validation set and prints it's accuracy.

In [12]:
correct = 0
total = 0
with torch.no_grad():
    for data in valid_loader:
        inputs, labels = data
        inputs, labels = inputs.to(device), labels.to(device)
        outputs = model(inputs)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f"Accuracy: {100 * correct / total}")


Accuracy: 73.79234863937117
