# Image Classification Pipeline using Remo

<img src="remo_normal.png">

In this tutorial, Remo will be used to accelerate the process of building a transfer learning pipeline for the task of Image Classification.

In [2]:
#Imports
import torch
import torch.nn as nn
import numpy as np
from PIL import Image
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torchvision import transforms
import torchvision.models as models
import pandas as pd
import os
import tqdm
import torch.optim as optim
import random
import remo
remo.set_viewer('jupyter')
from pprint import pprint
import json

## Adding Data to Remo

- The dataset used in this example is the <a href="lol">Flowers 102 Dataset</a>.
- Run the next cell to download the data from s3, create a new folder and extract the files required.

- The directory structure of the dataset is:

    ```
    ├── flower_dataset
        ├── train                    
            ├── 1
            ├── image_1.jpg
            ├── image_2.jpg
            ├── ...
        ├── test                    
            ├── 1
            ├── image_3.jpg
            ├── image_4.jpg
            ├── ...
        ├── valid                    
            ├── 1
            ├── image_5.jpg
            ├── image_6.jpg
            ├── ...
    ```

In [6]:
# The dataset will be downloaded from s3, and extracted into a new folder
!wget https://s-3.s3-eu-west-1.amazonaws.com/flower_dataset.zip
!unzip flower_dataset.zip

lating: flower_dataset/valid/30/image_03531.jpg  
   creating: flower_dataset/valid/82/
  inflating: flower_dataset/valid/82/image_01683.jpg  
  inflating: flower_dataset/valid/82/image_01693.jpg  
  inflating: flower_dataset/valid/82/image_01625.jpg  
  inflating: flower_dataset/valid/82/image_01643.jpg  
  inflating: flower_dataset/valid/82/image_01596.jpg  
  inflating: flower_dataset/valid/82/image_01658.jpg  
  inflating: flower_dataset/valid/82/image_01592.jpg  
  inflating: flower_dataset/valid/82/image_01671.jpg  
  inflating: flower_dataset/valid/82/image_01649.jpg  
  inflating: flower_dataset/valid/82/image_01659.jpg  
  inflating: flower_dataset/valid/82/image_01690.jpg  
  inflating: flower_dataset/valid/82/image_01691.jpg  
  inflating: flower_dataset/valid/82/image_01640.jpg  
   creating: flower_dataset/valid/28/
  inflating: flower_dataset/valid/28/image_05258.jpg  
  inflating: flower_dataset/valid/28/image_05265.jpg  
  inflating: flower_dataset/valid/28/image_05257.

In [4]:
# The path to the folders
train_path = "flower_dataset/train"
valid_path = "flower_dataset/valid"
test_path = "flower_dataset/test"

**Generating Annotations from Folders**

To generate an annotations file from a folder of folders, the path to the root directory containing the class folders is passed to ```remo.generate_annotations_from_folders()```.

In [6]:
remo.generate_annotations_from_folders(train_path)
remo.generate_annotations_from_folders(test_path)
remo.generate_annotations_from_folders(valid_path)

'flower_dataset/valid/annotations.csv'

In [5]:
# The JSON file is provided in the dataset, and is then converted into a mapping dictionary.
# cat_to_index : mapping between class_index -> class_label

cat_to_index = dict(json.load(open("flower_dataset/cat_to_name.json")))
mapping = { value : key for (key, value) in cat_to_index.items()}

**Adding Data to Remo**

To add a dataset, you can use the ```remo.create_dataset()``` specifying the path to data and annotations. 
The class encoding is passed via a dictionary.

For a complete list of formats supported please refer the <a href="https://remo.ai/docs/annotation-formats/">documentation</a>.


In [12]:
# The annotations.csv is generated in the same path of the sub-folder
training_dataset = remo.create_dataset(name = 'flowers-train',
                    paths_to_upload=[train_path, os.path.join(train_path, "annotations.csv")],
                    annotation_task = "Image classification",
                    class_encoding=cat_to_index)

valid_dataset = remo.create_dataset(name = 'flowers-valid',
                    paths_to_upload=[valid_path, os.path.join(valid_path, "annotations.csv")],
                    annotation_task = "Image classification",
                    class_encoding=cat_to_index)

testing_dataset = remo.create_dataset(name = 'flowers-test',
                    paths_to_upload=[test_path, os.path.join(test_path, "annotations.csv")],
                    annotation_task = "Image classification",
                    class_encoding=cat_to_index)

ges: 2394/6552Processing data - Processing images: 2410/6552Processing data - Processing images: 2416/6552Processing data - Processing images: 2429/6552Processing data - Processing images: 2447/6552Processing data - Processing images: 2461/6552Processing data - Processing images: 2471/6552Processing data - Processing images: 2485/6552Processing data - Processing images: 2500/6552Processing data - Processing images: 2511/6552Processing data - Processing images: 2526/6552Processing data - Processing images: 2546/6552Processing data - Processing images: 2559/6552Processing data - Processing images: 2571/6552Processing data - Processing images: 2584/6552Processing data - Processing images: 2595/6552Processing data - Processing images: 2609/6552Processing data - Processing images: 2625/6552Processing data - Processing images: 2638/6552Processing data - Processing images: 2652/6552Processing data - Processing images: 2667/6552Processing data - Processing images: 2684/6552Processing data - Pr

'\ntesting_dataset = remo.create_dataset(name = \'flowers-test\',\n                    paths_to_upload=[test_path, os.path.join(test_path, "annotations.csv")],\n                    annotation_task = "Image classification",\n                    class_encoding=cat_to_index)\n'

In [13]:
# Method for the getting the IDs of the created datasets
remo.list_datasets()

[Dataset 38 - 'flowers-test',
 Dataset 39 - 'flowers-train',
 Dataset 40 - 'flowers-valid']

**Visualizing the dataset**

To view your data and labels using the Remo visual interface directly in the notebook, call the ```dataset.view()``` method.

In [14]:
# Un-comment to view the other datasets.
training_dataset.view()
#validation_dataset.view()
#testing_dataset.view()

Open http://localhost:8123/datasets/39


**Dataset Statistics**

Remo alleviates the need to write extra boilerplate for accessing dataset properties. 

This can be done either using code, or via the visual interface.

In [16]:
train_stats = training_dataset.get_annotation_statistics()
valid_stats = valid_dataset.get_annotation_statistics()
test_stats = testing_dataset.get_annotation_statistics()

pprint("Training Statistics {}".format(train_stats))
pprint("Validation Statistics {}".format(valid_stats))
pprint("Testing Statistics {}".format(test_stats))

("Training Statistics [{'AnnotationSet ID': 35, 'AnnotationSet name': 'Image "
 "classification', 'n_images': 0, 'n_classes': 102, 'n_objects': 0, "
 "'top_3_classes': [{'name': 'Petunia', 'count': 206}, {'name': 'Passion "
 "flower', 'count': 205}, {'name': 'Wallflower', 'count': 157}], "
 "'creation_date': None, 'last_modified_date': '2020-07-22T11:12:56.197670Z'}]")
("Validation Statistics [{'AnnotationSet ID': 36, 'AnnotationSet name': 'Image "
 "classification', 'n_images': 0, 'n_classes': 102, 'n_objects': 0, "
 "'top_3_classes': [{'name': 'Petunia', 'count': 28}, {'name': 'Cyclamen', "
 "'count': 25}, {'name': 'Passion flower', 'count': 21}], 'creation_date': "
 "None, 'last_modified_date': '2020-07-22T11:18:08.274564Z'}]")
("Testing Statistics [{'AnnotationSet ID': 34, 'AnnotationSet name': 'Image "
 "classification', 'n_images': 0, 'n_classes': 102, 'n_objects': 0, "
 "'top_3_classes': [{'name': 'Water lily', 'count': 28}, {'name': 'Passion "
 "flower', 'count': 25}, {'name': 

In [17]:
# Un-comment the other lines to view the other datasets.

training_dataset.view_annotation_stats()
#validation_dataset.view_annotation_stats()
#testing_dataset.view_annotation_stats()

Open http://localhost:8123/annotation-detail/35/insights


**Export Annotations To File**

Using the ```dataset.export_annotations_to_file()``` method, the annotations from Remo can be exported to a format of your choice.

For a complete list of formats supported please refer the <a href="https://remo.ai/docs/annotation-formats/">documentation</a>.


In [18]:
training_dataset.export_annotations_to_file("training.csv", annotation_format="csv", full_path='true')
testing_dataset.export_annotations_to_file("testing.csv", annotation_format="csv", full_path='true')
valid_dataset.export_annotations_to_file("validation.csv", annotation_format="csv", full_path='true')

## Feeding Data into PyTorch

A custom PyTorch ```Dataset``` object defined below is used to load data.

In order to adapt this to your dataset, the following are required:

- **Path to data:** Path to the Data Folder
- **Path to Annotations:** Path to Annotations CSV File (Format : file_name, class_name)
- **Mapping:** Python dictionary containing mapping of class name and class index (Format : {"class_name" : "class_index"})
- **transforms:** Transforms to be applied to the images before passing it to the network.

In [19]:
class FlowerDataset(Dataset):
    def __init__(self, annotations, data_path, mapping, transform=None, mode="train"):
        
        # Pandas is used to read in the csv file into a DataFrame for data loading
        self.data = pd.read_csv(annotations)
        self.data_path = data_path
        self.mapping = mapping
        self.transform = transform
        self.mode = mode
        
    def __len__(self):
        return len(self.data)
  
    def __getitem__(self, idx):
        
        labels = int(self.mapping[self.data.loc[idx, 'classes'].lower()])
        im_path = self.data_path + "/" + str(labels) + "/" + self.data.loc[idx, 'file_name']
        label_tensor = torch.as_tensor(labels-1, dtype=torch.long)
        im = Image.open(im_path)
        
        if self.transform:
            im = self.transform(im)
        if self.mode == "test":
            # For saving the predicitions, the file name is required
            return {"im" : im, "labels": label_tensor, "im_name" : self.data.loc[idx, 'file_name']}
        else:
            return {"im" : im, "labels" : label_tensor}

In [20]:
# Channel wise mean and standard deviation for normalizing according to ImageNet Statistics
means = [0.485, 0.456, 0.406]
stds = [0.229, 0.224, 0.225]


# Transforms to be applied to Train-Test-Validation
train_transforms = transforms.Compose([
        transforms.RandomRotation(30),
        transforms.RandomResizedCrop(224),
        transforms.RandomHorizontalFlip(p=0.5),
        transforms.ToTensor(),
        transforms.Normalize(means, stds)
    ])

val_transforms = transforms.Compose([
        transforms.Resize(224),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(means, stds)])

test_transform = transforms.Compose([
        transforms.Resize(224),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(means, stds)
    ])

The train, test and validation datasets are instantiated and wrapped around a ```DataLoader``` method.

In [21]:
train_dataset = FlowerDataset(annotations="training.csv",
                                   data_path='./flower_dataset/train',
                                   transform=train_transforms,
                                   mapping=mapping)

val_dataset = FlowerDataset(annotations="validation.csv",
                              data_path='./flower_dataset/valid',
                              transform=val_transforms,
                              mapping=mapping)

test_dataset = FlowerDataset(annotations="testing.csv",
                              data_path='./flower_dataset/test',
                              transform=test_transform,
                             mapping=mapping,
                              mode="test")

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True, num_workers=2)
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=64, shuffle=False, num_workers=2)
test_loader = torch.utils.data.DataLoader(test_dataset,batch_size=1,shuffle=False, num_workers=2)

data_loader = {"train" : train_loader, "valid": val_loader}
len_dict = {"train" : len(train_dataset), "valid" : len(val_dataset)}

## Training the Model

The pre-trained weights of the ```ResNet-18``` model with ImageNet are used in this tutorial.

To train the model, the following details are passed to the ```train_model()``` function

1. **Model:** The edited version of the pre-trained model.
2. **Data Loaders:** The dictionary containing our training and validation dataloaders
3. **Criterion:** The loss function used for training the network
4. **Num_epochs:** The number of epochs for which we would like to train the network.
5. **dataset_size:** an additional parameter which is used to correctly scale the loss, the method for this is specified in the DataLoader cell

In [22]:
model = models.resnet18(pretrained=True)

# Freezing the weights
for param in model.parameters():
    param.required_grad = False


# Replacing the final layer
model.fc = nn.Sequential(nn.Linear(512, 256), nn.ReLU(), nn.Dropout(p=0.5), nn.Linear(256, 102),
                         nn.LogSoftmax(dim=1))

In [23]:
def train_model(model, data_loaders, optimizer, criterion, num_epochs, dataset_size):

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Checks if GPU is present

    model.to(device) # This method pushes the model to the device.
    
    # The training loop trains the model for the total number of epochs,
    # an epoch is one complete pass over the entire dataset
    for epoch in range(num_epochs):
        
        model.train() # This sets the model back to training after the validation step
        print("Epoch Number {}".format(epoch))

        training_loss = 0.0
        val_loss = 0.0
        val_acc = 0
        correct_preds = 0
        best_acc = 0
        validation = 0.0
        total = 0

        
        data_loader = tqdm.tqdm(data_loaders["train"])
        for x, data in enumerate(data_loader):
            inputs, labels = data["im"].to(device), data["labels"].to(device)
            outputs = model(inputs)
            optimizer.zero_grad()

            loss = criterion(outputs, labels)


            loss.backward()
            optimizer.step()

            training_loss += loss.item()
        
        epoch_loss = training_loss / dataset_size["train"]

        print("Training Loss : {:.5f}".format(epoch_loss))

        val_data_loader = tqdm.tqdm(data_loaders["valid"])
        
        # Validation step after every epoch
        # The gradients are not required at inference time, hence the model is set to eval mode
        with torch.no_grad():
            model.eval()
            for x, data in enumerate(val_data_loader):
                inputs, labels = data["im"].to(device), data["labels"].to(device)
                outputs = model(inputs)

                val_loss = criterion(outputs, labels)
                _, index = torch.max(outputs, 1)

                total += labels.size(0)
                correct_preds += (index == labels).sum().item()

                validation += val_loss.item()

            val_acc = 100 * (correct_preds / total)

            print("Validation Loss : {:.5f}".format(validation / dataset_size["valid"]))
            print("Validation Accuracy is: {:.2f}%".format(val_acc))
            
            # The model is saved only if current validation accuracy is higher than the previous best accuracy
            if best_acc < val_acc:
                best_acc = val_acc
                model_name = "./saved_model_{:.2f}.pt".format(best_acc)
                torch.save(model, model_name)
                
            return model_name 


In [24]:
# We use the Adam optimizer, which inherits the parameters of only the trainable layers.
optimizer = optim.Adam(model.fc.parameters(), lr=0.001)
trained_model = train_model(model=model, data_loaders=data_loader, optimizer=optimizer, num_epochs=1, dataset_size=len_dict, criterion=nn.NLLLoss())

0%|          | 0/103 [00:00<?, ?it/s]Epoch Number 0
100%|██████████| 103/103 [09:17<00:00,  5.42s/it]
  0%|          | 0/13 [00:00<?, ?it/s]Training Loss : 0.06202
100%|██████████| 13/13 [00:23<00:00,  1.79s/it]Validation Loss : 0.04312
Validation Accuracy is: 45.48%



In [None]:
def test_model(dataloader, model):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    test_model = torch.load(model)
    
    test_model.eval()
    tk0 = tqdm.tqdm(dataloader)
    
    total = 0
    correct_preds = 0
    pred_list = {}
    
    with torch.no_grad():
        for x, data in enumerate(tk0):
            single_im, label = data["im"].to(device), data["labels"].to(device)
            im_name = data["im_name"]
            
            pred = test_model(single_im)

            _, index = torch.max(pred, 1)

            total += label.size(0)
            correct_preds += (index == label).sum().item()
            
            pred_list[im_name[0]] = (index+1).item()
            
    df = pd.DataFrame(pred_list.items(), columns=['file_name', 'class_name'])
    with open("results.csv", "w") as f:
        df.to_csv(f, index=False)
    print('Accuracy of the network on the test images: %d %%' % (100 * (correct_preds / total)))

In [None]:
test_model (test_loader, trained_model)

## Visualizing Predictions

For visualizing the predicted v/s original label in Remo, the predictions are added to a CSV, which is then added as an ```AnnotationSet``` to the testing_dataset.

In [25]:
classes = [i for i in mapping.keys()]
annotation_set = remo.create_annotation_set("Image Classification", name="model_predictions", dataset_id=<enter test dataset ID>, classes=classes)
testing_dataset.add_data(local_files=["/home/harsha/Documents/rediscovery/image_classification/flower_classification/results.csv"], annotation_set_id=annotation_set.id, class_encoding=cat_to_index)

Acquiring data - completed                                                                           
Processing data - Processing annotation files: 1/1Processing data - completed                                                                          
Data upload completed


{'session_id': '1b15014d-1f12-43b8-8b37-8288d73bd73c',
 'created_at': '2020-07-22T12:03:30.043775Z',
 'finished_at': '2020-07-22T12:04:33.366098Z',
 'dataset': {'id': 38, 'name': 'flowers-test'},
 'status': 'done',
 'substatus': '',
 'images': {'pending': 0,
  'total': 0,
  'successful': 0,
  'failed': 0,
  'errors': []},
 'annotations': {'pending': 0,
  'total': 1,
  'successful': 1,
  'failed': 0,
  'errors': []},
 'errors': [],
 'uploaded': {'total': {'items': 0, 'size': 0, 'human_size': '0 b'},
  'images': {'items': 0, 'size': 0, 'human_size': '0 b'},
  'annotations': {'items': 0, 'size': 0, 'human_size': '0 b'},
  'archives': {'items': 0, 'size': 0, 'human_size': '0 b'}}}

In [26]:
testing_dataset.view()

Open http://localhost:8123/datasets/38
