# Image Classification using PyTorch and 🔭 Galileo

In this tutorial, we'll train a model with PyTorch and explore the results in Galileo.

This notebook pulls data from S3 and is the suggested way for working with images in Galileo.

**Make sure to select GPU in your Runtime! (Runtime -> Change Runtime type)**

# 1. Install Prerequisites and Login Galileo

In [None]:
#@markdown Install `dataquality`
# Upgrade pip
!pip install -U pip &> /dev/null

# Install all dependecies
!pip install -U dataquality torch &> /dev/null

print('👋 Installed necessary libraries!')

In [None]:
#@markdown Check that a GPU is available

import torch
# Check Cuda.
if torch.cuda.is_available():
  print("⚡ You are connected to a GPU!")
else:
  print("❗You are NOT connected to a GPU ❗It is recommended to connect to a GPU before training")
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

In [None]:
import dataquality as dq

dq.login()

# 2. Download the data in the notebook

Pull the data from GCS or S3. 

For enterprise customers, your cluster will have to have permissions to request data from S3 and GCS for AWS and GCP clusters. Cross account transfer is not currently supported.

For free users, the data has to be publicly available.

In [None]:
CLOUD_ZIP_PATH = f"https://storage.googleapis.com/galileo-public-data/CV_datasets/ImageNet10_animals_train_val.zip"

In [None]:
#@markdown Download the images

LOCAL_DATA_DIR = "tmp/content"
dataset_dir_name = CLOUD_ZIP_PATH.split('/')[-1].split('.zip')[0]

cmd = f"""
mkdir -p {LOCAL_DATA_DIR}
if [ ! -d {LOCAL_DATA_DIR}/{dataset_dir_name} ]
then
  echo "Downloading data"
  curl {CLOUD_ZIP_PATH} -o {LOCAL_DATA_DIR}/{dataset_dir_name}.zip
  unzip {LOCAL_DATA_DIR}/{dataset_dir_name}.zip -d {LOCAL_DATA_DIR}
else
  echo "Data already exists locally. Moving on."
fi
"""
with open('download_images.sh', 'w') as file:
  file.write(cmd)

!bash download_images.sh

# Select a small portion of the dataset for CI.
import os
def _minimize_for_ci() -> bool:
    return os.getenv("MINIMIZE_FOR_CI", "false") == "true"

# 3. Initialize Galileo


In [None]:
DATASET_NAME = "ImageNet10_animals_train_val" # 🔭🌕 used for creating a run name in Galileo

In [None]:
# 🔭🌕 Initializing a new run in Galileo. Each run is part of a project.
dq.init(task_type="image_classification", 
        project_name="image_classification_pytorch", 
        run_name=f"example_run_{DATASET_NAME.replace('/', '-')}")

# 4. Create Dataset and Log Input Data with Galileo

Input data is logged via `log_image_dataset`. This step will log the images, gold labels, data split, and list of all labels. You can achieve this adding 1 line of code to the standard PyTorch Dataset Class.

To skip uploading the images, provide their location in a (unzipped) folder in the cloud.

In [None]:
CLOUD_DATA_DIR = f"https://storage.googleapis.com/galileo-public-data/CV_datasets/ImageNet10_animals_train_val" # 🔭🌕  Set to None if data not available unzipped in the cloud (which would require uploading)
train_csv_relpath = "train.csv"
val_csv_relpath = "val.csv"
# Fix Labels: The labels of ImageNet are hashes. Convert to human readable labels
CLASSES_DICT = {'n02124075' : 'Egyptian_cat', 'n02107574': 'Greater_Swiss_Mountain_dog', 'n02114367': 'Timber_wolf', 'n02085620': 'Chihuahua', 'n02114548': 'White_wolf', 'n02117135': 'Hyena', 'n02108915': 'French_bulldog', 'n02123159': 'Tiger_cat', 'n02114855': 'Coyote', 'n02106550': 'Rottweiler'}

In [None]:
#@markdown Fix a random Seed and load helper methods.
from typing import Optional, List
from io import BytesIO
from PIL import Image
import numpy as np
import random

# Fix a random seed.
def seed_all(seed: int) -> None:
    """Set all relevant seed for training a Pytorch Model.

    Based on the following post:
    https://discuss.pytorch.org/t/reproducibility-with-all-the-bells-and-whistles/81097
    """
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.cuda.manual_seed(seed)
    np.random.seed(seed)
    random.seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False


def seed_worker(worker_id: int) -> None:
    """Set seed for dataloader worker.

    Based on the following post:
    https://discuss.pytorch.org/t/reproducibility-with-all-the-bells-and-whistles/81097
    """
    worker_seed = torch.initial_seed() % 2**32
    np.random.seed(worker_seed)
    random.seed(worker_seed)

# Methods for loading the df into a dataset.
def find_label_col_name(col_names: List[str]) -> Optional[str]:
    for col_name in col_names:
        if "label" in col_name:
            return col_name
    return None

def find_imgs_loc_col_name(col_names: List[str]) -> Optional[str]:
    for col_name in col_names:
        if "path" in col_name:
            return col_name
    return None

In [None]:
#@markdown Create the Dataset class
from uuid import uuid4
from pathlib import Path
from typing import List

import torch
from torch.utils.data import Dataset as TorchDataset
from torchvision import transforms
import pandas as pd
import os

STANDARD_DATA_COLUMNS_CV = ["id", "text", "label_idx", "path"]

class ImageDatasetFromLocal(TorchDataset):
    def __init__(
        self, 
        split: str,
        imgs_dir: str,
        csv_relpath: str,
        cloud_imgs_dir: str = None,
        transform: transforms.Compose = None, 
        list_of_labels: List[str] = None
    ):  
        """
        Args:
          split: the split for the hf dataset
          imgs_dir: location of images on this machine (for training)
          csv_relpath: relative path to the csv on this machine (for training)
          cloud_imgs_dir: location of images in the cloud (for reference in console)
          transform [Optional]: a transform to apply to the images dynamically 
            before training
          list_of_labels [Optional]: the list of labels used to convert between 
            label (string) and label_idx (int). To insure consistency pass the 
            list_of_labels of the training dataset to the test/val datasets.
        """
        self.imgs_dir = imgs_dir
        self.cloud_imgs_dir = cloud_imgs_dir
        self.transform = transform
        self.split = split
        
        self.ds = pd.read_csv(f"{imgs_dir}/{csv_relpath}")
        
        # Find the label column name: could be label, labels, coarse_label, etc.
        self.label_col_name = find_label_col_name(self.ds.columns)
        if self.label_col_name is None:
            raise ValueError(f"Could not find the label column in the dataframe")
        STANDARD_DATA_COLUMNS_CV.append(self.label_col_name)

        if _minimize_for_ci():
          if self.label_col_name is None:
            self.ds = self.ds[:10].reset_index(drop=True)
          else:
            self.ds = self.ds.groupby(self.label_col_name, group_keys=False).apply(lambda x: x.sample(10)).reset_index(drop=True)

        # Fix Labels: The labels of ImageNet are hashes. Convert to human readable labels
        self.ds[self.label_col_name] = self.ds[self.label_col_name].map(CLASSES_DICT)

        # Set the list of labels for this split.
        self.list_of_labels = list_of_labels
        if self.list_of_labels is None:
          self.list_of_labels = list(self.ds[self.label_col_name].unique())

        # Add column with labels as string (for dq).
        label_to_labelidx = {label:i for i, label in enumerate(self.list_of_labels)}
        self.ds["label_idx"] =  self.ds[self.label_col_name].map(label_to_labelidx)

        # Find the path column name: could be path, relpath, etc (or none).
        self.imgs_location_colname = find_imgs_loc_col_name(self.ds.columns)
        STANDARD_DATA_COLUMNS_CV.append(self.imgs_location_colname)

        # Get the metadata columns.
        self.meta_data_cols = [
            column
            for column in self.ds.columns
            if column not in STANDARD_DATA_COLUMNS_CV
        ]

        # Set the images local paths in the "text" column (for training and smart features)
        self.ds["text"] = self.ds[self.imgs_location_colname].apply(lambda x: f"{self.imgs_dir}/{x}")
        # If a remote location is given, set the remote paths (to skip uploading)
        if cloud_imgs_dir is not None:
            self.ds["imgs_remote_paths"] = self.ds[self.imgs_location_colname].apply(lambda x: f"{self.cloud_imgs_dir}/{x}")

    def __getitem__(self, idx: int):
        row = self.ds.loc[idx]
        img_path = os.path.join(self.imgs_dir, row[self.imgs_location_colname])
        image = Image.open(img_path).convert('RGB')
        label, id = row["label_idx"], row["id"]
        if self.transform is not None:
            image = self.transform(image)
        return {"image": image, "label": label, "id": id}

    def __len__(self) -> int:
        return len(self.ds)

In [None]:
# Create the Dataset and Dataloader + Log input to 🔭🌕 Galileo

# Create the Datasets.
image_crop_size = (224, 224)

val_transforms = transforms.Compose(
    [
        transforms.Resize((image_crop_size[0], image_crop_size[1])),
        transforms.ToTensor()
    ]
)
train_transforms = transforms.Compose(val_transforms.transforms + [transforms.RandomHorizontalFlip()])

TRAIN_SPLIT_NAME = "train"
train_dataset = ImageDatasetFromLocal(
    imgs_dir=f"./{LOCAL_DATA_DIR}/{dataset_dir_name}", 
    cloud_imgs_dir=CLOUD_DATA_DIR, 
    csv_relpath=train_csv_relpath,
    split=TRAIN_SPLIT_NAME, 
    transform=train_transforms)

VAL_SPLIT_NAME = "validation" # this var is needed in dq.set_split down below
val_dataset = ImageDatasetFromLocal(
    imgs_dir=f"./{LOCAL_DATA_DIR}/{dataset_dir_name}",
    cloud_imgs_dir=CLOUD_DATA_DIR, 
    csv_relpath=val_csv_relpath,
    split=VAL_SPLIT_NAME, 
    transform=val_transforms,
    list_of_labels=train_dataset.list_of_labels)

print(f"Loaded {TRAIN_SPLIT_NAME} dataset with {len(train_dataset.ds)} samples and {len(train_dataset.list_of_labels)} labels")
print(f"Loaded {VAL_SPLIT_NAME} dataset with {len(val_dataset.ds)} samples and  {len(val_dataset.list_of_labels)} labels")

# 🔭🌕 Galileo log: Set labels
dq.set_labels_for_run(train_dataset.list_of_labels)

# 🔭🌕 Galileo log: Log dataset
dq.log_image_dataset(
    dataset = train_dataset.ds,
    label = train_dataset.label_col_name,
    split = train_dataset.split,
    meta = train_dataset.meta_data_cols,
    imgs_local_colname = "text",
    imgs_remote="imgs_remote_paths" if CLOUD_DATA_DIR is not None else None
)
dq.log_image_dataset(
    dataset = val_dataset.ds,
    label = val_dataset.label_col_name,
    split = val_dataset.split,
    meta = val_dataset.meta_data_cols,
    imgs_local_colname = "text",
    imgs_remote="imgs_remote_paths" if CLOUD_DATA_DIR is not None else None
)

# Create the DataLoaders.
from torch.utils.data import DataLoader as TorchDataLoader

BATCH_SIZE = 64

NUM_WORKERS = 0 
SEED_WORKER = 42

seed_all(SEED_WORKER)

train_dataloader = TorchDataLoader(
    train_dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,
    num_workers=NUM_WORKERS,
    worker_init_fn=seed_worker,
    pin_memory=True
)
val_dataloader = TorchDataLoader(
    val_dataset,
    batch_size=BATCH_SIZE,
    shuffle=False,
    num_workers=NUM_WORKERS,
    worker_init_fn=seed_worker,
    pin_memory=True
)

In [None]:
#@markdown Visualize the Data.
# Visualizing a few images of the dataset (post-processing/augmentation)
import random
import matplotlib.pyplot as plt
from torchvision.utils import make_grid
idxs = [random.randint(0, len(train_dataset) -1) for _ in range(20)]
grid_img = make_grid([train_dataset[idx]["image"] for idx in idxs], nrow=5)
plt.figure(figsize = (20,10))
plt.imshow(grid_img.permute(1, 2, 0))
plt.show()

# 6. Log model data with Galileo

Model data is logged by wrapping the model with `watch` function. This step will log the model logits and embeddings. You can achieve this by adding 1 line of code to the standard pytorch model. 

In [None]:
from torchvision.models import resnet50

EPOCHS = 1

# Load model and replace last layer.
model = resnet50(pretrained=True)
model.fc = torch.nn.Linear(model.fc.in_features, len(train_dataset.list_of_labels))
torch.nn.init.xavier_uniform_(model.fc.weight)

model = model.to(device)

# Set optimizer and loss.
params_1x = [  # get the original weights, they'll be updated with a lower learning rate
    param
    for name, param in model.named_parameters()
    if "fc" not in str(name)
]
lr, weight_decay = 1e-5, 5e-4
optimizer = torch.optim.Adam(
    [
        {"params": params_1x, "lr": lr},
        {"params": model.fc.parameters(), "lr": lr * 10},
    ],
    weight_decay=weight_decay,
)
criterion = torch.nn.CrossEntropyLoss()

from dataquality.integrations.torch import watch, unwatch

# 🔭🌕 Galileo logging -- Watch model
watch(
    model=model,
    classifier_layer=model.fc,
    dataloaders=[train_dataloader, val_dataloader],
)

# 7. Putting into Action: Training a Model

We complete the training pipeline by using a standard PyTorch training setup. While training, we log the current `epoch` and `split`. To complete logging, we call `dq.finish()` after training.

In [None]:
from tqdm import tqdm
from time import sleep, time

# Train !
start = time()
print(f"Training for {EPOCHS} epochs on {device}")

for epoch in range(1, EPOCHS + 1):
    print(f"Epoch {epoch}/{EPOCHS}")
    dq.set_epoch(epoch)  # 🔭🌕 Galileo -- Set split

    model.train()
    train_loss = torch.tensor(0.0, device=device)
    train_correct = torch.tensor(0, device=device)
    
    dq.set_split(TRAIN_SPLIT_NAME)
    with tqdm(train_dataloader, unit="batch") as train_minibatchs:
        for train_minibatch in train_minibatchs:
            train_minibatchs.set_description(f"Epoch {epoch}")

            images = train_minibatch["image"].to(device)
            labels = train_minibatch["label"].to(device)

            preds = model(images)
            loss = criterion(preds, labels)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            with torch.no_grad():
                train_loss += loss
                train_batch_correct = (torch.argmax(preds, dim=1) == labels).sum()
                train_correct += train_batch_correct

            train_minibatchs.set_postfix(batch_loss=loss.item(), batch_accuracy=float(train_batch_correct) / BATCH_SIZE)
            sleep(0.01)

    print(f"Training loss: {train_loss:.2f}")
    print(f"Training accuracy: {100 * float(train_correct) / len(train_dataloader.dataset):.2f}")
    
    dq.set_split(VAL_SPLIT_NAME)  # 🔭🌕 Galileo -- Set split
    if val_dataloader is not None:
        model.eval()
        val_loss = torch.tensor(0.0, device=device)
        val_correct = torch.tensor(0, device=device)

        with torch.no_grad():
            for val_minibatch in tqdm(val_dataloader):
                images = val_minibatch["image"].to(device)
                labels = val_minibatch["label"].to(device)
                
                preds = model(images)
                loss = criterion(preds, labels)

                val_loss += loss
                val_correct += (torch.argmax(preds, dim=1) == labels).sum()

        print(f"{VAL_SPLIT_NAME} loss: {val_loss:.2f}")
        print(f"{VAL_SPLIT_NAME} accuracy: {100*val_correct/len(val_dataloader.dataset):.2f}")

end = time()
print(f"Total training time: {end-start:.1f} seconds")
dq.finish()

unwatch(model)

# 8. Monitoring: Inference on Production data

After training, continue monitoring the model's performance by logging predictions on production (unlabeled) data. The integration of the `dataquality` client is very similar.

In [None]:
INF_SPLIT_NAME = "inference"
INF_NAME = "inference_run1"

inf_dataset = ImageDatasetFromLocal(
    imgs_dir=f"./{LOCAL_DATA_DIR}/{dataset_dir_name}", 
    cloud_imgs_dir=CLOUD_DATA_DIR, 
    csv_relpath="inf.csv",
    split=INF_SPLIT_NAME, 
    transform=val_transforms)

print(f"Loaded {INF_SPLIT_NAME} dataset with {len(inf_dataset.ds)} samples and {len(train_dataset.list_of_labels)} labels")

# 🔭🌕 Galileo log: Set labels
dq.set_labels_for_run(train_dataset.list_of_labels)

# 🔭🌕 Galileo log: Log dataset
dq.log_image_dataset(
    dataset = inf_dataset.ds,
    split = INF_SPLIT_NAME,
    inference_name = INF_NAME,
    imgs_local_colname = "text",
    imgs_remote="imgs_remote_paths" if CLOUD_DATA_DIR is not None else None
)

inf_dataloader = TorchDataLoader(
    inf_dataset,
    batch_size=BATCH_SIZE,
    shuffle=False,
    num_workers=NUM_WORKERS,
    worker_init_fn=seed_worker,
    pin_memory=True
)

# 🔭🌕 Galileo logging -- Watch model
watch(
    model=model,
    classifier_layer=model.fc,
    dataloaders=[inf_dataloader],
)

In [None]:
dq.set_split(INF_SPLIT_NAME, inference_name = INF_NAME)  # 🔭🌕 Galileo -- Set split
model.eval()

with torch.no_grad():
    for inf_minibatch in tqdm(inf_dataloader):
        images = inf_minibatch["image"].to(device)
        preds = model(images)

dq.finish()

# General Help and Docs
- To get help with your task's requirements, call `dq.get_data_logger().doc()`
- To see more general data and model logging docs, run `dq.docs()`

In [None]:
dq.get_data_logger().doc()
help(dq.log_dataset)