In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from PIL import Image
from torch.utils.data import DataLoader, Dataset
from torchvision import transforms
import torch.nn as nn
import torch
import torch.optim as optim

In [2]:

!wget https://github.com/brendenlake/omniglot/raw/master/python/images_evaluation.zip

!wget https://github.com/brendenlake/omniglot/raw/master/python/images_background.zip

--2024-06-24 17:25:00--  https://github.com/brendenlake/omniglot/raw/master/python/images_evaluation.zip
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/brendenlake/omniglot/master/python/images_evaluation.zip [following]
--2024-06-24 17:25:00--  https://raw.githubusercontent.com/brendenlake/omniglot/master/python/images_evaluation.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6462886 (6.2M) [application/zip]
Saving to: 'images_evaluation.zip'


2024-06-24 17:25:00 (136 MB/s) - 'images_evaluation.zip' saved [6462886/6462886]

--2024-06-24 17:25:01--  https://github.com/brendenlake/omniglot

In [3]:

!unzip -qq images_background.zip
!unzip -qq images_evaluation.zip

# Two-input dataset

Building a multi-input model starts with crafting a custom dataset that can supply all the inputs to the model. In this exercise, you will build the Omniglot dataset that serves triplets consisting of:

The image of a character to be classified,
The one-hot encoded alphabet vector of length 30, with zeros everywhere but for a single one denoting the ID of the alphabet the character comes from,
The target label, an integer between 0 and 963.
You are provided with train_samples, a list of 3-tuples comprising an image's file path, its alphabet vector, and the target label.

* Assign transform and samples to class attributes with the same names.
* Implement the .__len()__ method such that it return the number of samples stored in the class' samples attribute.
* Unpack the sample at index idx assigning its contents to img_path, alphabet, and label.
* Transform the loaded image with self.transform() and assign it to img_transformed.
* Nice done! With your implementation of OmniglotDataset ready, you can actually create the dataset and DataLoader, just like you did it before.

In [3]:
class OmniglotDataset(Dataset):
    def __init__(self, transform, samples):
        # Assign transform and samples to class attributes
        self.transform = transform
        self.samples = samples

    def __len__(self):
        # Return number of samples
        return len(self.samples)

    def __getitem__(self, idx):
        # Unpack the sample at index idx
        img_path, alphabet, label = self.samples[idx]
        img = Image.open(img_path).convert('L')
        # Transform the image 
        img_transformed = self.transform(img)
        return img_transformed, alphabet, label

**Tensor Concatenation**

In [5]:
x = torch.tensor([[1,2,3],])

y = torch.tensor([[4,5,6],])

In [6]:
# Concatenation along axis 0
torch.cat((x, y), dim = 0)



tensor([[1, 2, 3],
        [4, 5, 6]])

In [7]:
# Concatenation along axis 1
torch.cat((x, y), dim = 1)

tensor([[1, 2, 3, 4, 5, 6]])

**Two-input model**

With the data ready, it's time to build the two-input model architecture! To do so, you will set up a model class with the following methods:

.__init__(), in which you will define sub-networks by grouping layers; this is where you define the two layers for processing the two inputs, and the classifier that returns a classification score for each class.

forward(), in which you will pass both inputs through corresponding pre-defined sub-networks, concatenate the outputs, and pass them to the classifier.

* Define image, alphabet and classifier sub-networks as sequential models, assigning them to self.image_layer, self.alphabet_layer and self.classifier, respectively.
* Pass the image and alphabet through the appropriate model layers.
* Concatenate the outputs from image and alphabet layers and assign the result to x.

In [8]:
# class Net(nn.Module):
#     def __init__(self):
#         super(Net, self).__init__()
#         # Define sub-networks as sequential models
#         self.image_layer = nn.Sequential(
#             nn.Conv2d(1, 16, kernel_size=3, padding=1),
#             nn.MaxPool2d(kernel_size=2),
#             nn.ELU(),
#             nn.Flatten(),
#             nn.Linear(16*32*32, 128)
#         )
#         self.alphabet_layer = nn.Sequential(
#             nn.Linear(30, 8),
#             nn.ELU(), 
#         )
#         self.classifier = nn.Sequential(
#             nn.Linear(128 + 8, 964), 
#         )
        
#     def forward(self, x_image, x_alphabet):
#         # Pass the x_image and x_alphabet through appropriate layers
#         x_image = self.image_layer(x_image)
#         x_alphabet = self.alphabet_layer(x_alphabet)
#         # Concatenate x_image and x_alphabet
#         x = torch.cat((x_image, x_alphabet), dim=1)
#         return self.classifier(x)

In [52]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        # Define sub-networks as sequential models
        self.image_layer = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=3, padding=1),  # Input: [B, 1, 105, 105], Output: [B, 16, 105, 105]
            nn.MaxPool2d(kernel_size=2),                 # Output: [B, 16, 52, 52]
            nn.ELU(),
            nn.Conv2d(16, 32, kernel_size=3, padding=1), # Output: [B, 32, 52, 52]
            nn.MaxPool2d(kernel_size=2),                 # Output: [B, 32, 26, 26]
            nn.ELU(),
            nn.Flatten(),                                # Output: [B, 32*26*26]
            nn.Linear(32*26*26, 128)                     # Adjust input size based on flattened output
        )
        self.alphabet_layer = nn.Sequential(
            nn.Linear(30, 8),
            nn.ELU(), 
        )
        self.classifier = nn.Sequential(
            nn.Linear(128 + 8, 964), 
        )
        
    def forward(self, x_image, x_alphabet):
        # Ensure x_alphabet is a Tensor and convert to FloatTensor
        if isinstance(x_alphabet, list):
            x_alphabet = torch.stack(x_alphabet)
        x_alphabet = x_alphabet.float()  # Convert to FloatTensor
        
        # Pass the x_image and x_alphabet through appropriate layers
        x_image = self.image_layer(x_image)
        
        # Debugging shapes
#         print(f"x_image shape after image_layer: {x_image.shape}")
#         print(f"x_alphabet shape before view: {x_alphabet.shape}")
        
        x_alphabet = x_alphabet.view(x_alphabet.size(0), -1)  # Flatten x_alphabet
        
        # Debugging shapes
#         print(f"x_alphabet shape after view: {x_alphabet.shape}")
        
        x_alphabet = self.alphabet_layer(x_alphabet)
        
        # Concatenate x_image and x_alphabet
        x = torch.cat((x_image, x_alphabet), dim=1)
        return self.classifier(x)

**Training Loop**

In [16]:
import os

def collect_samples(root_dir, alphabet_map):
    samples = []
    
    # Traverse the root directory
    for alphabet in sorted(os.listdir(root_dir)):
        alphabet_path = os.path.join(root_dir, alphabet)
        
        if os.path.isdir(alphabet_path):
            for character in sorted(os.listdir(alphabet_path)):
                character_path = os.path.join(alphabet_path, character)
                
                if os.path.isdir(character_path) and character.startswith('character'):
                    for image in sorted(os.listdir(character_path)):
                        image_path = os.path.join(character_path, image)
                        
                        # Create one-hot encoded alphabet vector
                        alphabet_vector = [0] * len(alphabet_map)
                        alphabet_vector[alphabet_map[alphabet]] = 1
                        
                        # Extract label from the character directory name
                        label = int(character.replace('character', '')) - 1
                        
                        samples.append((image_path, alphabet_vector, label))
    
    return samples

# Create a map for the alphabets
root_dir = '/kaggle/input/omniglot/images_background/images_background'
alphabet_map = {alphabet: idx for idx, alphabet in enumerate(sorted(os.listdir(root_dir)))}

# Collect samples from images_background and images_evaluation
background_samples = collect_samples(root_dir, alphabet_map)
# evaluation_samples = collect_samples('/kaggle/input/omniglot/images_evaluation', alphabet_map)

# Combine all samples
all_samples = background_samples

# Print the number of samples collected for verification
print(f"Number of samples collected: {len(all_samples)}")


Number of samples collected: 19280


In [4]:
# Define the transformation for the images
transform = transforms.Compose([
    transforms.Resize((105, 105)),  # Resize images to 105x105
    transforms.ToTensor(),  # Convert images to tensors
    transforms.Normalize((0.5,), (0.5,))  # Normalize images to [-1, 1]
])

# Create the dataset instance
omniglot_dataset = OmniglotDataset(transform=transform, samples=all_samples)

# Create the DataLoader
dataloader_train = DataLoader(omniglot_dataset, batch_size=32, shuffle=True, num_workers=4)

# Iterate through the DataLoader
for batch in dataloader_train:
    images, alphabets, labels = batch
    
    # Now you can use images, alphabets, and labels for your training
    print(images.shape,len(alphabets), labels.shape)
    break

NameError: name 'all_samples' is not defined

In [53]:
net = Net()
print(net)

Net(
  (image_layer): Sequential(
    (0): Conv2d(1, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (2): ELU(alpha=1.0)
    (3): Conv2d(16, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (5): ELU(alpha=1.0)
    (6): Flatten(start_dim=1, end_dim=-1)
    (7): Linear(in_features=21632, out_features=128, bias=True)
  )
  (alphabet_layer): Sequential(
    (0): Linear(in_features=30, out_features=8, bias=True)
    (1): ELU(alpha=1.0)
  )
  (classifier): Sequential(
    (0): Linear(in_features=136, out_features=964, bias=True)
  )
)


In [54]:
# Assuming a batch of images and one-hot encoded alphabets
images = torch.randn(32, 1, 105, 105)
alphabets = torch.randn(32, 30)  # Simulating a tensor of shape (batch_size, 30)

# Forward pass to check dimensions
outputs = net(images, alphabets)
print(outputs.shape)  # Expected output: [32, 964]

torch.Size([32, 964])


In [58]:
for epoch in range(1):
    for img, alpha, labels in dataloader_train:
        # Check and correct the shape of alpha if necessary
        if isinstance(alpha, list):
            alpha = torch.stack(alpha)
            
        if alpha.size(0) != img.size(0):
            alpha = alpha.transpose(0, 1)
        
        optimizer.zero_grad()
        outputs = net(img, alpha)
        
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
print(f"Loss for epoch {epoch + 1}: {loss.item()}")
       

Loss for epoch 1: 6.856453895568848


# Multi_Outut Models

**Two-output Dataset and DataLoader**

In this and the following exercises, you will build a two-output model to predict both the character and the alphabet it comes from based on the character's image. As always, you will start with getting the data ready.

The OmniglotDataset class you have created before is available for you to use along with updated samples. Let's use it to build the Dataset and the DataLoader.



In [8]:
class OmniglotDataset(Dataset):
    def __init__(self, transform, samples):
        # Assign transform and samples to class attributes
        self.transform = transform
        self.samples = samples

    def __len__(self):
        # Return number of samples
        return len(self.samples)

    def __getitem__(self, idx):
        # Unpack the sample at index idx
        img_path, alphabet, label =  self.samples[idx]
        img = Image.open(img_path).convert('L')
        # Transform the image 
        img_transformed = self.transform(img)
        return img_transformed, alphabet, label

**Two-output model architecture**

In this exercise, you will construct a multi-output neural network architecture capable of predicting the character and the alphabet.

Recall the general structure: in the .__init__() method, you define layers to be used in the forward pass later. In the forward() method, you will first pass the input image through a couple of layers to obtain its embedding, which in turn is fed into two separate classifier layers, one for each output.

* Define self.classifier_alpha and self.classifier_char as linear layers with input shapes matching the output of image_layer, and output shapes corresponding to the number of alphabets (30) and the number of characters (964), respectively.
* Pass the image embedding x_image separately through each of the classifiers, assigning the results to output_alpha and output_char, respectively, and return them in this order.

In [9]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        # Define sub-networks as sequential models
        self.image_layer = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=3, padding=1),  
            nn.MaxPool2d(kernel_size=2),                 
            nn.Conv2d(16, 32, kernel_size=3, padding=1), # Output: [B, 32, 52, 52]
            nn.MaxPool2d(kernel_size=2),                 # Output: [B, 32, 26, 26]
            nn.ELU(),
            nn.Flatten(),                                # Output: [B, 32*26*26]
            nn.Linear(32*26*26, 128)                     # Adjust input size based on flattened output
        )
        
        self.classifier_alpha = nn.Linear(128, 30)
        self.classifer_char = nn.Linear(128, 964)
        
    def forward(self, x_image, x_alphabet):
        # Ensure x_alphabet is a Tensor and convert to FloatTensor
        if isinstance(x_alphabet, list):
            x_alphabet = torch.stack(x_alphabet)
        x_alphabet = x_alphabet.float()  # Convert to FloatTensor
        
        # Pass the x_image and x_alphabet through appropriate layers
        x_image = self.image_layer(x_image)
        
        output_alpha = self.classifier_alpha(x_image)
        output_char = self.classifer_char(x_image)
        return ouput_alpha, output_char

In [None]:
# Print the sample at index 100
print(samples[100])

# Create dataset_train
dataset_train = OmniglotDataset(
    transform=transforms.Compose([
        transforms.ToTensor(),
      	transforms.Resize((64, 64)),
    ]),
    samples=samples,
)

# Create dataloader_train
dataloader_train = DataLoader(
    dataset_train, shuffle=True, batch_size=32,
)

Notice how samples now contain, next to the image path, the target labels for the character and the alphabet. In the next exercise, you will examine the architecture of the two-output model.

**Training multi-output models**

When training models with multiple outputs, it is crucial to ensure that the loss function is defined correctly.

In this case, the model produces two outputs: predictions for the alphabet and the character. For each of these, there are corresponding ground truth labels, which will allow you to calculate two separate losses: one incurred from incorrect alphabet classifications, and the other from incorrect character classification. Since in both cases you are dealing with a multi-label classification task, the Cross-Entropy loss can be applied each time.

Gradient descent can optimize only one loss function, however. You will thus define the total loss as the sum of alphabet and character losses.

1. Calculate the alphabet classification loss and assign it to loss_alpha.
2. Calculate the character classification loss and assign it to loss_char.
3. Compute the total loss as the sum of the two partial losses and assign it to loss.

In [None]:
net = Net2()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.05)

for epoch in range(1):
    for images, labels_alpha, labels_char in dataloader_train:
        optimizer.zero_grad()
        outputs_alpha, outputs_char = net(images)
        # Compute alphabet classification loss
        loss_alpha = criterion(outputs_alpha, labels_alpha)
        # Compute character classification loss
        loss_char = criterion(outputs_char, labels_char)
        # Compute total loss
        loss = loss_alpha + loss_char
        loss.backward()
        optimizer.step()

* Defining the total loss as the sum of the two task-specific losses is a simple way to obtain the single optimization objective required by gradient descent. There are, however, other ways to combine the partial losses. 