Matching Network Implementation on Indian Sign Language Classification

### Ayush Muralidharan: PES1UG22AM912
### Tejas V Bhat: PES1UG22AM909
### Atharv Revankar: PES1UG22AM920
### Prarthana Kini: PES1UG22AM119

First, we import all necessary libraries. We're using PyTorch for deep learning, torchvision for image processing, and PIL for image handling. These are essential tools for our sign language recognition system.


In [None]:
import os
import numpy as np
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
from PIL import Image
from torch.utils.data import Dataset

Here we're setting up our device configuration. The code checks if a GPU is available - if yes, it uses CUDA for faster processing; otherwise, it falls back to CPU.

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: cuda


This is our custom dataset class for Indian Sign Language. It:

-Loads images from our directory structure

-Organizes them by class (different signs)

-Handles image transformations

-Currently manages 20 different sign classes .

In [None]:
class SignLanguageDataset(Dataset):
    def __init__(self, root_dir, transform=None):
        self.root_dir = root_dir
        self.transform = transform
        self.classes = [d for d in os.listdir(root_dir) if os.path.isdir(os.path.join(root_dir, d))]
        self.class_to_idx = {cls_name: i for i, cls_name in enumerate(self.classes)}

        self.image_paths = []
        self.labels = []

        for class_name in self.classes:
            class_dir = os.path.join(root_dir, class_name)
            for user_dir in os.listdir(class_dir):
                user_path = os.path.join(class_dir, user_dir)
                if os.path.isdir(user_path):
                    for img_name in os.listdir(user_path):
                        if img_name.endswith('.jpg'):
                            self.image_paths.append(os.path.join(user_path, img_name))
                            self.labels.append(self.class_to_idx[class_name])

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        img_path = self.image_paths[idx]
        image = Image.open(img_path).convert('RGB')
        label = self.labels[idx]

        if self.transform:
            image = self.transform(image)

        return image, label

For preprocessing our images, we:

-Resize all images to 224x224 pixels

-Convert them to tensors

-Normalize them using ImageNet statistics

This ensures consistent input to our model.

In [None]:
# Define the transforms
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                       std=[0.229, 0.224, 0.225])
])

# Create dataset instance
dataset_path = "./Indian-sign-Language-Real-life-Words"
dataset = SignLanguageDataset(root_dir=dataset_path, transform=transform)

# Basic information about the dataset
print(f"Number of classes: {len(dataset.classes)}")
print("Classes:", dataset.classes)

Number of classes: 20
Classes: ['afraid', 'agree', 'assistance', 'bad', 'become', 'college', 'doctor', 'from', 'pain', 'pray', 'secondary', 'skin', 'small', 'specific', 'stand', 'today', 'warn', 'which', 'work', 'you']


We implement few-shot learning by:

Splitting our 20 classes into 16 training and 4 testing classes

Using random selection with a fixed seed for reproducibility

This tests the model's ability to learn new signs with limited data

In [None]:
# Split classes into meta-train and meta-test
np.random.seed(42)  # for reproducibility
n_classes = len(dataset.classes)
n_meta_test_classes = 4  # 20% of classes
meta_test_classes = np.random.choice(dataset.classes, n_meta_test_classes, replace=False)
meta_train_classes = [c for c in dataset.classes if c not in meta_test_classes]

print("\nMeta-learning split:")
print(f"Meta-training classes ({len(meta_train_classes)}): {meta_train_classes}")
print(f"Meta-testing classes ({len(meta_test_classes)}): {meta_test_classes}")



Meta-learning split:
Meta-training classes (16): ['assistance', 'bad', 'become', 'college', 'doctor', 'from', 'pain', 'pray', 'secondary', 'skin', 'small', 'specific', 'stand', 'warn', 'work', 'you']
Meta-testing classes (4): ['afraid' 'which' 'today' 'agree']


In [None]:
class EmbeddingNetwork(nn.Module):
    def __init__(self):
        super(EmbeddingNetwork, self).__init__()
        resnet = torchvision.models.resnet18(weights=torchvision.models.ResNet18_Weights.IMAGENET1K_V1)
        self.features = nn.Sequential(*list(resnet.children())[:-1])

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        return x

class MatchingNetwork(nn.Module):
    def __init__(self):
        super(MatchingNetwork, self).__init__()
        self.embedding_network = EmbeddingNetwork()

    def forward(self, support_images, support_labels, query_images):
        support_embeddings = self.embedding_network(support_images)
        query_embeddings = self.embedding_network(query_images)

        similarities = self.cosine_similarity(query_embeddings, support_embeddings)
        attention = torch.softmax(similarities, dim=1)

        predicted_logits = torch.matmul(attention, torch.eye(len(support_labels.unique())).to(device)[support_labels])
        return predicted_logits

    def cosine_similarity(self, query, support):
        query_norm = torch.norm(query, dim=1, keepdim=True)
        support_norm = torch.norm(support, dim=1, keepdim=True)

        query_normalized = query / query_norm
        support_normalized = support / support_norm

        similarities = torch.matmul(query_normalized, support_normalized.t())
        return similarities

Our model architecture consists of:

-An embedding network using ResNet18 for feature extraction

-A matching network that compares query images with support images

-Cosine similarity to measure how close images are to each other

In [None]:
def create_episode(dataset, classes, n_way=5, n_support=5, n_query=5):
    episode_classes = np.random.choice(classes, n_way, replace=False)

    support_images = []
    support_labels = []
    query_images = []
    query_labels = []

    for label, class_name in enumerate(episode_classes):
        class_indices = [i for i, l in enumerate(dataset.labels)
                        if dataset.classes[l] == class_name]

        selected_indices = np.random.choice(class_indices,
                                          n_support + n_query,
                                          replace=False)

        support_idx = selected_indices[:n_support]
        query_idx = selected_indices[n_support:n_support + n_query]

        for idx in support_idx:
            img, _ = dataset[idx]
            support_images.append(img)
            support_labels.append(label)

        for idx in query_idx:
            img, _ = dataset[idx]
            query_images.append(img)
            query_labels.append(label)

    support_images = torch.stack(support_images).to(device)
    support_labels = torch.tensor(support_labels).to(device)
    query_images = torch.stack(query_images).to(device)
    query_labels = torch.tensor(query_labels).to(device)

    return support_images, support_labels, query_images, query_labels

For training, we:

Create episodes with n-way, k-shot learning

Each episode has support (training) and query (testing) images

Use 5 support and 5 query images per class

In [None]:
# Training parameters
n_way = 4  # changed from 5 to 4-way classification
n_support = 5  # 5-shot
n_query = 5
n_episodes = 1000

# Initialize model, optimizer and loss function
model = MatchingNetwork().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

Our training configuration uses:

4-way classification (4 classes at a time)

5-shot learning (5 examples per class)

1000 training episodes

In [None]:
def train_episode():
    model.train()
    optimizer.zero_grad()

    support_images, support_labels, query_images, query_labels = create_episode(
        dataset, classes=meta_train_classes, n_way=n_way, n_support=n_support, n_query=n_query
    )

    predicted_logits = model(support_images, support_labels, query_images)
    loss = criterion(predicted_logits, query_labels)

    loss.backward()
    optimizer.step()

    _, predicted = torch.max(predicted_logits, 1)
    accuracy = (predicted == query_labels).float().mean()

    return loss.item(), accuracy.item()

# Training and model saving
if not os.path.exists('matching_network.pth'):
    print("Starting training...")
    for episode in range(n_episodes):
        loss, accuracy = train_episode()

        if (episode + 1) % 100 == 0:
            print(f"Episode {episode + 1}/{n_episodes}")
            print(f"Loss: {loss:.4f}")
            print(f"Accuracy: {accuracy:.4f}")

    torch.save(model.state_dict(), 'matching_network.pth')
else:
    model.load_state_dict(torch.load('matching_network.pth'))
    model.eval()

def evaluate(n_test_episodes=100):
    model.eval()
    total_accuracy = 0

    with torch.no_grad():
        for episode in range(n_test_episodes):
            support_images, support_labels, query_images, query_labels = create_episode(
                dataset,
                classes=meta_test_classes,
                n_way=2,  # Use 4-way for testing
                n_support=5,
                n_query=5
            )

            predicted_logits = model(support_images, support_labels, query_images)
            _, predicted = torch.max(predicted_logits, 1)
            accuracy = (predicted == query_labels).float().mean()
            total_accuracy += accuracy.item()

    return total_accuracy / n_test_episodes

  model.load_state_dict(torch.load('matching_network.pth'))


The training process:

Trains the model episode by episode

Uses Adam optimizer and CrossEntropy loss

Saves the best model

Achieves 71.64% accuracy on test classes

Finally, we have a prediction function that:

Takes a single image input

Compares it with support examples

Predicts the sign class

Can work with completely new signs

In [None]:
# Evaluate the model
print("\nEvaluating model...")
test_accuracy = evaluate(n_test_episodes=500)
print(f"Test Accuracy over {500} episodes: {test_accuracy:.4f}")

def predict_single_image(image_path, support_size=5):
    test_image = Image.open(image_path).convert('RGB')
    test_image = transform(test_image).unsqueeze(0).to(device)

    support_images = []
    support_labels = []

    for label, class_name in enumerate(meta_test_classes[:4]):  # Only use 4 classes
        class_indices = [i for i, l in enumerate(dataset.labels)
                        if dataset.classes[l] == class_name]
        selected_indices = np.random.choice(class_indices, support_size, replace=False)

        for idx in selected_indices:
            img, _ = dataset[idx]
            support_images.append(img)
            support_labels.append(label)

    support_images = torch.stack(support_images).to(device)
    support_labels = torch.tensor(support_labels).to(device)

    model.eval()
    with torch.no_grad():
        predicted_logits = model(support_images, support_labels, test_image)
        _, predicted = torch.max(predicted_logits, 1)
        predicted_class = meta_test_classes[predicted.item()]

    return predicted_class



Evaluating model...
Test Accuracy over 500 episodes: 0.7164


This notebook demonstrates a practical application of few-shot learning for sign language recognition, achieving good accuracy even on previously unseen signs.