# Pseudo-Labeling for unlabeled dataset with zero-shot learned image-text embeddings

In this notebook, we introduce a new pseudo-labeling technique designed for unlabeled datasets. This feature leverages zero-shot learned image-text embeddings using the [CLIP (Contrastive Language-Image Pre-training)](https://github.com/openai/CLIP) model to propose labels for unlabeled data. By computing hash values of dataset items and predefined labels, we assign the most similar label to each item based on the calculated distance.

### Key Features:
- Utilizes CLIP to compute hash representations of both dataset items and labels.
- Computes similarity between dataset items and a set of predefined labels to assign pseudo-labels.
- Facilitates pseudo-labeling in semi-supervised learning tasks, particularly useful for unlabeled or partially labeled datasets.
- Applies hashing techniques to efficiently compare embeddings.

This method helps improve model performance in scenarios where labeled data is limited, by making use of the relationship between images and text to assign the best-fitting labels.

## Feature Overview

The `PseudoLabeling` class is designed to assign pseudo-labels to items in a dataset based on their similarity to a set of predefined labels. It does this by extracting hash keys for both the dataset items and the labels using a CLIP-based Explorer object, and then calculating the similarity between these hashes.

### Key Methods and Attributes:

- **`__init__(self, extractor, labels=None, explorer=None)`**: Initializes the pseudo-labeling system. Takes an extractor (which provides dataset access), an optional list of labels, and an optional `Explorer` object for hashing. If labels are not provided, all available labels in the dataset are used.

- **`transform_item(self, item)`**: Transforms a single dataset item by computing its hash key, comparing it to the label hashes, and assigning the most similar label as a pseudo-label.

- **Attributes**:
    - **`extractor`**: Provides access to dataset items and annotations.
    - **`labels`**: Optional list of predefined label names.
    - **`explorer`**: Optional CLIP-based Explorer object to compute hash keys for items and labels.
    - **`_label_hashkeys`**: Stores hash keys for predefined labels, computed during initialization.

This feature is especially useful for semi-supervised learning tasks where some items are unlabeled, and it can be easily integrated into workflows that involve large-scale datasets.


## Prerequisites

## Prerequisites

Before applying the pseudo-labeling feature, we need a suitable dataset to work with. For this notebook, we will use the **CIFAR-10** dataset for this notebook, which consists of 60,000 images across 10 classes. CIFAR-10 is a commonly used dataset for image classification tasks, making it ideal for evaluating the pseudo-labeling feature.

You can download and extract CIFAR-10 using the following command:

```bash
# Download CIFAR-10 dataset
!curl -o cifar-10-python.tar.gz https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
# Extract the dataset
!tar -xvzf cifar-10-python.tar.gz
```


## Basic Usage

Now that we have the dataset ready, let's walk through a basic example of how to use the `PseudoLabeling` feature to assign pseudo-labels to unlabeled items in the dataset.

### Steps:
1. Load the CIFAR-10 dataset using `Dataset.import_from()`.
2. Define a list of 10 labels to use for pseudo-labeling.
3. Apply the `pseudo_labeling` transformation to the dataset.
4. View the transformed dataset and check the assigned pseudo-labels.


In [1]:
# Copyright (C) 2024 Intel Corporation
#
# SPDX-License-Identifier: MIT

from copy import deepcopy
from datumaro.components.dataset import Dataset

# Assuming the CIFAR-100 dataset has been loaded and extracted
# Initialize the dataset extractor
dataset = Dataset.import_from("cifar-10-python", format="cifar")

# Define the label list (CIFAR-10 classes)
label_list = [
    "airplane",
    "automobile",
    "bird",
    "cat",
    "deer",
    "dog",
    "frog",
    "horse",
    "ship",
    "truck",
]

# Function to get the original label


def get_ground_truth_label(item):
    # Assuming that the dataset annotations include ground truth labels as the first annotation
    return item.annotations[0].label


original = deepcopy(dataset)
# Initialize the PseudoLabeling class
result = dataset.transform("pseudo_labeling", labels=label_list)

# Compare the pseudo-label and ground truth label
for item in result.take(5):
    # Get the pseudo-label from transformed dataset
    pseudo_label = item.annotations[0].label

    # Get the original ground truth label from the original dataset
    original_item = original.get(item.id)
    ground_truth_label = get_ground_truth_label(original_item)

    print(f"Item ID: {item.id}")
    print(f"Ground Truth Label: {label_list[ground_truth_label]}")
    print(f"Pseudo-label: {label_list[pseudo_label]}")
    print("=" * 30)

  from .autonotebook import tqdm as notebook_tqdm


DatasetNotFoundError: Failed to find dataset 'cifar' at 'cifar-100-python'

### Explanation:

1. **Load the dataset**: We use `Dataset.import_from()` to load CIFAR-10 in the appropriate format.
2. **Define labels**: A predefined list of 10 labels is passed, matching the CIFAR-10 classes.
3. **Apply transformation**: The `pseudo_labeling` transformation is applied, generating pseudo-labels based on the similarity between the images and the label embeddings.
4. **View results**: We print a few items from the resulting dataset along with their newly assigned pseudo-labels.

This basic usage demonstrates how to use the feature with CIFAR-10, applying pseudo-labeling to unlabeled or partially labeled data.

## Evaluation: Pseudo-Labeling Accuracy

After assigning pseudo-labels to the dataset, it’s important to evaluate how well these pseudo-labels match the actual ground truth labels. One simple way to do this is by calculating the accuracy of the pseudo-labels compared to the true labels.


In [None]:
# Function to evaluate the accuracy of pseudo-labeling
def evaluate_pseudo_labeling_accuracy(dataset, pseudo_labeled_dataset, label_list):
    total_items = 0
    correct_predictions = 0

    # Iterate through the original and pseudo-labeled datasets
    for item in pseudo_labeled_dataset:
        total_items += 1

        # Get pseudo-label
        pseudo_label = item.annotations[0].label

        # Get ground truth label
        original_item = dataset.get(item.id)
        ground_truth_label = original_item.annotations[0].label

        # Compare and count correct predictions
        if pseudo_label == ground_truth_label:
            correct_predictions += 1

    # Calculate accuracy
    accuracy = correct_predictions / total_items
    return accuracy


# Calculate accuracy of the pseudo-labeling
accuracy = evaluate_pseudo_labeling_accuracy(dataset, result, label_list)
print(f"Pseudo-Labeling Accuracy: {accuracy * 100:.2f}%")

## Next Steps

- **Fine-tuning pseudo-labeling**: Depending on the results, you can adjust the pseudo-labeling process, such as changing the label list, experimenting with different datasets, or modifying the hashing method.
- **Advanced applications**: Try applying this pseudo-labeling feature to a semi-supervised learning pipeline where some labels are available, but others need to be generated.
- **Custom Explorers**: Experiment with custom `Explorer` implementations that use different techniques for hashing or similarity calculation to see how they impact pseudo-labeling performance.
