# Dataset Report: CIFAR10 Example

Here is an example of how to run the [DatasetReport](https://betterwidhdata.github.io/deepview/api/deepview/introspectors.html#deepview.introspectors.DatasetReport) on the dataset. This notebook will demonstrate how to build the data for the report using DeepView. An example of how to visualize the report output with the Canvas UI framework (Coming Soon !)

For a deeper understanding of the Dataset Report, please see the [doc page](https://betterwidhdata.github.io/deepview/introspectors/data_introspection/dataset_report.html).

Before proceeding, please review the "How to Use" section in the docs, starting with the [How to Load A Model](https://betterwidhdata.github.io/deepview/how_to/connect_model.html).

## Dataset Report: (1) Setup

Here, group together required imports and set desired paths.

In [None]:
import logging
logging.basicConfig(level=logging.INFO)
logging.getLogger("numba").setLevel(logging.INFO)

In [None]:
from watermark import watermark
print(watermark(packages="deepview,deepview_tensorflow,deepview_data,canvas_ux,canvas_summary,canvas_list,canvas_scatterplot,canvas_duplicates,canvas_familiarity"))

In [None]:
from dataclasses import dataclass
import os
from pathlib import Path
import typing as t

# OpenCV will be used to interact with the CIFAR10 dataset, since it contains images.
import cv2
import numpy as np

# This notebook pieces together a pipeline to run a dataset through a model, with pre- and post- processing,
#    and then feeds it into the Dataset Report for full analysis
from deepview.introspectors import DatasetReport, ReportConfig
from deepview.base import Batch, Producer, pipeline, ImageFormat
from deepview.processors import Cacher, FieldRenamer, ImageResizer, Pooler, Processor
from deepview_tensorflow import load_tf_model_from_path

# For future protection, any deprecated DeepView features will be treated as errors
from deepview.exceptions import enable_deprecation_warnings
enable_deprecation_warnings()

# Use a pre-trained Keras MobileNet model to analyze the CIFAR10 dataset
import keras
from keras.applications.mobilenet_v2 import preprocess_input as mobilenet_preprocessing
from keras.datasets import cifar10

In [None]:
data_path = "./cifar/"

### Dataset Report: (1) Setup - Download Model

Download a [MobileNet](https://keras.io/api/applications/mobilenet/) model from keras that has been pre-trained on the ImageNet dataset, which is similar to the [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset used. [TFModelExamples](https://apple.github.io/deepview/api/deepview_tensorflow/index.html#deepview_tensorflow.TFModelExamples) are used to load MobileNet, but any model can be loaded, as described [here](https://apple.github.io/deepview/how_to/connect_model.html).

In [None]:
from deepview_tensorflow import TFModelExamples

mobilenet = TFModelExamples.MobileNet()
mobilenet_preprocessor = mobilenet.preprocessing
assert mobilenet_preprocessor is not None

## Dataset Report: (2) DeepView Producer

This is the chunk of the work that will change for new datasets. This step teaches DeepView how to load data as batches, by creating a custom [Producer](https://apple.github.io/deepview/api/deepview/base.html#deepview.base.Producer) for the CIFAR-10 dataset.

[TFDatasetExamples.CIFAR10](https://apple.github.io/deepview/api/deepview_tensorflow/index.html#deepview_tensorflow.TFDatasetExamples.CIFAR10) could be used to instantiate the data producer, but for this example, the full extent of creating a custom DeepView [Producer](https://apple.github.io/deepview/api/deepview/base.html#deepview.base.Producer) is shown.

DeepView operates on datasets in batches, so that it can handle large-scale datasets without loading everything into memory at once. For each batch, metadata can be attached to provide a more thorough exploration in the report. [Batch.StdKeys](https://apple.github.io/deepview/api/deepview/base.html#deepview.base.Batch.StdKeys) metadata keys will be used to attach:

- Identifier: unique identifier for each data sample (in this case, the path to the file)
- Label:
    - Class label: airplane, automobile, etc. class label
    - Dataset label: train vs. test

**Note**: The following `Cifar10Producer` is more complicated than the typical custom [Producer](https://apple.github.io/deepview/api/deepview/base.html#deepview.base.Producer). The data is in a couple different numpy arrays, but in order to have the option of exporting the report and sharing the zip of files, the raw data must be turned into image files. If there were already files on disk, it would be simpler.

DeepView has a couple built-in Producers: [ImageProducer](https://apple.github.io/deepview/api/deepview/base.html#deepview.base.ImageProducer) that load directly from files, and [TorchProducer](https://apple.github.io/deepview/api/torch/index.html#deepview_torch.TorchProducer) which is a Producer created directly from a PyTorch dataset.

Follow the comments in the following `Cifar10Producer` code block to see how to create a custom [Producer](https://apple.github.io/deepview/api/deepview/base.html#deepview.base.Producer) for a dataset, and refer to the doc on [loading data](https://apple.github.io/deepview/how_to/connect_data.html) for more information.

In [None]:
# First, download data
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
class_to_name = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']

# Concatenate the train and test into one array, as well as the train/test labels, and the class labels
full_dataset = np.concatenate((x_train, x_test))
dataset_labels = ['train']*len(x_train) + ['test']*len(x_test)
class_labels = np.squeeze(np.concatenate((y_train, y_test)))

In [None]:
from typing import List

@dataclass
class Cifar10Producer(Producer):
    dataset: np.ndarray
    """All raw image data to load into the Dataset Report"""

    dataset_labels: t.Sequence[str]
    """Labels to distinguish what dataset the data came from (e.g. train / test)"""

    class_labels: t.Sequence[str]
    """Labels to distinguish class"""

    data_path: str
    """Where data will be written to be packaged up with Dataset Report"""

    max_data: int = -1
    """Max data samples to pull from. This is helpful for local debugging."""

    def __post_init__(self) -> None:
        if self.max_data <=0:
            self.max_data = len(self.dataset)

    def _class_path(self, index: int) -> str:
        return f"{self.dataset_labels[index]}/{class_to_name[int(self.class_labels[index])]}"

    def _write_images_to_disk(self, ii: int, jj: int) -> List[str]:
        file_paths = []
        for idx in range(ii, jj):
            base_path = os.path.join(self.data_path, self._class_path(idx))
            Path(base_path).mkdir(exist_ok=True, parents=True)
            filename = os.path.join(base_path, f"image{idx}.png")
            # Write to disk after converting to BGR format, used by opencv
            cv2.imwrite(filename, cv2.cvtColor(self.dataset[idx, ...], cv2.COLOR_RGB2BGR))
            file_paths.append(filename)
        
        return file_paths
        
    def __call__(self, batch_size: int) -> t.Iterable[Batch]:
        """The important function... yield a batch of data from the downloaded dataset"""
        logger = logging.getLogger(__name__)
        # Iteratively loop over the data samples and yield it in batches
        for ii in range(0, self.max_data, batch_size):
            jj = min(ii+batch_size, self.max_data)

            # Create batch from data already in memory
            builder = Batch.Builder(
                fields={"images": self.dataset[ii:jj, ...]}
            )

            # Use pathname as the identifier for each data sample, excluding base data directory
            builder.metadata[Batch.StdKeys.IDENTIFIER] = self._write_images_to_disk(ii, jj)
            
            # Add class and dataset labels
            builder.metadata[Batch.StdKeys.LABELS] = {
                "class": [class_to_name[int(lbl_idx)] for lbl_idx in self.class_labels[ii:jj]],
                "dataset": self.dataset_labels[ii:jj]
            }

            batch = builder.make_batch()

            yield batch

In [None]:
# Now instantiate the producer from the loaded CIFAR-10 data
cifar10_producer = Cifar10Producer(
    dataset=full_dataset,
    dataset_labels=dataset_labels,
    class_labels=class_labels,
    data_path=data_path,

    # This "max data" param is purely for running this notebook quickly in the DeepView docs
    #    Remove this param to run on the whole dataset
    max_data=-1
)

## Dataset Report: (3) Model Inference w/ Pre + Post Processing

First load the saved TF Keras model into deepview using [load_tf_model_from_path](https://apple.github.io/deepview/api/tensorflow/index.html#deepview_tensorflow.load_tf_model_from_path).

Then, apply pre and post processing steps around model inference. This consists of the following steps:

- mobilenet preprocessing: Keras has its own preprocessing for MobileNet, this function is turned into a DeepView [Processor](https://apple.github.io/deepview/api/deepview/processors.html#deepview.processors.Processor) so it can be chained together with other pre / post processing stages
- resize images to fit the input of MobileNet, (224, 224) using an [ImageResizer](https://apple.github.io/deepview/api/deepview/processors.html#deepview.processors.ImageResizer)
- rename the data, which have been stored under "images", to match the input layer of MobileNet. To learn about how to read input and output layers from a loaded deepview model, please read through the [Dataset Errors and Rare Samples example notebook](familiarity_for_rare_data_discovery.ipynb).
- run inference and extract intermediate embeddings (this time, just `conv_pw_13`, other layers can be added, e.g. what is found when inspecting them from the `deepview_model`. Again, please see the [Dataset Errors and Rare Samples example notebook](familiarity_for_rare_data_discovery.ipynb) for more of a guide on this piece.
- max pool the responses before DeepView processing using a DeepView [Pooler](https://apple.github.io/deepview/api/deepview/processors.html#deepview.processors.Pooler)

In [None]:
# Chain together all operations around running the data through the model
model_stages = (
    mobilenet_preprocessor,
    
    ImageResizer(pixel_format=ImageFormat.HWC, size=(224, 224)),
    
    # Run inference with MobileNet and extract intermediate embeddings
    # (this time, just `conv_pw_130`, but other layers can be added)
    # :: Note: This auto-detects the input layer and connects up 'images' to it:
    mobilenet.model(requested_responses=['conv_pw_13']),
    
    Pooler(dim=(1, 2), method=Pooler.Method.MAX)
)

## Dataset Report: (4) Chain together pipeline stages

Create the DeepView [pipeline](https://apple.github.io/deepview/api/deepview/base.html#deepview.base.pipeline) from the base CIFAR-10 Producer that was written earlier, and then by unwrapping the tuple of model-related [PipelineStages](https://apple.github.io/deepview/api/deepview/base.html#deepview.base.PipelineStage) defined in the prior cell.

No processing is done at this point, but will be called when the Dataset Report "introspects".


In [None]:
# Finally put it all together!
producer = pipeline(
    # Original data producer that will yield batches
    cifar10_producer,

    # unwrap the tuple of pipeline stages that contain model inference, and pre/post-processing
    *model_stages,

    # Cache responses to play around with data in future cells
    Cacher()
)

## Dataset Report: (5) Run Dataset Report. Introspect!

All compute is performed in this step by pulling batches through the entire pipeline.

[DatasetReport](https://betterwidhdata.github.io/deepview/api/deepview/introspectors.html#deepview.DatasetReport) [introspect](https://betterwidhdata.github.io/deepview/api/deepview/introspectors.html#deepview.introspectors.DatasetReport.introspect) takes the data through the pipeline and gives us a report object. This object contains `data`, which is a pandas dataframe table with metadata about each data sample like familiarity, duplicates, overall summary, and projection that can be passed to Canvas for visualization.

A [ReportConfig](https://betterwidhdata.github.io/deepview/api/deepview/introspectors.html#deepview.introspectors.ReportConfig) is passed as input in the next cell that specifies that to not run the projection or familiarity components. This is simply for speed of this example notebook. To run all components, simply omit the config parameter from the `DatasetReport.introspect` function.

Displayed in the next cell are the first five rows of the resulting `report.data` that is interpretable by the Canvas UI.

In [None]:
%%time
# The most time consuming, since all compute is done here
# Data passed through DeepView in batches to produce the backend data table that will be displayed by Canvas

custom_config = ReportConfig(
    projection=None,
    duplicates=None
)


report = DatasetReport.introspect(
    producer,
    #config=custom_config # unComment this out to run with custom config
)

In [None]:
report.data.head()

In [None]:
# To save the reprort for future use
# We can load it back in later with

# import pandas as pd
# df = pd.read_pickle("./cifar_report/report_save_data.pkl")
# import canvas_ux
# canva = canvas_ux.Canvas(df)

report.save("cifar_report")

## Dataset Report: (6) Visualization

Remember that this example operates on the full dataset. To run on a smaller dataset, simply change the
- `max_data` param from `Cifar10Producer` to run on smaller size
- `config` param for `DatasetReport.introspect` to build only the required components of the Dataset Report

To visualize the results, the resulting ``report`` can be fed into the Canvas UI framework 

Let's use Canvas to explore this dataset in a Jupyter notebook.

In [None]:
import canvas_ux

canva = canvas_ux.Canvas(report.data)

To use the different Symphony widgets, you can import them indepdently. Let's first look at the Summary widget to see the overall distributions of our datset.

In [None]:
from canvas_summary import CanvasSummary

canva.widget(CanvasSummary)

Instead of a summary, if we want to browse through the data we can use the List widget.

In [None]:
from canvas_list import CanvasList

canva.widget(CanvasList)

It's common to use dimensionality reduction techniques to summarize and find patterns in ML dataset. DeepView already ran a reduction, and saves it when running a DataSet Report. We can use the Scatterplot widget to visualize this embedding.

In [None]:
from canvas_scatterplot import CanvasScatterplot

canva.widget(CanvasScatterplot)

Some datasets can contain duplicates: data instances that are the same or very similar to others. These can be hard to find, and become espeically problematic if the same data instance is in the training and testing splits. We can answer these questions using the Duplicates widget.

Hint: Take a look at the `automobile` class, where there are duplicates across train and test data!

In [None]:
from canvas_duplicates import CanvasDuplicates

canva.widget(CanvasDuplicates)

Lastly, we can use advanced ML metrics and the Familiarity widget to find the most and least representative data instances from a given datset, which can help identify model biases and annotation errors.

In [None]:
from canvas_familiarity import CanvasFamiliarity

canva.widget(CanvasFamiliarity)

## Visualization as a Standalone Export

Report can also be exported as a standalone static export to be shared with others or hosted. To explore this example in a web browser, you can export the report to local folder.


In [None]:
canva.export('./canvas_report', name="Canvas CIFAR10 Visualization")

You can now serve the dataset report. For example, from the `canvas_export` folder, run a simple server from the command line:

```bash
python -m http.server
```

And navigate to http://localhost:8000/.