# Dataset Report: Custom Dataset

Here is an example of how to run the [DatasetReport](https://satishlokkoju.github.io/deepview/api/deepview/introspectors.html#deepview.introspectors.DatasetReport) on the dataset. This notebook will demonstrate how to build the data for the report using DeepView. An example of how to visualize the report output with the Canvas UI framework (Coming Soon !)

For a deeper understanding of the Dataset Report, please see the [doc page](https://satishlokkoju.github.io/deepview/introspectors/data_introspection/dataset_report.html).

Before proceeding, please review the "How to Use" section in the docs, starting with the [How to Load A Model](https://satishlokkoju.github.io/deepview/how_to/connect_model.html).

## Dataset Report: (1) Setup

Here, group together required imports and set desired paths.

In [None]:
import logging
logging.basicConfig(level=logging.INFO)
logging.getLogger("numba").setLevel(logging.INFO)

In [None]:
from watermark import watermark
print(watermark(packages="deepview,deepview_tensorflow,deepview_data,canvas_ux,canvas_summary,canvas_list,canvas_scatterplot,canvas_duplicates,canvas_familiarity"))

In [None]:


# This notebook pieces together a pipeline to run a dataset through a model, with pre- and post- processing,
#    and then feeds it into the Dataset Report for full analysis
from deepview.introspectors import DatasetReport, ReportConfig
from deepview.base import Batch, Producer, pipeline, ImageFormat
from deepview.processors import Cacher, FieldRenamer, ImageResizer, Pooler, Processor
from deepview_tensorflow import load_tf_model_from_path

# For future protection, any deprecated DeepView features will be treated as errors
from deepview.exceptions import enable_deprecation_warnings
enable_deprecation_warnings()


### Dataset Report: (1) Setup - Download Model

Download a [MobileNet](https://keras.io/api/applications/mobilenet/) model from keras that has been pre-trained on the ImageNet dataset, which is similar to the [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset used. [TFModelExamples](https://satishlokkoju.github.io/deepview/api/deepview_tensorflow/index.html#deepview_tensorflow.TFModelExamples) are used to load MobileNet, but any model can be loaded, as described [here](https://satishlokkoju.github.io/deepview/how_to/connect_model.html).

In [None]:
from deepview_tensorflow import TFModelExamples

mobilenet = TFModelExamples.MobileNet()
mobilenet_preprocessor = mobilenet.preprocessing
assert mobilenet_preprocessor is not None

## Dataset Report: (2) DeepView Producer

To use the Kaggle API, you need to:

* Create a Kaggle account if you haven't already
* Generate an API token
* Download the kaggle.json file
* Place the file in the correct directory

Here are the step-by-step instructions:

1. Go to your Kaggle account settings: https://www.kaggle.com/account
2. Scroll down to the "API" section
3. Click on "Create New API Token"
4. This will download a kaggle.json file
5. Now, you need to place this file in the correct location. The error message suggests the path should be  ~/.kaggle/kaggle.json.

In [None]:
import os
from pathlib import Path
from kaggle.api.kaggle_api_extended import KaggleApi

def download_kaggle_dataset(dataset_id: str, output_dir: str = "data") -> None:
    """
    Download a dataset from Kaggle.
    
    Parameters
    ----------
    dataset_id : str
        Kaggle dataset identifier (e.g., 'username/dataset-name')
    output_dir : str, optional
        Directory where the dataset should be downloaded (default: 'data')
    """
    # Create output directory if it doesn't exist
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    
    # Initialize the Kaggle API
    api = KaggleApi()
    api.authenticate()
    
    # Download the dataset
    print(f"Downloading dataset {dataset_id} to {output_dir}...")
    api.dataset_download_files(
        dataset_id,
        path=output_dir,
        unzip=True
    )
    print("Download complete!")


In [None]:
# download_kaggle_dataset(dataset_id="agrigorev/clothing-dataset-full", output_dir="data/clothing-dataset")

In [None]:
import pandas as pd
import os
import shutil
from pathlib import Path

def organize_images(dataset_dir, output_dir):
    # Read the CSV file
    df = pd.read_csv(os.path.join(dataset_dir, 'images.csv'))
    # Create the base output directory if it doesn't exist
    base_output_dir = os.path.join(dataset_dir, output_dir)
    os.makedirs(base_output_dir, exist_ok=True)
    source_dir = os.path.join(dataset_dir, 'images_original')
    # Get unique labels
    unique_labels = df['label'].unique()

    # Create subdirectories for each label
    for label in unique_labels:
        label_dir = os.path.join(base_output_dir, label)
        os.makedirs(label_dir, exist_ok=True)

    # Move files to their respective directories
    for index, row in df.iterrows():
        image_filename = f"{row['image']}.jpg"
        source_path = os.path.join(source_dir, image_filename)
        destination_path = os.path.join(base_output_dir, row['label'], image_filename)

        try:
            if os.path.exists(source_path):
                shutil.copy2(source_path, destination_path)
            else:
                print(f"Warning: Source file not found: {source_path}")
        except Exception as e:
            print(f"Error copying {image_filename}: {str(e)}")

    print("\nOrganization complete!")
    print(f"Images have been organized into subfolders in the '{base_output_dir}' directory")
    return base_output_dir 

In [None]:
dataset_path = organize_images("./data/clothing-dataset", 'images_by_label')

In [None]:
'''
    Example directory structure:
    root_folder/
        class1/
            image1.jpg
            image2.jpg
        class2/
            image3.jpg
            image4.jpg

'''

In [None]:
from deepview_data import CustomDatasets
dataset_producer = CustomDatasets.ImageFolderDataset(root_folder=dataset_path,image_size=(224, 224), max_samples=100000)

## Dataset Report: (3) Model Inference w/ Pre + Post Processing

First load the saved TF Keras model into deepview using [load_tf_model_from_path](https://satishlokkoju.github.io/deepview/api/tensorflow/index.html#deepview_tensorflow.load_tf_model_from_path).

Then, apply pre and post processing steps around model inference. This consists of the following steps:

- mobilenet preprocessing: Keras has its own preprocessing for MobileNet, this function is turned into a DeepView [Processor](https://satishlokkoju.github.io/deepview/api/deepview/processors.html#deepview.processors.Processor) so it can be chained together with other pre / post processing stages
- resize images to fit the input of MobileNet, (224, 224) using an [ImageResizer](https://satishlokkoju.github.io/deepview/api/deepview/processors.html#deepview.processors.ImageResizer)
- rename the data, which have been stored under "images", to match the input layer of MobileNet. To learn about how to read input and output layers from a loaded deepview model, please read through the [Dataset Errors and Rare Samples example notebook](familiarity_for_rare_data_discovery.ipynb).
- run inference and extract intermediate embeddings (this time, just `conv_pw_13`, other layers can be added, e.g. what is found when inspecting them from the `deepview_model`. Again, please see the [Dataset Errors and Rare Samples example notebook](familiarity_for_rare_data_discovery.ipynb) for more of a guide on this piece.
- max pool the responses before DeepView processing using a DeepView [Pooler](https://satishlokkoju.github.io/deepview/api/deepview/processors.html#deepview.processors.Pooler)

In [None]:
# Chain together all operations around running the data through the model
model_stages = (
    mobilenet_preprocessor,
    
    ImageResizer(pixel_format=ImageFormat.HWC, size=(224, 224)),
    
    # Run inference with MobileNet and extract intermediate embeddings
    # (this time, just `conv_pw_130`, but other layers can be added)
    # :: Note: This auto-detects the input layer and connects up 'images' to it:
    mobilenet.model(requested_responses=['conv_pw_13']),
    
    Pooler(dim=(1, 2), method=Pooler.Method.MAX)
)

## Dataset Report: (4) Chain together pipeline stages

Create the DeepView [pipeline](https://satishlokkoju.github.io/deepview/api/deepview/base.html#deepview.base.pipeline) from the base CIFAR-10 Producer that was written earlier, and then by unwrapping the tuple of model-related [PipelineStages](https://satishlokkoju.github.io/deepview/api/deepview/base.html#deepview.base.PipelineStage) defined in the prior cell.

No processing is done at this point, but will be called when the Dataset Report "introspects".


In [None]:
# Finally put it all together!
producer = pipeline(
    # Original data producer that will yield batches
    dataset_producer,

    # unwrap the tuple of pipeline stages that contain model inference, and pre/post-processing
    *model_stages,

    # Cache responses to play around with data in future cells
    Cacher()
)

## Dataset Report: (5) Run Dataset Report. Introspect!

All compute is performed in this step by pulling batches through the entire pipeline.

[DatasetReport](https://betterwidhdata.github.io/deepview/api/deepview/introspectors.html#deepview.DatasetReport) [introspect](https://betterwidhdata.github.io/deepview/api/deepview/introspectors.html#deepview.introspectors.DatasetReport.introspect) takes the data through the pipeline and gives us a report object. This object contains `data`, which is a pandas dataframe table with metadata about each data sample like familiarity, duplicates, overall summary, and projection that can be passed to Canvas for visualization.

A [ReportConfig](https://betterwidhdata.github.io/deepview/api/deepview/introspectors.html#deepview.introspectors.ReportConfig) is passed as input in the next cell that specifies that to not run the projection or familiarity components. This is simply for speed of this example notebook. To run all components, simply omit the config parameter from the `DatasetReport.introspect` function.

Displayed in the next cell are the first five rows of the resulting `report.data` that is interpretable by the Canvas UI.

In [None]:
%%time
# The most time consuming, since all compute is done here
# Data passed through DeepView in batches to produce the backend data table that will be displayed by Canvas

custom_config = ReportConfig(
    projection=None,
    duplicates=None
)


report = DatasetReport.introspect(
    producer,
    #config=custom_config # unComment this out to run with custom config
)

In [None]:
report.data.head()

## Dataset Report: (6) Visualization

To visualize the results, the resulting ``report`` can be fed into the Canvas UI framework. To save ``report`` to disk, run:

```
report.to_disk('./report_files/')
```
where ``./report_files`` can be any path. This Pandas DataFrame can then be loaded and fed into Canvas UI.

In [None]:
report.to_disk('./report_files_clothes/', overwrite=True)

## Canvas in Jupyter Notebooks

Let's use Canvas to explore this dataset in a Jupyter notebook.

To use Canvas, we'll import the main library and instantiate a Canvas object, passing the pandas DataFrame analysis and a file path to the dataset we downloaded.

In [None]:
import canvas_ux

canva = canvas_ux.Canvas(report.data)

To use the different Canvas widgets, you can import them indepdently. Let's first look at the Summary widget to see the overall distributions of our datset.

In [None]:
from canvas_summary import CanvasSummary

canva.widget(CanvasSummary)

Instead of a summary, if we want to browse through the data we can use the List widget.

In [None]:
from canvas_list import CanvasList

canva.widget(CanvasList)

It's common to use dimensionality reduction techniques to summarize and find patterns in ML dataset. DeepView already ran a reduction, and saves it when running a DataSet Report. We can use the Scatterplot widget to visualize this embedding.

In [None]:
from canvas_scatterplot import CanvasScatterplot

canva.widget(CanvasScatterplot)

Some datasets can contain duplicates: data instances that are the same or very similar to others. These can be hard to find, and become espeically problematic if the same data instance is in the training and testing splits. We can answer these questions using the Duplicates widget.

Hint: Take a look at the `automobile` class, where there are duplicates across train and test data!

In [None]:
from canvas_duplicates import CanvasDuplicates

canva.widget(CanvasDuplicates)

Lastly, we can use advanced ML metrics and the Familiarity widget to find the most and least representative data instances from a given datset, which can help identify model biases and annotation errors.

In [None]:
from canvas_familiarity import CanvasFamiliarity

canva.widget(CanvasFamiliarity)

## Visualization as a Standalone Export

Report can also be exported as a standalone static export to be shared with others or hosted. To explore this example in a web browser, you can export the report to local folder.


In [None]:
canva.export('./canvas_report_clothes', name="Canvas ClothesDataset Visualization")

You can now serve the dataset report. For example, from the `canvas_export` folder, run a simple server from the command line:

```bash
python -m http.server
```

And navigate to http://localhost:8000/.