# Deeplake Tutorial

# **Step 1**: _Hello World_

## Installing Hub

Hub can be installed via `pip`.

In [None]:
from IPython.display import clear_output
!pip3 install hub
clear_output()

**By default, Hub does not install dependencies for audio, video, and google-cloud (GCS) support. They can be installed using:**

In [None]:
#pip install hub[audio]  -> Audio support via miniaudio

#pip install hub[video]  -> Video support via pyav

#pip install hub[gcp]    -> GSS support via google-* dependencies

#pip install hub[all]    -> Installs everything - audio, video and GCS support

## Fetching your first Hub dataset

Begin by loading in [MNIST](https://en.wikipedia.org/wiki/MNIST_database), the hello world dataset of machine learning. 

First, load the `Dataset` by pointing to its storage location. Datasets hosted on the Activeloop Platform are typically identified by the namespace of the organization followed by the dataset name: `activeloop/mnist-train`.

In [None]:
import hub

dataset_path = 'hub://activeloop/mnist-train'
ds = hub.load(dataset_path) # Returns a Hub Dataset but does not download data locally

## Reading Samples From a Hub Dataset

Data is not immediately read into memory because Hub operates [lazily](https://en.wikipedia.org/wiki/Lazy_evaluation). You can fetch data by calling the `.numpy()` method, which reads data into a NumPy array.


In [None]:
# Indexing
W = ds.images[0].numpy() # Fetch image return a NumPy array
X = ds.labels[0].numpy(aslist=True) # Fetch label and store as list of NumPy array

# Slicing
Y = ds.images[0:100].numpy() # Fetch 100 images and return a NumPy array if possible
                               # This method produces an exception if
                               # the shape of the images is not equal
Z = ds.labels[0:100].numpy(aslist=True) # Fetch 100 labels and store as list of 
                                           # NumPy arrays

In [None]:
print('X is {}'.format(X))

Congratulations, you've got Hub working on your local machine! 🤓

# **Step 2**: _Creating Hub Datasets_
*Creating and storing Hub Datasets manually.*

Creating Hub datasets is simple, you have full control over connecting your source data (files, images, etc.) to specific tensors in the Hub Dataset.

## Manual Creation

Let's follow along with the example below to create our first dataset. First, download and unzip the small classification dataset below called the *animals dataset*.

In [None]:
# Download dataset
from IPython.display import clear_output
!wget https://github.com/activeloopai/examples/raw/main/colabs/starting_data/animals.zip
clear_output()

In [None]:
# Unzip to './animals' folder
!unzip -qq /content/animals.zip

The dataset has the following folder structure:

animals
- cats
  - image_1.jpg
  - image_2.jpg
- dogs
  - image_3.jpg
  - image_4.jpg

Now that you have the data, you can **create a Hub `Dataset`** and initialize its tensors. Running the following code will create a Hub dataset inside of the `./animals_hub` folder.


In [None]:
import hub
from PIL import Image
import numpy as np
import os

ds = hub.empty('./animals_hub') # Creates the dataset

Next, let's inspect the folder structure for the source dataset `'./animals'` to find the class names and the files that need to be uploaded to the Hub dataset.

In [None]:
# Find the class_names and list of files that need to be uploaded
dataset_folder = './animals'

class_names = os.listdir(dataset_folder)

files_list = []
for dirpath, dirnames, filenames in os.walk(dataset_folder):
    for filename in filenames:
        files_list.append(os.path.join(dirpath, filename))

Next, let's **create the dataset tensors and upload metadata**. Check out our page on [Storage Synchronization](https://docs.activeloop.ai/how-hub-works/storage-synchronization) for details about the `with` syntax below.


In [None]:
with ds:
  # Create the tensors with names of your choice.
  ds.create_tensor('images', htype = 'image', sample_compression = 'jpeg')
  ds.create_tensor('labels', htype = 'class_label', class_names = class_names)

  # Add arbitrary metadata - Optional
  ds.info.update(description = 'My first Hub dataset')
  ds.images.info.update(camera_type = 'SLR')

**Note:** Specifying `htype` and `dtype` is not required, but it is highly recommended in order to optimize performance, especially for large datasets. Use `dtype` to specify the numeric type of tensor data, and use `htype` to specify the underlying data structure. More information on `htype` can be found [here](https://api-docs.activeloop.ai/htypes.html).

Finally, let's **populate the data** in the tensors.         

In [None]:
with ds:
    # Iterate through the files and append to hub dataset
    for file in files_list:
        label_text = os.path.basename(os.path.dirname(file))
        label_num = class_names.index(label_text)
        
        #Append data to the tensors
        ds.append({'images': hub.read(file), 'labels': np.uint32(label_num)})

**Note:** `ds.append({'images': hub.read(path)})` is functionally equivalent to `ds.append({'images': PIL.Image.fromarray(path)})`. However, the `hub.read()` method is significantly faster because it does not decompress and recompress the image if the compression matches the `sample_compression` for that tensor. Further details are available in the next section.

**Note:** In order to maintain proper indexing across tensors, `ds.append({...})` requires that you to append to all tensors in the dataset. If you wish to skip tensors during appending, please use `ds.append({...}, skip_ok = True)` or append to a single tensor using `ds.tensor_name.append(...)`.

Check out the first image from this dataset. More details about Accessing Data are available in **Step 4**.

In [None]:
Image.fromarray(ds.images[0].numpy())

## Creating Tensor Hierarchies

Often it's important to create tensors hierarchically, because information between tensors may be inherently coupled—such as bounding boxes and their corresponding labels. Hierarchy can be created using tensor `groups`:

In [None]:
ds = hub.empty('./groups_test') # Creates the dataset

# Create tensor hierarchies
ds.create_group('my_group')
ds.my_group.create_tensor('my_tensor')

# Alternatively, a group can us created using create_tensor with '/'
ds.create_tensor('my_group_2/my_tensor') # Automatically creates the group 'my_group_2'

Tensors in groups are accessed via:

In [None]:
ds.my_group.my_tensor

#OR

ds['my_group/my_tensor']

For more detailed information regarding accessing datasets and their tensors, check out **Step 4**.

# **Step 3**: _Understanding Compression_

*Using compression to achieve optimal performance.*

**Data in Hub can be stored in raw uncompressed format. However, compression is highly recommended for achieving optimal performance in terms of speed and storage.**


Compression is specified separately for each tensor, and it can occur at the `sample` or `chunk` level. For example, when creating a tensor for storing images, you can choose the compression technique for the image samples using the `sample_compression` input:

In [None]:
import hub

# Set overwrite = True for re-runability
ds = hub.empty('./compression_test', overwrite = True)

ds.create_tensor('images', htype = 'image', sample_compression = 'jpeg')

In this example, every image added in subsequent `.append(...)` calls is compressed using the specified `sample_compression` method. 

### **Choosing the Right Compression**

There is no single answer for choosing the right compression, and the tradeoffs are described in detail in the next section. However, good rules of thumb are:



1.   For data that has application-specific compressors (`image`, `audio`, `video`,...), choose the sample_compression technique that is native to the application such as `jpg`, `mp3`, `mp4`,...
2.   For other data containing large samples (i.e. large arrays with >100 values), `lz4` is a generic compressor that works well in most applications. `lz4` can be used as a `sample_compression` or `chunk_compression`. In most cases, `sample_compression` is sufficient, but in theory, `chunk_compression` produces slightly smaller data.
3.   For other data containing small samples (i.e. labels with <100 values), it is not necessary to use compression.

### **Compression Tradeoffs**

**Lossiness -** Certain compression techniques are lossy, meaning that there is irreversible information loss when compressing the data. Lossless compression is less important for data such as images and videos, but it is critical for label data such as numerical labels, binary masks, and segmentation data.


**Memory -** Different compression techniques have substantially different memory footprints. For instance, png vs jpeg compression may result in a 10X difference in the size of a Hub dataset. 


**Runtime -** The primary variables affecting download and upload speeds for generating usable data are the network speed and available compute power for processing the data . In most cases, the network speed is the limiting factor. Therefore, the highest end-to-end throughput for non-local applications is achieved by maximizing compression and utilizing compute power to decompress/convert the data to formats that are consumed by deep learning models (i.e. arrays). 


**Upload Considerations -** When applicable, the highest uploads speeds can be achieved when the  `sample_compression` input matches the compression of the source data, such as:

In [None]:
# sample_compression is "jpg" and appended image is "jpeg"
ds.create_tensor('images_jpg', htype = 'image', sample_compression = 'jpg')
ds.images_jpg.append(hub.read('./animals/dogs/image_3.jpg'))

In this case, the input data is a `.jpg`, and the hub `sample_compression` is `jpg`. 

However, a mismatch between compression of the source data and sample_compression in Hub results in significantly slower upload speeds, because Hub must decompress the source data and recompress it using the specified `sample_compression` before saving.

In [None]:
# sample_compression is "jpg" and appended image is "jpeg"
ds.create_tensor('images_png', htype = 'image', sample_compression = 'png')
ds.images_png.append(hub.read('./animals/dogs/image_3.jpg'))

**NOTE:** Due to the computational costs associated with decompressing and recompressing data, it is important that you consider the runtime implications of uploading source data that is compressed differently than the specified sample_compression. 

# **Step 4**: _Accessing Data_
_Accessing and loading Hub Datasets._

## Loading Datasets

Hub Datasets can be loaded and created in a variety of storage locations with minimal configuration. 

In [None]:
import hub

In [None]:
# Local Filepath
ds = hub.load('./animals_hub') # Dataset created in Step 2 in this Colab Notebook

In [None]:
# S3
# ds = hub.load('s3://my_dataset_bucket', creds={...})

In [None]:
# Public Dataset hosted by Activeloop
## Activeloop Storage - See Step 6
ds = hub.load('hub://activeloop/k49-train')

In [None]:
# Dataset in another workspace on Activeloop Platform
# ds = hub.load('hub://workspace_name/dataset_name')

**Note:** Since `ds = hub.dataset(path)` can be used to both create and load datasets, you may accidentally create a new dataset if there is a typo in the path you provided while intending to load a dataset. If that occurs, simply use `ds.delete()` to remove the unintended dataset permanently.

## Referencing Tensors

Hub allows you to reference specific tensors using keys or via the `.` notation outlined below. 


**Note:** data is still not loaded by these commands.

In [None]:
ds = hub.dataset('hub://activeloop/k49-train')

In [None]:
### NO HIERARCHY ###
ds.images # is equivalent to
ds['images']

ds.labels # is equivalent to
ds['labels']

### WITH HIERARCHY ###
# ds.localization.boxes # is equivalent to
# ds['localization/boxes']

# ds.localization.labels # is equivalent to
# ds['localization/labels']

## Accessing Data

Data within the tensors is loaded and accessed using the `.numpy()` command:

In [None]:
# Indexing
ds = hub.dataset('hub://activeloop/k49-train')

W = ds.images[0].numpy() # Fetch an image and return a NumPy array
X = ds.labels[0].numpy(aslist=True) # Fetch a label and store it as a 
                                    # list of NumPy arrays

# Slicing
Y = ds.images[0:100].numpy() # Fetch 100 images and return a NumPy array
                             # The method above produces an exception if 
                             # the images are not all the same size

Z = ds.labels[0:100].numpy(aslist=True) # Fetch 100 labels and store 
                                        # them as a list of NumPy arrays

**Note:** The `.numpy()` method will produce an exception if all samples in the requested tensor do not have a uniform shape. If that's the case, running `.numpy(aslist=True)` solves the problem by returning a list of NumPy arrays, where the indices of the list correspond to different samples. 

#**Step 5**: *Visualizing Datasets*

One of Hub's core features is to enable users to visualize and interpret large amounts of data. Let's load the COCO dataset, which is one of the most popular datasets in computer vision.

In [None]:
import hub

ds = hub.load('hub://activeloop/coco-train')

The tensor layout for this dataset can be inspected using:

In [None]:
ds.summary()

The dataset can be [visualized in Platform](https://app.activeloop.ai/activeloop/coco-train), or using an iframe in a jupyter notebook:

In [None]:
ds.visualize()

**Note:** Visualizing datasets in [Activeloop Platform](https://app.activeloop.ai/) will unlock more features and faster performance compared to visualization in Jupyter notebooks.

##Visualizing your own datasets

Any hub dataset can be visualized using the methods above as long as it follows the conventions necessary for the visualization engine to interpret and parse the data. These conventions [are explained here](https://docs.activeloop.ai/dataset-visualization).

# **Step 6**: _Using Activeloop Storage_ (optional -> needs account!)

_Storing and loading datasets from Activeloop Platform Storage._

## Register

You can store your Hub Datasets with Activeloop by first creating an account in [Activeloop Platform](https://app.activeloop.ai/) or in the CLI using:

In [None]:
!activeloop register

## Login

In order for the Python API to authenticate with the Activeloop Platform, you should log in from the CLI using:

In [None]:
!activeloop login  # prompts for inputting username and password will follow ...

# Alternatively, you can directly input your username and password in the same line:
# !activeloop login -u my_username -p my_password


You can then access or create Hub Datasets by passing the Activeloop Platform path to `hub.dataset()`.

In [None]:
import hub

# platform_path = 'hub://workspace_name/dataset_name'
#                 'hub://jane_smith/my_awesome_dataset'
               
ds = hub.dataset(platform_path)

**Note**: When you create an account in Activeloop Platform, a default workspace is created that has the same name as your username. You are also able to create other workspaces that represent organizations, teams, or other collections of multiple users. 

Public datasets such as `hub://activeloop/mnist-train` can be accessed without logging in.

## Tokens

Once you have an Activeloop account, you can create tokens in [Activeloop Platform](https://app.activeloop.ai/) (Organization Details -> API Tokens) and pass them to python commands that require authentication using:

In [None]:
#ds = hub.load(platform_path, token = 'xyz')

# **Step 7**: _Connecting Hub Datasets to ML Frameworks_

_Connecting Hub Datasets to machine learning frameworks such as PyTorch and TensorFlow._

You can connect Hub Datasets to popular ML frameworks such as PyTorch and TensorFlow using minimal boilerplate code, and Hub takes care of the parallel processing!

## PyTorch

You can train a model by creating a PyTorch DataLoader from a Hub Dataset using `ds.pytorch()`.

In [None]:
import hub
from torchvision import datasets, transforms, models

ds = hub.dataset('hub://activeloop/cifar100-train') # Hub Dataset

The transform parameter in `ds.pytorch()` is a dictionary where the `key` is the tensor name and the `value` is the transformation function that should be applied to that tensor. If a specific tensor's data does not need to be returned, it should be omitted from the keys. If a tensor's data does not need to be modified during preprocessing, the transformation function is set as `None`.

In [None]:
tform = transforms.Compose([
    transforms.ToPILImage(), # Must convert to PIL image for subsequent operations to run
    transforms.RandomRotation(20), # Image augmentation
    transforms.ToTensor(), # Must convert to pytorch tensor for subsequent operations to run
    transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]),
])

#PyTorch Dataloader
dataloader= ds.pytorch(batch_size = 16, num_workers = 2, 
    transform = {'images': tform, 'labels': None}, shuffle = True)

You can iterate through the Hub DataLoader just like you would for a Pytorch DataLoader. Loading the first batch of data takes the longest time because the shuffle buffer is filled before any data is returned.

In [None]:
for data in dataloader:
    print(data)
    break
    # Training Loop

**Note:** Some datasets such as imagenet contain both grayscale and color images, which can cause errors when the transformed images are passed to the model. To convert only the grayscale images to color format, you can add this Torchvision transform to your pipeline:

In [None]:
# transforms.Lambda(lambda x: x.repeat(int(3/x.shape[0]), 1, 1))

## TensorFlow

Similarly, you can convert a Hub Dataset to a TensorFlow Dataset via the `tf.Data` API. 

In [None]:
ds # Hub Dataset object, to be used for training
ds_tf = ds.tensorflow() # A TensorFlow Dataset

# **Step 8**: _Parallel Computing_

_Running computations and processing data in parallel._

Hub enables you to easily run computations in parallel and significantly accelerate your data processing workflows. This example primarily focuses on parallel dataset uploading, and other use cases such as dataset transformations can be found in [this tutorial](https://docs.activeloop.ai/tutorials/data-processing-using-parallel-computing).

Parallel compute using Hub has two core elements: #1. defining a function or pipeline that will run in parallel and #2. evaluating it using the appropriate inputs and outputs. Let's start with #1 by defining a function that processes files and appends their data to the labels and images tensors. 

**Defining the parallel computing function**

The first step for running parallel computations is to define a function that will run in parallel by decorating it using `@hub.compute`. In the example below, `file_to_hub` converts data from files into hub format, just like in **Step 2: Creating Hub Datasets Manually**. If you have not completed Step 2, please complete the section that downloads and unzips the *animals* dataset

In [None]:
import hub
from PIL import Image
import numpy as np
import os

@hub.compute
def file_to_hub(file_name, sample_out, class_names):
    ## First two arguments are always default arguments containing:
    #     1st argument is an element of the input iterable (list, dataset, array,...)
    #     2nd argument is a dataset sample
    # Other arguments are optional
    
    # Find the label number corresponding to the file
    label_text = os.path.basename(os.path.dirname(file_name))
    label_num = class_names.index(label_text)
    
    # Append the label and image to the output sample
    sample_out.labels.append(np.uint32(label_num))
    sample_out.images.append(hub.read(file_name))
    
    return sample_out

In all functions decorated using `@hub.compute`, the first argument must be a single element of any input iterable that is being processed in parallel. In this case, that is a filename `file_name`, becuase `file_to_hub` reads image files and populates data in the dataset's tensors. 

The second argument is a dataset sample `sample_out`, which can be operated on using similar syntax to dataset objects, such as `sample_out.append(...)`, `sample_out.extend(...)`, etc.

The function decorated using `@hub.compute` must return `sample_out`, which represents the data that is added or modified by that function.

**Executing the transform**

To execute the transform, you must define the dataset that will be modified by the parallel computation.

In [None]:
ds = hub.empty('./animals_hub_transform') # Creates the dataset

Next, you define the input iterable that describes the information that will be operated on in parallel. In this case, that is a list of files `files_list` from the animals dataset in Step 2.

In [None]:
# Find the class_names and list of files that need to be uploaded
dataset_folder = './animals'

class_names = os.listdir(dataset_folder)

files_list = []
for dirpath, dirnames, filenames in os.walk(dataset_folder):
    for filename in filenames:
        files_list.append(os.path.join(dirpath, filename))

You can now create the tensors for the dataset and **run the parallel computation** using the `.eval` syntax. Pass the optional input arguments to `file_to_hub`, and we skip the first two default arguments `file_name` and `sample_out`. 

The input iterable `files_list` and output dataset `ds` is passed to the `.eval` method as the first and second argument respectively.

In [None]:
with ds:
    ds.create_tensor('images', htype = 'image', sample_compression = 'jpeg')
    ds.create_tensor('labels', htype = 'class_label', class_names = class_names)
    
    file_to_hub(class_names=class_names).eval(files_list, ds, num_workers = 2)

In [None]:
Image.fromarray(ds.images[0].numpy())

Congrats! You just created a dataset using parallel computing! 🎈

# **Step 9**: _Dataset Version Control_

*Managing changes to your datasets using Version Control.*

Hub dataset version control allows you to manage changes to datasets with commands very similar to Git. It provides critical insights into how your data is evolving, and it works with datasets of any size!

Let's check out how dataset version control works in Hub! If you haven't done so already, please download and unzip the *animals* dataset from **Step 2**. 

First let's create a hub dataset in the `./version_control_hub` folder.

In [None]:
import hub
import numpy as np
from PIL import Image

# Set overwrite = True for re-runability
ds = hub.dataset('./version_control_hub', overwrite = True)

# Create a tensor and add an image
with ds:
    ds.create_tensor('images', htype = 'image', sample_compression = 'jpeg')
    ds.images.append(hub.read('./animals/cats/image_1.jpg'))

The first image in this dataset is a picture of a cat:

In [None]:
Image.fromarray(ds.images[0].numpy())

##Commit

To commit the data added above, simply run `ds.commit`:


In [None]:
first_commit_id = ds.commit('Added image of a cat')

print('Dataset in commit {} has {} samples'.format(first_commit_id, len(ds)))

Next, let's add another image and commit the update:

In [None]:
with ds:
    ds.images.append(hub.read('./animals/dogs/image_3.jpg'))
    
second_commit_id = ds.commit('Added an image of a dog')

print('Dataset in commit {} has {} samples'.format(second_commit_id, len(ds)))

The second image in this dataset is a picture of a dog:

In [None]:
Image.fromarray(ds.images[1].numpy())

##Log

The commit history starting from the current commit can be show using `ds.log`:


In [None]:
log = ds.log()

This command prints the log to the console and also assigns it to the specified variable log. The author of the commit is the username of the [Activeloop account](https://docs.activeloop.ai/getting-started/using-activeloop-storage) that logged in on the machine.

##Branch

Branching takes place by running the `ds.checkout` command with the parameter `create = True`. Let's create a new branch `dog_flipped`, flip the second image (dog), and create a new commit on that branch.

In [None]:
ds.checkout('dog_flipped', create = True)

with ds:
    ds.images[1] = np.transpose(ds.images[1], axes=[1,0,2])

flipped_commit_id = ds.commit('Flipped the dog image')

The dog image is now flipped and the log shows a commit on the `dog_flipped` branch as well as the previous commits on `main`: 

In [None]:
Image.fromarray(ds.images[1].numpy())

In [None]:
ds.log()

##Checkout

A previous commit of branch can be checked out using `ds.checkout`:

In [None]:
ds.checkout('main')

Image.fromarray(ds.images[1].numpy())

As expected, the dog image on `main` is not flipped.

## Diff

Understanding changes between commits is critical for managing the evolution of datasets. Hub's `ds.diff` function enables users to determine the number of samples that were added, removed, or updated for each tensor. The function can be used in 3 ways:

In [None]:
ds.diff() # Diff between the current state and the last commit

In [None]:
ds.diff(first_commit_id) # Diff between the current state and a specific commit

In [None]:
ds.diff(second_commit_id, first_commit_id) # Diff between two specific commits

##HEAD Commit


Unlike Git, Hub's version control does not have a staging area because changes to datasets are not stored locally before they are committed. All changes are automatically reflected in the dataset's permanent storage (local or cloud). **Therefore, any changes to a dataset are automatically stored in a HEAD commit on the current branch**. This means that the uncommitted changes do not appear on other branches. Let's see how this works:

You should currently be on the `main` branch, which has 2 samples. Let's adds another image:


In [None]:
print('Dataset on {} branch has {} samples'.format('main', len(ds)))

with ds:
    ds.images.append(hub.read('./animals/dogs/image_4.jpg'))
    
print('After updating, the HEAD commit on {} branch has {} samples'.format('main', len(ds)))

The 3rd sample is also an image of a dog:

In [None]:
Image.fromarray(ds.images[2].numpy())

Next, if you checkout `dog_flipped` branch, the dataset contains 2 samples, which is sample count from when that branch was created. Therefore, the additional uncommitted third sample that was added to the `main` branch above is not reflected when other branches or commits are checked out.

In [None]:
ds.checkout('dog_flipped')

print('Dataset in {} branch has {} samples'.format('dog_flipped', len(ds)))

Finally, when checking our the `main` branch again, the prior uncommitted changes and visible and they are stored in the `HEAD` commit on `main`:

In [None]:
ds.checkout('main')

print('Dataset in {} branch has {} samples'.format('main', len(ds)))

The dataset now contains 3 samples and the uncommitted dog image is visible:

In [None]:
Image.fromarray(ds.images[2].numpy())

##Merge - Coming Soon


Merging is a critical feature for collaborating on datasets, and Activeloop is currently working on an implementation.

Congrats! You just are now an expert in dataset version control!🎓

# **Step 10:** *Dataset Filtering*

Filtering and querying is an important aspect of data engineering because it enables users to focus on subsets of their datasets in order to obtain important insights, perform quality control, and train models on parts of their data. 

Hub enables you to perform queries using user-defined functions or Hub's Pythonic query language, all of which can be parallelized using our simple multi-processing API.

## Filtering with user-defined-functions

The first step for querying using UDFs is to define a function that returns a boolean depending on whether an input sample in a dataset meets the user-defined condition. In this example, we define a function that returns `True` if the labels for a tensor are in the desired labels_list. If there are inputs to the filtering function other than `sample_in`, it must be decorated with `@hub.compute`.

In [None]:
@hub.compute
def filter_labels(sample_in, labels_list, class_names):
    text_label = class_names[sample_in.labels.numpy()[0]]
    
    return text_label in labels_list

Let's load a dataset and specify the `labels_list` that we want to filter for.

In [None]:
import hub
from PIL import Image

ds = hub.load('hub://activeloop/cifar10-test')

labels_list = ['automobile', 'ship'] # Desired labels for filtering
class_names = ds.labels.info.class_names # Mapping from numeric to text labels

The filtering function is executed using the `ds.filter()` command below, and it returns a virtual view of the dataset (`dataset_view`) that only contains the indices that met the filtering condition. Just like in the Parallel Computing API, the `sample_in` parameter does not need to be passed into the filter function when evaluating it, and multi-processing can be specified using the `scheduler` and `num_workers` parameters.

In [None]:
ds_view = ds.filter(filter_labels(labels_list, class_names), scheduler = 'threaded', num_workers = 0)

The data in the returned `ds_view` can be accessed just like a regular dataset.

In [None]:
Image.fromarray(ds_view.images[0].numpy())

**Note:** in most cases, multi-processing is not necessary for queries that involve simple data such as labels or bounding boxes. However, multi-processing significantly accelerates queries that must load rich data types such as images and videos.

## Filtering using our pythonic query language

Queries can also be executed using hub's Pythonic query language. This UX is primarily intended for use in [Activeloop Platform](https://app.activeloop.ai/), but it can also be applied programmatically in Python.

In [None]:
ds_view = ds.filter("labels == 'automobile' or labels == 'automobile'", scheduler = 'threaded', num_workers = 0)

Tensors can be referred to by name, the language supports common logical operations (`in, ==, !=, >, <, >=, <=`), and numpy-like operators and indexing can be applied such as `'images.min > 5'`, `'images.shape[2]==1'`, and others.

Congrats! You just learned to filter data with hub! 🎈