# 🚀 Getting Started with DataChain

<img src="static/images/datachain-overview.png" alt="DataChain Overview" style="width: 600px;"/>

DataChain is a powerful tool for managing datasets and ML workflows. This tutorial explores how **DataChain** helps Computer Vision projects:
- 🗂️ **Manage and version datasets and annotations** effectively.
- 🔍 **Handle large-scale operations**, applying complex filters and transformations to millions of data entries.
- ⏰ **Save valuable time and resources** by avoiding redundant computations for previously processed samples.
- 🌊 **Directly stream curated data into PyTorch**, eliminating the need for intermediate resharing.

## 📋 Agenda

- 🖼️ Create a `fashion-product-images` dataset from an image directory
- 📂 Load the dataset
- 🔍 Explore filtering techniques
  
## 🛠 Prerequisites

Before you begin, ensure you have:
- ⚙️ DataChain installed in your environment (follow the instructions in `examples/fashion-product-images/README.md`)

## Imports

In [2]:
%load_ext autoreload
%autoreload 2

# Import datachain 
from datachain.lib.dc import DataChain, C
from datachain.query import udf

import pandas as pd
from typing import Iterator, Tuple

# 🆕 Create a DataChain

There are multiple ways of creating of DataChain and persisting it as a dataset. First, import the necessary modules and load your dataset using a `DataChain` class. 

- From cloud storages (AWS S3, GCP, Azure...) or local directory
- From previously saved dataset version
- From values 

Here are a few examples: 

```python
# from cloud storages as S3, gs or Azure: 
DataChain.from_storage("s3://my-bucket/my-dir/")

# from previously saved dataset: 
DataChain.from_dataset("name", version=1)

# from values: 
DataChain.from_features(fib=[1, 2, 3, 5, 8])
```

Data in DataChain is presented as Python classes with an arbitrary set of fields,
including nested classes. The data classes have to inherit from `Feature` class based on `Pydantic`

<img src="static/images/dataset-1.png" alt="Dataset" style="width: 600px;"/>

**Note:** The DataChain represents file samples as pointers to their respective storage locations. This means a newly created dataset version does not duplicate files in storage, and storage remains the single source of truth for the original samples

## Create a DataChain from a GCP bucket

In [5]:
# Create a DataChain

ds = (
    DataChain.from_storage("gs://datachain-demo/fashion-product-images", type="image")
    .filter(C.name.glob("*.jpg"))
    .save()
)
ds.show(3)

Listing gs://datachain-demo: 49505 objects [00:51, 960.67 objects/s] 
Processed: 44441 rows [00:01, 29300.38 rows/s]

   id vtype  dir_type                         parent       name  \
0   1               0  fashion-product-images/images  10000.jpg   
1   2               0  fashion-product-images/images  10001.jpg   
2   3               0  fashion-product-images/images  10002.jpg   

               etag           version  is_latest  \
0  CPzf74/e+4YDEAE=  1719489653370876          1   
1  CKaGwIne+4YDEAE=  1719489640006438          1   
2  CKTW55fe+4YDEAE=  1719489670015780          1   

                     last_modified  size  ...         file__source  \
0 2024-06-27 12:00:53.421000+00:00  1030  ...  gs://datachain-demo   
1 2024-06-27 12:00:40.056000+00:00  1210  ...  gs://datachain-demo   
2 2024-06-27 12:01:10.067000+00:00   807  ...  gs://datachain-demo   

                    file__parent  file__name file__size     file__version  \
0  fashion-product-images/images   10000.jpg       1030  1719489653370876   
1  fashion-product-images/images   10001.jpg       1210  1719489640006438   
2  fashion




## Create a DataChain from a local directory of images

**(OPTIONAL) You may skip this and work with data in our public dataset.**

You may create a DataChain from a directory if images are stored locally. Download data from Kaggle to follow the example. 

**Manually**
- Download the Fashion Product Images (Small) dataset from kaggle.com: [Fashion Product Images (Small) dataset from Kaggle.com](https://www.kaggle.com/datasets/paramaggarwal/fashion-product-images-small/data) dataset contributed by Param Aggarwal.
- Unzip data into the (`data`) directory in `examples/fashion-product-images`

**Using a script below:**
1. Obtain your Kaggle credentials file (`kaggle.json`) and save it to the (`~/.kaggle`) directory so that it's available at (`~/.kaggle/kaggle.json`).
2. Download the desired dataset from Kaggle.
3. Unzip the downloaded data into the (`data`) directory.

In [2]:
## Prepare credentials 
# !mkdir -p ~/.kaggle
# !cp kaggle.json ~/.kaggle/
# !chmod 600 ~/.kaggle/kaggle.json

## Download data 
# !pip install -q kaggle
# !kaggle datasets download -d paramaggarwal/fashion-product-images-small

## Unzip files 
# unzip fashion-product-images-small.zip "images/*" -d data2
# unzip fashion-product-images-small.zip "styles.csv" -d data2

## (optional) Remove unnecessary redundant directory in the source data 
# ![ -d "data/myntradataset" ] && rm -r "data/myntradataset" 

In [4]:
# Create a DataChain

# DATA_PATH = "data/images"

# ds = (
#     DataChain.from_storage(DATA_PATH, type="image")
#     .filter(C.name.glob("*.jpg"))
# )

## Preview DataChain content

In [4]:
# Preview with `.show()`

ds.show(3)

Processed: 44441 rows [00:01, 27796.21 rows/s]

   id vtype  dir_type                                             parent  \
0   1               0  Users/mikhailrozhkov/dev/products/dvcx/dvcx/ex...   
1   2               0  Users/mikhailrozhkov/dev/products/dvcx/dvcx/ex...   
2   3               0  Users/mikhailrozhkov/dev/products/dvcx/dvcx/ex...   

        name                   etag version  is_latest  \
0   9733.jpg  0x1.76bbcf3800000p+30                  1   
1  14147.jpg  0x1.76bbcc8800000p+30                  1   
2  52112.jpg  0x1.76bbcf0000000p+30                  1   

              last_modified   size  ... file__source  \
0 2019-10-22 12:19:26+00:00   1694  ...     file:///   
1 2019-10-22 12:16:34+00:00  10078  ...     file:///   
2 2019-10-22 12:19:12+00:00  24511  ...     file:///   

                                        file__parent  file__name file__size  \
0  Users/mikhailrozhkov/dev/products/dvcx/dvcx/ex...    9733.jpg       1694   
1  Users/mikhailrozhkov/dev/products/dvcx/dvcx/ex...   14147.jpg      10078   





In [5]:
# Preview with Pandas

df = ds.to_pandas()

print(df.shape)
df.head()

Processed: 44441 rows [00:01, 27464.91 rows/s]


(44441, 25)


Unnamed: 0,id,vtype,dir_type,parent,name,etag,version,is_latest,last_modified,size,...,file__source,file__parent,file__name,file__size,file__version,file__etag,file__is_latest,file__last_modified,file__location,file__vtype
0,1,,0,Users/mikhailrozhkov/dev/products/dvcx/dvcx/ex...,9733.jpg,0x1.76bbcf3800000p+30,,1,2019-10-22 12:19:26+00:00,1694,...,file:///,Users/mikhailrozhkov/dev/products/dvcx/dvcx/ex...,9733.jpg,1694,,0x1.76bbcf3800000p+30,1,1970-01-01 00:00:00+00:00,,
1,2,,0,Users/mikhailrozhkov/dev/products/dvcx/dvcx/ex...,14147.jpg,0x1.76bbcc8800000p+30,,1,2019-10-22 12:16:34+00:00,10078,...,file:///,Users/mikhailrozhkov/dev/products/dvcx/dvcx/ex...,14147.jpg,10078,,0x1.76bbcc8800000p+30,1,1970-01-01 00:00:00+00:00,,
2,3,,0,Users/mikhailrozhkov/dev/products/dvcx/dvcx/ex...,52112.jpg,0x1.76bbcf0000000p+30,,1,2019-10-22 12:19:12+00:00,24511,...,file:///,Users/mikhailrozhkov/dev/products/dvcx/dvcx/ex...,52112.jpg,24511,,0x1.76bbcf0000000p+30,1,1970-01-01 00:00:00+00:00,,
3,4,,0,Users/mikhailrozhkov/dev/products/dvcx/dvcx/ex...,6400.jpg,0x1.76bbcf3000000p+30,,1,2019-10-22 12:19:24+00:00,20042,...,file:///,Users/mikhailrozhkov/dev/products/dvcx/dvcx/ex...,6400.jpg,20042,,0x1.76bbcf3000000p+30,1,1970-01-01 00:00:00+00:00,,
4,5,,0,Users/mikhailrozhkov/dev/products/dvcx/dvcx/ex...,34297.jpg,0x1.76bbce9000000p+30,,1,2019-10-22 12:18:44+00:00,15617,...,file:///,Users/mikhailrozhkov/dev/products/dvcx/dvcx/ex...,34297.jpg,15617,,0x1.76bbce9000000p+30,1,1970-01-01 00:00:00+00:00,,


# 🏷️ Add Metadata

In DataChain, you can add annotations and attributes to files.  In the following steps, you'll add metadata from a CSV file. Here's how you can do it:
1. Load/prepare annotations
2. Define a mapping function or UDF
3. Apply the function to generate new columns
4. Save an annotated dataset

<img src="static/images/dataset-2.png" alt="Dataset" style="width: 600px;"/>

## Load metadata from CSV in GCP

- With Datachain, you can create a chain from a single CSV, JSON, or Parquet file or parse multiple files at once
The example below shows how to parse a single CSV file of metadata using `parse_csv()` method

In [22]:
# Load metadata from CSV 
ds_meta = (
    DataChain.from_storage("gs://datachain-demo/fashion-product-images/styles_clean.csv")
    .parse_csv()
    .save()
)

ds_meta.show(3)

Processed: 1 rows [00:00, 1128.71 rows/s]


Inferred tabular data schema: {'source': <class 'datachain.lib.arrow.Source'>, 'c0': <class 'int'>, 'id': <class 'int'>, 'gender': <class 'str'>, 'mastercategory': <class 'str'>, 'subcategory': <class 'str'>, 'articletype': <class 'str'>, 'basecolour': <class 'str'>, 'season': <class 'str'>, 'year': <class 'float'>, 'usage': <class 'str'>, 'productdisplayname': <class 'str'>}


Processed: 1 rows [00:00, 1538.63 rows/s]
Processed: 0 rows [00:00, ? rows/s]
Generated: 0 rows [00:00, ? rows/s][A
Generated: 10001 rows [00:00, 28080.55 rows/s][A
Generated: 20001 rows [00:00, 26139.45 rows/s][A
Generated: 30001 rows [00:01, 25289.88 rows/s][A
Processed: 1 rows [00:01,  1.98s/ rows]rows/s][A
Generated: 44446 rows [00:01, 22407.51 rows/s]

   id source__file__source    source__file__parent source__file__name  \
0   1  gs://datachain-demo  fashion-product-images   styles_clean.csv   
1   2  gs://datachain-demo  fashion-product-images   styles_clean.csv   
2   3  gs://datachain-demo  fashion-product-images   styles_clean.csv   

   source__file__size source__file__version source__file__etag  \
0             4675018      1719830629903847   COfbk67UhYcDEAE=   
1             4675018      1719830629903847   COfbk67UhYcDEAE=   
2             4675018      1719830629903847   COfbk67UhYcDEAE=   

   source__file__is_latest source__file__last_modified source__file__location  \
0                        1   1970-01-01 00:00:00+00:00                   None   
1                        1   1970-01-01 00:00:00+00:00                   None   
2                        1   1970-01-01 00:00:00+00:00                   None   

   ...     c0  gender  mastercategory subcategory articletype basecolour  \
0  ...  12904     Men         Apparel    




In [23]:
# Add a "filename" column to map each image file to its corresponding metadata

ds_meta = ds_meta.map(filename=lambda c0: str(c0) + '.jpg', output=str)
ds_meta.show(3)

Processed: 44446 rows [00:00, 66907.51 rows/s]

   id source__file__source    source__file__parent source__file__name  \
0   1  gs://datachain-demo  fashion-product-images   styles_clean.csv   
1   2  gs://datachain-demo  fashion-product-images   styles_clean.csv   
2   3  gs://datachain-demo  fashion-product-images   styles_clean.csv   

   source__file__size source__file__version source__file__etag  \
0             4675018      1719830629903847   COfbk67UhYcDEAE=   
1             4675018      1719830629903847   COfbk67UhYcDEAE=   
2             4675018      1719830629903847   COfbk67UhYcDEAE=   

   source__file__is_latest source__file__last_modified source__file__location  \
0                        1   1970-01-01 00:00:00+00:00                   None   
1                        1   1970-01-01 00:00:00+00:00                   None   
2                        1   1970-01-01 00:00:00+00:00                   None   

   ... gender  mastercategory  subcategory articletype basecolour  season  \
0  ...    Men         Apparel      Topwe




## Load metadata from a local CSV file

**(OPTIONAL) You may skip this and work with data in our public dataset.**

- In this example, you load the metadata from a CSV file and prepare a annotations 
- Use an image `filename` to map each image file to its corresponding metadata

In [18]:
# # Load Annotations from 'data/styles.csv'

# ANNOTATIONS_PATH = "data/styles.csv"

# annotations = pd.read_csv(
#     ANNOTATIONS_PATH,
#     usecols=["id", "gender", "masterCategory", "subCategory", "articleType", "baseColour", "season", "year", "usage", "productDisplayName"],
# )

# annotations.head(3)

In [19]:
# # Preprocess columns
# annotations["baseColour"] = annotations["baseColour"].fillna('')
# annotations["season"] = annotations["season"].fillna('')
# annotations["usage"] = annotations["usage"].fillna('')
# annotations["productDisplayName"] = annotations["productDisplayName"].fillna('')

# # Add 'filename' column for each image
# annotations["filename"] = annotations["id"].apply(lambda s: str(s) + ".jpg")
# annotations = annotations.drop("id", axis=1)

In [20]:
# ### Create a metadata Datachain allows to generate a chain from Pandas DataFrame 

# ds_meta = DataChain.from_pandas(annotations)
# ds_meta.show(3)

## Merge the original image and metadata datachains

- The `merge` method merges two chains based on the specified criteria
- Parameters:
  - `right_ds`: Chain to join with.
  - `on`: Predicate or list of Predicates to join on. If both chains have the same predicates then this predicate is enough for the join. Otherwise, `right_on` parameter has to specify the predicates for the other chain.
  - `right_on`: Optional predicate or list of Predicates for the `right_ds` to join.
  - `inner`: Whether to run inner join or outer join. Default is False.
  - `rname`: name prefix for conflicting signal names. Default: "{name}_right"

In [24]:
ds_annotated = ds.merge(ds_meta, on="name", right_on="filename")
ds_annotated.show(3)

Processed: 44441 rows [00:01, 26269.21 rows/s]
Processed: 44446 rows [00:00, 65820.94 rows/s]


   id vtype  dir_type                         parent       name  \
0   1               0  fashion-product-images/images  10000.jpg   
1   2               0  fashion-product-images/images  10001.jpg   
2   3               0  fashion-product-images/images  10002.jpg   

               etag           version  is_latest  \
0  CPzf74/e+4YDEAE=  1719489653370876          1   
1  CKaGwIne+4YDEAE=  1719489640006438          1   
2  CKTW55fe+4YDEAE=  1719489670015780          1   

                     last_modified  size  ...  gender mastercategory  \
0 2024-06-27 12:00:53.421000+00:00  1030  ...  Unisex       Footwear   
1 2024-06-27 12:00:40.056000+00:00  1210  ...     Men       Footwear   
2 2024-06-27 12:01:10.067000+00:00   807  ...  Unisex    Accessories   

   subcategory   articletype basecolour  season    year   usage  \
0   Flip Flops    Flip Flops  Navy Blue  Winter  2012.0  Casual   
1        Shoes  Casual Shoes      Black  Summer  2013.0  Casual   
2        Socks         Socks    

# 💾 Save Dataset

Saving datasets in DataChain allows you to:

- Persist the dataset and its metadata for future use
- Version the dataset to track changes over time
- Share the dataset with others in your team or organization
- Easily load the dataset in other DataChain workflows or notebooks

By saving the annotated dataset, you ensure the metadata is stored alongside the image data, making it convenient to access and use the enriched dataset in your DataChain projects.

To save the annotated dataset in DataChain, you can use the `.save()` method on the ds_annotated dataset object. 

<img src="static/images/dataset-3.png" alt="Dataset" style="width: 600px;"/>

In [25]:
ds_annotated.save("fashion-product-images")

Processed: 44441 rows [00:01, 27774.44 rows/s]
Processed: 44446 rows [00:00, 66249.91 rows/s]


<datachain.lib.dc.DataChain at 0x332501b90>

This line of code saves the `ds_annotated` dataset as a new dataset named "fashion-product-images" in DataChain.

The `.save()` method takes the name of the dataset as a parameter and creates a new dataset with that name in DataChain. The saved dataset will include all the data and metadata from the original dataset, as well as the newly added metadata signals from the `ImageMetadata` UDF.

After executing this code, you will have a new dataset named "fashion-product-images" in your DataChain workspace, which contains the annotated image data. You can later load this dataset using `DataChain.from_dataset("fashion-product-images")` to access the annotated data in your DataChain workflows.

# 🔍 Explore Data

The dataset contains metadata about the images. We can view this metadata in two ways: 
- using method `.show()` 
- using the `.to_pandas()` method to review as a Pandas DataFrame 

In [12]:
# Load Image Catalog

ds = DataChain.from_dataset(name="fashion-product-images")

This line creates a DataChain object named `ds` that refers to previously saved dataset named `fashion-product-images`.

## Use DataChain API `show()`

In [13]:
ds.show(3)

   id vtype  dir_type                                             parent  \
0   1               0  Users/mikhailrozhkov/dev/products/dvcx/dvcx/ex...   
1   2               0  Users/mikhailrozhkov/dev/products/dvcx/dvcx/ex...   
2   3               0  Users/mikhailrozhkov/dev/products/dvcx/dvcx/ex...   

        name                   etag version  is_latest  \
0   9733.jpg  0x1.76bbcf3800000p+30                  1   
1  14147.jpg  0x1.76bbcc8800000p+30                  1   
2  52112.jpg  0x1.76bbcf0000000p+30                  1   

              last_modified   size  ... gender mastercategory  subcategory  \
0 2019-10-22 12:19:26+00:00   1694  ...    Men        Apparel      Topwear   
1 2019-10-22 12:16:34+00:00  10078  ...    Men    Accessories    Cufflinks   
2 2019-10-22 12:19:12+00:00  24511  ...  Women        Apparel      Topwear   

  articletype basecolour  season    year   usage  \
0      Shirts      Green    Fall  2011.0  Casual   
1   Cufflinks      Steel    Fall  2011.0  For

## Convert to Pandas DataFrame

This line converts the DataChain dataset (`ds`) into a pandas DataFrame (`df`), making it easier to explore the data using familiar pandas functionalities.
- For example, review of the distribution of values in these columns

In [14]:
df = ds.to_pandas()

print(df.shape)
print(df.columns)
df.head(3)

(44441, 36)
Index(['id', 'vtype', 'dir_type', 'parent', 'name', 'etag', 'version',
       'is_latest', 'last_modified', 'size', 'owner_name', 'owner_id',
       'random', 'location', 'source', 'file__source', 'file__parent',
       'file__name', 'file__size', 'file__version', 'file__etag',
       'file__is_latest', 'file__last_modified', 'file__location',
       'file__vtype', 'right_id', 'gender', 'mastercategory', 'subcategory',
       'articletype', 'basecolour', 'season', 'year', 'usage',
       'productdisplayname', 'filename'],
      dtype='object')


Unnamed: 0,id,vtype,dir_type,parent,name,etag,version,is_latest,last_modified,size,...,gender,mastercategory,subcategory,articletype,basecolour,season,year,usage,productdisplayname,filename
0,1,,0,Users/mikhailrozhkov/dev/products/dvcx/dvcx/ex...,9733.jpg,0x1.76bbcf3800000p+30,,1,2019-10-22 12:19:26+00:00,1694,...,Men,Apparel,Topwear,Shirts,Green,Fall,2011.0,Casual,Indian Terrain Men Chase Green Shirts,9733.jpg
1,2,,0,Users/mikhailrozhkov/dev/products/dvcx/dvcx/ex...,14147.jpg,0x1.76bbcc8800000p+30,,1,2019-10-22 12:16:34+00:00,10078,...,Men,Accessories,Cufflinks,Cufflinks,Steel,Fall,2011.0,Formal,Belmonte Men Bright Assorted Steel Cufflinks,14147.jpg
2,3,,0,Users/mikhailrozhkov/dev/products/dvcx/dvcx/ex...,52112.jpg,0x1.76bbcf0000000p+30,,1,2019-10-22 12:19:12+00:00,24511,...,Women,Apparel,Topwear,Kurtis,Multi,Summer,2012.0,Ethnic,Myntra Women Multi Coloured Kurti,52112.jpg


In [15]:
print(df.mastercategory.value_counts())
print(df.subcategory.value_counts())

mastercategory
Apparel           21395
Accessories       11289
Footwear           9222
Personal Care      2404
Free Items          105
Sporting Goods       25
Home                  1
Name: count, dtype: int64
subcategory
Topwear                     15401
Shoes                        7344
Bags                         3055
Bottomwear                   2693
Watches                      2542
Innerwear                    1808
Jewellery                    1080
Eyewear                      1073
Fragrance                    1012
Sandal                        963
Wallets                       933
Flip Flops                    915
Belts                         811
Socks                         698
Lips                          527
Dress                         478
Loungewear and Nightwear      470
Saree                         427
Nails                         329
Makeup                        307
Headwear                      293
Ties                          258
Accessories                   1

This code snippet demonstrates how to leverage DataChain to load and get a basic understanding of your dataset using `pandas`.

**Note**: DataChain offers functionalities beyond pandas conversion. Explore the documentation for more advanced data manipulation techniques!

# 🕵️‍♀️ Filtering Data

DataChain allows you to filter the dataset based on specific conditions.
- `.filter()` method applies querying expressions to columns  
- use a `C` object to refer to the dataset column by names like `C.NAME` (e.g. `C.mastercategory`)

## Show only images with `Apparel` category

In [16]:
(
    ds.filter(C.mastercategory == "Apparel").show(3)
)


   id vtype  dir_type                                             parent  \
0   1               0  Users/mikhailrozhkov/dev/products/dvcx/dvcx/ex...   
1   3               0  Users/mikhailrozhkov/dev/products/dvcx/dvcx/ex...   
2   5               0  Users/mikhailrozhkov/dev/products/dvcx/dvcx/ex...   

        name                   etag version  is_latest  \
0   9733.jpg  0x1.76bbcf3800000p+30                  1   
1  52112.jpg  0x1.76bbcf0000000p+30                  1   
2  34297.jpg  0x1.76bbce9000000p+30                  1   

              last_modified   size  ... gender mastercategory  subcategory  \
0 2019-10-22 12:19:26+00:00   1694  ...    Men        Apparel      Topwear   
1 2019-10-22 12:19:12+00:00  24511  ...  Women        Apparel      Topwear   
2 2019-10-22 12:18:44+00:00  15617  ...  Women        Apparel      Topwear   

  articletype basecolour  season    year   usage  \
0      Shirts      Green    Fall  2011.0  Casual   
1      Kurtis      Multi  Summer  2012.0  Eth

## Show only `Topwear` products

In [17]:
(
    ds.filter(C.subcategory == "Topwear").show(3)
)

   id vtype  dir_type                                             parent  \
0   1               0  Users/mikhailrozhkov/dev/products/dvcx/dvcx/ex...   
1   3               0  Users/mikhailrozhkov/dev/products/dvcx/dvcx/ex...   
2   5               0  Users/mikhailrozhkov/dev/products/dvcx/dvcx/ex...   

        name                   etag version  is_latest  \
0   9733.jpg  0x1.76bbcf3800000p+30                  1   
1  52112.jpg  0x1.76bbcf0000000p+30                  1   
2  34297.jpg  0x1.76bbce9000000p+30                  1   

              last_modified   size  ... gender mastercategory  subcategory  \
0 2019-10-22 12:19:26+00:00   1694  ...    Men        Apparel      Topwear   
1 2019-10-22 12:19:12+00:00  24511  ...  Women        Apparel      Topwear   
2 2019-10-22 12:18:44+00:00  15617  ...  Women        Apparel      Topwear   

  articletype basecolour  season    year   usage  \
0      Shirts      Green    Fall  2011.0  Casual   
1      Kurtis      Multi  Summer  2012.0  Eth

## Chain multiple filters together

Show only 'Topwear' apparel products for a 'Summer' season

In [18]:
(
    ds
    .filter(C.mastercategory == "Apparel")
    .filter(C.subcategory == "Topwear")
    .filter(C.season == "Summer")
    .to_pandas().shape
    # .show(3)
 )

(8830, 36)

You may use one line filter with multiple expressions joined with logical operators like `&` (AND) and  `|` (OR)

In [19]:
(
    ds
    .filter((C.mastercategory == "Apparel") & (C.subcategory == "Topwear") & (C.season == "Summer"))
    .to_pandas().shape
    # .show(3)
 )

(8830, 36)

## Save Dataset

Let's save "fashion-topwear" to make it version and reusable

In [20]:
(
    DataChain(name="fashion-product-images")
    .filter(C.mastercategory == "Apparel")
    .filter(C.subcategory == "Topwear")
    .save("fashion-topwear")
    .to_pandas().shape
)

(15401, 36)

# ☁️ Run in Studio (SaaS)

<a href="https://dvc.ai/">
    <img src="static/images/studio.png" alt="DataChain Studio SaaS" style="width: 600px;"/>
</a>

To run these examples in Studio, follow the quide

1. Open Studio / YOUR_TEAM / `datasets` workspace
2. Create a new Python Script
3. Copy/past a script from `scripts/1-quick-start.py`
4. Click the Run button


# 🎉 Summary 

👏 **Congratulations on completing this tutorial! You're a DataChain superstar! 🌟** You've taken the first steps in harnessing the power of DataChain for your computer vision projects. In this tutorial, we covered:
- Creating the `fashion-product-images` dataset from existing images
- Filtering the dataset based on specific conditions
- Essential DataChain methods:
    - `.show()` for displaying dataset samples
    - `.to_pandas()` for converting datasets to Pandas DataFrames
    - `.filter()` for applying custom filters to datasets
    - `.gen()` for generated metadata
    - `.merge()` for attaching metadata to images

But this is just the beginning! DataChain offers many features for streamlining your ML workflows, including data transformations, versioning, and much more. 🚀

## What's Next?

Excited to learn more? Check out the next parts of our tutorial series:
- 📂 Saving and Versioning Datasets 
- 🧩 Splitting Datasets for Training, Validation, and Testing
- 🎨 Generating and Managing Embeddings
- 🔍 Performing Similarity Search
- 🧹 Finding and Removing Redundant Images
- 🧠 Training Models
- 🔮 Running Inference and Saving Predictions
- 📊 Analyzing Predictions

By mastering these techniques, you'll be well on your way to building powerful and efficient computer vision pipelines with DataChain.

## 🤝 Get Involved

We'd love to have you join our growing community of DataChain users and contributors! Here's how you can get involved:

- ⭐ Give us a star on [GitHub](https://github.com/iterative/dvcx) to show your support
- 🌐 Visit the [dvc.ai website](https://dvc.ai/) to learn more about our products and services
- 📞 Contact us to discuss on scaling 🚀 DataChain for your project!
- 🙌 Follow us on [LinkedIn](https://www.linkedin.com/company/dvc-ai/) and [Twitter](https://x.com/DVCorg) for the latest updates and insights

Thanks for choosing DataChain, and happy coding! 😄