<a href="https://colab.research.google.com/github/star-whale/starwhale/blob/main/example/notebooks/dataset-sdk.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Installing Starwhale

Starwhale can be installed via `pip` command. By default, Starwhale does not install dependencies for audio and image. 

In [None]:
%%bash

pip install "starwhale[all]"  # install starwhale all dependencies: audio and image
# pip install "starwhale[image]"   # --> install image dependencies: pillow
# pip install "starwhale[audio]"   # --> install audio dependencies: soundfile
# pip install starwhale     # --> install basic dependencies

# 2.  Building CIFAR10 Dataset

[CIFAR10](https://www.cs.toronto.edu/~kriz/cifar.html) Dataset is an image dataset that includes 60000 32*32 color images in 10 classes.

In [None]:
%%bash

rm -rf data && mkdir data
curl -o data/cifar.tar.gz https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz 
tar -xzf data/cifar.tar.gz -C data
rm -rf data/cifar.tar.gz
ls data/cifar-10-batches-py

## 2.2 Building Starwhale Dataset

### 2.2.1 Creating dataset object

In [None]:
from starwhale import dataset

ds = dataset("cifar10", create="empty")
print(ds)

### 2.2.2 Loading original dataset content

In [None]:
import os
import pickle
from pathlib import Path

root_dir = Path(os.path.abspath('')) / "data" / "cifar-10-batches-py"
meta = pickle.load((root_dir / "batches.meta").open("rb"))
train_data_contents = [pickle.load((root_dir /f"data_batch_{i}").open("rb"), encoding="bytes")  for i in range(1, 6)]

### 2.2.3 Appending to dataset

In [None]:
import io
from PIL import Image as PILImage
from starwhale import Image, MIMEType

for content in train_data_contents:
    for data, label, filename in zip(content[b"data"], content[b"labels"], content[b"filenames"]):
        image_array = data.reshape(3, 32, 32).transpose(1, 2, 0)
        image_bytes = io.BytesIO()
        PILImage.fromarray(image_array).save(image_bytes, format="PNG")

        image_data = Image(fp=image_bytes.getvalue(), display_name=filename.decode(), shape=image_array.shape, mime_type=MIMEType.PNG)
        ds.append({"label": label, "display_name": meta["label_names"][label], "image": image_data})

### 2.2.4 Commit and close dataset

In [None]:
ds.commit()
ds.close()

## 2.3 Using swcli to find CIFAR10 dataset

In [None]:
!swcli dataset list

# 3. Loading Starwhale Dataset

In [None]:
from starwhale import dataset

ds = dataset("cifar10/version/latest")
print(ds)

## 3.1 Showing dataset summary

In [None]:
# get dataset summary
ds.summary()

In [None]:
# get dataset rows count
len(ds)

## 3.2 Fetching data rows

In [None]:
# get first dataset row
ds[0]

In [None]:
# get pillow object
ds[0].features.image.to_pil()
# or ds[0].features["image"].to_pil()

In [None]:
# iterator for dataset
rows = list(ds[:10])
len(rows)

## 3.3 To Pytorch Dataset

Starwhale Dataset can be converted into Pytorch dataset automatically. Before code execution, we should install Pytorch lib via pip command. Pytorch is not the Starwhale package dependency.

In [None]:
!pip install torch

In [None]:
torch_ds = ds.to_pytorch()
print(torch_ds)

In [None]:
import torch.utils.data
torch_loader = torch.utils.data.DataLoader(torch_ds, batch_size=5)
item = next(iter(torch_loader))
print(item)

## 3.4 To Tensorflow Dataset
Starwhale Dataset can be converted into Tensorflow dataset automatically. Before code execution, we should install Pytorch lib via tensorflow command. Tensorflow is not the Starwhale package dependency.

In [None]:
!pip install tensorflow

In [None]:
tf_ds = ds.to_tensorflow()
print(tf_ds)

In [None]:
import tensorflow as tf
batch_ds = tf_ds.batch(5, drop_remainder=True)
items = list(batch_ds.take(2))
print(items)

🍺 Congratulations! You just learned to use starwhale sdk to build and load dataset. 👍