# HuggingFace Dataset Setup

This notebook accesses the [Liver Tumor Segmentation dataset images](https://www.kaggle.com/datasets/trpakov/liver-cancer-segmentation) from kaggle, converts them into a [HuggingFace Dataset object](https://huggingface.co/docs/datasets/index) and pushes the result to the [HuggingFace Hub](https://huggingface.co/docs/hub/index) for easier usage during model fine-tuning. 

In [None]:
# Installation of required Python packages
!pip install transformers datasets

In [None]:
# Importing required Python packages
from huggingface_hub import login
from datasets import DatasetDict

In [None]:
# Authentication with HuggingFace
login(token='PASTE_HF_TOKEN_HERE', add_to_git_credential=True)

In [None]:
# Kaggle credentials
%env KAGGLE_USERNAME=ENTER_KAGGLE_USERNAME
%env KAGGLE_KEY=ENTER_KAGGLE_KEY

In [None]:
# Download the dataset using the kaggle CLI tool
!kaggle datasets download trpakov/liver-cancer-segmentation

Downloading liver-cancer-segmentation.zip to /content
100% 4.22G/4.23G [00:33<00:00, 111MB/s]
100% 4.23G/4.23G [00:33<00:00, 137MB/s]


In [None]:
# Unsip the downloaded archive
!unzip liver-cancer-segmentation.zip > /dev/null

In [None]:
# Create a dict with the training images and segmentation masks paths
train_dict = {'image': [], 'annotation': []}
for path in Path('liver-segmentation/train/images').glob('*'):
  train_dict['image'].append(path.as_posix())
  train_dict['annotation'].append((path.parent.parent / 'masks' / f'{path.stem}_mask.png').as_posix())

In [None]:
# Create a dict with the validation images and segmentation masks paths
val_dict = {'image': [], 'annotation': []}
for path in Path('liver-segmentation/val/images').glob('*'):
  val_dict['image'].append(path.as_posix())
  val_dict['annotation'].append((path.parent.parent / 'masks' / f'{path.stem}_mask.png').as_posix())

In [None]:
# Create a dict with the testing images and segmentation masks paths
test_dict = {'image': [], 'annotation': []}
for path in Path('liver-segmentation/test/images').glob('*'):
  test_dict['image'].append(path.as_posix())
  test_dict['annotation'].append((path.parent.parent / 'masks' / f'{path.stem}_mask.png').as_posix())

In [None]:
# Create the Dataset objects for each set
train = Dataset.from_dict(train_dict).cast_column("image", Image()).cast_column("annotation", Image())
val = Dataset.from_dict(val_dict).cast_column("image", Image()).cast_column("annotation", Image())
test = Dataset.from_dict(test_dict).cast_column("image", Image()).cast_column("annotation", Image())

In [None]:
# Create a DatasetDict object that can be used as a common interface for the different data subsets
ds = DatasetDict(train=train, val=val, test=test)

In [None]:
# Push the datasets to HuggingFace
ds.push_to_hub('liver-cancer-segmentation', private=True)