# Data pipeline

This data pipeline has the objective to download and transform the original dataset to obtain our final dataset. Moreover, we will use DVC to keep data versioning control in every step.

### 0. Necessary imports and functions

In [1]:
import sys
import os
from pathlib import Path
import yaml


sys.path.append(str(Path(os.getenv('PATH_TO_REPO'))))

from person_image_segmentation.utils.dataset_utils import download_dataset, split_dataset, transform_masks, generate_labels
from person_image_segmentation.config import DATASET_LINK, RAW_DATA_DIR, SPLIT_DATA_DIR, TRANSFORM_DATA_DIR, LABELS_DATA_DIR, TRAIN_SIZE, VAL_SIZE, TEST_SIZE, KAGGLE_KEY, KAGGLE_USERNAME

[32m2024-10-30 17:38:26.348[0m | [1mINFO    [0m | [36mperson_image_segmentation.config[0m:[36m<module>[0m:[36m21[0m - [1mPROJ_ROOT path is: /Users/nachogris/Desktop/UNI/GCED/4_QUART/TAED2/LAB/TAED2_YOLOs[0m


### 1. Varibale definition

In [2]:
# Create data directory if it does not exist
RAW_DATA_DIR.mkdir(parents=True, exist_ok=True)

### 2. Working with the original data

The first thing we need to do is to download the original data. To do this we will use `KaggleAPI`.

In [3]:
# Download the dataset
download_dataset(
        dataset_link = DATASET_LINK,
        data_dir = RAW_DATA_DIR
    )

Dataset URL: https://www.kaggle.com/datasets/mariarisques/dataset-person-yolos


### 3. Split data

In this section we are going to split the data into train, validation and test splits.

In [4]:
# Split the dataset
split_dataset(
        train_size = TRAIN_SIZE,
        val_size = VAL_SIZE,
        data_dir = RAW_DATA_DIR,
        split_dir = SPLIT_DATA_DIR
    )

### 4. Transformations

In this step we need to go from the original masks to ones that can be later transformed to labels that yolo is able to understand.

In [5]:
# Transform the masks
transform_masks(SPLIT_DATA_DIR, TRANSFORM_DATA_DIR)

### 5. Create labels

In this last step we convert the previous transformed masks to some labels, 

In [6]:
# Generate the labels
generate_labels(TRANSFORM_DATA_DIR, LABELS_DATA_DIR, SPLIT_DATA_DIR)