# EDA

In this notebook, we will download a sample of BDD100K semantic segmentation dataset and use W&B artificats and tables to version and analyze our data

In [1]:
DEBUG = False # set this flag to True to use a small subset of data for testing

In [2]:
import os
from fastai.vision.all import *

import wandb

In [3]:
class Config:
    WANDB_PROJECT = "mlops-course-001"
    ENTITY = None # set this to team name if working in a team
    BDD_CLASSES = {i: c for i, c in enumerate(['background', 'road', 'traffic light', 'traffic sign',
                                              'person', 'vehicle', 'bicycle'])}
    RAW_DATA_AT = 'bdd_simple_1k'
    PROCESSED_DATA_AT = 'bdd_simple_1k_split'
    
PARAMS = Config()

We have defined some global configuration parameters in the ```Config()``` class. The ```ENTITY``` should correspond to your W&B team name if you work in a team, replace with None if you work individually.

In the section below, we'll use ```untar_data``` function from ```fastai``` to download and unzip our datasets.

In [4]:
URL = 'https://storage.googleapis.com/wandb_course/bdd_simple_1k.zip'

In [5]:
path = Path(untar_data(URL, force_download=True))

In [6]:
os.listdir(path)

['images', 'labels', 'LICENSE.txt']

In [7]:
path.ls()

(#3) [Path('/home/studio-lab-user/.fastai/data/bdd_simple_1k/images'),Path('/home/studio-lab-user/.fastai/data/bdd_simple_1k/labels'),Path('/home/studio-lab-user/.fastai/data/bdd_simple_1k/LICENSE.txt')]

Define several functions to help process the data and upload it as a ```Table``` to W&B

In [8]:
def label_func(fname):
    return (fname.parent.parent/"labels")/f"{fname.stem}_mask.png"


def get_classes_per_image(mask_data, class_labels):
    unique = list(np.unique(mask_data))
    result_dict = {}
    for _class in class_labels.keys():
        result_dict[class_labels[_class]] = int(_class in unique)
    return result_dict


def _create_table(image_files, class_labels):
    "create a table with the dataset"
    labels = [str(class_labels[_lab]) for _lab in list(class_labels)]
    table = wandb.Table(
        columns=['File_Name', 'P1', 'P2', 'Images', 'Dataset'] + labels
    )
    
    for i, image_file in progress_bar(enumerate(image_files), total=len(image_files)):
        image = Image.open(image_file)
        mask_data = np.array(Image.open(label_func(image_file)))
        class_in_image = get_classes_per_image(mask_data, class_labels)
        
        table.add_data(
            str(image_file.name),
            image_file.stem.split('-')[0],
            image_file.stem.split('-')[1],
            wandb.Image(
                image,
                masks={
                    "predictions": {
                        "mask_data": mask_data,
                        "class_labels": class_labels,
                    }
                }
            ),
            "bdd1k", # we don't have a dataset split yet
            *[class_in_image[_lab] for _lab in labels]
        )
        
    return table

We will start a new W&B ```run``` and put everything into a raw Artifact

In [9]:
run = wandb.init(
    project=PARAMS.WANDB_PROJECT,
    entity=PARAMS.ENTITY,
    job_type="upload"
)

raw_data = wandb.Artifact(
    PARAMS.RAW_DATA_AT,
    type='raw_data'
)

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33msamu2505[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [10]:
raw_data.add_file(path/'LICENSE.txt', name='LICENSE.txt')

<ManifestEntry digest: X+6ZFkDOlnKesJCNt20yRg==>

Let's add the images and label masks

In [11]:
raw_data.add_dir(path/'images', name='images')
raw_data.add_dir(path/'labels', name='labels')

[34m[1mwandb[0m: Adding directory to artifact (/home/studio-lab-user/.fastai/data/bdd_simple_1k/images)... Done. 0.8s
[34m[1mwandb[0m: Adding directory to artifact (/home/studio-lab-user/.fastai/data/bdd_simple_1k/labels)... Done. 0.4s


Let's get the file names of images in our dataset and use the function we defined above to create a W&B Table

In [12]:
image_files = get_image_files(path/'images', recurse=False)

# sample a subset if DEBUG
if DEBUG:
    image_files = image_files[:10]

In [13]:
table = _create_table(image_files, PARAMS.BDD_CLASSES)

Finally, we will add the Table to our artifact, log it to W&B and finish our run

In [14]:
raw_data.add(table, "eda_table")

<ManifestEntry digest: KD3UiqjzkaA+ujwEer9dag==>

In [15]:
run.log_artifact(raw_data)
run.finish()

VBox(children=(Label(value='846.591 MB of 846.591 MB uploaded (846.005 MB deduped)\r'), FloatProgress(value=1.…