# Data preparation

In this notebook we will prepare the data to later train our deep learning model. To do so, 
- we will first upload a copy of our raw dataset as a `wandb.Artifact`
- preprocess the dataset and setup the target column to train a classifier
- split the data and save the splits into a `wandb.Artifact`

In [None]:
import os, json
from pathlib import Path
import pandas as pd

import wandb

We will define some global configuration parameters. `ENTITY` should correspond to your W&B Team name if you work in a team, replace it with `None` if you work individually. 

In [None]:
PROJECT_NAME = 'lemon-project'
ENTITY = 'wandb_course'
RAW_DATA_FOLDER = 'lemon-dataset/images'
ANNOTATIONS_FILE = 'lemon-dataset/annotations/instances_default.json'
PREFIX = 'lemon_dataset'

In [None]:
RAW_DATA_AT = f'{PREFIX}_raw_data'
RAW_DATA_AT

In [None]:
PROCESSED_DATA_AT = f'{PREFIX}_split_data'
PROCESSED_DATA_AT

## Register raw data as an artifact

It is a good practice to save a copy of the raw dataset so that you can reproduce your experiments later and track the model lineage. 

In [None]:
run = wandb.init(project=PROJECT_NAME, entity=ENTITY, job_type="upload")

create an artifact for all the raw data

In [None]:
raw_data_at = wandb.Artifact(RAW_DATA_AT, type="raw_data")

add all images in the directory to the artifact

In [None]:
raw_data_at.add_dir(RAW_DATA_FOLDER, name='images')

add annotations file to the artifact

In [None]:
raw_data_at.add_file(ANNOTATIONS_FILE, name='annotations/instances_default.json')

save artifact to W&B

In [None]:
run.log_artifact(raw_data_at)

finalize run

In [None]:
run.finish()

## Pre-process data for binary classification

The image annotations live in this `instances_default.json` file, we will select the interesting columns and create a new file to store our target col.

We will first grab a copy of the `raw_dataset` from the previously logged `wandb.Artifact`. Why we do this? Don't worry, wandb automatically realizes that the dataset is already here. Doing this, will link the data lineage to point to the raw dataset version that was used to create the preprocessed *new* dataset. This is very handy for dataset traceability.

We first create a new run, we can specify the `job_type` so we can filter later on.

In [None]:
run = wandb.init(project=PROJECT_NAME, entity=ENTITY, job_type="data_split")

find the most recent ("latest") version of the full raw data

In [None]:
raw_data_at = run.use_artifact(f'{RAW_DATA_AT}:latest')
dataset_dir = Path(raw_data_at.download())

In [None]:
data = json.load(open(dataset_dir / 'annotations/instances_default.json'))

we then open and convert the `json` file to a pandas `DataFrame`

In [None]:
annotations = pd.DataFrame.from_dict(data['annotations'])
images = pd.DataFrame.from_dict(data['images'])

In [None]:
annotations.head()

In [None]:
images.head()

In [None]:
df = annotations[['image_id', 'category_id']].groupby('image_id')['category_id'].apply(lambda x: list(set(x))).reset_index()
df.head()

In [None]:
df['mold'] = df['category_id'].apply(lambda x: 4 in x)
df['mold'].value_counts()

In [None]:
df = pd.merge(df, images[['id', 'file_name']], left_on='image_id', right_on='id')

In [None]:
del df['id']

In [None]:
df['fruit_id'] = df['file_name'].apply(lambda x: x.split('/')[1].split('_')[0])

In [None]:
df.head()

## Train / validation / test split

Now we will split the data into train (80%), validation (10%) and test (10%) sets. As we do that, we need to be careful to: 
- *avoid leakage*: for that reason we are grouping data according to each fruit id (we don't want the classifier to memorize which individual fruit contains mold, but generalize across different lemons)
- handle the *label imbalance*: for that reason we stratify data with our target column

We will use sklearn's `StratifiedGroupKFold` to split the data into 10 folds and assign 1 fold for test, 1 for validation and the rest for training. 

In [None]:
df['fold'] = -1

In [None]:
from sklearn.model_selection import StratifiedGroupKFold

X = df.index.values
y = df.mold.values
groups = df.fruit_id.values

cv = StratifiedGroupKFold(n_splits=10, random_state=42, shuffle=True)
for i, (train_idxs, test_idxs) in enumerate(cv.split(X, y, groups)):
    df['fold'].iloc[test_idxs] = i

In [None]:
df['stage'] = df['fold'].apply(lambda x: 'test' if x == 0 else ('valid' if x == 1 else 'train'))

In [None]:
df.to_csv('data_split.csv', index=False)
df.head()

create an artifact for all the raw data

In [None]:
processed_data_at = wandb.Artifact(PROCESSED_DATA_AT, type="split_data")

add data split file to the artifact

In [None]:
processed_data_at.add_file('data_split.csv')

add images to the artifact

In [None]:
processed_data_at.add_dir(dataset_dir)

Let's also enrich our EDA table with the split data so we can inspect it visually.

First, we will fetch original EDA table

In [None]:
orig_eda = run.use_artifact("run-v9s47qhv-table_coco_sample:latest")
orig_eda_table = orig_eda.get("table_coco_sample")

create data split table

In [None]:
data_split_table = wandb.Table(dataframe=df[['file_name', 'stage']])

join tables on `file_name`

In [None]:
join_table = wandb.JoinedTable(orig_eda_table, data_split_table, "file_name")

add table to artifact and log to W&B

In [None]:
processed_data_at.add(join_table, "table_coco_data_split")

save artifact to W&B and finalize run

In [None]:
run.log_artifact(processed_data_at)
run.finish()