# Prepare the dataset

<!--- @wandbcode{pis_course} -->

In this notebook we will prepare the dataset for the model. You will first need to download the lemon dataset:

```bash
$ git clone -qq https://github.com/softwaremill/lemon-dataset.git
$ unzip -q lemon-dataset/data/lemon-dataset.zip
```
    
Then you can run this notebook to prepare the dataset as an artifact and upload it to W&B.

In [3]:
# import the necessary packages
import json
from pathlib import Path

import pandas as pd
from sklearn.model_selection import StratifiedGroupKFold

import params
import wandb

start with a new wandb run

In [4]:
run = wandb.init(
    project=params.PROJECT_NAME, entity=params.ENTITY, job_type="data_prep")

get the path to the raw data folder

In [5]:
raw_data_folder = Path(params.RAW_DATA_FOLDER)
images_folder = raw_data_folder / params.IMAGES_FOLDER
annotations_file = raw_data_folder / params.ANNOTATIONS_FILE

create a new artifact to store the dataset

In [6]:
dataset_artifact = wandb.Artifact(params.ARTIFACT_NAME, type="dataset")
dataset_artifact.add_dir(images_folder, name=params.IMAGES_FOLDER)
dataset_artifact.add_file(annotations_file, name=params.ANNOTATIONS_FILE)

[34m[1mwandb[0m: Adding directory to artifact (.\lemon-dataset\images)... Done. 1.8s


ArtifactManifestEntry(path='annotations/instances_default.json', digest='1mcsrBLiFwep9+prlDhEtw==', size=41537226, local_path='C:\\Users\\renan\\AppData\\Local\\wandb\\wandb\\artifacts\\staging\\tmp9getx6jv')


read annotations data and convert it to dataframes

In [7]:
data = json.load(open(annotations_file, mode="r", encoding="utf-8"))
annotations = pd.DataFrame.from_dict(data["annotations"])
images = pd.DataFrame.from_dict(data["images"])

In [8]:
annotations.head()

Unnamed: 0,id,iscrowd,area,category_id,image_id,segmentation,bbox
0,1,0,539.0,9,0,"[[179.15200000000914, 641.3920000000107, 179.0...","[157.40800000001036, 603.6640000000098, 27.215..."
1,2,0,622.0,5,0,"[[411.404296875, 458.1650390625, 403.939548160...","[398.2661785600103, 422.66166784001143, 29.561..."
2,3,0,809.0,5,0,"[[299.818359375, 442.6376953125, 293.547719680...","[291.15893248000975, 431.8883584000105, 39.414..."
3,4,0,30.0,5,100,"[[311.98046875, 494.6767578125, 308.9262595362...","[308.92625953626884, 494.6767578125, 6.2998631..."
4,5,0,31.0,2,100,"[[606.7744140625, 489.2041015625, 602.81602149...","[602.8160214904838, 489.2041015625, 7.58732610..."


In [9]:
images.head()

Unnamed: 0,id,date_captured,coco_url,file_name,license,flickr_url,height,width
0,0,0,,images/0001_A_H_0_A.jpg,0,,1056,1056
1,100,0,,images/0003_A_V_150_A.jpg,0,,1056,1056
2,101,0,,images/0003_A_V_15_A.jpg,0,,1056,1056
3,102,0,,images/0003_A_V_165_A.jpg,0,,1056,1056
4,103,0,,images/0003_A_V_30_A.jpg,0,,1056,1056



wrangle data to give us a binary classification target and fruit ids based on our EDA


In [10]:
df = (
    annotations[["image_id", "category_id"]]
    .groupby("image_id")["category_id"]
    .apply(lambda x: list(set(x)))
    .reset_index()
)
df["mold"] = df["category_id"].apply(lambda x: 4 in x)
df = pd.merge(df, images[["id", "file_name"]], left_on="image_id", right_on="id")
del df["id"]
df["fruit_id"] = df["file_name"].apply(lambda x: x.split("/")[1].split("_")[0])

In [11]:
df.head()

Unnamed: 0,image_id,category_id,mold,file_name,fruit_id
0,0,"[9, 5]",False,images/0001_A_H_0_A.jpg,1
1,100,"[2, 5, 7]",False,images/0003_A_V_150_A.jpg,3
2,101,"[9, 2, 5]",False,images/0003_A_V_15_A.jpg,3
3,102,"[2, 5, 7]",False,images/0003_A_V_165_A.jpg,3
4,103,"[9, 5]",False,images/0003_A_V_30_A.jpg,3


## TRAIN / VALIDATION / TEST SPLIT

Let's use scikit-learn to split our data into train, validation, and test sets. We'll use stratified group k-fold cross-validation to ensure that our train, validation, and test sets are representative of the entire dataset.

This technique is useful when we have a small dataset and we want to ensure that our train, validation, and test sets are representative of the entire dataset. We'll use 10 folds, which means that we'll have 10 different train, validation, and test sets. We'll use the first fold as our test set, the second fold as our validation set, and the remaining 8 folds as our training set.

In [12]:

df["fold"] = -1
X = df.index.values
y = df.mold.values  # stratify by our target column
groups = df.fruit_id.values  # group individual fruit to avoid leakage

cv = StratifiedGroupKFold(n_splits=10, random_state=42, shuffle=True)
for i, (_, test_idxs) in enumerate(cv.split(X, y, groups)):
    df["fold"].iloc[test_idxs] = i

df["stage"] = df["fold"].apply(
    lambda x: "test" if x == 0 else ("valid" if x == 1 else "train")
)
df.to_csv("data_split.csv", index=False)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["fold"].iloc[test_idxs] = i
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["fold"].iloc[test_idxs] = i
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["fold"].iloc[test_idxs] = i
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["fold"].iloc[test_idxs] = i
A value is trying to be set on a copy of a s

In [13]:
df.head()

Unnamed: 0,image_id,category_id,mold,file_name,fruit_id,fold,stage
0,0,"[9, 5]",False,images/0001_A_H_0_A.jpg,1,3,train
1,100,"[2, 5, 7]",False,images/0003_A_V_150_A.jpg,3,7,train
2,101,"[9, 2, 5]",False,images/0003_A_V_15_A.jpg,3,7,train
3,102,"[2, 5, 7]",False,images/0003_A_V_165_A.jpg,3,7,train
4,103,"[9, 5]",False,images/0003_A_V_30_A.jpg,3,7,train



add csv containing processed data split into the artifact

In [14]:
dataset_artifact.add_file("data_split.csv")

ArtifactManifestEntry(path='data_split.csv', digest='cKtAzBivv/NN5VRTt0YiFw==', size=161453, local_path='C:\\Users\\renan\\AppData\\Local\\wandb\\wandb\\artifacts\\staging\\tmprfy4ymwn')

log artifact to W&B and finish the run

In [15]:
run.log_artifact(dataset_artifact)
run.finish()