<a href="https://colab.research.google.com/github/sayakpaul/Dual-Deployments-on-Vertex-AI/blob/main/notebooks/Dataset_Prep.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, we will use the [Flowers dataset](https://www.tensorflow.org/datasets/catalog/tf_flowers) and create a `.csv` file out of it so that it can be imported into Vertex AI as a [managed dataset](https://cloud.google.com/vertex-ai/docs/training/using-managed-datasets). 

To proceed with the rest of the notebook you'd need a billing-enabled GCP account. 

## Setup

In [None]:
!gcloud init

## Download the original dataset and copy over to a GCS Bucket

In [None]:
!wget -q https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz
!tar -xf flower_photos.tgz

***If you have a spare bucket, then you can use that for this purpose.*** 

In [None]:
#@title GCS
#@markdown You should change these values as per your preferences. The copy operation can take ~5 minutes. 
BUCKET_PATH = "gs://flowers-experimental" #@param {type:"string"}
REGION = "us-central1" #@param {type:"string"}

!gsutil mb -l {REGION} {BUCKET_PATH}
!gsutil -m cp -r flower_photos {BUCKET_PATH}

Verify if the files were copied over.

In [None]:
!gsutil ls {BUCKET_PATH}/flower_photos/

gs://flowers-experimental/flower_photos/LICENSE.txt
gs://flowers-experimental/flower_photos/daisy/
gs://flowers-experimental/flower_photos/dandelion/
gs://flowers-experimental/flower_photos/roses/
gs://flowers-experimental/flower_photos/sunflowers/
gs://flowers-experimental/flower_photos/tulips/


## Imports

In [None]:
import random
random.seed(666)

from google.cloud import storage
from pprint import pprint
import pandas as pd
import os

In [None]:
from google.colab import auth
auth.authenticate_user()

## Preparing a single `.csv` file

Vertex AI datasets can operate with `.jsonl` and `.csv` formats in order to import datasets. In this notebook, we will be using `.csv`. Here's the structure which Vertex AI expects ([reference](https://cloud.google.com/vertex-ai/docs/datasets/prepare-image#csv)):

```
[ML_USE],GCS_FILE_PATH,[LABEL]
```

`ML_USE` stands for the data split - `training`, `valid`, and `test`. 

### Derive GCS URIs of the images

In [None]:
gs_uris = []

storage_client = storage.Client(project="fast-ai-exploration") # Change it accordingly.
blobs = storage_client.list_blobs(BUCKET_PATH.split("/")[-1])

for blob in blobs:
    if ".txt" in blob.name.split("/")[-1]:
        continue
    gs_uri = os.path.join(BUCKET_PATH, blob.name)
    gs_uris.append(gs_uri)

pprint(gs_uris[:5])

['gs://flowers-experimental/flower_photos/daisy/100080576_f52e8ee070_n.jpg',
 'gs://flowers-experimental/flower_photos/daisy/10140303196_b88d3d6cec.jpg',
 'gs://flowers-experimental/flower_photos/daisy/10172379554_b296050f82_n.jpg',
 'gs://flowers-experimental/flower_photos/daisy/10172567486_2748826a8b.jpg',
 'gs://flowers-experimental/flower_photos/daisy/10172636503_21bededa75_n.jpg']


### Dataset splitting

In [None]:
# Create splits.
random.shuffle(gs_uris)

i = int(len(gs_uris) * 0.9)
train_paths = gs_uris[:i]
test_paths = gs_uris[i:]

i = int(len(train_paths) * 0.05)
valid_paths = train_paths[:i]
train_paths = train_paths[i:]

print(len(train_paths), len(valid_paths), len(test_paths))

3138 165 367


### Utility for deriving the labels and `ML_USE`

In [None]:
def derive_labels(gcs_paths, split="training"):
    labels = []
    for gcs_path in gcs_paths:
        label = gcs_path.split("/")[4]
        labels.append(label)
    return labels, [split] * len(gcs_paths)

### Prepare the lists

In [None]:
# File format is referred from: https://cloud.google.com/vertex-ai/docs/datasets/prepare-image#csv
train_labels, train_use = derive_labels(train_paths)
val_labels, val_use = derive_labels(valid_paths, split="validation")
test_labels, test_use= derive_labels(test_paths, split="test")

### Create `.csv` file

In [None]:
gcs_uris = []
labels = []
use = []

gcs_uris.extend(train_paths)
gcs_uris.extend(valid_paths)
gcs_uris.extend(test_paths)

labels.extend(train_labels)
labels.extend(val_labels)
labels.extend(test_labels)

use.extend(train_use)
use.extend(val_use)
use.extend(test_use)

In [None]:
import csv

with open("flowers_vertex.csv", "w") as csvfile: 
    csvwriter = csv.writer(csvfile)
    
    for ml_use, gcs_uri, label in zip(use, gcs_uris, labels):
        row = [ml_use, gcs_uri, label]
        csvwriter.writerow(row)  

In [None]:
!head -5 flowers_vertex.csv

training,gs://flowers-experimental/flower_photos/sunflowers/4895721788_f10208ab77_n.jpg,sunflowers
training,gs://flowers-experimental/flower_photos/sunflowers/8202034834_ee0ee91e04_n.jpg,sunflowers
training,gs://flowers-experimental/flower_photos/daisy/19019544592_b64469bf84_n.jpg,daisy
training,gs://flowers-experimental/flower_photos/dandelion/4634716478_1cbcbee7ca.jpg,dandelion
training,gs://flowers-experimental/flower_photos/tulips/12163418275_bd6a1edd61.jpg,tulips


In [None]:
!tail -5 flowers_vertex.csv

test,gs://flowers-experimental/flower_photos/roses/6363951285_a802238d4e.jpg,roses
test,gs://flowers-experimental/flower_photos/dandelion/4571923094_b9cefa9438_n.jpg,dandelion
test,gs://flowers-experimental/flower_photos/roses/2471103806_87ba53d997_n.jpg,roses
test,gs://flowers-experimental/flower_photos/roses/12238827553_cf427bfd51_n.jpg,roses
test,gs://flowers-experimental/flower_photos/roses/3663244576_97f595cf4a.jpg,roses


## Copy over to a GCS Bucket

In [None]:
!gsutil cp flowers_vertex.csv {BUCKET_PATH}

Copying file://flowers_vertex.csv [Content-Type=text/csv]...
-
Operation completed over 1 objects/334.7 KiB.                                    
