# Dataset Transform

This file will ask for a path to a directory which must contain:
- A file called _annotations.csv containing labels in the retinanet format. This will be altered to fit the Google Cloud Api input format.
- An amount of square jpg images of one set size. In my case 416x416 pixels.

---

### Retinanet _annotations.csv input syntax

file_name, x1, y1, x2, y2, label

Where the x and y values are stored as integers.

### Google Cloud _annotations.csv output syntax

gs_file_path, label, x1, y1,,, x2, y2,,

Where the x and y values are now floating point numbers between 0 and 1. And ,, represents an empty column.

The gs_file_path represents where the images are stored in google cloud. But this can be set to whatever the filepaths
in the resulting csv file should be.

---

The purpose of this program will be to reduce the amount of images, and format the csv file.
The original dataset will be read only. And a new dataset will be created by creating a new csv file, and copying chosen images to a new destination.

To use this program, tweak the parameters below, and run all cells.

---

In [1]:
# import necessary libraries
import pandas as pd
import shutil

---

## Parameters

- src_path: Path to all source images and their _annotations.csv file.
- dest_path: Path to where the partition of images and the new _annotations.csv file should be sent.
- gs_filepath: Path to files in google cloud storage if using google cloud ML services.

---

- kept: The number of images to keep. (Note the resulting images may not be exactly the amount requested. There's a slight chance of collisions. For instance I requested 360 and resulted in 355.)
- move_images: Set to False if you only want the resulting csv file, and don't want the images to be moved to the destination folder.
- img_size: Size of square input images.

In [2]:
src_path = 'Data_Orig/'
dest_path = 'Data_Aug/'
gs_filepath = 'gs://capstone_benchmark/'

In [3]:
kept = 360
move_images = False
img_size = 416.

In [4]:
# read in the annotations file and label columns
labels = ['file_name', 'x1', 'y1', 'x2', 'y2', 'label']
ann = pd.read_csv(src_path + '_annotations.csv', sep=',', header=None, names=labels)
ann.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43473 entries, 0 to 43472
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   file_name  43473 non-null  object
 1   x1         43473 non-null  int64 
 2   y1         43473 non-null  int64 
 3   x2         43473 non-null  int64 
 4   y2         43473 non-null  int64 
 5   label      43473 non-null  object
dtypes: int64(4), object(2)
memory usage: 2.0+ MB


In [5]:
# For now we will drop all rows that don't have the label 'car'
ann = ann[ann['label'] == 'car']
print('Labels in df:', ann['label'].unique())
print()
ann.info()

Labels in df: ['car']

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32510 entries, 0 to 43472
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   file_name  32510 non-null  object
 1   x1         32510 non-null  int64 
 2   y1         32510 non-null  int64 
 3   x2         32510 non-null  int64 
 4   y2         32510 non-null  int64 
 5   label      32510 non-null  object
dtypes: int64(4), object(2)
memory usage: 1.7+ MB


In [6]:
# pick a random array of 'kept' images, these are the labels of the images we will
# transfer to the destination directory
img_subset = list(ann['file_name'].sample(n=kept, random_state=0))
ann_subset = ann[ann['file_name'].isin(img_subset)]
ann_subset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2084 entries, 90 to 43354
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   file_name  2084 non-null   object
 1   x1         2084 non-null   int64 
 2   y1         2084 non-null   int64 
 3   x2         2084 non-null   int64 
 4   y2         2084 non-null   int64 
 5   label      2084 non-null   object
dtypes: int64(4), object(2)
memory usage: 114.0+ KB


In [7]:
name_mapping = dict()
count = 1
for name in ann_subset['file_name'].unique():
    name_mapping[name] = 'img' + str(count) + '.jpg'
    count += 1

In [8]:
ann_subset.head(2)

Unnamed: 0,file_name,x1,y1,x2,y2,label
90,1478732982753103095_jpg.rf.3c4c355e77fb688df76...,349,168,415,292,car
91,1478732982753103095_jpg.rf.3c4c355e77fb688df76...,28,176,98,260,car


In [9]:
# copy files with unique file_name from source folder to name mapping in destination folder
if move_images:
    for name in ann_subset['file_name'].unique():
        shutil.copy(src_path + name, dest_path + name_mapping[name])

In [10]:
# append the mapped file names to the ann_subset df
new_name_df = pd.DataFrame(ann_subset['file_name'].map(name_mapping))
new_name_df.columns = ['new_name']
ann_subset = pd.concat([ann_subset, new_name_df], 
                       axis=1)

In [11]:
ann_subset

Unnamed: 0,file_name,x1,y1,x2,y2,label,new_name
90,1478732982753103095_jpg.rf.3c4c355e77fb688df76...,349,168,415,292,car,img1.jpg
91,1478732982753103095_jpg.rf.3c4c355e77fb688df76...,28,176,98,260,car,img1.jpg
92,1478732982753103095_jpg.rf.3c4c355e77fb688df76...,91,180,118,240,car,img1.jpg
93,1478732982753103095_jpg.rf.3c4c355e77fb688df76...,95,183,121,245,car,img1.jpg
94,1478732982753103095_jpg.rf.3c4c355e77fb688df76...,136,192,159,219,car,img1.jpg
...,...,...,...,...,...,...,...
43294,1478896703276941790_jpg.rf.f38039b06d7f529c250...,63,160,302,415,car,img354.jpg
43295,1478896703276941790_jpg.rf.f38039b06d7f529c250...,179,121,415,260,car,img354.jpg
43352,1478896286807642136_jpg.rf.f40e5a7216c1fed70b1...,362,116,406,135,car,img355.jpg
43353,1478896286807642136_jpg.rf.f40e5a7216c1fed70b1...,311,193,328,214,car,img355.jpg


In [12]:
# prepend gs_filepath for use in Google Cloud AI
ann_subset['new_name'] = gs_filepath + ann_subset['new_name']
ann_subset.head(3)

Unnamed: 0,file_name,x1,y1,x2,y2,label,new_name
90,1478732982753103095_jpg.rf.3c4c355e77fb688df76...,349,168,415,292,car,gs://capstone_benchmark/img1.jpg
91,1478732982753103095_jpg.rf.3c4c355e77fb688df76...,28,176,98,260,car,gs://capstone_benchmark/img1.jpg
92,1478732982753103095_jpg.rf.3c4c355e77fb688df76...,91,180,118,240,car,gs://capstone_benchmark/img1.jpg


In [13]:
# append an empty column for padding requested data
ann_subset['empty'] = pd.Series(dtype='float64')
ann_subset['x1'] /= img_size
ann_subset['y1'] /= img_size
ann_subset['x2'] /= img_size
ann_subset['y2'] /= img_size
ann_subset.head()

Unnamed: 0,file_name,x1,y1,x2,y2,label,new_name,empty
90,1478732982753103095_jpg.rf.3c4c355e77fb688df76...,0.838942,0.403846,0.997596,0.701923,car,gs://capstone_benchmark/img1.jpg,
91,1478732982753103095_jpg.rf.3c4c355e77fb688df76...,0.067308,0.423077,0.235577,0.625,car,gs://capstone_benchmark/img1.jpg,
92,1478732982753103095_jpg.rf.3c4c355e77fb688df76...,0.21875,0.432692,0.283654,0.576923,car,gs://capstone_benchmark/img1.jpg,
93,1478732982753103095_jpg.rf.3c4c355e77fb688df76...,0.228365,0.439904,0.290865,0.588942,car,gs://capstone_benchmark/img1.jpg,
94,1478732982753103095_jpg.rf.3c4c355e77fb688df76...,0.326923,0.461538,0.382212,0.526442,car,gs://capstone_benchmark/img1.jpg,


In [14]:
# create final dataframe with columns in requested order
final_df = ann_subset[['new_name', 'label', 'x1', 'y1', 'empty', 'empty', 'x2', 'y2', 'empty', 'empty']]

# create an _annotations.csv file in the destination folder
final_df.to_csv(dest_path + '_annotations.csv', header=False, index=False)