# Label generator

This notebook allows you to generate a CSV from the images contained in the project's dataset folders, which contains information about:

* `file_name` - Image file name.
* `label` - The label of the class to which the image belongs (`bee`, `other_insect`, `other_noinsect`, `wasp`).
* `file_path` - Composed of the name of the image and the name of the folder that contains it. We will need it later when generating the data to train the model.

It also allows us to divide the dataset into training, validation and test subsets.

In [1]:
import os
from glob import glob
import pandas as pd
from sklearn.model_selection import train_test_split

In [4]:
DATASET_PATH = '../dataset'
CSV_PATH = '../'

## Import images path

We create a list with the paths to all the images in our dataset.

In [5]:
image_files = sorted(glob(os.path.join(DATASET_PATH, '**', '*.*'), recursive=True))

In [6]:
len(image_files)

12653

In [7]:
image_files[:5]

['dataset/bee/10007154554_026417cfd0_n.jpg',
 'dataset/bee/10024864894_6dc54d4b34_n.jpg',
 'dataset/bee/10092043833_7306dfd1f0_n.jpg',
 'dataset/bee/1011948979_fc3637e779_w.jpg',
 'dataset/bee/10128235063_dca17db76c_n.jpg']

## Create DataFrame with labels

In [8]:
df = pd.DataFrame(image_files, columns=['path'])
df['file_name'] = df['path'].apply(lambda x: os.path.basename(x))
df['label'] = df['path'].apply(lambda x: os.path.basename(os.path.dirname(x)))
df['file_path'] = df['label'] + '/' + df['file_name']
df.drop('path', axis=1, inplace=True)

In [9]:
df

Unnamed: 0,file_name,label,file_path
0,10007154554_026417cfd0_n.jpg,bee,bee/10007154554_026417cfd0_n.jpg
1,10024864894_6dc54d4b34_n.jpg,bee,bee/10024864894_6dc54d4b34_n.jpg
2,10092043833_7306dfd1f0_n.jpg,bee,bee/10092043833_7306dfd1f0_n.jpg
3,1011948979_fc3637e779_w.jpg,bee,bee/1011948979_fc3637e779_w.jpg
4,10128235063_dca17db76c_n.jpg,bee,bee/10128235063_dca17db76c_n.jpg
...,...,...,...
12648,wasp_image_93.jpg,wasp,wasp/wasp_image_93.jpg
12649,wasp_image_94.jpg,wasp,wasp/wasp_image_94.jpg
12650,wasp_image_95.jpg,wasp,wasp/wasp_image_95.jpg
12651,wasp_image_97.jpg,wasp,wasp/wasp_image_97.jpg


We check that we have the labels we need.

In [10]:
df.label.unique()

array(['bee', 'other_insect', 'other_noinsect', 'wasp'], dtype=object)

We check how many images we have of each class.

In [11]:
df.label.value_counts()

wasp              3264
bee               3232
other_noinsect    3218
other_insect      2939
Name: label, dtype: int64

## Split the data into training and test sets

In [13]:
train, test = train_test_split(df,
                               test_size=0.2,
                               random_state=42,
                               stratify=df['label'])

In [14]:
val, test = train_test_split(test,
                             test_size=0.5,
                             random_state=42,
                             stratify=test['label'])

In [15]:
df_f = pd.concat([train,val,test], keys=['train', 'validation', 'test']).reset_index()
df_f.drop('level_1', axis=1, inplace=True)
df_f = df_f.rename({'level_0': 'subset'}, axis=1)
df_f = df_f[['file_path', 'file_name', 'label', 'subset']]
df_f

Unnamed: 0,file_path,file_name,label,subset
0,wasp/1331787019_ca513a7acf_n.jpg,1331787019_ca513a7acf_n.jpg,wasp,train
1,bee/14322267704_2ac34a2af2_n.jpg,14322267704_2ac34a2af2_n.jpg,bee,train
2,wasp/7382817412_b5a0f8c899_w.jpg,7382817412_b5a0f8c899_w.jpg,wasp,train
3,wasp/4250759545_eb707b1145_n.jpg,4250759545_eb707b1145_n.jpg,wasp,train
4,other_insect/insect3.jpg,insect3.jpg,other_insect,train
...,...,...,...,...
12648,other_insect/44233106205_938f38751a_n.jpg,44233106205_938f38751a_n.jpg,other_insect,test
12649,other_noinsect/animal_image_385.jpg,animal_image_385.jpg,other_noinsect,test
12650,wasp/J00321.jpg,J00321.jpg,wasp,test
12651,wasp/39517028_0f3fbfed55_n.jpg,39517028_0f3fbfed55_n.jpg,wasp,test


We check how many images we have of each type.

In [16]:
df_f.value_counts('subset')

subset
train         10122
test           1266
validation     1265
dtype: int64

## EExport CSV with data

In [17]:
df_f.to_csv(os.path.join(CSV_PATH, 'bs_labels.csv'), index=False)