# 2 - Data Preprocessing

This is the second notebook of our project. In this notebook, we will process our data following the steps outlined below:

- Filter our data with the our label
- Split the data for test and training
- Generate and move images to a test folder
- Save the csv's to train the model

**Let's import the functions to process the data**

In [1]:
# to load, filter and save datasets
from utils.data_eda_viz_preprocessing import load_csv_as_dataset
from utils.data_preprocess import clean_dataframe
from utils.data_preprocess import save_datasets_to_csv

# to create folders and move images
from utils.data_preprocess import create_test_folder
from utils.data_preprocess import move_images_to_test_folder
import glob

# to split our data
from sklearn.model_selection import train_test_split

# others
import numpy as np

In [2]:
df_train = load_csv_as_dataset("data/filtered_csv/train_filtered.csv")
df_train.head()

Unnamed: 0.1,Unnamed: 0,ImageID,Source,LabelName,Confidence,XMin,XMax,YMin,YMax,IsOccluded,...,IsDepiction,IsInside,XClick1X,XClick2X,XClick3X,XClick4X,XClick1Y,XClick2Y,XClick3Y,XClick4Y
0,1035,0000a90019e380dc,xclick,/m/0cmf2,1,0.0,0.922452,0.262697,0.707531,1,...,0,0,0.293308,0.06379,0.0,0.922452,0.262697,0.707531,0.495622,0.567426
1,6097,00042d9c8cb5aad4,xclick,/m/0cmf2,1,0.0,0.207813,0.473437,0.603125,0,...,0,0,0.00625,0.079687,0.207813,0.0,0.473437,0.603125,0.564063,0.548438
2,6098,00042d9c8cb5aad4,xclick,/m/0cmf2,1,0.0,0.659375,0.528125,0.801562,0,...,0,0,0.385937,0.176563,0.0,0.659375,0.528125,0.801562,0.673438,0.66875
3,6099,00042d9c8cb5aad4,xclick,/m/0cmf2,1,0.4375,0.967188,0.48125,0.64375,0,...,0,0,0.478125,0.4375,0.967188,0.907813,0.48125,0.5625,0.589063,0.64375
4,6468,00048f37069b6aa8,xclick,/m/0cmf2,1,0.0,0.922951,0.185751,0.997455,0,...,0,0,0.009836,0.768852,0.922951,0.0,0.997455,0.185751,0.722646,0.653944


In [3]:
df_validation = load_csv_as_dataset("data/filtered_csv/validation_filtered.csv")
df_validation.head()

Unnamed: 0.1,Unnamed: 0,ImageID,Source,LabelName,Confidence,XMin,XMax,YMin,YMax,IsOccluded,IsTruncated,IsGroupOf,IsDepiction,IsInside
0,0,0001eeaf4aed83f9,xclick,/m/0cmf2,1,0.022673,0.964201,0.071038,0.800546,0,0,0,0,0
1,75,0009bad4d8539bb4,xclick,/m/0cmf2,1,0.294551,0.705449,0.340708,0.515487,0,0,0,0,0
2,213,0019e544c79847f5,xclick,/m/0cmf2,1,0.0,0.349558,0.106195,0.396018,0,0,0,0,0
3,214,0019e544c79847f5,xclick,/m/0cmf2,1,0.538348,0.874631,0.688053,0.909292,0,0,0,0,0
4,578,007384da2ed0464f,xclick,/m/0cmf2,1,0.0,1.0,0.372917,0.76875,0,1,0,0,0


In [4]:
df_train.shape, df_validation.shape

((1690, 22), (325, 14))

### Data Split into Test and Train

As we do not have a test dataset and in order to train our model, will split our *df_train_cleansed* dataset to create the test and train datasets

In [5]:
train, test = train_test_split(df_train, test_size=0.2, random_state=42)

In [6]:
# Sizes of our new sets
train.shape, test.shape

((1352, 22), (338, 22))

### Create a folder for the test images

Now we want to create a folder with our test images as train and test images are in the same folder. For this we will need to grab our ImagesID of our test dataset and create the folder, following by moving the train images to our new folder

In [7]:
test_img_ids, train_img_ids = create_test_folder(test=test, train=train,)

len(train_img_ids), len(test_img_ids)

(1352, 338)

**Some ImageID's will be present in both datasets as some image are same but not the bounding boxes, so if the ID is in both datasets, we will copy the image. If not, we will move it as the image will have only one bounding box**

In [8]:
images_moved = move_images_to_test_folder(test_img_ids=test_img_ids, 
                           train_img_ids=train_img_ids,
                           source_folder="unzipped/trainImages/train/data/*.jpg",
                           dest_folder="unzipped/testImages/data")

print(f"\nCopied files: {images_moved[0]}")
print(f"\nMoved files: {images_moved[1]}")


Copied files: 133

Moved files: 127


### Checking number of Images for test and train

In [9]:
train_ids = []

train_path = "unzipped/trainImages/train/data/*.jpg"
train_folder = glob.glob(train_path)

for img in train_folder:
    id = img[32:48]
    train_ids.append(id)
    
len(train_ids)

773

In [10]:
test_ids = []

test_path = "unzipped/testImages/data/*.jpg"
test_folder = glob.glob(test_path)

for img in test_folder:
    id = img[25:41]
    test_ids.append(id)
    
len(test_ids)

260

In [11]:
train_df_ids = train.ImageID.values.tolist()
test_df_ids = test.ImageID.values.tolist()


len(train_df_ids), len(test_df_ids)

(1352, 338)

**Number of ImageID's is not the same because the ImageID can be duplicated, as we can have multiple bounding boxes in an image. Thus duplicated ImageID will mean that the image has more than one bounding box.**

**Finally, we want to ensure that all test and train id images are the same we have in our test and train dataset.**

In [12]:
train_both = set(train_df_ids).intersection(train_img_ids)
len(train_both)

773

In [13]:
test_both = set(test_df_ids).intersection(test_img_ids)
len(test_both)

260

**Finally, we will save our train and test datasets as csv's in order to use them from now on in our project**

In [14]:
save_datasets_to_csv(train_df=train, test_df=test, folder_path="data/csv")

Training dataset saved to data/csv/train.csv
Test dataset saved to data/csv/test.csv
