# 3 - Feature Modelling

This is the third notebook of our project. In this notebook we will start applying feature engineering to our datasets in order to preapare them for our object detection models using MXNet.

We will implement the framework MXNet as is a proven and efficient deep learning framework optimized for scalability and cross-platform deployment, making it particularly suitable for object detection tasks on AWS. Being natively supported by AWS, MXNet ensures seamless integration, optimal performance, and efficient resource utilization. Furthermore, our project involves meticulous feature engineering to tailor our dataset for the specific requirements of our model. We will transform our dataset into the LST Label format, which is a compact and efficient representation encapsulating image index, variable-length label, and image path. This step ensures that our data is both consistent and optimized for training within the MXNet environment.

The format of LST file is:


<code>integer_image_index \t label_of_variable_length \t relative_path_to_image</code>


More info can be found <a href="https://cv.gluon.ai/build/examples_datasets/detection_custom.html#lst-label-for-gluoncv-and-mxnet">here</a>

In [1]:
import pandas as pd
import shutil
import glob
from utils.data_eda_viz_preprocessing import load_csv_as_dataset
from utils.data_preprocess import save_datasets_to_csv
from utils.data_eda_viz_preprocessing import extract_zip_to_folder
from utils.data_engineering import insert_column


import warnings
warnings.filterwarnings('ignore')

### Loading Datasets

In [2]:
# Loading our datasets
train = load_csv_as_dataset('data/csv/train.csv')
test = load_csv_as_dataset('data/csv/test.csv')

# Make copies in order to keep the original datasets
train_df = train.copy()
test_df = test.copy()

In [3]:
train_df.columns

Index(['Unnamed: 0', 'ImageID', 'Source', 'LabelName', 'Confidence', 'XMin',
       'XMax', 'YMin', 'YMax', 'IsOccluded', 'IsTruncated', 'IsGroupOf',
       'IsDepiction', 'IsInside', 'XClick1X', 'XClick2X', 'XClick3X',
       'XClick4X', 'XClick1Y', 'XClick2Y', 'XClick3Y', 'XClick4Y'],
      dtype='object')

In [4]:
test_df.columns

Index(['Unnamed: 0', 'ImageID', 'Source', 'LabelName', 'Confidence', 'XMin',
       'XMax', 'YMin', 'YMax', 'IsOccluded', 'IsTruncated', 'IsGroupOf',
       'IsDepiction', 'IsInside', 'XClick1X', 'XClick2X', 'XClick3X',
       'XClick4X', 'XClick1Y', 'XClick2Y', 'XClick3Y', 'XClick4Y'],
      dtype='object')

### Selecting necessary columns

In [5]:
train_df = train_df[['LabelName', 'XMin','YMin', 'XMax', 'YMax', 'ImageID']]
test_df = test_df[['LabelName', 'XMin','YMin', 'XMax', 'YMax', 'ImageID']]

In [6]:
train_df.head(2)

Unnamed: 0,LabelName,XMin,YMin,XMax,YMax,ImageID
0,/m/0cmf2,0.5375,0.347092,0.99875,0.574109,0644da39dd206abc
1,/m/0cmf2,0.286875,0.628385,0.336875,0.652661,01f0114cacd689a3


In [7]:
test_df.head(2)

Unnamed: 0,LabelName,XMin,YMin,XMax,YMax,ImageID
0,/m/0cmf2,0.444653,0.493125,0.546904,0.585,0289cd0483d2f758
1,/m/0cmf2,0.553125,0.487395,0.6325,0.612512,094d697dbc53a9f2


### Inserting <code>header_cols</code> and <code>label_width</code>

In [8]:
train_df = insert_column(train_df,0,"header_cols", 2)
train_df = insert_column(train_df,1,"label_width", 5)

test_df = insert_column(test_df,0,"header_cols", 2)
test_df = insert_column(test_df,1,"label_width", 5)

In [9]:
train_df.head(2)

Unnamed: 0,header_cols,label_width,LabelName,XMin,YMin,XMax,YMax,ImageID
0,2,5,/m/0cmf2,0.5375,0.347092,0.99875,0.574109,0644da39dd206abc
1,2,5,/m/0cmf2,0.286875,0.628385,0.336875,0.652661,01f0114cacd689a3


In [10]:
test_df.head(2)

Unnamed: 0,header_cols,label_width,LabelName,XMin,YMin,XMax,YMax,ImageID
0,2,5,/m/0cmf2,0.444653,0.493125,0.546904,0.585,0289cd0483d2f758
1,2,5,/m/0cmf2,0.553125,0.487395,0.6325,0.612512,094d697dbc53a9f2


### Formatting <code>LabelName</code> column

We will rename the column to <code>className</code> and change the value to 0.000 as the framework requires it

In [11]:
train_df.rename(columns={"LabelName": "className"}, inplace=True)
test_df.rename(columns={"LabelName": "className"}, inplace=True)

In [12]:
train_df.className = "0.000"
test_df.className = "0.000"

In [13]:
train_df.head(2)

Unnamed: 0,header_cols,label_width,className,XMin,YMin,XMax,YMax,ImageID
0,2,5,0.0,0.5375,0.347092,0.99875,0.574109,0644da39dd206abc
1,2,5,0.0,0.286875,0.628385,0.336875,0.652661,01f0114cacd689a3


In [14]:
test_df.head(2)

Unnamed: 0,header_cols,label_width,className,XMin,YMin,XMax,YMax,ImageID
0,2,5,0.0,0.444653,0.493125,0.546904,0.585,0289cd0483d2f758
1,2,5,0.0,0.553125,0.487395,0.6325,0.612512,094d697dbc53a9f2


### Formatting <code>ImageID</code> column

Relative path to the images are required, as we have explained, for MXNET. We will create a new folder and move the images there. The relative path will be the new value of <code>ImageID</code> (that we will rename it to <code>ImagePath</code>) and each value will point to this new folder that will contain the images.


First of all, we will format the column and values

In [15]:
# Change column name
train_df.rename(columns={"ImageID": "ImagePath"}, inplace=True)
test_df.rename(columns={"ImageID": "ImagePath"}, inplace=True)

In [16]:
# Format value to point to the new folder
train_df.ImagePath = "airplanes/images/train/" + train_df.ImagePath + ".jpg"
test_df.ImagePath = "airplanes/images/test/" + test_df.ImagePath + ".jpg"

In [17]:
train_df.head(2)

Unnamed: 0,header_cols,label_width,className,XMin,YMin,XMax,YMax,ImagePath
0,2,5,0.0,0.5375,0.347092,0.99875,0.574109,airplanes/images/train/0644da39dd206abc.jpg
1,2,5,0.0,0.286875,0.628385,0.336875,0.652661,airplanes/images/train/01f0114cacd689a3.jpg


In [18]:
test_df.head(2)

Unnamed: 0,header_cols,label_width,className,XMin,YMin,XMax,YMax,ImagePath
0,2,5,0.0,0.444653,0.493125,0.546904,0.585,airplanes/images/test/0289cd0483d2f758.jpg
1,2,5,0.0,0.553125,0.487395,0.6325,0.612512,airplanes/images/test/094d697dbc53a9f2.jpg


Now that we have our dataset with the column order and values as required by MXNet framework, we are going to move the images from the <code>unzipped</code> folder to our new ***train*** and ***test*** folders

In [19]:
# Creating our need folder and subfolders
shutil.move("unzipped/trainImages/train/data", "airplanes/images/train/")
shutil.move("unzipped/testImages/data", "airplanes/images/test/")

'airplanes/images/test/'

In [20]:
# Images count in each folder
folder_train = glob.glob("airplanes/images/train/*.jpg")
folder_test = glob.glob("airplanes/images/test/*.jpg")

count_train_images = len(folder_train)
count_test_images = len(folder_test)

print(f"Number of images in train folder: {count_train_images}")
print(f"Number of images in test folder: {count_test_images}")

Number of images in train folder: 773
Number of images in test folder: 260


Once we achieve this, we can save our new dataframes

In [21]:
save_datasets_to_csv(train_df=train_df, test_df=test_df, folder_path= "data/processed_csv")

Training dataset saved to data/processed_csv/train.csv
Test dataset saved to data/processed_csv/test.csv
