# Methodology and downloading datasets

---

> This section defines the problem and has a script to download the dataset. After this, a measure of success is chosen, an evaluation protocol is decided, and the raw data is downloaded in the required form.

## 1. Defining the problem and assembling a dataset

> The problem is a multiclass single-label image classification problem. The initial objective is to build models that can accurately classify a vehicle depending on its type - for example, accurately predicting whether a given image is of a bike, pickup truck, mini bus etc.

> The main objective is to build models that are small and efficient, and yet can predict with a relatively high degree of accuracy.

> The main dataset being used is called 'A Dataset Containing Tiny and Low Quality Images for Vehicle Classification' downloaded from this link: [https://zenodo.org/record/6634554](https://zenodo.org/record/6634554). It contains six classes of vehicles and 800 images for each of those classes. This dataset is referred to as 'zenodo' in this project.

> The secondary dataset which will be used for transfer learning is called 'Vehicle Type Image Dataset (Version 2)' downloaded from this link: [https://data.mendeley.com/datasets/htsngg9tpc](https://data.mendeley.com/datasets/htsngg9tpc). This dataset contains 4356 images split between five classes of vehicles. This dataset is referred to as 'vtid2' in this project.

> Both datasets have the licence of 'Creative Commons Attribution 4.0 International'. More information about them can be found in the file **datasets/dataset_sources.md**.


### 1.1 Downloading the datasets

**Primary dataset (Zenodo)**

> The primary dataset (Zenodo) data is already in the *datasets/data/raw folder*, and it is *'Dataset (Vehicles)'*

**Secondary dataset (VTID2)**

> In order to download the secondary dataset (VTID2) which will be used for transfer learning, run the code cell below:

In [None]:
# Download VTID2 dataset

!rm -rf "./datasets/data/raw/htsngg9tpc-2"

!wget -nc "https://md-datasets-cache-zipfiles-prod.s3.eu-west-1.amazonaws.com/htsngg9tpc-2.zip" -O "./datasets/data/raw/htsngg9tpc-2.zip"

# Unzip dataset
!unzip "./datasets/data/raw/htsngg9tpc-2.zip" -d "./datasets/data/raw/"

!rm -rf "./datasets/data/raw/htsngg9tpc-2.zip"

## 2. Choosing a measure of success

----

Primary metric: **accuracy**

Secondary metrics: *class wise precision and recall*

----

## 3. Deciding on an evaluation protocol

----

Evaluation Protocol: **Maintaining a hold-out validation set**

----

Dataset split ratio:

| Split      | Ratio |
|------------|-------|
| Train      | 80%   |
| Validation | 10%   |
| Test       | 10%   |

----

## 4. Last layer activation, optimization, and loss function

> This is a multiclass single-label image classification problem, so:

1. Last-layer activation - **sigmoid**

2. Loss function - **sparse_categorical_crossentropy**

3. Optimization Configuration - **rmsprop**

## 5. Fixing issues before data processing

#### Getting raw paths for the datasets 

In [10]:
raw_zenodo_path = './datasets/data/raw/Zenodo_Dataset'
raw_vtid2_path = './datasets/data/raw/htsngg9tpc-2'

#### Renaming the 'other' folder in the VTID2 dataset since they are all clearly bikes

In [11]:
import os

# If the directory exists, then rename it
if os.path.exists(raw_vtid2_path + '/' + 'other'):
  os.rename(raw_vtid2_path + '/' + 'other', raw_vtid2_path + '/' + 'bike')