# Methodology and downloading datasets

---

> This section defines the problem and has a script to download the dataset. After this, a measure of success is chosen, an evaluation protocol is decided, and the raw data is downloaded in the required form.

## 1. Defining the problem and assembling a dataset

> The problem is a multiclass single-label image classification problem. The objective is to build models that can accurately classify a vehicle depending on its type - for example, accurately predicting whether a given image is of a bike, pickup truck, mini bus etc.

> A secondary objective is to build models that are small and efficient, and yet can predict with a relatively high degree of accuracy.

> The main dataset being used is called 'A Dataset Containing Tiny and Low Quality Images for Vehicle Classification' downloaded from this link: [https://zenodo.org/record/6634554](https://zenodo.org/record/6634554). It contains six classes of vehicles and 800 images for each of those classes. This dataset is referred to as 'zenodo' in this project.

> The secondary dataset which will be used for transfer learning is called 'Vehicle Type Image Dataset (Version 2)' downloaded from this link: [https://data.mendeley.com/datasets/htsngg9tpc](https://data.mendeley.com/datasets/htsngg9tpc). This dataset contains 4356 images split between five classes of vehicles. This dataset is referred to as 'vtid2' in this project.

> Both datasets have the licence of 'Creative Commons Attribution 4.0 International'. More information about them can be found in the file **datasets/dataset_sources.md** or in the **README.md** file.

> The primary dataset (Zenodo) data is already in the datasets/data/raw folder, and running the code shown below will download the secondary dataset (VTID2) as well.


### 1.1 Downloading the VTID2 dataset

In [2]:
# Download VTID2 dataset

## 2. Choosing a measure of success

> The primary measure of success being chosen is **accuracy**. The primary metric (accuracy) will signal the overall success of the models. The secondary metrics which will be looked at are the *precision* and *recall* of each individual class. The secondary metrics will indicate where the model needs to improve further. 

> The dataset is perfectly balanced, and it is a multiclass classification problem, so using accuracy as a primary metric of success is the perfect choice for this project.

----

Primary metric: **accuracy**

Secondary metrics: *class wise precision and recall*

----

## 3. Deciding on an evaluation protocol

> The evaluation protocol will be maintaining a hold-out validation set. 10% of the dataset will be used as a validation set, this will be used to tune hyperparameters and get the best models. The final models will be evaluated on the test set which will also be 10% of the dataset, so evidently, the training set will be 80% of the dataset. 

> Since the main dataset has 4800 images split equally among six classes, this means that there will be 480 images in the test set and 480 images in the validation set. This means that the dataset is sufficiently large to use the hold-out validation technique.

----

Evaluation Protocol: **Maintaining a hold-out validation set**

----

Dataset split ratio:

| Split      | Ratio |
|------------|-------|
| Train      | 80%   |
| Validation | 10%   |
| Test       | 10%   |

----

## 4. Last layer activation, optimization, and loss function

> This is a multiclass single-label image classification problem, so:

1. Last-layer activation - **sigmoid**

2. Loss function - **sparse_categorical_crossentropy**

3. Optimization Configuration - **rmsprop**

## 5. Fixing issues before data processing

#### Getting raw paths for the datasets 

In [3]:
raw_zenodo_path = './datasets/data/raw/Dataset (Vehicles)'
raw_vtid2_path = './datasets/data/raw/htsngg9tpc-2'

#### Renaming the 'other' folder in the VTID2 dataset since they are all clearly bikes

In [4]:
import os

# If the directory exists, then rename it
if os.path.exists(raw_vtid2_path + '/' + 'other'):
  os.rename(raw_vtid2_path + '/' + 'other', raw_vtid2_path + '/' + 'bike')