# Vehicle Make Model Recognition Data Preparation

Objective of this project is to train a classifier model - specifically, a model that receives an image of a vehicle as input and then outputs a predicted label indicating the vehicle's **make and model**.

For this experiment, we will only be using the top 5 classes of vehicles to perform the training.

In order to prepare the dataset, we will split the images in these 5 folders into **TRAIN**, **TEST** and **VAL** folders.

For example:
```
├── <vehicle make and model 1>
│   ├── <image of vehicle no. 1>
│   ├── <image of vehicle no. 2>
│   ├── <image of vehicle no. 3>
├── <vehicle make and model 2>
├── <vehicle make and model 3>
```

i.e. 
```
├── TRAIN
    ├── Toyota Prius
    │   ├── 10_0_0_2.jpg
    │   ├── 40_5_2_5.jpg
    │   ├── 45_0_1_6.jpg
    ├── Toyota Camry
    ├── Honda Civic
├── TEST 
    ├── Toyota Camry
    ├── Honda Civic
```

In [None]:
import os
import pandas as pd
import shutil
import numpy as np
from sklearn.model_selection import train_test_split

In [None]:
root = '' # Insert root directory here

In [None]:
data_root_source = "data_classes_raw-old"
data_dest_root = "data_split"

In [None]:
os.mkdir(os.path.join("data_split"))

## Analzying the data distribution

In [None]:
import os
data = dict()
total = 0
for category in os.listdir(data_root_source):
    if category == 'OTHER_CLASSES': continue
    class_num = len(os.listdir(os.path.join(data_root_source, category)))
    data[category] = class_num
    total += class_num

data

In [None]:
import pandas as pd
df = pd.DataFrame.from_dict(data, orient='index', columns=['count'])
df = df.sort_values(by='count', ascending=False)
df

## Creating the TRAIN, TEST, VAL folders


In [None]:
# Load classes
all_classes = df.head(5).index.tolist()
# ['VOLVO_FM12', 'HINO_FN2P', 'SINOTRUK_A7', 'HINO_SH1E', 'MITSUBISHI_FP517']

In [None]:
all_classes

In [None]:
for category in all_classes:
    data_source = os.path.join(data_root_source, category)

    # Obtain all file paths for this category
    img_names = []
    for child in os.listdir(data_source):
        img = os.path.join(data_source, child) 
        img_names.append(img)

    # Determine train test split for this category
        # Shuffle data
    [train, others] = train_test_split(img_names, test_size=0.30, random_state=42)
        # Split according to ratio
    [test, val] = train_test_split(others, test_size=0.50, random_state=42)
    
    # Move the image into the correct folder
    for folder in ['TRAIN', 'TEST', 'VAL']:
        if folder == 'TRAIN': dataset = train
        elif folder == 'TEST': dataset = test
        else: dataset = val

        folder_path = os.path.join(data_dest_root, folder)
        category_path = os.path.join(data_dest_root, folder, category)
        if not os.path.isdir(folder_path):
            os.mkdir(folder_path)
        if not os.path.isdir(category_path):
            os.mkdir(category_path)

        for item in dataset:
            img_name = item.split('/')[-1]
            dest_file = os.path.join(data_dest_root, folder, category, img_name)
            # print(item, dest_file)
            
            shutil.copyfile(item, dest_file)
        
    
    