# VEHICLE IMAGE CLASSIFICATION
---
# 1. Business Understanding
---
### 1.1 Overview
In collaboration with Kenya Imports Authority (KIA), this project aims to streamline vehicle identification processes at ports of entry. The Kenya Imports Authority oversees the regulation and importation of vehicles into the country, ensuring compliance with national laws and regulations. A machine learning model capable of accurately classifying different types of vehicles (such as Auto Rickshaws, Bikes, Cars, Motorcycles, Planes, Ships and Trains) would enhance operational efficiency, reduce manual errors and improve regulatory compliance.

The goal of this project is to build a vehicle classification model that can automatically identify the type of vehicle from an image. This can be applied in real-time during inspections or integrated into existing digital systems for batch processing.

#### Metrics of Success:
* Accuracy: The percentage of correctly classified vehicle images.
* Precision: The ability of the model to correctly classify positive instances without mislabeling others.

### 1.2 Problem Statement
Manually inspecting and classifying vehicles entering Kenya’s ports is both time-consuming and prone to human error. With thousands of vehicles arriving at the borders daily, it is increasingly difficult to ensure that each one is correctly classified and documented. A machine learning solution is needed to automate this process, allowing for faster, more accurate classification of vehicles. This project seeks to address the challenge of automatically identifying vehicle types using image recognition techniques.

### 1.3 Objectives
* Develop a machine learning model that can classify vehicles into one of seven predefined categories: `Auto Rickshaws`, `Bikes`, `Cars`, `Motorcycles`, `Planes`, `Ships` and `Trains`.
* Achieve a high accuracy and precision for classifying images from the dataset.
* Evaluate the performance of the model using test data.
* Deploying the model using streamlit.
---

# 2. Data Understanding
---

### 2.1 Data Source and Access
The dataset used for this project is sourced from Kaggle and can be accessed [here](https://www.kaggle.com/datasets/mohamedmaher5/vehicle-classification).
It's expected to contain `5,600` images across seven categories, each stored in a subfolder within the Vehicles folder. To manage the large size of the dataset, it has been uploaded to AWS S3 for efficient storage and access. The categories include:

* Auto Rickshaws
* Bikes
* Cars
* Motorcycles
* Planes
* Ships
* Trains

Each category is expected to contain 800 images in `.jpg` and `.png` formats, including both uppercase and lowercase extensions (`.JPG`, `.PNG`).

### 2.2 Data Loading
We are going to load the data and store the image urls in a dataframe. This allows for efficient management and on-demand loading of images.

In [1]:
# Import necessary libraries

import boto3
import pandas as pd
from PIL import Image
import matplotlib.pyplot as plt
from io import BytesIO

# Filter future warnings
import warnings
warnings.filterwarnings('ignore')


In [2]:
# AWS S3 setup
s3_client = boto3.client('s3')
bucket_name = 'vehicle-image-classification'
folder_path = 'Vehicles/' 

# Initialize an empty list to store data
data = []

# Using pagination to retrieve all objects in the Vehicles folder
paginator = s3_client.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=bucket_name, Prefix=folder_path)

# Iterating through each page of results
for page in pages:
    if 'Contents' in page:
        files = [obj['Key'] for obj in page['Contents'] if obj['Key'].endswith(('.jpg', '.png', '.PNG', '.JPG'))]
        
        for file in files:
            category = file.split('/')[1]
            data.append([file, category])

# Creating a DataFrame to store the image paths and labels
df = pd.DataFrame(data, columns=['File Path', 'Category'])
df

Unnamed: 0,File Path,Category
0,Vehicles/Auto Rickshaws/Auto Rickshaw (1).jpg,Auto Rickshaws
1,Vehicles/Auto Rickshaws/Auto Rickshaw (10).jpg,Auto Rickshaws
2,Vehicles/Auto Rickshaws/Auto Rickshaw (100).jpg,Auto Rickshaws
3,Vehicles/Auto Rickshaws/Auto Rickshaw (101).jpg,Auto Rickshaws
4,Vehicles/Auto Rickshaws/Auto Rickshaw (102).jpg,Auto Rickshaws
...,...,...
5582,Vehicles/Trains/Train (95).jpg,Trains
5583,Vehicles/Trains/Train (96).jpg,Trains
5584,Vehicles/Trains/Train (97).png,Trains
5585,Vehicles/Trains/Train (98).png,Trains


### 2.3 Class Distribution
To understand the class distribution, we are going to calculate the number of images per category to ensure the dataset is balanced across all classes. 

In [3]:
# Check class distribution
class_distribution = df['Category'].value_counts()
class_distribution

Category
Auto Rickshaws    800
Bikes             800
Motorcycles       800
Ships             800
Trains            800
Planes            797
Cars              790
Name: count, dtype: int64

#### Observation:
The dataset is mostly balanced across categories, but `Planes` and `Cars` have slightly fewer images than expected.

---
# 3. Data Preparation
---
In this step, we will clean and organize the data to make it suitable for analysis and modeling.
### 3.1 Checking for Missing Values
First, let's check if there are any missing values in the dataset. This is important because missing data can lead to errors during analysis or modeling.

In [5]:
# Check for missing values
missing_values = df.isnull().sum()
print("Missing values per column:\n", missing_values)

Missing values per column:
 File Path    0
Category     0
dtype: int64


#### Observation:
* The data has no missing values.

### 3.2 Data Type Verification
Next, we should verify the data types to ensure that they are appropriate for the analysis we plan to perform.

In [6]:
print("Data types:\n", df.dtypes)

Data types:
 File Path    object
Category     object
dtype: object


### 3.3 Duplicates Check
Check for any duplicate entries in the dataset that could skew our analysis.

In [7]:
# Check for duplicates
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

Number of duplicate rows: 0
