# <center>Principal Component Analysis - Vehicle Silhouette</center>
Data Set Information:

The purpose is to classify a given silhouette as one of four types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles. 

HISTORY: 

This data was originally gathered at the TI in 1986-87 by JP Siebert. It was partially financed by Barr and Stroud Ltd. The original purpose was to find a method of distinguishing 3D objects within a 2D image by application of an ensemble of shape feature extractors to the 2D silhouettes of the objects. Measures of shape features extracted from example silhouettes of objects to be discriminated were used to generate a classification rule tree by means of computer induction. 

This object recognition strategy was successfully used to discriminate between silhouettes of model cars, vans and buses viewed from constrained elevation but all angles of rotation. 

The rule tree classification performance compared favourably to MDC (Minimum Distance Classifier) and k-NN (k-Nearest Neighbour) statistical classifiers in terms of both error rate and computational efficiency. An investigation of these rule trees generated by example indicated that the tree structure was heavily influenced by the orientation of the objects, and grouped similar object views into single decisions. 


DESCRIPTION: 

The features were extracted from the silhouettes by the HIPS (Hierarchical Image Processing System) extension BINATTS, which extracts a combination of scale independent features utilising both classical moments based measures such as scaled variance, skewness and kurtosis about the major/minor axes and heuristic measures such as hollows, circularity, rectangularity and compactness. 

Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars. 

The images were acquired by a camera looking downwards at the model vehicle from a fixed angle of elevation (34.2 degrees to the horizontal). The vehicles were placed on a diffuse backlit surface (lightbox). The vehicles were painted matte black to minimise highlights. The images were captured using a CRS4000 framestore connected to a vax 750. All images were captured with a spatial resolution of 128x128 pixels quantised to 64 greylevels. These images were thresholded to produce binary vehicle silhouettes, negated (to comply with the processing requirements of BINATTS) and thereafter subjected to shrink-expand-expand-shrink HIPS modules to remove "salt and pepper" image noise. 

The vehicles were rotated and their angle of orientation was measured using a radial graticule beneath the vehicle. 0 and 180 degrees corresponded to "head on" and "rear" views respectively while 90 and 270 corresponded to profiles in opposite directions. Two sets of 60 images, each set covering a full 360 degree rotation, were captured for each vehicle. The vehicle was rotated by a fixed angle between images. These datasets are known as e2 and e3 respectively. 

A further two sets of images, e4 and e5, were captured with the camera at elevations of 37.5 degs and 30.8 degs respectively. These sets also contain 60 images per vehicle apart from e4.van which contains only 46 owing to the difficulty of containing the van in the image at some orientations. 


Attribute Information:

ATTRIBUTES 

COMPACTNESS	(average perim)**2/area 

CIRCULARITY	(average radius)**2/area 

DISTANCE CIRCULARITY	area/(av.distance from border)**2 

RADIUS RATIO	(max.rad-min.rad)/av.radius 

PR.AXIS ASPECT RATIO	(minor axis)/(major axis) 

MAX.LENGTH ASPECT RATIO	(length perp. max length)/(max length) 

SCATTER RATIO	(inertia about minor axis)/(inertia about major axis) 

ELONGATEDNESS	area/(shrink width)**2 

PR.AXIS RECTANGULARITY	area/(pr.axis length*pr.axis width) 

MAX.LENGTH RECTANGULARITY area/(max.length*length perp. to this) 

SCALED VARIANCE (2nd order moment about minor axis)/area 
ALONG MAJOR AXIS 

SCALED VARIANCE (2nd order moment about major axis)/area 
ALONG MINOR AXIS 

SCALED RADIUS OF GYRATION	(mavar+mivar)/area 

SKEWNESS ABOUT (3rd order moment about major axis)/sigma_min**3 
MAJOR AXIS 

SKEWNESS ABOUT (3rd order moment about minor axis)/sigma_maj**3 
MINOR AXIS 

KURTOSIS ABOUT (4th order moment about major axis)/sigma_min**4 
MINOR AXIS 

KURTOSIS ABOUT (4th order moment about minor axis)/sigma_maj**4 
MAJOR AXIS 

HOLLOWS RATIO	(area of hollows)/(area of bounding polygon) 

Where sigma_maj**2 is the variance along the major axis and sigma_min**2 is the variance along the minor axis, and 

area of hollows= area of bounding poly-area of object 

The area of the bounding polygon is found as a side result of the computation to find the maximum length. Each individual length computation yields a pair of calipers to the object orientated at every 5 degrees. The object is propagated into an image containing the union of these calipers to obtain an image of the bounding polygon. 

NUMBER OF CLASSES 

4	OPEL, SAAB, BUS, VAN 

---
## Imports and Configurations

In [1]:
# Numerical calculation
import numpy as np

# Data handling
import pandas as pd

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Sample and parameter tuning
from sklearn.model_selection import train_test_split

#Predictive Modeling
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier

# Evaluation metrics
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc, accuracy_score, precision_recall_curve

In [2]:
# Configure for any default setting of any library
%matplotlib inline
sns.set(style='darkgrid', palette='deep', font='sans-serif', font_scale=1.2, color_codes=True)

**Comments**
- **``%matplotlib inline``** sets the backend of matplotlib to the 'inline' backend: With this backend, the output of plotting commands is displayed inline without needing to call plt.show() every time a data is plotted.
- Set few of the Seaborn's asthetic parameters

---
## Load the Dataset

In [3]:
# Load the dataset into a Pandas dataframe called vehicle
vehicle = pd.read_csv('vehicle.csv')

In [4]:
# Check the head of the dataset
vehicle.head()

Unnamed: 0,compactness,circularity,distance_circularity,radius_ratio,pr.axis_aspect_ratio,max.length_aspect_ratio,scatter_ratio,elongatedness,pr.axis_rectangularity,max.length_rectangularity,scaled_variance,scaled_variance.1,scaled_radius_of_gyration,scaled_radius_of_gyration.1,skewness_about,skewness_about.1,skewness_about.2,hollows_ratio,class
0,95,48.0,83.0,178.0,72.0,10,162.0,42.0,20.0,159,176.0,379.0,184.0,70.0,6.0,16.0,187.0,197,van
1,91,41.0,84.0,141.0,57.0,9,149.0,45.0,19.0,143,170.0,330.0,158.0,72.0,9.0,14.0,189.0,199,van
2,104,50.0,106.0,209.0,66.0,10,207.0,32.0,23.0,158,223.0,635.0,220.0,73.0,14.0,9.0,188.0,196,car
3,93,41.0,82.0,159.0,63.0,9,144.0,46.0,19.0,143,160.0,309.0,127.0,63.0,6.0,10.0,199.0,207,van
4,85,44.0,70.0,205.0,103.0,52,149.0,45.0,19.0,144,241.0,325.0,188.0,127.0,9.0,11.0,180.0,183,bus


In [5]:
# Check the tail of the dataset
vehicle.tail()

Unnamed: 0,compactness,circularity,distance_circularity,radius_ratio,pr.axis_aspect_ratio,max.length_aspect_ratio,scatter_ratio,elongatedness,pr.axis_rectangularity,max.length_rectangularity,scaled_variance,scaled_variance.1,scaled_radius_of_gyration,scaled_radius_of_gyration.1,skewness_about,skewness_about.1,skewness_about.2,hollows_ratio,class
841,93,39.0,87.0,183.0,64.0,8,169.0,40.0,20.0,134,200.0,422.0,149.0,72.0,7.0,25.0,188.0,195,car
842,89,46.0,84.0,163.0,66.0,11,159.0,43.0,20.0,159,173.0,368.0,176.0,72.0,1.0,20.0,186.0,197,van
843,106,54.0,101.0,222.0,67.0,12,222.0,30.0,25.0,173,228.0,721.0,200.0,70.0,3.0,4.0,187.0,201,car
844,86,36.0,78.0,146.0,58.0,7,135.0,50.0,18.0,124,155.0,270.0,148.0,66.0,0.0,25.0,190.0,195,car
845,85,36.0,66.0,123.0,55.0,5,120.0,56.0,17.0,128,140.0,212.0,131.0,73.0,1.0,18.0,186.0,190,van


**Comments**
* To take a closer look at the data, pandas library provides **“.head()”** function which returns first five observations and **“.tail()”** function which returns last five observations of the data set.

---
### Inspect the Dataset
The dataset is divided into two parts, namely, **feature matrix** and the **response vector**.

- Feature matrix contains all the vectors(rows) of dataset in which each vector consists of the value of **dependent features**. In above dataset, features are *age*, *job*, *marital*, *education*, *default*, *balance*, *housing*, *loan*, *contact*, *day*, *month*, *duration*, *campaign*, *pdays*, *previous*, *poutcome*.
- Response vector contains the value of **Target variable**(prediction or output) for each row of feature matrix. In above dataset, the class variable name is *Target*.

In [6]:
# Get the shape and size of the dataset
vehicle.shape

(846, 19)

In [7]:
# Get more info on it
# 1. Name of the columns
# 2. Find the data types of each columns
# 3. Look for any null/missing values
vehicle.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 846 entries, 0 to 845
Data columns (total 19 columns):
compactness                    846 non-null int64
circularity                    841 non-null float64
distance_circularity           842 non-null float64
radius_ratio                   840 non-null float64
pr.axis_aspect_ratio           844 non-null float64
max.length_aspect_ratio        846 non-null int64
scatter_ratio                  845 non-null float64
elongatedness                  845 non-null float64
pr.axis_rectangularity         843 non-null float64
max.length_rectangularity      846 non-null int64
scaled_variance                843 non-null float64
scaled_variance.1              844 non-null float64
scaled_radius_of_gyration      844 non-null float64
scaled_radius_of_gyration.1    842 non-null float64
skewness_about                 840 non-null float64
skewness_about.1               845 non-null float64
skewness_about.2               845 non-null float64
hollows_ratio    

**Observations**
- The dataset comprises of **846 rows** and **19 columns**
- All the input features are numerics of type integer or float. Only the class/target feature is of type object, as it is categorical in nature.
- There are **few null/missing values** present in the dataset

In [8]:
# Describe the dataset with various summary and statistics
vehicle.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
compactness,846.0,93.678487,8.234474,73.0,87.0,93.0,100.0,119.0
circularity,841.0,44.828775,6.152172,33.0,40.0,44.0,49.0,59.0
distance_circularity,842.0,82.110451,15.778292,40.0,70.0,80.0,98.0,112.0
radius_ratio,840.0,168.888095,33.520198,104.0,141.0,167.0,195.0,333.0
pr.axis_aspect_ratio,844.0,61.67891,7.891463,47.0,57.0,61.0,65.0,138.0
max.length_aspect_ratio,846.0,8.567376,4.601217,2.0,7.0,8.0,10.0,55.0
scatter_ratio,845.0,168.901775,33.214848,112.0,147.0,157.0,198.0,265.0
elongatedness,845.0,40.933728,7.816186,26.0,33.0,43.0,46.0,61.0
pr.axis_rectangularity,843.0,20.582444,2.592933,17.0,19.0,20.0,23.0,29.0
max.length_rectangularity,846.0,147.998818,14.515652,118.0,137.0,146.0,159.0,188.0


**Observations:**
- *Age* columns has a minimum of 18 and maximum of 95 value which indicates the campaign covers a wide age group of clients.
- *Balance* column indicates there are clients whose bank balance is in negative as well.
- *Day* column has a min of 1, max of 31 and mean of 15 which clearly shows that the data covers every day in a month, there is no exceptions.
- Maximum value of *duration* column indicates clients were contacted for more than 80 mins in a single call.
- There are extreme outliers in the *campaign* and *previous* columns.
- Negative sign in the *pdays* column indicates there are clients who were not previously contacted

In [9]:
# Compare class wise mean
pd.pivot_table(vehicle, index='class', aggfunc=['mean']).T

Unnamed: 0,class,bus,car,van
mean,circularity,44.981308,46.035047,42.070352
mean,compactness,91.591743,96.184149,90.562814
mean,distance_circularity,76.767442,88.878788,73.247475
mean,elongatedness,40.114679,38.093458,47.939698
mean,hollows_ratio,191.325688,197.582751,196.145729
mean,max.length_aspect_ratio,7.013761,8.825175,9.713568
mean,max.length_rectangularity,146.701835,149.967366,145.175879
mean,pr.axis_aspect_ratio,63.414747,60.992991,61.261307
mean,pr.axis_rectangularity,20.580645,21.511682,18.575758
mean,radius_ratio,165.708333,180.591549,147.176768


**Observations:**
- There is no age group variations in the clients who turned up for the term deposit sale vs. who turned down.
- Clients tend to turn down for the term deposite tend to hung up the call quickly as compared to who turned up for the sale which is also obvious from the data.

### Understanding the target variable

In [10]:
# Find count of unique target variable
len(vehicle['class'].unique())
# OR
vehicle['class'].nunique()

3

In [11]:
# What are the different values for the dependant variable
vehicle['class'].unique()

array(['van', 'car', 'bus'], dtype=object)

In [12]:
# Find out the value counts in each outcome category
vehicle['class'].value_counts()

car    429
bus    218
van    199
Name: class, dtype: int64

**Observation**
- The ratio of yes to no candidate in the dataset is close to **1:7**, which indicates its a highly unbalanced dataset

---
### Univariate Analysis
Let's explore the spread of data points or the observations for each independent attribute. We will be using the density curve plus histogram and boxplot for numerical features and count plot for discrete features.