# Apply Random Forest classification on given dataset (car_evaluation dataset).                                          
**Link to the dataset:** https://www.kaggle.com/elikplim/car-evaluation-data-set                   
                             
**Dataset detail:**                         
- No. of attributes: 7             
- buying: buying price                   
- maint: price of the maintenance                    
- doors: number of doors                                 
- persons: capacity in terms of persons to carry                                               
- lug_boot: the size of luggage boot                                             
- safety: estimated safety of the car                                      
- class: car acceptability (Target attribute)                                 

**Random Forest Classifier:**                            
Random forests is a supervised learning algorithm. It can be used both for classification and regression. It is also the most flexible and easy to use algorithm. A forest is comprised of trees. It is said that the more trees it has, the more robust a forest is. Random forests creates decision trees on randomly selected data samples, gets prediction from each tree and selects the best solution by means of voting. It also provides a pretty good indicator of the feature importance.     

Random forest algorithm combines multiple decision-trees, resulting in a forest of trees, hence the name Random Forest. In the random forest classifier, the higher the number of trees in the forest results in higher accuracy.            

It technically is an ensemble method (based on the divide-and-conquer approach) of decision trees generated on a randomly split dataset. This collection of decision tree classifiers is also known as the forest. The individual decision trees are generated using an attribute selection indicator such as information gain, gain ratio, and Gini index for each attribute. Each tree depends on an independent random sample. In a classification problem, each tree votes and the most popular class is chosen as the final result. In the case of regression, the average of all the tree outputs is considered as the final result. It is simpler and more powerful compared to the other non-linear classification algorithms.           

Random forests has a variety of applications, such as recommendation engines, image classification and feature selection. It can be used to classify loyal loan applicants, identify fraudulent activity and predict diseases. It lies at the base of the Boruta algorithm, which selects important features in a dataset.                         

**Advantages:**                              
- Random forests is considered as a highly accurate and robust method because of the number of decision trees participating in the process.                       
- It does not suffer from the overfitting problem. The main reason is that it takes the average of all the predictions, which cancels out the biases.                                                    
- The algorithm can be used in both classification and regression problems.                             
- Random forests can also handle missing values. There are two ways to handle these: using median values to replace continuous variables, and computing the proximity-weighted average of missing values.                 
- You can get the relative feature importance, which helps in selecting the most contributing features for the classifier.                           

## Importing dataset                      

In [1]:
# Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# To supress warnings
import warnings
warnings.filterwarnings('ignore')

In [3]:
# Importing dataset from the directory
df = pd.read_csv('car_evaluation.csv', header=None)

## Exploring the dataset                   

In [4]:
# Shape of the dataset
df.shape

(1728, 7)

In [5]:
# First 5 tuples of dataset
df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [6]:
# Renaming the column names as required
col_names = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']
df.columns = col_names
col_names

['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']

In [7]:
# Checking the column names
df.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [8]:
# Viewing summary
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
buying      1728 non-null object
maint       1728 non-null object
doors       1728 non-null object
persons     1728 non-null object
lug_boot    1728 non-null object
safety      1728 non-null object
class       1728 non-null object
dtypes: object(7)
memory usage: 94.6+ KB


In [9]:
# Frequency distribution of values in variables
col_names = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']
for col in col_names:
 print(df[col].value_counts())

high     432
low      432
med      432
vhigh    432
Name: buying, dtype: int64
high     432
low      432
med      432
vhigh    432
Name: maint, dtype: int64
2        432
3        432
4        432
5more    432
Name: doors, dtype: int64
more    576
2       576
4       576
Name: persons, dtype: int64
small    576
med      576
big      576
Name: lug_boot, dtype: int64
high    576
low     576
med     576
Name: safety, dtype: int64
unacc    1210
acc       384
good       69
vgood      65
Name: class, dtype: int64


We can see that the doors and persons are also categorical in nature. So, we will treat them as categorical variables. There are 7 variables in the dataset. All the variables are of categorical data type. These are given by buying, maint, doors, persons, lug_boot, safety and class. class is the target variable.                     

In [10]:
df['class'].value_counts()

unacc    1210
acc       384
good       69
vgood      65
Name: class, dtype: int64

In [11]:
# check missing values in variables
df.isnull().sum()

buying      0
maint       0
doors       0
persons     0
lug_boot    0
safety      0
class       0
dtype: int64

## Separating the dependent and independent variables                      

In [12]:
# Separating the dependent and independent variables
X = df.drop(['class'], axis=1)
y = df['class']

## Splitting data into training and testing sets                            

In [13]:
# Splitting data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

In [14]:
# check the shape of X_train and X_test
X_train.shape, X_test.shape

((1209, 6), (519, 6))

In [15]:
# check data types in X_train
X_train.dtypes

buying      object
maint       object
doors       object
persons     object
lug_boot    object
safety      object
dtype: object

## Encode categorical variables in training set                                       

In [16]:
# Encode categorical variables
X_train.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety
1616,low,med,5more,more,med,high
65,vhigh,vhigh,4,4,small,high
743,high,med,5more,4,med,high
376,vhigh,low,3,more,big,med
266,vhigh,med,3,more,med,high


In [17]:
# pip install --upgrade category_encoders

In [18]:
#import category encoders
import category_encoders as ce

In [19]:
# encode categorical variables with ordinal encoding
encoder = ce.OrdinalEncoder(cols=['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety'])
X_train = encoder.fit_transform(X_train)
X_test = encoder.transform(X_test)

In [20]:
X_train.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety
1616,1,1,1,1,1,1
65,2,2,2,2,2,1
743,3,1,1,2,1,1
376,2,3,3,1,3,2
266,2,1,3,1,1,1


In [21]:
X_test.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety
7,2,2,4,3,3,2
780,3,3,4,1,3,3
1051,4,4,2,1,3,2
749,3,1,1,1,2,1
342,2,3,4,1,2,3


## Implementing Random Forest Classifier                        

In [22]:
# import Random Forest classifier
from sklearn.ensemble import RandomForestClassifier

# instantiate the classifier 
rfc = RandomForestClassifier(n_estimators=100, criterion='entropy')

# fit the model
rfc.fit(X_train, y_train)

# Predict the Test set results
y_pred = rfc.predict(X_test)

# Importing library to check accuracy score 
from sklearn.metrics import accuracy_score

# Print accuracy score
print('Model accuracy score with 10 decision-trees : {0:0.4f}'. format(accuracy_score(y_test,y_pred)))

Model accuracy score with 10 decision-trees : 0.9422


Here, we have build the Random Forest Classifier model with default parameter of n_estimators = 10. So, we have used 10 decision-trees to build the model. Now, we will increase the number of decision-trees and see its effect on accuracy.                        

In [23]:
rfc_100 = RandomForestClassifier(n_estimators=100, criterion = 'entropy', random_state=0)

# fit the model to the training set
rfc_100.fit(X_train, y_train)

# Predict on the test set results
y_pred_100 = rfc_100.predict(X_test)

# Print accuracy score 
print('Model accuracy score with 100 decision-trees : {0:0.4f}'. format(accuracy_score(y_test,y_pred_100)))

Model accuracy score with 100 decision-trees : 0.9422


So, as expected accuracy increases with number of decision-trees in the model.                 

## Confusion Matrix and Classification Report                         

In [24]:
# Print the Confusion Matrix 
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred_100)
print('Confusion matrix:\n\n', cm)

Confusion matrix:

 [[105   3   5   0]
 [  8  11   0   2]
 [  6   0 357   0]
 [  4   2   0  16]]


In [25]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred_100))

              precision    recall  f1-score   support

         acc       0.85      0.93      0.89       113
        good       0.69      0.52      0.59        21
       unacc       0.99      0.98      0.98       363
       vgood       0.89      0.73      0.80        22

    accuracy                           0.94       519
   macro avg       0.85      0.79      0.82       519
weighted avg       0.94      0.94      0.94       519

