<a href="https://www.kaggle.com/code/realshaktigupta/car-acceptability-classifier?scriptVersionId=132428298" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Car Acceptability Classifier with 99% Accuracy 

We are trying to develop a classifier that classifies cars into 4 different categories related to acceptance by consumers. The dataset is entirely categorical so this is a slightly different project than an usual classifier project where the training data has atleast a few numerical features.

* [1. Why we will use CatBoost instead of OneHotEncoding](#one)
* [2.Importing,Examining and Preparing the data ](#two)
* [3.Training and Testing the model ](#three)

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/car-acceptability-classification-dataset/car.csv
/kaggle/input/car-acceptability-classification-dataset/car.data


## 1. Why we will use CatBoost instead of OneHotEncoding <a class='anchor' id='one'></a>

We will be working on a categorical dataset for this classification project. Now, OneHotEncoding might be a simple and good way to handle categorical attributes but it's not too good. The problem with OneHotEncoding is that it if the number of categories becomes too large, a feature explosion happens,increasing the dimensionality of the data at times exponentially.  
For example,if a column has say 50 categories, then that means an addition of 49 columns to the dataset for only that column.

To get around this problem and handle categorical attributes efficiently, CatBoost was developed. We will be using CatBoost in this project and as we will see, it works really accurately as well and fits the data nicely.

## 2.Importing,Examining and Preparing the data <a class='anchor' id='two'><a/>

In [2]:
cars=pd.read_csv("/kaggle/input/car-acceptability-classification-dataset/car.csv")

In [3]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Buying_Price       1728 non-null   object
 1   Maintenance_Price  1728 non-null   object
 2   No_of_Doors        1728 non-null   object
 3   Person_Capacity    1728 non-null   object
 4   Size_of_Luggage    1728 non-null   object
 5   Safety             1728 non-null   object
 6   Car_Acceptability  1728 non-null   object
dtypes: object(7)
memory usage: 94.6+ KB


In [4]:
for key in cars.keys():
    print(key,np.unique(cars[key]))

Buying_Price ['high' 'low' 'med' 'vhigh']
Maintenance_Price ['high' 'low' 'med' 'vhigh']
No_of_Doors ['2' '3' '4' '5more']
Person_Capacity ['2' '4' 'more']
Size_of_Luggage ['big' 'med' 'small']
Safety ['high' 'low' 'med']
Car_Acceptability ['acc' 'good' 'unacc' 'vgood']


In [5]:
X=cars.drop("Car_Acceptability",axis=1)
y=cars.Car_Acceptability

In [6]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=100,random_state=42)

In [7]:
X_train.head()

Unnamed: 0,Buying_Price,Maintenance_Price,No_of_Doors,Person_Capacity,Size_of_Luggage,Safety
70,vhigh,vhigh,4,4,big,med
29,vhigh,vhigh,3,2,small,high
1540,low,med,3,2,small,med
69,vhigh,vhigh,4,4,big,low
1228,med,low,3,4,med,med


## 3.Training and Testing the model <a class='anchor' id='three'></a>

In [8]:
from catboost import CatBoostClassifier
clf = CatBoostClassifier()
clf.fit(X_train,y_train,cat_features=range(0,6),verbose=False)

<catboost.core.CatBoostClassifier at 0x7bb7fbf7af80>

In [9]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,clf.predict(X_test))

0.99