Using the dataset from https://archive.ics.uci.edu/ml/datasets/Car+Evaluation, create a machine learning model to predict the buying price given the following parameters:

Maintenance = High, Number of doors = 4, Lug Boot Size = Big, Safety = High, Class Value = Good

In [1]:
import pandas as pd 
import numpy as np

In [2]:
car_data = open('/data/car.data', 'r')
count = 0
full_car_list = []
 
while True:
    line = car_data.readline()
    
    if not line:
        break
    full_car_list.append(line.strip().split(","))


### From the description file :
- buying
    - buying price 
    - vhigh, high, med, low
- maint
    - price of maintenance
    - vhigh, high, med, low
- doors
    - number of doors 
    - 2, 3, 4, 5more
- persons
    - capacity in terms of the persons to carry
    - 2, 4, more 
- lug_boot
    - the size of the luggage boot
    - small, med, big
- safety
    - estimated safety of the car 
    - low, med, high
- car_eval_domain
    - unacc, acc, good, vgood

In [3]:
data_df = pd.DataFrame(full_car_list, columns = ["buying","maint","doors","persons","lug_boot","safety","car_eval_domain"])

In [4]:
data_df

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,car_eval_domain
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc
...,...,...,...,...,...,...,...
1723,low,low,5more,more,med,med,good
1724,low,low,5more,more,med,high,vgood
1725,low,low,5more,more,big,low,unacc
1726,low,low,5more,more,big,med,good


## EDA + data processing

In [5]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   buying           1728 non-null   object
 1   maint            1728 non-null   object
 2   doors            1728 non-null   object
 3   persons          1728 non-null   object
 4   lug_boot         1728 non-null   object
 5   safety           1728 non-null   object
 6   car_eval_domain  1728 non-null   object
dtypes: object(7)
memory usage: 94.6+ KB


no null object as per stated in the description

In [6]:
data_df["buying"].value_counts()

low      432
vhigh    432
med      432
high     432
Name: buying, dtype: int64

In [7]:
data_df["maint"].value_counts()

low      432
vhigh    432
med      432
high     432
Name: maint, dtype: int64

In [8]:
data_df["doors"].value_counts()

2        432
3        432
5more    432
4        432
Name: doors, dtype: int64

In [9]:
data_df["persons"].value_counts()

2       576
more    576
4       576
Name: persons, dtype: int64

In [10]:
data_df["lug_boot"].value_counts()

big      576
small    576
med      576
Name: lug_boot, dtype: int64

In [11]:
data_df["safety"].value_counts()

low     576
med     576
high    576
Name: safety, dtype: int64

In [12]:
data_df["car_eval_domain"].value_counts()

unacc    1210
acc       384
good       69
vgood      65
Name: car_eval_domain, dtype: int64

all the values checked and matches w the description. Categorical variables

In [13]:
from sklearn.preprocessing import LabelEncoder

In [14]:
le = LabelEncoder()

In [15]:
le.fit(data_df["buying"])

LabelEncoder()

In [16]:
le.classes_

array(['high', 'low', 'med', 'vhigh'], dtype=object)

In [17]:
y_numpy = le.transform(data_df["buying"])

In [18]:
X_df = data_df.drop(["buying"], axis = 1)

In [19]:
X_df

Unnamed: 0,maint,doors,persons,lug_boot,safety,car_eval_domain
0,vhigh,2,2,small,low,unacc
1,vhigh,2,2,small,med,unacc
2,vhigh,2,2,small,high,unacc
3,vhigh,2,2,med,low,unacc
4,vhigh,2,2,med,med,unacc
...,...,...,...,...,...,...
1723,low,5more,more,med,med,good
1724,low,5more,more,med,high,vgood
1725,low,5more,more,big,low,unacc
1726,low,5more,more,big,med,good


In [20]:
X_df.columns

Index(['maint', 'doors', 'persons', 'lug_boot', 'safety', 'car_eval_domain'], dtype='object')

In [21]:
final_X_df = pd.get_dummies(X_df)

In [22]:
final_X_df

Unnamed: 0,maint_high,maint_low,maint_med,maint_vhigh,doors_2,doors_3,doors_4,doors_5more,persons_2,persons_4,...,lug_boot_big,lug_boot_med,lug_boot_small,safety_high,safety_low,safety_med,car_eval_domain_acc,car_eval_domain_good,car_eval_domain_unacc,car_eval_domain_vgood
0,0,0,0,1,1,0,0,0,1,0,...,0,0,1,0,1,0,0,0,1,0
1,0,0,0,1,1,0,0,0,1,0,...,0,0,1,0,0,1,0,0,1,0
2,0,0,0,1,1,0,0,0,1,0,...,0,0,1,1,0,0,0,0,1,0
3,0,0,0,1,1,0,0,0,1,0,...,0,1,0,0,1,0,0,0,1,0
4,0,0,0,1,1,0,0,0,1,0,...,0,1,0,0,0,1,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1723,0,1,0,0,0,0,0,1,0,0,...,0,1,0,0,0,1,0,1,0,0
1724,0,1,0,0,0,0,0,1,0,0,...,0,1,0,1,0,0,0,0,0,1
1725,0,1,0,0,0,0,0,1,0,0,...,1,0,0,0,1,0,0,0,1,0
1726,0,1,0,0,0,0,0,1,0,0,...,1,0,0,0,0,1,0,1,0,0


In [23]:
final_X_df.columns

Index(['maint_high', 'maint_low', 'maint_med', 'maint_vhigh', 'doors_2',
       'doors_3', 'doors_4', 'doors_5more', 'persons_2', 'persons_4',
       'persons_more', 'lug_boot_big', 'lug_boot_med', 'lug_boot_small',
       'safety_high', 'safety_low', 'safety_med', 'car_eval_domain_acc',
       'car_eval_domain_good', 'car_eval_domain_unacc',
       'car_eval_domain_vgood'],
      dtype='object')

In [24]:
X_numpy = final_X_df.to_numpy()

In [25]:
X_numpy

array([[0, 0, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 1, 0],
       ...,
       [0, 1, 0, ..., 0, 1, 0],
       [0, 1, 0, ..., 1, 0, 0],
       [0, 1, 0, ..., 0, 0, 1]], dtype=uint8)

In [26]:
y_numpy

array([3, 3, 3, ..., 1, 1, 1])

In [27]:
from sklearn.model_selection import train_test_split

In [28]:
X_train, X_test, y_train, y_test = train_test_split(X_numpy, y_numpy, test_size = 0.25, random_state = 42)

In [29]:
X_train

array([[1, 0, 0, ..., 0, 1, 0],
       [0, 1, 0, ..., 0, 1, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 1, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 1, 0],
       [0, 0, 1, ..., 0, 0, 0]], dtype=uint8)

In [30]:
np.unique(y_train, return_counts = True) 

(array([0, 1, 2, 3]), array([312, 329, 332, 323]))

In [31]:
np.unique(y_test, return_counts = True)

(array([0, 1, 2, 3]), array([120, 103, 100, 109]))

the class distribution is almost balanced 

### Modelling - SVM,  logistic regression 

In [32]:
from sklearn.svm import LinearSVC, SVC

In [33]:
from sklearn.model_selection import cross_val_score

In [34]:
from sklearn.metrics import classification_report

##### Linear SVC

In [35]:
lin_clf = LinearSVC()

In [36]:
lin_clf.fit(X_train, y_train)

LinearSVC()

In [37]:
y_predict = lin_clf.predict(X_test)

In [38]:
print(classification_report(y_test,y_predict))

              precision    recall  f1-score   support

           0       0.26      0.15      0.19       120
           1       0.34      0.31      0.32       103
           2       0.22      0.22      0.22       100
           3       0.35      0.54      0.42       109

    accuracy                           0.30       432
   macro avg       0.29      0.31      0.29       432
weighted avg       0.29      0.30      0.29       432



In [39]:
kfold_accuracy_scores_lin_svc = cross_val_score(lin_clf, X_train, y_train, scoring = "accuracy")
kfold_accuracy_scores_lin_svc

array([0.31923077, 0.24324324, 0.36679537, 0.35135135, 0.37065637])

In [40]:
avg_accuracy_score_svc = sum(kfold_accuracy_scores_lin_svc) / 5
avg_accuracy_score_svc

0.3302554202554202

##### SVC with kernel = rbf

In [41]:
clf = SVC(kernel = "rbf")

In [42]:
clf.fit(X_train, y_train)

SVC()

In [43]:
y_predict = clf.predict(X_test)

In [44]:
print(classification_report(y_test,y_predict))

              precision    recall  f1-score   support

           0       0.13      0.06      0.08       120
           1       0.25      0.31      0.28       103
           2       0.10      0.16      0.13       100
           3       0.21      0.19      0.20       109

    accuracy                           0.18       432
   macro avg       0.18      0.18      0.17       432
weighted avg       0.17      0.18      0.17       432



In [45]:
kfold_accuracy_scores_svc = cross_val_score(clf, X_train, y_train, scoring = "accuracy")
kfold_accuracy_scores_svc

array([0.23461538, 0.18532819, 0.22393822, 0.25868726, 0.26640927])

In [46]:
avg_accuracy_score_svc = sum(kfold_accuracy_scores_svc) / 5
avg_accuracy_score_svc

0.2337956637956638

##### logistic regression

In [47]:
from sklearn.linear_model import LogisticRegression 

In [48]:
clf = LogisticRegression(multi_class='multinomial', solver='lbfgs')

In [49]:
clf.fit(X_train, y_train)

LogisticRegression(multi_class='multinomial')

In [50]:
y_predict = clf.predict(X_test)

In [51]:
print(classification_report(y_test,y_predict))

              precision    recall  f1-score   support

           0       0.28      0.16      0.20       120
           1       0.33      0.30      0.32       103
           2       0.21      0.23      0.22       100
           3       0.35      0.51      0.41       109

    accuracy                           0.30       432
   macro avg       0.29      0.30      0.29       432
weighted avg       0.29      0.30      0.29       432



In [52]:
kfold_accuracy_scores_lg = cross_val_score(clf, X_train, y_train, scoring = "accuracy")
kfold_accuracy_scores_lg

array([0.30769231, 0.24324324, 0.36679537, 0.35135135, 0.37065637])

In [53]:
avg_accuracy_score_lg = sum(kfold_accuracy_scores_lg) / 5
avg_accuracy_score_lg

0.3279477279477279

to use linear SVC for whole data training since it has the highest cross val score using accuracy metrics 

### Final model training - w all the data 

In [54]:
lin_clf = LinearSVC()

In [55]:
lin_clf.fit(X_numpy, y_numpy)

LinearSVC()

### predict buying price given the following parameters 
- Maintenance = High, Number of doors = 4, Lug Boot Size = Big, Safety = High, Class Value = Good

In [56]:
X_df.columns

Index(['maint', 'doors', 'persons', 'lug_boot', 'safety', 'car_eval_domain'], dtype='object')

As persons column is not given, using the car_eval_domain / class_value as a gauge to fill in this value.
This is because:
1. based on other columns relationships, it is difficult to use a more appropriate value for persons as the characteristics are the same for the different permutations of the column where as for car_eval_domain, we can use the number of persons that happen more frequently. 
2. Due to time constraints and data size limits, I did not do clustering. How i would do it:
    - find the clusters in this given dataset without the y and persons column eg using kmeans, where k = 3, 
    - find the cluster that is most similar to this set of parameters
    - based on the rows belonging to this cluster with the persons columns back, find the value of persons with the highest frequency and use that for the prediction


In [57]:
data_df[data_df["car_eval_domain"] == "good"]["persons"].value_counts()

4       36
more    33
Name: persons, dtype: int64

In [58]:
# use person == 4 

In [59]:
unknown_row = {
    "maint": "high",
    "doors": "4",
    "persons": "4",
    "lug_boot": "big",
    "safety": "high",
    "car_eval_domain": "good"
}

In [60]:
len(X_df)

1728

In [61]:
unknown_row_df = pd.DataFrame(unknown_row, index = [len(X_df)] )

In [62]:
new_X_df = pd.concat([X_df, unknown_row_df], axis = 0)

In [63]:
final_new_X_df = pd.get_dummies(new_X_df)

In [64]:
unknown_row_vector = final_new_X_df.loc[1728].to_numpy()

In [65]:
unknown_row_vector.shape

(21,)

In [66]:
unknown_row_vector

array([1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0],
      dtype=uint8)

In [67]:
lin_clf.predict(unknown_row_vector.reshape(1, -1))

array([1])

In [68]:
le.classes_[1]

'low'

Given these set of parameters and using persons = "4", my model predict the buying price as "low"