## Car Evaluation

This data set is composed of 1728 records and 6 different attributes which are buying price, price of maintenance, number of doors, capacity in terms of persons to carry, the relative size of luggage boot and the estimated safety value of each car. There are no missing values ​​in the dataset, which is an advantage!

In [None]:
!pip install plotly flaml\[notebook] auto-sklearn

In [1]:
# Built-in libraries
import pickle
from pathlib import Path

# Data analysis
import pandas as pd
import plotly.express as px

# Machine learning
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from flaml import AutoML
from autosklearn.classification import AutoSklearnClassifier

  from ray.tune.suggest import Searcher
  from ray.tune.suggest.optuna import OptunaSearch as GlobalSearch
  from ray.tune import (
  from ray.tune.suggest.variant_generator import generate_variants


In [2]:
for path in Path('./datasets').rglob('*'):
    print(path)

datasets/titanic.csv
datasets/car_evaluation.csv


In [3]:
df = pd.read_csv('./datasets/car_evaluation.csv', header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [4]:
df.rename(columns={
    0: 'buying',
    1: 'maint',
    2: 'doors',
    3: 'persons',
    4: 'lug_boot',
    5: 'safety',
    6: 'class',
}, inplace=True)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   buying    1728 non-null   object
 1   maint     1728 non-null   object
 2   doors     1728 non-null   object
 3   persons   1728 non-null   object
 4   lug_boot  1728 non-null   object
 5   safety    1728 non-null   object
 6   class     1728 non-null   object
dtypes: object(7)
memory usage: 94.6+ KB


In [6]:
df.describe()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
count,1728,1728,1728,1728,1728,1728,1728
unique,4,4,4,3,3,3,4
top,vhigh,vhigh,2,2,small,low,unacc
freq,432,432,432,576,576,576,1210


In [7]:
df.corr()

No correlation was found due to the non-numerical nature of our data!

In [8]:
df.groupby(by=['class']).agg({
    'buying': ['count', 'min', 'max'],
    'maint': ['count', 'min', 'max'],
    'doors': ['count', 'min', 'max'],
    'persons': ['count', 'min', 'max'],
    'lug_boot': ['count', 'min', 'max'],
    'safety': ['count', 'min', 'max'],
})

Unnamed: 0_level_0,buying,buying,buying,maint,maint,maint,doors,doors,doors,persons,persons,persons,lug_boot,lug_boot,lug_boot,safety,safety,safety
Unnamed: 0_level_1,count,min,max,count,min,max,count,min,max,count,min,max,count,min,max,count,min,max
class,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2
acc,384,high,vhigh,384,high,vhigh,384,2,5more,384,4,more,384,big,small,384,high,med
good,69,low,med,69,low,med,69,2,5more,69,4,more,69,big,small,69,high,med
unacc,1210,high,vhigh,1210,high,vhigh,1210,2,5more,1210,2,more,1210,big,small,1210,high,med
vgood,65,low,med,65,high,med,65,2,5more,65,4,more,65,big,med,65,high,high


A nonsense grouping...

In [9]:
px.bar(data_frame=df, x='class', y='buying', color='safety',
       title='Is There Any Relationship Between the Safety of the Car and Its Price?')

At this moment, I don't know much about how to plot categorical data!

In [10]:
le = LabelEncoder()
df['buying'] = le.fit_transform(df['buying'])
df['maint'] = le.fit_transform(df['maint'])
df['doors'] = le.fit_transform(df['doors'])
df['persons'] = le.fit_transform(df['persons'])
df['lug_boot'] = le.fit_transform(df['lug_boot'])
df['safety'] = le.fit_transform(df['safety'])
df['class'] = le.fit_transform(df['class'])

Correlations after encoding the labels:

In [11]:
px.imshow(df.corr())

In [12]:
X = df.drop(columns=['class'])
y = df['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((1382, 6), (1382,), (346, 6), (346,))

I want to use two different AutoML frameworks to figure out our best model in this dataset!

The first framework is `FLAML` by Microsoft:

In [None]:
with open('./exports/car_evaluation_flaml_model.pkl', 'rb') as f:
    automl = pickle.load(f)
automl

In [20]:
automl = AutoML()
automl.fit(X_train, y_train, task='classification', time_budget=1*60)

[flaml.automl: 09-07 11:50:29] {2565} INFO - task = classification
[flaml.automl: 09-07 11:50:29] {2567} INFO - Data split method: stratified
[flaml.automl: 09-07 11:50:29] {2570} INFO - Evaluation method: cv
[flaml.automl: 09-07 11:50:29] {2689} INFO - Minimizing error metric: log_loss
[flaml.automl: 09-07 11:50:29] {2831} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'catboost', 'xgboost', 'extra_tree', 'xgb_limitdepth', 'lrl1']
[flaml.automl: 09-07 11:50:29] {3133} INFO - iteration 0, current learner lgbm
[flaml.automl: 09-07 11:50:29] {3266} INFO - Estimated sufficient time budget=856s. Estimated necessary time budget=21s.
[flaml.automl: 09-07 11:50:29] {3313} INFO -  at 0.1s,	estimator lgbm's best error=0.7532,	best estimator lgbm's best error=0.7532
[flaml.automl: 09-07 11:50:29] {3133} INFO - iteration 1, current learner lgbm
[flaml.automl: 09-07 11:50:29] {3313} INFO -  at 0.2s,	estimator lgbm's best error=0.7532,	best estimator lgbm's best error=0.7532
[flaml.automl



In [21]:
automl.model.estimator

RandomForestClassifier(criterion='entropy', max_features=1.0,
                       max_leaf_nodes=146, n_estimators=21, n_jobs=-1)

In [22]:
y_pred = automl.predict(X_test)
y_pred

array([0, 3, 0, 0, 2, 2, 2, 2, 2, 2, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2,
       2, 2, 2, 0, 0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 0, 2, 3, 0, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 2, 2, 2, 0, 2, 2, 2,
       2, 2, 2, 2, 0, 2, 2, 1, 2, 1, 0, 2, 1, 2, 2, 2, 2, 2, 2, 0, 2, 2,
       0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 0, 2,
       2, 2, 2, 2, 2, 0, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0, 2, 0, 2, 0, 1, 2,
       2, 2, 0, 0, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2,
       0, 2, 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 3, 0, 2, 0, 3, 2, 0, 2, 2,
       2, 2, 2, 2, 0, 2, 3, 2, 1, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 1, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 1, 2, 0, 1, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 0, 2, 2, 2, 2, 1, 2, 2, 2, 2, 0, 0, 2, 2, 0, 0, 2, 0,
       0, 2, 2, 2, 2, 2, 3, 2, 2, 0, 2, 2, 2, 0, 2, 0, 2, 0, 2, 0, 2, 2,
       2, 2, 0, 2, 2, 2, 2, 0, 2, 0, 0, 0, 3, 1, 2, 2, 2, 2, 2, 2, 0, 2,
       0, 2, 2, 0, 3, 0, 0, 2, 2, 2, 2, 2, 2, 0, 1,

In [23]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.96      0.95      0.95        77
           1       0.93      1.00      0.97        14
           2       0.99      1.00      0.99       242
           3       1.00      0.77      0.87        13

    accuracy                           0.98       346
   macro avg       0.97      0.93      0.95       346
weighted avg       0.98      0.98      0.98       346



In [24]:
with open('./exports/car_evaluation_flaml_model.pkl', 'wb') as f:
    pickle.dump(automl, f)

Now we want to do it with `Auto-Sklearn`:

In [None]:
with open('./exports/car_evaluation_autosklearn_model.pkl', 'rb') as f:
    automl = pickle.load(f)
automl

In [25]:
automl = AutoSklearnClassifier(time_left_for_this_task=1*60)
automl.fit(X_train, y_train)


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



AutoSklearnClassifier(per_run_time_limit=6, time_left_for_this_task=60)

In [26]:
automl.leaderboard()

Unnamed: 0_level_0,rank,ensemble_weight,type,cost,duration
model_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
3,1,1.0,adaboost,0.299781,5.45901


In [27]:
automl.show_models()

{3: {'model_id': 3,
  'rank': 1,
  'cost': 0.29978118161925604,
  'ensemble_weight': 1.0,
  'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice at 0x7fd1481a9df0>,
  'balancing': Balancing(random_state=1),
  'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice at 0x7fd1486d2bb0>,
  'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice at 0x7fd1486d2b50>,
  'sklearn_classifier': AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1),
                     learning_rate=0.03743735372990651, n_estimators=475,
                     random_state=1)}}

In [28]:
y_pred = automl.predict(X_test)
y_pred

array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,

In [29]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        77
           1       0.00      0.00      0.00        14
           2       0.70      1.00      0.82       242
           3       0.00      0.00      0.00        13

    accuracy                           0.70       346
   macro avg       0.17      0.25      0.21       346
weighted avg       0.49      0.70      0.58       346




Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



In [30]:
with open('./exports/car_evaluation_autosklearn_model.pkl', 'wb') as f:
    pickle.dump(automl, f)