## Car Evaluation

This data set is composed of 1728 records and 6 different attributes which are buying price, price of maintenance, number of doors, capacity in terms of persons to carry, the relative size of luggage boot and the estimated safety value of each car. There is no missing value in the data set as a big advantage you may directly dive into developing your algorithm without preprocessing.

In [None]:
!pip install plotly flaml\[notebook] auto-sklearn

In [1]:
# Built-in libraries
import pickle
from pathlib import Path

# Data analysis
import pandas as pd
import plotly.express as px

# Machine learning
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from flaml import AutoML
from autosklearn.classification import AutoSklearnClassifier

  from ray.tune.suggest import Searcher
  from ray.tune.suggest.optuna import OptunaSearch as GlobalSearch
  from ray.tune import (
  from ray.tune.suggest.variant_generator import generate_variants


In [2]:
for path in Path('./datasets').rglob('*'):
    print(path)

datasets/titanic.csv
datasets/car_evaluation.csv


In [3]:
df = pd.read_csv('./datasets/car_evaluation.csv', header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [4]:
df.rename(columns={
    0: 'buying',
    1: 'maint',
    2: 'doors',
    3: 'persons',
    4: 'lug_boot',
    5: 'safety',
    6: 'class',
}, inplace=True)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   buying    1728 non-null   object
 1   maint     1728 non-null   object
 2   doors     1728 non-null   object
 3   persons   1728 non-null   object
 4   lug_boot  1728 non-null   object
 5   safety    1728 non-null   object
 6   class     1728 non-null   object
dtypes: object(7)
memory usage: 94.6+ KB


In [6]:
df.describe()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
count,1728,1728,1728,1728,1728,1728,1728
unique,4,4,4,3,3,3,4
top,vhigh,vhigh,2,2,small,low,unacc
freq,432,432,432,576,576,576,1210


In [7]:
df.corr()

No correlation was found due to the non-numerical nature of our data!

In [8]:
df.groupby(by=['class']).agg({
    'buying': ['count', 'min', 'max'],
    'maint': ['count', 'min', 'max'],
    'doors': ['count', 'min', 'max'],
    'persons': ['count', 'min', 'max'],
    'lug_boot': ['count', 'min', 'max'],
    'safety': ['count', 'min', 'max'],
})

Unnamed: 0_level_0,buying,buying,buying,maint,maint,maint,doors,doors,doors,persons,persons,persons,lug_boot,lug_boot,lug_boot,safety,safety,safety
Unnamed: 0_level_1,count,min,max,count,min,max,count,min,max,count,min,max,count,min,max,count,min,max
class,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2
acc,384,high,vhigh,384,high,vhigh,384,2,5more,384,4,more,384,big,small,384,high,med
good,69,low,med,69,low,med,69,2,5more,69,4,more,69,big,small,69,high,med
unacc,1210,high,vhigh,1210,high,vhigh,1210,2,5more,1210,2,more,1210,big,small,1210,high,med
vgood,65,low,med,65,high,med,65,2,5more,65,4,more,65,big,med,65,high,high


A nonsense grouping...

In [9]:
px.bar(data_frame=df, x='class', y='buying', color='safety',
       title='Is There Any Relationship Between the Safety of the Car and Its Price?')

At this moment, I don't know much about how to plot categorical data!

In [10]:
le = LabelEncoder()
df['buying'] = le.fit_transform(df['buying'])
df['maint'] = le.fit_transform(df['maint'])
df['doors'] = le.fit_transform(df['doors'])
df['persons'] = le.fit_transform(df['persons'])
df['lug_boot'] = le.fit_transform(df['lug_boot'])
df['safety'] = le.fit_transform(df['safety'])
df['class'] = le.fit_transform(df['class'])

Correlations after encoding the labels:

In [11]:
px.imshow(df.corr())

In [12]:
X = df.drop(columns=['class'])
y = df['safety']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((1382, 6), (1382,), (346, 6), (346,))

I want to use two different AutoML frameworks to figure out our best model in this dataset!

The first framework is `FLAML` by Microsoft:

In [13]:
with open('./exports/car_evaluation_flaml_model.pkl', 'rb') as f:
    automl = pickle.load(f)
automl


Trying to unpickle estimator SimpleImputer from version 1.1.2 when using version 0.24.2. This might lead to breaking code or invalid results. Use at your own risk.


Trying to unpickle estimator ColumnTransformer from version 1.1.2 when using version 0.24.2. This might lead to breaking code or invalid results. Use at your own risk.


Trying to unpickle estimator LabelEncoder from version 1.1.2 when using version 0.24.2. This might lead to breaking code or invalid results. Use at your own risk.


Trying to unpickle estimator DecisionTreeClassifier from version 1.1.2 when using version 0.24.2. This might lead to breaking code or invalid results. Use at your own risk.


Trying to unpickle estimator RandomForestClassifier from version 1.1.2 when using version 0.24.2. This might lead to breaking code or invalid results. Use at your own risk.



AutoML(append_log=False, auto_augment=True, custom_hp={}, early_stop=False,
       ensemble=False, estimator_list='auto', eval_method='auto',
       fit_kwargs_by_estimator={}, hpo_method='auto', keep_search_state=False,
       learner_selector='sample', log_file_name='', log_training_metric=False,
       log_type='better', max_iter=None, mem_thres=4294967296, metric='auto',
       metric_constraints=[], min_sample_size=10000, model_history=False,
       n_concurrent_trials=1, n_jobs=-1, n_splits=5, pred_time_limit=inf,
       retrain_full=True, sample=True, split_ratio=0.1, split_type='auto',
       starting_points='static', task='classification', ...)

In [15]:
automl = AutoML()
automl.fit(X_train, y_train, task='classification', time_budget=30)

[flaml.automl: 09-05 21:47:56] {2565} INFO - task = classification
[flaml.automl: 09-05 21:47:56] {2567} INFO - Data split method: stratified
[flaml.automl: 09-05 21:47:56] {2570} INFO - Evaluation method: cv
[flaml.automl: 09-05 21:47:56] {2689} INFO - Minimizing error metric: log_loss
[flaml.automl: 09-05 21:47:56] {2831} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'catboost', 'xgboost', 'extra_tree', 'xgb_limitdepth', 'lrl1']
[flaml.automl: 09-05 21:47:56] {3133} INFO - iteration 0, current learner lgbm
[flaml.automl: 09-05 21:47:57] {3266} INFO - Estimated sufficient time budget=12698s. Estimated necessary time budget=312s.
[flaml.automl: 09-05 21:47:57] {3313} INFO -  at 1.3s,	estimator lgbm's best error=0.9100,	best estimator lgbm's best error=0.9100
[flaml.automl: 09-05 21:47:57] {3133} INFO - iteration 1, current learner lgbm
[flaml.automl: 09-05 21:47:57] {3313} INFO -  at 1.4s,	estimator lgbm's best error=0.9100,	best estimator lgbm's best error=0.9100
[flaml.aut

In [16]:
automl.model.estimator

RandomForestClassifier(criterion='entropy', max_features=0.859364268853592,
                       max_leaf_nodes=49, n_estimators=7, n_jobs=-1)

In [17]:
y_pred = automl.predict(X_test)
y_pred

array([1, 1, 1, 1, 0, 2, 1, 2, 2, 0, 2, 2, 0, 2, 2, 0, 2, 0, 2, 1, 0, 0,
       0, 1, 1, 1, 0, 1, 2, 2, 1, 2, 1, 2, 2, 0, 2, 1, 0, 1, 2, 2, 2, 2,
       2, 1, 2, 1, 2, 0, 0, 0, 2, 2, 0, 0, 1, 2, 1, 0, 2, 1, 2, 2, 1, 1,
       0, 1, 2, 2, 2, 1, 0, 2, 2, 2, 1, 0, 0, 1, 1, 2, 1, 0, 0, 0, 1, 2,
       1, 0, 2, 0, 1, 0, 2, 2, 0, 2, 1, 0, 1, 2, 2, 2, 2, 2, 0, 0, 2, 1,
       0, 2, 0, 0, 1, 1, 2, 0, 1, 2, 0, 0, 0, 0, 2, 1, 2, 1, 1, 2, 0, 2,
       1, 1, 2, 1, 0, 2, 2, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 2, 0, 2, 2, 0,
       2, 1, 1, 1, 1, 1, 1, 2, 0, 2, 2, 1, 0, 2, 2, 1, 1, 0, 2, 0, 1, 2,
       2, 0, 0, 1, 1, 1, 1, 0, 2, 1, 1, 0, 0, 0, 0, 2, 0, 1, 2, 0, 0, 1,
       2, 1, 1, 0, 2, 2, 0, 0, 0, 0, 0, 0, 1, 0, 2, 0, 0, 1, 2, 0, 2, 1,
       2, 1, 2, 0, 1, 0, 1, 0, 1, 1, 2, 0, 0, 0, 1, 0, 2, 1, 1, 2, 2, 0,
       2, 2, 0, 0, 0, 0, 2, 0, 2, 0, 0, 2, 1, 2, 2, 1, 0, 1, 1, 0, 2, 2,
       1, 2, 2, 1, 2, 0, 1, 1, 0, 2, 2, 1, 1, 0, 1, 2, 0, 1, 2, 0, 2, 2,
       1, 2, 2, 2, 1, 2, 2, 0, 2, 0, 2, 1, 0, 2, 2,

In [18]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       122
           1       1.00      1.00      1.00       102
           2       1.00      1.00      1.00       122

    accuracy                           1.00       346
   macro avg       1.00      1.00      1.00       346
weighted avg       1.00      1.00      1.00       346



In [19]:
with open('./exports/car_evaluation_flaml_model.pkl', 'wb') as f:
    pickle.dump(automl, f)

Now we want to do it with `Auto-Sklearn`:

In [34]:
with open('./exports/car_evaluation_autosklearn_model.pkl', 'rb') as f:
    automl = pickle.load(f)
automl

AutoSklearnClassifier(per_run_time_limit=6, time_left_for_this_task=60)

In [28]:
automl = AutoSklearnClassifier(time_left_for_this_task=60)
automl.fit(X_train, y_train)


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



AutoSklearnClassifier(per_run_time_limit=6, time_left_for_this_task=60)

In [35]:
automl.leaderboard()

Unnamed: 0_level_0,rank,ensemble_weight,type,cost,duration
model_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2,1,0.2,random_forest,0.0,4.281472
3,2,0.28,passive_aggressive,0.0,3.883725
4,3,0.12,random_forest,0.0,5.17173
5,4,0.16,random_forest,0.0,4.686456
12,5,0.24,gaussian_nb,0.0,2.047975


In [36]:
automl.show_models()

{2: {'model_id': 2,
  'rank': 1,
  'cost': 0.0,
  'ensemble_weight': 0.2,
  'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice at 0x7fc7a8a61340>,
  'balancing': Balancing(random_state=1),
  'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice at 0x7fc7afa65ee0>,
  'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice at 0x7fc7b066edf0>,
  'sklearn_classifier': RandomForestClassifier(max_features=2, n_estimators=512, n_jobs=1,
                         random_state=1, warm_start=True)},
 3: {'model_id': 3,
  'rank': 2,
  'cost': 0.0,
  'ensemble_weight': 0.28,
  'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice at 0x7fc7a8b8d670>,
  'balancing': Balancing(random_state=1, strategy='weighting'),
  'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice at 0x7fc7a8b8ba60>,
  '

In [31]:
y_pred = automl.predict(X_test)
y_pred

array([1, 1, 1, 1, 0, 2, 1, 2, 2, 0, 2, 2, 0, 2, 2, 0, 2, 0, 2, 1, 0, 0,
       0, 1, 1, 1, 0, 1, 2, 2, 1, 2, 1, 2, 2, 0, 2, 1, 0, 1, 2, 2, 2, 2,
       2, 1, 2, 1, 2, 0, 0, 0, 2, 2, 0, 0, 1, 2, 1, 0, 2, 1, 2, 2, 1, 1,
       0, 1, 2, 2, 2, 1, 0, 2, 2, 2, 1, 0, 0, 1, 1, 2, 1, 0, 0, 0, 1, 2,
       1, 0, 2, 0, 1, 0, 2, 2, 0, 2, 1, 0, 1, 2, 2, 2, 2, 2, 0, 0, 2, 1,
       0, 2, 0, 0, 1, 1, 2, 0, 1, 2, 0, 0, 0, 0, 2, 1, 2, 1, 1, 2, 0, 2,
       1, 1, 2, 1, 0, 2, 2, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 2, 0, 2, 2, 0,
       2, 1, 1, 1, 1, 1, 1, 2, 0, 2, 2, 1, 0, 2, 2, 1, 1, 0, 2, 0, 1, 2,
       2, 0, 0, 1, 1, 1, 1, 0, 2, 1, 1, 0, 0, 0, 0, 2, 0, 1, 2, 0, 0, 1,
       2, 1, 1, 0, 2, 2, 0, 0, 0, 0, 0, 0, 1, 0, 2, 0, 0, 1, 2, 0, 2, 1,
       2, 1, 2, 0, 1, 0, 1, 0, 1, 1, 2, 0, 0, 0, 1, 0, 2, 1, 1, 2, 2, 0,
       2, 2, 0, 0, 0, 0, 2, 0, 2, 0, 0, 2, 1, 2, 2, 1, 0, 1, 1, 0, 2, 2,
       1, 2, 2, 1, 2, 0, 1, 1, 0, 2, 2, 1, 1, 0, 1, 2, 0, 1, 2, 0, 2, 2,
       1, 2, 2, 2, 1, 2, 2, 0, 2, 0, 2, 1, 0, 2, 2,

In [37]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       122
           1       1.00      1.00      1.00       102
           2       1.00      1.00      1.00       122

    accuracy                           1.00       346
   macro avg       1.00      1.00      1.00       346
weighted avg       1.00      1.00      1.00       346



In [38]:
with open('./exports/car_evaluation_autosklearn_model.pkl', 'wb') as f:
    pickle.dump(automl, f)