# Car Evaluation

1. Title: Car Evaluation Database

2. Sources:
   (a) Creator: Marko Bohanec
   (b) Donors: Marko Bohanec   (marko.bohanec@ijs.si)
               Blaz Zupan      (blaz.zupan@ijs.si)
   (c) Date: June, 1997

3. Past Usage:

   The hierarchical decision model, from which this dataset is derived, was first presented in M. Bohanec and V. Rajkovic: Knowledge acquisition and explanation for multi-attribute decision making. In 8th Intl Workshop on Expert Systems and their Applications, Avignon, France. pages 59-78, 1988.

   Within machine-learning, this dataset was used for the evaluation of HINT (Hierarchy INduction Tool), which was proved to be able to completely reconstruct the original hierarchical model. This, together with a comparison with C4.5, is presented in B. Zupan, M. Bohanec, I. Bratko, J. Demsar: Machine learning by
   function decomposition. ICML-97, Nashville, TN. 1997 (to appear)

4. Relevant Information Paragraph:

   Car Evaluation Database was derived from a simple hierarchical decision model originally developed for the demonstration of DEX (M. Bohanec, V. Rajkovic: Expert system for decision making. Sistemica 1(1), pp. 145-157, 1990.). The model evaluates cars according to the following concept structure:

   ```text
   CAR                      car acceptability
   . PRICE                  overall price
   . . buying               buying price
   . . maint                price of the maintenance
   . TECH                   technical characteristics
   . . COMFORT              comfort
   . . . doors              number of doors
   . . . persons            capacity in terms of persons to carry
   . . . lug_boot           the size of luggage boot
   . . safety               estimated safety of the car
   ```

   Input attributes are printed in lowercase. Besides the target concept (CAR), the model includes three intermediate concepts:  PRICE, TECH, COMFORT. Every concept is in the original model related to its lower level descendants by a set of examples (for these examples sets see http://www-ai.ijs.si/BlazZupan/car.html).

   The Car Evaluation Database contains examples with the structural information removed, i.e., directly relates CAR to the six input attributes: buying, maint, doors, persons, lug_boot, safety.

   Because of known underlying concept structure, this database may be particularly useful for testing constructive induction and structure discovery methods.

5. Number of Instances: 1728
   (instances completely cover the attribute space)

6. Number of Attributes: 6

7. Attribute Values:

   ```text
   buying       v-high, high, med, low
   maint        v-high, high, med, low
   doors        2, 3, 4, 5-more
   persons      2, 4, more
   lug_boot     small, med, big
   safety       low, med, high
   ```

8. Missing Attribute Values: none

9. Class Distribution (number of instances per class)

   ```text
   class      N          N[%]
   -----------------------------
   unacc     1210     (70.023 %) 
   acc        384     (22.222 %) 
   good        69     ( 3.993 %) 
   v-good      65     ( 3.762 %) 
   ```

In [1]:
# Import required dependencies
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.svm import SVR, SVC 
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor, RandomForestClassifier
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, confusion_matrix, classification_report, balanced_accuracy_score, roc_auc_score

# Build a Pipeline Function Instead

In [2]:
# Import data
file_path = "https://static.bc-edx.com/ai/ail-v-1-0/m13/lesson_3/datasets/car.csv"
df = pd.read_csv(file_path)
# Get the target variable (the "class" column)
y = df["class"]
# Get the features (everything except the "class" column)
X = df.copy().drop(columns="class")
# Split data into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [3]:
def r2_adj(x, y, p):
    r2 = r2_score(y, p)
    n_cols = x.shape[1]
    return 1 - (1 - r2) * (len(y) - 1) / (len(y) - n_cols - 1)

def run_pipeline(model, X_train, X_test, y_train, y_test):
    # Since the target column is an object, we need to convert the data to numerical classes
    # Encode the y data
    # Create an instance of the label encoder
    le = LabelEncoder()

    # Fit and transform the y training and testing data using the label encoder
    y_train_encoded = le.fit_transform(y_train)
    y_test_encoded = le.transform(y_test)

    # Remember that all of the columns in the DataFrame are objects
    # Use a OneHotEncoder to convert the training data to numerical values
    ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False, dtype='int')
    X_train_encoded = pd.DataFrame(data=ohe.fit_transform(X_train), columns=ohe.get_feature_names_out())
    X_test_encoded = pd.DataFrame(data=ohe.transform(X_test), columns=ohe.get_feature_names_out())

    # Fit the model to the training data
    model.fit(X_train_encoded, y_train_encoded)
    preds = model.predict(X_test_encoded)
    
    # Validate the model by checking the model metrics
    if "Regress" in model.__class__.__name__ or "SVR" in model.__class__.__name__:
        mse = mean_squared_error(y_test_encoded, preds)
        r2 = r2_score(y_test_encoded, preds)
        adj_r2 = r2_adj(X_test_encoded, y_test_encoded, preds)
        return pd.DataFrame([{"Mean Squared Error:": mse, "R-Squared:": r2, "Adjusted R-squared:": adj_r2}])
    else:
        train_accuracy = accuracy_score(y_train_encoded, model.predict(X_train_encoded))
        test_accuracy = accuracy_score(y_test_encoded, preds)
        cm = confusion_matrix(y_test_encoded, preds)
        cr = classification_report(y_test_encoded, preds)
        bas = balanced_accuracy_score(y_test_encoded, preds)
        roc = "N/A"
        if "SVC" in model.__class__.__name__:
            preds_proba = model.predict_proba(X_test_encoded)
            roc = roc_auc_score(y_test_encoded, preds_proba, multi_class='ovr')
        return pd.DataFrame([{"Training Accuracy:": train_accuracy, "Testing Accuracy:": test_accuracy, "Balanced Accuracy Score:": bas, "ROC AUC Score:": roc,
            "Confusion Matrix:\n": cm, "Classification Report:\n": cr}])


models = [LinearRegression(), KNeighborsRegressor(n_neighbors=9), RandomForestRegressor(n_estimators=128, random_state=1), ExtraTreesRegressor(n_estimators=128, random_state=1), 
          AdaBoostRegressor(n_estimators=128, random_state=1), SVR(C=1.0, epsilon=0.2),
          LogisticRegression(random_state=1), SVC(kernel='poly', probability=True), KNeighborsClassifier(n_neighbors=9), DecisionTreeClassifier(), 
          RandomForestClassifier(n_estimators=128, random_state=1)]
# output the metrics to a markdown file
with open("metrics.md", "w") as f:
    f.write("# Model Metrics\n")
    for m in models:
        metrics_df = run_pipeline(m, X_train, X_test, y_train, y_test)
        print(f"Model: {m}")
        f.write(f"## Model: {m}\n")
        f.write("| Metric | Value |\n")
        f.write("| :--- | ---: |\n")
        keys = metrics_df.keys()
        for metric, value in metrics_df.iloc[0].items():
            if metric == "Confusion Matrix:\n" or metric == "Classification Report:\n":
                f.write(f"```\n{metric} {value}\n```\n")
            else:
                f.write(f"|{metric}|{value}| \n")
        f.write("\n")

Model: LinearRegression()
Model: KNeighborsRegressor(n_neighbors=9)
Model: RandomForestRegressor(n_estimators=128, random_state=1)
Model: ExtraTreesRegressor(n_estimators=128, random_state=1)
Model: AdaBoostRegressor(n_estimators=128, random_state=1)
Model: SVR(epsilon=0.2)
Model: LogisticRegression(random_state=1)
Model: SVC(kernel='poly', probability=True)
Model: KNeighborsClassifier(n_neighbors=9)
Model: DecisionTreeClassifier()
Model: RandomForestClassifier(n_estimators=128, random_state=1)
