# Boosted Trees Exercise

Using the following dataset : http://archive.ics.uci.edu/dataset/19/car+evaluation

Preliminaries

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
from xgboost import XGBClassifier
from catboost import CatBoostClassifier

In [5]:
file_path_csv = "./dataset/car.data.csv" 
columns = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']
car_data = pd.read_csv(file_path_csv, names=columns)
car_data.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


1. Apply XGBoost classifier on the dataset using the following parameter : 
    Objective = Multi Softmax 
    Print out the classification report

In [6]:
car_data_encoded = pd.get_dummies(car_data, columns=['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety'])

# Features and target variable
X = car_data_encoded.drop('class', axis=1)
y = car_data_encoded['class']

# Label encoding for the target variable
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42)

In [7]:
xgb_classifier = XGBClassifier(objective='multi:softmax')
xgb_classifier.fit(X_train, y_train)
y_pred_xgb = xgb_classifier.predict(X_test)

print("Classification Report for XGBoost:")
print(classification_report(y_test, y_pred_xgb))

Classification Report for XGBoost:
              precision    recall  f1-score   support

           0       1.00      0.92      0.96        83
           1       0.61      1.00      0.76        11
           2       1.00      1.00      1.00       235
           3       1.00      1.00      1.00        17

    accuracy                           0.98       346
   macro avg       0.90      0.98      0.93       346
weighted avg       0.99      0.98      0.98       346



2. Apply Catboost classifier on the dataset using the following parameter 
    Iterations = 150 
    Depth = 5
    Learning_rate = 0.3 
    Loss_function='MultiClass'
    Verbose for the model fit = 15 

In [8]:
cat_classifier = CatBoostClassifier(iterations=150, depth=5, learning_rate=0.3, loss_function='MultiClass', verbose=15)
cat_classifier.fit(X_train, y_train)
y_pred_cat = cat_classifier.predict(X_test)

print("Classification Report for CatBoost:")
print(classification_report(y_test, y_pred_cat))

0:	learn: 0.9474975	total: 143ms	remaining: 21.3s
15:	learn: 0.2026547	total: 181ms	remaining: 1.52s
30:	learn: 0.1136741	total: 221ms	remaining: 848ms
45:	learn: 0.0760837	total: 264ms	remaining: 597ms
60:	learn: 0.0548474	total: 314ms	remaining: 458ms
75:	learn: 0.0425822	total: 348ms	remaining: 339ms
90:	learn: 0.0346446	total: 387ms	remaining: 251ms
105:	learn: 0.0287692	total: 436ms	remaining: 181ms
120:	learn: 0.0247338	total: 473ms	remaining: 113ms
135:	learn: 0.0219709	total: 518ms	remaining: 53.3ms
149:	learn: 0.0195640	total: 592ms	remaining: 0us
Classification Report for CatBoost:
              precision    recall  f1-score   support

           0       1.00      0.92      0.96        83
           1       0.59      0.91      0.71        11
           2       1.00      1.00      1.00       235
           3       0.94      1.00      0.97        17

    accuracy                           0.98       346
   macro avg       0.88      0.96      0.91       346
weighted avg       0.

3. Compare the Overall Accuracy and the individual accuracy ratings between the two classifiers 

In [9]:
accuracy_xgb = (y_test == y_pred_xgb).mean()
accuracy_cat = (y_test == y_pred_cat).mean()

print(f"Overall Accuracy - XGBoost: {accuracy_xgb}")
print(f"Overall Accuracy - CatBoost: {accuracy_cat}")

Overall Accuracy - XGBoost: 0.9797687861271677
Overall Accuracy - CatBoost: 0.518109525877911
