**Loading and Preprocessing the Dataset** <br>
The following cells import the necessary libraries. It then loads a dataset (DT-BrainCancer.csv) into a pandas DataFrame. The first column of the dataset is dropped because it is redundant. The categorical columns sex, diagnosis, and loc are transformed using one-hot encoding, which creates binary columns representing the presence of each category within those features.

In [90]:
import pandas as pd
import numpy as np

In [91]:
# Load the dataset
data = pd.read_csv('DT-BrainCancer.csv')
# data

In [92]:
# dropping the unnamed column
data = data.drop(data.columns[0], axis=1)
# data

In [93]:
data = pd.get_dummies(data, columns=['sex', 'diagnosis', 'loc'])
# data

**Splitting the Dataset into Training, Validation, and Test Sets** <br>
The dataset is split into three parts: training, validation, and test sets. Initially, 30% of the data is set aside as a temporary dataset (temp_data), leaving 70% for training. The temporary data is further split equally into validation and test sets (15% each). The feature set X_* is obtained by dropping the target column (status), and the target variable y_* is set to the status column. The shapes of these datasets are printed to ensure proper splitting.

In [94]:
from sklearn.model_selection import train_test_split

train_data, temp_data = train_test_split(data, test_size=0.3, random_state=42)
val_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=42)

X_train, y_train = train_data.drop('status', axis=1), train_data['status']
X_val, y_val = val_data.drop('status', axis=1), val_data['status']
X_test, y_test = test_data.drop('status', axis=1), test_data['status']

print(X_train.shape)
print(y_train.shape)
print(X_val.shape)
print(y_val.shape)
print(X_test.shape)
print(y_test.shape)

(61, 10)
(61,)
(13, 10)
(13,)
(14, 10)
(14,)


**Defining the calculate_metrics Function** <br>
This function is designed to compute multiple evaluation metrics (accuracy, precision, recall, and F1 score) based on the true and predicted labels. It checks if the inputs are DataFrame objects and converts them to NumPy arrays for compatibility. The function computes the confusion matrix, followed by the individual metrics for each class. It calculates macro-averaged metrics (mean of metrics across all classes) and prints the confusion matrix along with detailed precision, recall, and F1 scores.

In [95]:
def calculate_metrics(y_true, y_pred):
    # Check if y_true is a pandas DataFrame and convert it to a NumPy array for compatibility
    if isinstance(y_true, pd.DataFrame):
        y_true = y_true.to_numpy()  # Convert y_true to a NumPy array

    # Check if y_pred is a pandas DataFrame and convert it to a NumPy array for compatibility
    if isinstance(y_pred, pd.DataFrame):
        y_pred = y_pred.to_numpy()  # Convert y_pred to a NumPy array

    # Check if y_true is already flattened (1D)
    if y_true.ndim > 1:
        y_true = y_true.flatten()  # Flatten if it has more than 1 dimension

    # Check if y_pred is already flattened (1D)
    if y_pred.ndim > 1:
        y_pred = y_pred.flatten()  # Flatten if it has more than 1 dimension

    # Find unique class names in y_true and determine the number of unique classes
    class_names = np.unique(y_true)  # Get unique class names from y_true
    unique_classes = class_names.size  # Count the number of unique classes

    # Initialize a confusion matrix with zeros, sized based on the number of unique classes
    confusion_matrix = np.zeros((unique_classes, unique_classes), dtype=int)

    # Map class names to indices for easy lookup
    class_name_to_index = {class_name: idx for idx, class_name in enumerate(class_names)}

    # Count occurrences of actual vs predicted labels
    for actual, predicted in zip(y_true, y_pred):
        # Ensure actual and predicted are scalar values, not arrays
        actual = actual.item() if isinstance(actual, np.ndarray) else actual
        predicted = predicted.item() if isinstance(predicted, np.ndarray) else predicted
        
        # Find the index for the actual and predicted class
        actual_index = class_name_to_index[actual]
        predicted_index = class_name_to_index[predicted]

        # Increment the appropriate cell in the confusion matrix
        confusion_matrix[actual_index, predicted_index] += 1

    # Print the confusion matrix row by row
    print("\nConfusion matrix:")
    for row in confusion_matrix:
        print(" ".join(map(str, row)))  # Print each row of the confusion matrix

    # --- --- --- --- --- ---
    
    # Accuracy Calculation
    
    # Sum of the diagonal elements (correct predictions)
    correct_predictions = np.trace(confusion_matrix)  # np.trace() gives the sum of diagonal elements

    # Total number of predictions (sum of all elements in the matrix)
    total_predictions = np.sum(confusion_matrix)

    # Calculate accuracy
    accuracy = correct_predictions / total_predictions
    print(f"\nAccuracy: {accuracy}")

    # --- --- --- --- --- ---
        
    # Precision Calculation

    def calculate_precision(confusion_matrix, class_names):
        precision = {}
        
        # Iterate over each class to calculate its precision
        for i, class_name in enumerate(class_names):
            # True Positive (TP) is the value in the diagonal for that class
            true_positive = confusion_matrix[i, i]
            
            # False Positive (FP) is the sum of the column (excluding the diagonal)
            false_positive = np.sum(confusion_matrix[:, i]) - true_positive
            
            # Precision for the current class
            precision[class_name] = true_positive / (true_positive + false_positive) if (true_positive + false_positive) != 0 else 0        
        
        return precision
    
    precision = calculate_precision(confusion_matrix, class_names)
    print("\nPrecision for each class:")
    for class_name, precision_value in precision.items():
        print(f"{class_name}: {precision_value:}")

    total_precision = sum(precision.values())
    print(f"\nMacro precision: {total_precision / unique_classes}")

    # --- --- --- --- --- ---
    
    # Recall Calculation
    
    def calculate_recall(confusion_matrix, class_names):
        recall = {}
        
        # Iterate over each class to calculate its recall
        for i, class_name in enumerate(class_names):
            # True Positive (TP) is the value in the diagonal for that class
            true_positive = confusion_matrix[i, i]
            
            # False Negative (FN) is the sum of the row (excluding the diagonal)
            false_negative = np.sum(confusion_matrix[i, :]) - true_positive
            
            # Recall for the current class
            recall[class_name] = true_positive / (true_positive + false_negative) if (true_positive + false_negative) != 0 else 0        
        
        return recall
    
    recall = calculate_recall(confusion_matrix, class_names)
    print("\nRecall for each class:")
    for class_name, recall_value in recall.items():
        print(f"{class_name}: {recall_value:.4f}")

    total_recall = sum(recall.values())
    print(f"\nMacro recall: {total_recall / unique_classes}")

    # --- --- --- --- --- ---

    # F1 Score Calculation
    
    def calculate_f1_score(precision, recall):
        f1_scores = {}
    
        # Calculate F1 score for each class
        for class_name in precision.keys():
            p = precision[class_name]
            r = recall[class_name]
            
            # Calculate F1 score for the class, handling cases where p + r = 0
            f1_scores[class_name] = (2 * p * r) / (p + r) if (p + r) != 0 else 0
        
        return f1_scores
    
    f1_scores = calculate_f1_score(precision, recall)
    print("\nF1 Score for each class:")
    for class_name, f1_value in f1_scores.items():
        print(f"{class_name}: {f1_value:}")
        
    # Macro F1 Score Calculation
    total_f1 = sum(f1_scores.values())  # Sum of F1 scores for each class
    macro_f1 = total_f1 / len(f1_scores)  # Average F1 score across all classes
    
    print(f"\nMacro F1 score: {macro_f1:}")

**Training a K-Nearest Neighbors (KNN) Classifier** <br>
A K-Nearest Neighbors classifier is initialized with 6 neighbors. The model is trained using the training data (X_train and y_train). Predictions are made for the training, validation, and test sets (X_train, X_val, and X_test). The calculate_metrics function is called to print the performance metrics for each dataset (training, validation, and test).

In [117]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(X_train, y_train)

y_train_pred_knn = knn.predict(X_train)
y_val_pred_knn = knn.predict(X_val)
y_test_pred_knn = knn.predict(X_test)

print("------------------------------")
print("Training Set Metrics:")
calculate_metrics(y_train, y_train_pred_knn)

print("------------------------------")
print("Validation Set Metrics:")
calculate_metrics(y_val, y_val_pred_knn)

print("------------------------------")
print("Test Set Metrics:")
calculate_metrics(y_test, y_test_pred_knn)

------------------------------
Training Set Metrics:

Confusion matrix:
34 1
16 10

Accuracy: 0.7213114754098361

Precision for each class:
0: 0.68
1: 0.9090909090909091

Macro precision: 0.7945454545454546

Recall for each class:
0: 0.9714
1: 0.3846

Macro recall: 0.6780219780219781

F1 Score for each class:
0: 0.8
1: 0.5405405405405405

Macro F1 score: 0.6702702702702703
------------------------------
Validation Set Metrics:

Confusion matrix:
7 1
2 3

Accuracy: 0.7692307692307693

Precision for each class:
0: 0.7777777777777778
1: 0.75

Macro precision: 0.7638888888888888

Recall for each class:
0: 0.8750
1: 0.6000

Macro recall: 0.7375

F1 Score for each class:
0: 0.823529411764706
1: 0.6666666666666665

Macro F1 score: 0.7450980392156863
------------------------------
Test Set Metrics:

Confusion matrix:
10 0
2 2

Accuracy: 0.8571428571428571

Precision for each class:
0: 0.8333333333333334
1: 1.0

Macro precision: 0.9166666666666667

Recall for each class:
0: 1.0000
1: 0.5000

Ma

**Training a Naive Bayes Classifier** <br>
A Multinomial Naive Bayes classifier is trained using the training data. The model predicts the target variable for the training, validation, and test datasets. The performance metrics are calculated and displayed for each dataset.

In [97]:
from sklearn.naive_bayes import MultinomialNB

nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)

y_train_pred_nb = nb_classifier.predict(X_train)
y_val_pred_nb = nb_classifier.predict(X_val)
y_test_pred_nb = nb_classifier.predict(X_test)

print("------------------------------")
print("Training Set Metrics:")
calculate_metrics(y_train, y_train_pred_nb)

print("------------------------------")
print("Validation Set Metrics:")
calculate_metrics(y_val, y_val_pred_nb)

print("------------------------------")
print("Test Set Metrics:")
calculate_metrics(y_test, y_test_pred_nb)

------------------------------
Training Set Metrics:

Confusion matrix:
27 8
11 15

Accuracy: 0.6885245901639344

Precision for each class:
0: 0.7105263157894737
1: 0.6521739130434783

Macro precision: 0.681350114416476

Recall for each class:
0: 0.7714
1: 0.5769

Macro recall: 0.6741758241758242

F1 Score for each class:
0: 0.7397260273972601
1: 0.6122448979591837

Macro F1 score: 0.6759854626782219
------------------------------
Validation Set Metrics:

Confusion matrix:
8 0
3 2

Accuracy: 0.7692307692307693

Precision for each class:
0: 0.7272727272727273
1: 1.0

Macro precision: 0.8636363636363636

Recall for each class:
0: 1.0000
1: 0.4000

Macro recall: 0.7

F1 Score for each class:
0: 0.8421052631578948
1: 0.5714285714285715

Macro F1 score: 0.7067669172932332
------------------------------
Test Set Metrics:

Confusion matrix:
9 1
2 2

Accuracy: 0.7857142857142857

Precision for each class:
0: 0.8181818181818182
1: 0.6666666666666666

Macro precision: 0.7424242424242424

Recall 

**Training a Support Vector Machine (SVM) Classifier** <br>
A Support Vector Machine classifier with a linear kernel is initialized and trained on the training data. Predictions are made for the training, validation, and test sets. Performance metrics are printed for each dataset.

In [98]:
from sklearn.svm import SVC

svm_classifier = SVC(kernel='linear', random_state=42)  # 'linear', 'rbf'
svm_classifier.fit(X_train, y_train)

y_train_pred_svm = svm_classifier.predict(X_train)
y_val_pred_svm = svm_classifier.predict(X_val)
y_test_pred_svm = svm_classifier.predict(X_test)

print("------------------------------")
print("Training Set Metrics:")
calculate_metrics(y_train, y_train_pred_svm)

print("------------------------------")
print("Validation Set Metrics:")
calculate_metrics(y_val, y_val_pred_svm)

print("------------------------------")
print("Test Set Metrics:")
calculate_metrics(y_test, y_test_pred_svm)

------------------------------
Training Set Metrics:

Confusion matrix:
31 4
14 12

Accuracy: 0.7049180327868853

Precision for each class:
0: 0.6888888888888889
1: 0.75

Macro precision: 0.7194444444444444

Recall for each class:
0: 0.8857
1: 0.4615

Macro recall: 0.6736263736263737

F1 Score for each class:
0: 0.775
1: 0.5714285714285714

Macro F1 score: 0.6732142857142858
------------------------------
Validation Set Metrics:

Confusion matrix:
8 0
2 3

Accuracy: 0.8461538461538461

Precision for each class:
0: 0.8
1: 1.0

Macro precision: 0.9

Recall for each class:
0: 1.0000
1: 0.6000

Macro recall: 0.8

F1 Score for each class:
0: 0.888888888888889
1: 0.7499999999999999

Macro F1 score: 0.8194444444444444
------------------------------
Test Set Metrics:

Confusion matrix:
9 1
2 2

Accuracy: 0.7857142857142857

Precision for each class:
0: 0.8181818181818182
1: 0.6666666666666666

Macro precision: 0.7424242424242424

Recall for each class:
0: 0.9000
1: 0.5000

Macro recall: 0.7

F

**Training a ZeroR Classifier** <br>
The ZeroR classifier, a baseline model, predicts the most frequent class for all instances. It is trained on the training set, and predictions are made for all datasets (training, validation, and test). The model's performance is evaluated by computing the metrics for each dataset.

In [99]:
from sklearn.dummy import DummyClassifier
from sklearn import datasets

zero_r_classifier = DummyClassifier(strategy="most_frequent")
zero_r_classifier.fit(X_train, y_train)

y_train_pred_zero_r = zero_r_classifier.predict(X_train)
y_val_pred_zero_r = zero_r_classifier.predict(X_val)
y_test_pred_zero_r = zero_r_classifier.predict(X_test)

print("------------------------------")
print("ZeroR Classifier - Training Set Metrics:")
calculate_metrics(y_train, y_train_pred_zero_r)

print("------------------------------")
print("ZeroR Classifier - Validation Set Metrics:")
calculate_metrics(y_val, y_val_pred_zero_r)

print("------------------------------")
print("ZeroR Classifier - Test Set Metrics:")
calculate_metrics(y_test, y_test_pred_zero_r)

------------------------------
ZeroR Classifier - Training Set Metrics:

Confusion matrix:
35 0
26 0

Accuracy: 0.5737704918032787

Precision for each class:
0: 0.5737704918032787
1: 0

Macro precision: 0.28688524590163933

Recall for each class:
0: 1.0000
1: 0.0000

Macro recall: 0.5

F1 Score for each class:
0: 0.7291666666666666
1: 0

Macro F1 score: 0.3645833333333333
------------------------------
ZeroR Classifier - Validation Set Metrics:

Confusion matrix:
8 0
5 0

Accuracy: 0.6153846153846154

Precision for each class:
0: 0.6153846153846154
1: 0

Macro precision: 0.3076923076923077

Recall for each class:
0: 1.0000
1: 0.0000

Macro recall: 0.5

F1 Score for each class:
0: 0.761904761904762
1: 0

Macro F1 score: 0.380952380952381
------------------------------
ZeroR Classifier - Test Set Metrics:

Confusion matrix:
10 0
4 0

Accuracy: 0.7142857142857143

Precision for each class:
0: 0.7142857142857143
1: 0

Macro precision: 0.35714285714285715

Recall for each class:
0: 1.0000
1

**Training a OneR Classifier** <br>
The OneR classifier, a simple rule-based model, is implemented. It selects the most predictive feature and its best value for classification. The model learns the rules from the training data and makes predictions accordingly. Performance metrics are calculated for the training, validation, and test datasets.

In [100]:
from sklearn.base import BaseEstimator

class OneRClassifier(BaseEstimator):
    def __init__(self):
        self.rules = {}

    def fit(self, X, y):
        X = X.to_numpy() if isinstance(X, pd.DataFrame) else X
        self.rules = {}
        for feature_index in range(X.shape[1]):
            feature_values = X[:, feature_index]
            rule_accuracy = {}
            for value in np.unique(feature_values):
                predicted_class = y[feature_values == value].mode()[0]
                rule_accuracy[value] = (y[feature_values == value] == predicted_class).mean()
            best_value = max(rule_accuracy, key=rule_accuracy.get)
            self.rules[feature_index] = best_value
        return self
        
    def predict(self, X):
        X = X.to_numpy() if isinstance(X, pd.DataFrame) else X
        predictions = []
        for row in X:
            row_predictions = [self.rules.get(i, None) for i in range(len(row))]
            most_frequent_class = max(set(row_predictions), key=row_predictions.count)  # majority vote for the row
            predictions.append(most_frequent_class)
        return np.array(predictions)

one_r_classifier = OneRClassifier()
one_r_classifier.fit(X_train, y_train)

y_train_pred_one_r = one_r_classifier.predict(X_train)
y_val_pred_one_r = one_r_classifier.predict(X_val)
y_test_pred_one_r = one_r_classifier.predict(X_test)

print("------------------------------")
print("OneR Classifier - Training Set Metrics:")
calculate_metrics(y_train, y_train_pred_one_r)

print("------------------------------")
print("OneR Classifier - Validation Set Metrics:")
calculate_metrics(y_val, y_val_pred_one_r)

print("------------------------------")
print("OneR Classifier - Test Set Metrics:")
calculate_metrics(y_test, y_test_pred_one_r)

------------------------------
OneR Classifier - Training Set Metrics:

Confusion matrix:
0 35
0 26

Accuracy: 0.4262295081967213

Precision for each class:
0: 0
1: 0.4262295081967213

Macro precision: 0.21311475409836064

Recall for each class:
0: 0.0000
1: 1.0000

Macro recall: 0.5

F1 Score for each class:
0: 0
1: 0.5977011494252873

Macro F1 score: 0.29885057471264365
------------------------------
OneR Classifier - Validation Set Metrics:

Confusion matrix:
0 8
0 5

Accuracy: 0.38461538461538464

Precision for each class:
0: 0
1: 0.38461538461538464

Macro precision: 0.19230769230769232

Recall for each class:
0: 0.0000
1: 1.0000

Macro recall: 0.5

F1 Score for each class:
0: 0
1: 0.5555555555555556

Macro F1 score: 0.2777777777777778
------------------------------
OneR Classifier - Test Set Metrics:

Confusion matrix:
0 10
0 4

Accuracy: 0.2857142857142857

Precision for each class:
0: 0
1: 0.2857142857142857

Macro precision: 0.14285714285714285

Recall for each class:
0: 0.000