1. Load breast cancer dataset (**structured data**)

For more details about the data: https://scikit-learn.org/1.5/modules/generated/sklearn.datasets.load_breast_cancer.html

In [2]:

from sklearn.datasets import load_breast_cancer

my_data = load_breast_cancer()

X =my_data.data
y =my_data.target

2. Visualize the data

- Only **5 points** for visualizing the data
- Use TSNE algorithm: sklearn.manifold.TSNE
- A good and simple code can be found here (they used PCA instead of TSNE): https://skp2707.medium.com/pca-on-cancer-dataset-4d7a97f5fdb8

In [None]:
from sklearn.manifold import TSNE
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

np.random.seed(42)
sample_indices = np.random.choice(X.shape[0], 100, replace=False)
X_sample = X[sample_indices]
y_sample = y[sample_indices]


tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_tsne_sample = tsne.fit_transform(X_sample)


plt.figure(figsize=(8, 6))
plt.scatter(X_tsne_sample[:, 0], X_tsne_sample[:, 1], c=y_sample, cmap='coolwarm', s=100, edgecolor='k')
plt.title('t-SNE Visualization of Breast Cancer Data (100 Points)', fontsize=14)
plt.xlabel('TSNE Component 1')
plt.ylabel('TSNE Component 2')
plt.colorbar(label='Target (0 = Malignant, 1 = Benign)')
plt.show()


3. Split **my_data** to train and test:

- Define X_train, X_test, Y_train, Y_test
- Choose **test_size** for splitting **my_data**
- Use **train_test_split** (for details: https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.train_test_split.html)

In [4]:

from sklearn.model_selection import train_test_split

# X_train, X_test, Y_train, Y_test = train_test_split(...)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


4. Train **model_decision_tree**

- Library: sklearn.tree.DecisionTreeClassifier
- Data: X_train, Y_train
- **Essential**: explore and optimize DecisionTreeClassifier options   

In [None]:
from sklearn.tree import DecisionTreeClassifier

# model_decision_tree = DecisionTreeClassifier(...)
# model_decision_tree.fit(...)

dt_model = DecisionTreeClassifier(random_state=42, max_depth=5, min_samples_split=10)
dt_model.fit(X_train, y_train)


5. Train model_random_forest
- Library: sklearn.ensemble.RandomForestClassifier
- Data: X_train, Y_train
- **Essential**: explore and optimize RandomForestClassifier options

In [None]:
from sklearn.ensemble import RandomForestClassifier

# model_random_forest = RandomForestClassifier(...)
# model_random_forest.fit(...)

rf_model = RandomForestClassifier(random_state=42, n_estimators=100, max_depth=7, min_samples_split=10)
rf_model.fit(X_train, y_train)


6. Train model_adaboost

- Library: sklearn.ensemble.AdaBoostClassifier
- Data: X_train, Y_train
- **Essential**: explore and optimize AdaBoostClassifier options

In [None]:
from sklearn.ensemble import AdaBoostClassifier

# model_adaboost = AdaBoostClassifier(...)
# model_adaboost.fit(...)

ab_model = AdaBoostClassifier(random_state=42, n_estimators=50, learning_rate=1.0)
ab_model.fit(X_train, y_train)


7. Evaluate model_decision_tree, model_random_forest, model_adaboost

- Library: sklearn.metrics
- Data: X_test, Y_test
- **Calculate** and **print** results of each classifier
- **Choose** the decisive metric
- **Compare** between the classifiers and declare the winner


In [17]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    return {
        'Accuracy': accuracy_score(y_test, y_pred),
        'F1 Score': f1_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred)
    }

dt_metrics = evaluate_model(dt_model, X_test, y_test)
rf_metrics = evaluate_model(rf_model, X_test, y_test)
ab_metrics = evaluate_model(ab_model, X_test, y_test)

print("Decision Tree Metrics:", dt_metrics)
print("Random Forest Metrics:", rf_metrics)
print("AdaBoost Metrics:", ab_metrics)

results = pd.DataFrame([dt_metrics, rf_metrics, ab_metrics],
                         index=['Decision Tree', 'Random Forest', 'AdaBoost'])
print("\nComparison of Models:\n", results)

winner = results['F1 Score'].idxmax()
print(f"\nThe best model based on F1 Score is: {winner}")







Decision Tree Metrics: {'Accuracy': 0.9532163742690059, 'F1 Score': 0.9629629629629629, 'Precision': 0.9629629629629629, 'Recall': 0.9629629629629629}
Random Forest Metrics: {'Accuracy': 0.9649122807017544, 'F1 Score': 0.9724770642201835, 'Precision': 0.9636363636363636, 'Recall': 0.9814814814814815}
AdaBoost Metrics: {'Accuracy': 0.9766081871345029, 'F1 Score': 0.9814814814814815, 'Precision': 0.9814814814814815, 'Recall': 0.9814814814814815}

Comparison of Models:
                Accuracy  F1 Score  Precision    Recall
Decision Tree  0.953216  0.962963   0.962963  0.962963
Random Forest  0.964912  0.972477   0.963636  0.981481
AdaBoost       0.976608  0.981481   0.981481  0.981481

The best model based on F1 Score is: AdaBoost
