<a href="https://colab.research.google.com/github/subhashpolisetti/Decision-Tree-Ensemble-Algorithms/blob/main/Evaluating_Classification_Algorithms_Breast_Cancer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning Model Comparison on Breast Cancer Dataset

This notebook compares the performance of various machine learning algorithms for classification on the **Breast Cancer** dataset. The models tested include:

- **Decision Tree Classifier**
- **Random Forest Classifier**
- **AdaBoost Classifier**
- **Gradient Boosting Classifier**
- **XGBoost Classifier**
- **LightGBM Classifier**
- **CatBoost Classifier**

## Key Steps:
1. **Dataset Loading and Preprocessing**:
   - The Breast Cancer dataset is loaded using `sklearn.datasets.load_breast_cancer()`.
   - The dataset is split into training and test sets, with 80% used for training and 20% for testing.

2. **Model Training**:
   - Several models are trained on the training data using different algorithms, including decision tree-based methods (Decision Tree, Random Forest, AdaBoost, Gradient Boosting, XGBoost, LightGBM, and CatBoost).

3. **Model Evaluation**:
   - After training, each model is evaluated on the test set using accuracy as the performance metric.
   - The accuracy of each model is compared and displayed.

## Results:
The accuracy of the models on the test set is as follows:
- **Decision Tree Accuracy**: 0.95
- **Random Forest Accuracy**: 0.96
- **AdaBoost Accuracy**: 0.97
- **Gradient Boosting Accuracy**: 0.96
- **XGBoost Accuracy**: 0.96
- **LightGBM Accuracy**: 0.96
- **CatBoost Accuracy**: 0.97

This comparison provides insights into the effectiveness of different ensemble methods for classification tasks.


In [1]:
from sklearn.tree import DecisionTreeClassifier  # Import the DecisionTreeClassifier from sklearn
from sklearn.datasets import load_breast_cancer  # Import the breast cancer dataset
from sklearn.model_selection import train_test_split  # Import train_test_split to split the data
from sklearn.metrics import accuracy_score  # Import accuracy_score to evaluate model performance

# Load the breast cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target  # Features (X) and target labels (y)

# Split the dataset into training and test sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Decision Tree classifier with a maximum depth of 5
dt = DecisionTreeClassifier(max_depth=5)
dt.fit(X_train, y_train)  # Fit the model on the training data

# Make predictions on the test set
y_pred = dt.predict(X_test)

# Evaluate the model's performance using accuracy score
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred))  # Print the accuracy on the test set


Decision Tree Accuracy: 0.9385964912280702


In [2]:
from sklearn.ensemble import RandomForestClassifier  # Import the RandomForestClassifier from sklearn

# Train a Random Forest model with 100 trees (estimators) and a fixed random state for reproducibility
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)  # Fit the model on the training data

# Make predictions on the test set
y_pred = rf.predict(X_test)

# Evaluate the model's performance using accuracy score
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred))  # Print the accuracy on the test set


Random Forest Accuracy: 0.9649122807017544


In [3]:
from sklearn.ensemble import AdaBoostClassifier  # Import the AdaBoostClassifier from sklearn

# Train an AdaBoost model with 50 weak learners (decision stumps) and a fixed random state for reproducibility
ab = AdaBoostClassifier(n_estimators=50, random_state=42)
ab.fit(X_train, y_train)  # Fit the model on the training data

# Make predictions on the test set
y_pred = ab.predict(X_test)

# Evaluate the model's performance using accuracy score
print("AdaBoost Accuracy:", accuracy_score(y_test, y_pred))  # Print the accuracy on the test set




AdaBoost Accuracy: 0.9736842105263158


In [4]:
from sklearn.ensemble import GradientBoostingClassifier  # Import the GradientBoostingClassifier from sklearn

# Train a Gradient Boosting model with 100 estimators, a learning rate of 0.1, and a maximum tree depth of 3
# The random_state ensures reproducibility of the results
gbm = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gbm.fit(X_train, y_train)  # Fit the model to the training data

# Make predictions on the test set
y_pred = gbm.predict(X_test)

# Evaluate the model's performance using accuracy score
print("Gradient Boosting Accuracy:", accuracy_score(y_test, y_pred))  # Print the accuracy on the test set


Gradient Boosting Accuracy: 0.956140350877193


In [5]:
from xgboost import XGBClassifier  # Import the XGBClassifier from the xgboost library

# Train an XGBoost model with 100 estimators, a learning rate of 0.1, and a maximum tree depth of 3
# The random_state ensures reproducibility of the results
# use_label_encoder=False avoids deprecation warnings in recent versions of XGBoost
# eval_metric='mlogloss' is used for multiclass classification (but also works for binary classification)
xgb = XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42, use_label_encoder=False, eval_metric='mlogloss')
xgb.fit(X_train, y_train)  # Fit the model to the training data

# Make predictions on the test set
y_pred = xgb.predict(X_test)

# Evaluate the model's performance using accuracy score
print("XGBoost Accuracy:", accuracy_score(y_test, y_pred))  # Print the accuracy on the test set


Parameters: { "use_label_encoder" } are not used.



XGBoost Accuracy: 0.956140350877193


In [6]:
import lightgbm as lgb  # Import the LightGBM library for gradient boosting

# Train a LightGBM model with 100 estimators, a learning rate of 0.1, and a maximum tree depth of 3
# The random_state ensures reproducibility of the results
lgbm = lgb.LGBMClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
lgbm.fit(X_train, y_train)  # Fit the model to the training data

# Make predictions on the test set
y_pred = lgbm.predict(X_test)

# Evaluate the model's performance using accuracy score
print("LightGBM Accuracy:", accuracy_score(y_test, y_pred))  # Print the accuracy on the test set


Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



[LightGBM] [Info] Number of positive: 286, number of negative: 169
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.006472 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 4548
[LightGBM] [Info] Number of data points in the train set: 455, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.628571 -> initscore=0.526093
[LightGBM] [Info] Start training from score 0.526093
LightGBM Accuracy: 0.956140350877193


In [7]:

pip install catboost

Collecting catboost
  Downloading catboost-1.2.7-cp310-cp310-manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading catboost-1.2.7-cp310-cp310-manylinux2014_x86_64.whl (98.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.7/98.7 MB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: catboost
Successfully installed catboost-1.2.7


In [8]:
from catboost import CatBoostClassifier

# Train CatBoost
catboost = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=3, verbose=0, random_seed=42)
catboost.fit(X_train, y_train)

# Predict and Evaluate
y_pred = catboost.predict(X_test)
print("CatBoost Accuracy:", accuracy_score(y_test, y_pred))

CatBoost Accuracy: 0.9736842105263158


In [9]:
# Dictionary containing the models to evaluate, with model names as keys and model objects as values
models = {
    "Decision Tree": dt,  # Decision Tree model
    "Random Forest": rf,  # Random Forest model
    "AdaBoost": ab,  # AdaBoost model
    "Gradient Boosting": gbm,  # Gradient Boosting model
    "XGBoost": xgb,  # XGBoost model
    "LightGBM": lgbm,  # LightGBM model
    "CatBoost": catboost  # CatBoost model
}

# Loop through each model in the models dictionary
for name, model in models.items():
    # Make predictions on the test set
    y_pred = model.predict(X_test)

    # Calculate the accuracy of the model and print the result
    print(f"{name} Accuracy: {accuracy_score(y_test, y_pred):.2f}")


Decision Tree Accuracy: 0.94
Random Forest Accuracy: 0.96
AdaBoost Accuracy: 0.97
Gradient Boosting Accuracy: 0.96
XGBoost Accuracy: 0.96
LightGBM Accuracy: 0.96
CatBoost Accuracy: 0.97
