## Phase 1 : Bagging and Boosting
Concepts:
Ensemble learning, Bagging (Random Forest), Boosting (Adaboost, XGBoost)


In [None]:
import pandas as pd
import numpy as np

Step 1:Data Understanding

In [None]:
df = pd.read_csv('/content/breast-cancer.csv')
df.head()
df.info()
df.isnull().sum()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

Unnamed: 0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,30371830.0,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,125020600.0,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,8670.0,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,869218.0,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,906024.0,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,8813129.0,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,911320500.0,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


Step 2: Preprocessing the Data

In [None]:
df = df.drop(columns=['id'])    #Droped the 'id' column as it is a identifier
df['diagnosis'] = df['diagnosis'].map({'M': 1, 'B': 0})    #Encoded the 'diagnosis' column
X = df.drop('diagnosis', axis=1)
y = df['diagnosis']
print("Features shape:", X.shape)
print("Target shape:", y.shape)

Features shape: (569, 30)
Target shape: (569,)


Step 3: Train-Test Split

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

Training set shape: (455, 30)
Test set shape: (114, 30)


Step 4: Model Training — RandomForestClassifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)
rf_preds = rf_model.predict(X_test)
rf_acc = accuracy_score(y_test, rf_preds)
print("Random Forest Accuracy:", rf_acc)
print("\nClassification Report:\n", classification_report(y_test, rf_preds))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, rf_preds))


Random Forest Accuracy: 0.9736842105263158

Classification Report:
               precision    recall  f1-score   support

           0       0.96      1.00      0.98        72
           1       1.00      0.93      0.96        42

    accuracy                           0.97       114
   macro avg       0.98      0.96      0.97       114
weighted avg       0.97      0.97      0.97       114


Confusion Matrix:
 [[72  0]
 [ 3 39]]


Step 5: Train AdaBoostClassifier

In [None]:
from sklearn.ensemble import AdaBoostClassifier

ada_model = AdaBoostClassifier(random_state=42)
ada_model.fit(X_train, y_train)
ada_preds = ada_model.predict(X_test)
ada_acc = accuracy_score(y_test, ada_preds)
print("AdaBoost Accuracy:", ada_acc)
print("\nClassification Report:\n", classification_report(y_test, ada_preds))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, ada_preds))


AdaBoost Accuracy: 0.9824561403508771

Classification Report:
               precision    recall  f1-score   support

           0       0.97      1.00      0.99        72
           1       1.00      0.95      0.98        42

    accuracy                           0.98       114
   macro avg       0.99      0.98      0.98       114
weighted avg       0.98      0.98      0.98       114


Confusion Matrix:
 [[72  0]
 [ 2 40]]


Step 6: Train XGBoostClassifier

In [None]:
from xgboost import XGBClassifier
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
xgb_model.fit(X_train, y_train)
xgb_preds = xgb_model.predict(X_test)
xgb_acc = accuracy_score(y_test, xgb_preds)
print("XGBoost Accuracy:", xgb_acc)
print("\nClassification Report:\n", classification_report(y_test, xgb_preds))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, xgb_preds))

XGBoost Accuracy: 0.956140350877193

Classification Report:
               precision    recall  f1-score   support

           0       0.94      1.00      0.97        72
           1       1.00      0.88      0.94        42

    accuracy                           0.96       114
   macro avg       0.97      0.94      0.95       114
weighted avg       0.96      0.96      0.96       114


Confusion Matrix:
 [[72  0]
 [ 5 37]]


Parameters: { "use_label_encoder" } are not used.



## Final Summary: Ensemble Learning on Breast Cancer Dataset

### Overview
 Applied and compared three ensemble learning models:
- **RandomForestClassifier**
- **XGBoostClassifier**

###  Model Performance Comparison

| Model         | Accuracy | Precision (M) | Recall (M) | F1-Score (M) | False Negatives |
|---------------|----------|----------------|--------------|----------------|------------------|
| Random Forest | 97.37%   | 100%          | 93%         | 96%           | 3                |
| AdaBoost      | **98.25%** | 100%        | 95%         | **98%**       | 2                |
| XGBoost       | 95.61%   | 100%          | 88%         | 94%           | 5                |

---

###  Key Observations:
- All models performed well with high accuracy and precision.
- **AdaBoostClassifier** showed the best overall performance with the highest accuracy and lowest number of false negatives.

###  Basic Error Analysis:

- **False Negatives**
  - Random Forest: 3
  - AdaBoost: 2
  - XGBoost: 5
- All models had **0 false positives**, which helps avoid unnecessary treatments or over-diagnosis.
- However, **false negatives are more dangerous** in medical scenarios, as they can result in missed cancer detection and delayed treatment.

#### Insight on XGBoost Errors:
From the classification metrics, we observed that XGBoost had the highest number of false negatives. These misclassifications likely occurred in **borderline cases**, where malignant samples had feature values (like radius_mean, area_mean, etc.) closer to benign thresholds — which made them harder to classify correctly.


### Final Conclusion:

- **AdaBoostClassifier** is the most suitable model for this problem:
  - It achieved the best accuracy and F1-score
  - It minimized false negatives, which is crucial in medical diagnostics

