# Portfolio assignment week 7

## 1. Bagging vs Boosting
The scikit-learn library provides several options for bagging and boosting. It is possible to create your own boosting model based on a base model. For instance, you can create a tree based bagging model. In addition, scikit-learn provides AdaBoost. For XGBoost it is best to use the xgboost library.

Based on the theory in the [accompanying notebook](../Exercises/E_BAGGING_BOOSTING.ipynb), create a bagging, boosting and dummy classifier. Test these classifiers on the [breast cancer dataset](https://www.kaggle.com/datasets/yasserh/breast-cancer-dataset). Go through the data science pipeline as you've done before:

1. Try to understand the dataset globally.
2. Load the data.
3. Exploratory analysis
4. Preprocess data (skewness, normality, etc.)
5. Modeling (cross-validation and training)
6. Evaluation
7. Try to understand why some methods perform better than others. Try different configurations for your bagging and boosting models.

In [28]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 
import warnings
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.tree import DecisionTreeClassifier

In [8]:
df = pd.read_csv('../Data/breast-cancer.csv')
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

In [12]:
# Checking for null values
df.isnull().sum().sort_values(ascending=False)/len(df)*100
# drop column with 100 percent null value and id which is not important for the modelling
df.drop( ['id'],inplace = True,axis=1)

In [15]:
# Standardization Scaling
standard_scaler = StandardScaler()
data_standardized = standard_scaler.fit_transform(df)

# Convert the standardized data back to a DataFrame (optional)
scaled_data = pd.DataFrame(data_standardized, columns=df.columns)

In [19]:
# defining the independent variable
y = df['diagnosis']

In [20]:
# split the data into train and test
X = scaled_data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Bagging in Machine Learning combines multiple models to improve accuracy and reduce overfitting.

In [24]:
models = [RandomForestClassifier(random_state=60),
          GradientBoostingClassifier(random_state=60),
          AdaBoostClassifier(random_state=60)]

for model in models:
    score = cross_val_score(model, X, y, cv=10)
    report = ("{0}:\n\tCross validated score\t= {1:.3f} "
           "(+/- {2:.3f})".format(model.__class__.__name__,
                                  score.mean(),
                                  score.std()))
    print(report)

# Fit the model on the dev set and predict and eval independent set
    model.fit(X_train, y_train)
    prediction = model.predict(X_test)
    acc_eval = accuracy_score(y_test, prediction)
    print("\tAccuracy score\t\t= {0:.3f}".format(acc_eval))

RandomForestClassifier:
	Cross validated score	= 0.998 (+/- 0.005)
	Accuracy score		= 1.000
GradientBoostingClassifier:
	Cross validated score	= 1.000 (+/- 0.000)
	Accuracy score		= 1.000
AdaBoostClassifier:
	Cross validated score	= 1.000 (+/- 0.000)
	Accuracy score		= 1.000


In [29]:
models = [
    RandomForestClassifier(random_state=60),
    GradientBoostingClassifier(random_state=60),
    AdaBoostClassifier(random_state=60),
    DecisionTreeClassifier(random_state=60)
]

for model in models:
    # Cross-validation
    score = cross_val_score(model, X, y, cv=10)
    report = ("{0}:\n\tCross-validated score\t= {1:.3f} (+/- {2:.3f})".format(model.__class__.__name__,
                                                                            score.mean(),
                                                                            score.std()))
    print(report)

    # Fit the model on the training data and evaluate on the test data
    model.fit(X_train, y_train)
    prediction = model.predict(X_test)
    acc_eval = accuracy_score(y_test, prediction)
    print("\tAccuracy score\t\t= {0:.3f}".format(acc_eval))

RandomForestClassifier:
	Cross-validated score	= 0.998 (+/- 0.005)
	Accuracy score		= 1.000
GradientBoostingClassifier:
	Cross-validated score	= 1.000 (+/- 0.000)
	Accuracy score		= 1.000
AdaBoostClassifier:
	Cross-validated score	= 1.000 (+/- 0.000)
	Accuracy score		= 1.000
DecisionTreeClassifier:
	Cross-validated score	= 1.000 (+/- 0.000)
	Accuracy score		= 1.000


Boosting is an ensemble meta-algorithm in machine learning that reduces bias and variance by iteratively training weak models on different subsets of data, assigning higher weights to misclassified instances, and combining the models to create a strong learner.

In [31]:
models = [
    RandomForestClassifier(random_state=60),
    GradientBoostingClassifier(random_state=60),
    AdaBoostClassifier(random_state=60), 
    DecisionTreeClassifier(random_state=60)
]

for model in models:
    # Fit the model on the training data
    model.fit(X_train, y_train)

    # Make predictions on the test data
    predictions = model.predict(X_test)

    # Evaluate the model
    accuracy = accuracy_score(y_test, predictions)
    print(f"{model.__class__.__name__}: Accuracy = {accuracy:.3f}")

RandomForestClassifier: Accuracy = 1.000
GradientBoostingClassifier: Accuracy = 1.000
AdaBoostClassifier: Accuracy = 1.000
DecisionTreeClassifier: Accuracy = 1.000


according the above result all the estimater are working good