# Ensemble Learning
Ensemble learning is a powerful approach that leverages the combined intelligence of multiple models to enhance machine learning performance and accuracy. This technique has gained widespread adoption in recent years due to its ability to improve predictive capabilities while minimizing the risk of overfitting. In this article, we will delve into ensemble learning and demonstrate how to implement it using Python.

# Understanding Ensemble Learning
Traditional machine learning involves training a single model on a dataset to make predictions. However, no individual model can fully capture the complexity and variability of real-world data. Ensemble learning addresses this challenge by aggregating the predictions of multiple models, known as base models or weak learners.

The fundamental principle behind ensemble learning is the wisdom of the crowd—while individual models may produce errors, combining their predictions results in a more robust and accurate final output. By compensating for each other's weaknesses, these models collectively deliver improved performance.

# Libraries

# 1. Data Handling and Visualization Libraries
numpy (np): A fundamental package for numerical computing in Python, often used for handling arrays and performing mathematical operations.

pandas (pd): A data analysis library used for handling structured data (DataFrames), enabling data manipulation and cleaning.

matplotlib.pyplot (plt): A plotting library that provides tools for creating static, animated, and interactive visualizations.

seaborn (sns): A statistical data visualization library that builds on matplotlib and provides attractive, informative graphs.

# 2. Data Preprocessing and Splitting
train_test_split: A function from sklearn.model_selection that splits a dataset into training and testing sets to evaluate model performance.

# 3. Machine Learning Models
KNeighborsClassifier: Implements the k-Nearest Neighbors (k-NN) algorithm, a non-parametric method used for classification based on the nearest training samples.

LogisticRegression: A statistical model that applies logistic function to binary or multi-class classification problems.

DecisionTreeClassifier: A tree-based model that splits data based on feature conditions to classify instances.

SVC (Support Vector Classifier): Implements Support Vector Machines (SVM), which find the best hyperplane to separate data into different classes.

# 4. Ensemble Learning Methods
RandomForestClassifier: An ensemble learning method that creates multiple decision trees and combines their outputs to improve accuracy and reduce overfitting.

AdaBoostClassifier: An adaptive boosting algorithm that combines weak classifiers iteratively to improve model performance.

BaggingClassifier: Implements bootstrap aggregating (bagging), which trains multiple versions of a base model on different subsets of data and averages their predictions.

ExtraTreesClassifier: A variant of Random Forest that uses more randomness in selecting split points, improving robustness.

VotingClassifier: Combines multiple models by majority voting (for classification) or averaging predictions (for regression).

StackingClassifier: A method that stacks multiple models together and uses a meta-model to combine their predictions for better performance.

# 5. Performance Evaluation
accuracy_score: A function that computes the accuracy of classification models by comparing predicted and actual labels.


In [21]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier , ExtraTreesClassifier, VotingClassifier ,StackingClassifier , AdaBoostClassifier
from sklearn.metrics import accuracy_score

The command

df = pd.read_csv('Tshirt Dataset.csv')

loads a CSV file named "Tshirt Dataset.csv" into a Pandas DataFrame (df).

In [22]:
df=pd.read_csv('Tshirt Dataset.csv')

The df.head() function in Pandas displays the first five rows of the DataFrame (df) by default. It helps in quickly inspecting the structure and contents of the dataset.

In [24]:
df.head()

Unnamed: 0,Height (in cms),Weight (in kgs),T Shirt Size
0,158,58,S
1,158,59,S
2,158,63,S
3,160,59,S
4,160,60,S


# Understanding: x = df.iloc[:, 0:2].values

# What it does:

a) Selects all rows (:) and the first two columns (0:2).

b) Extracts their values as a NumPy array.

# Effect:

a) Converts the first two columns of df into an array.

b) Stores it in x.


# Explanation of: y = df.iloc[:, 2].values

# What it does:

a) Selects all rows (:) and only the third column (2).

b) Extracts its values as a NumPy array.

# Effect:

a) Converts the third column into a 1D NumPy array.

In [25]:
x = df.iloc[: , 0:2].values
y = df.iloc[:,2].values

# Purpose of train_test_split
1. The function train_test_split() from sklearn.model_selection is used to split a dataset into training and testing subsets.
2. This is essential in machine learning to evaluate model performance.
3. The dataset is divided into: (a) Training Set (80%): Used to train the model and (b) Test Set (20%): Used to evaluate model performance.

# Breaking Down the Code: X_train, X_test, y_train, y_test

X_train → Training data (features)

X_test → Testing data (features)

y_train → Training labels (target variable)

y_test → Testing labels (target variable)

train_test_split(x, y, test_size=0.20)

x: Feature variables (independent variables).

y: Target variable (dependent variable).

test_size=0.20: 20% of the data will be used for testing, and 80% for training.


# Why Split the Data?

1. Prevents Overfitting: The model learns from training data but is tested on unseen data.

2. Ensures Generalization: If the model performs well on test data, it’s more likely to work on real-world data.


3. Evaluates Performance: We use y_test to compare predictions and calculate accuracy.


In [26]:
X_train , X_test , y_train , y_test = train_test_split(x,y,test_size=.20)

# Support Vector Machine (SVM) classifier
The provided code builds a machine learning model using a Support Vector Machine (SVM) classifier to make predictions and evaluate its accuracy.

1. First, an instance of SVC() (Support Vector Classifier) from sklearn.svm is created and assigned to the variable svm.
2. The model is then trained using the fit() method, where X_train (the training features) and y_train (the corresponding labels) are provided.
3. Once the model is trained, it is used to make predictions on unseen test data (X_test) using the predict() method, generating the predicted labels stored in y_pred.
4. Finally, the performance of the model is evaluated using the accuracy_score() function from sklearn.metrics, which compares the predicted labels (y_pred) against the actual test labels (y_test) to calculate the accuracy of the model.
5. This accuracy metric reflects the proportion of correctly classified instances, indicating how well the model generalizes to new data.

In [27]:
# build the model
svm = SVC()
svm.fit(X_train,y_train)

# make predictions
y_pred = svm.predict(X_test)

# get accuracy
accuracy_score(y_test,y_pred)

0.5

# Explanation (next code cell):

In this implementation, multiple machine learning models are created, including Logistic Regression (LR), Decision Tree (DT), Support Vector Machine (SVM), and K-Nearest Neighbors (KNN). However, only a subset of these models—Logistic Regression, Decision Tree, and SVM—are selected to form an ensemble model using VotingClassifier.

The Voting Classifier is an ensemble learning technique that combines the predictions of multiple models to improve overall performance. Here, the models are added to a list (model_list) and passed to VotingClassifier, which aggregates their predictions. The n_jobs=-1 parameter ensures that the computation runs in parallel for efficiency.

Once the voting classifier is initialized, it is trained on the dataset using fit(X_train, y_train), where X_train represents the input features and y_train represents the target variable. After training, the model makes predictions on the test set using predict(X_test). Finally, the accuracy of the ensemble model is calculated using accuracy_score(y_test, y_pred), which compares the predicted labels with the actual labels in y_test. By leveraging multiple models, this ensemble approach helps improve predictive performance and robustness compared to individual classifiers.

In [33]:
## Create many mode types
lr = LogisticRegression()
dt = DecisionTreeClassifier()
svm = SVC()
knn = KNeighborsClassifier()


## create a voting classifier
model_list = [('lr',lr),('dt',dt),('svm',svm)]

v = VotingClassifier(
    estimators = model_list ,
    n_jobs=-1
)


# train the voting classifier
v.fit(X_train,y_train)


# make predictions
y_pred = v.predict(X_test)


# get model accuracy
accuracy_score(y_test,y_pred)

0.75

#Explanation (next code cell):

In this implementation, BaggingClassifier is used as an ensemble learning technique to improve the performance and stability of a Decision Tree Classifier (DT). Bagging (Bootstrap Aggregating) works by training multiple instances of the same model on different subsets of the training data, which are sampled with replacement. This helps in reducing variance and preventing overfitting, particularly for high-variance models like decision trees.

Here, the BaggingClassifier is initialized with dt (a Decision Tree Classifier) as the base estimator and n_estimators=9, meaning that nine different decision trees will be trained on different bootstrapped subsets of the dataset. The classifier is then trained using bc.fit(X_train, y_train), where X_train represents the input features and y_train represents the target labels.

After training, predictions are made on the test set using bc.predict(X_test), which aggregates predictions from all the individual decision trees. Finally, the model's performance is evaluated using accuracy_score(y_test, y_pred), which calculates the proportion of correctly classified instances. By averaging the predictions of multiple decision trees, bagging helps to enhance generalization and reduce the likelihood of overfitting compared to a single decision tree.

In [29]:
# bagging
bc = BaggingClassifier(
    estimator= dt ,  # Changed 'base_estimator' to 'estimator'
    n_estimators=9
)

# fit the classifier
bc.fit(X_train,y_train)


# make predictions
y_pred = bc.predict(X_test)


# get model accuracy
accuracy_score(y_test,y_pred)

0.75

# Explanation (next code cell):
In this implementation, a StackingClassifier is used, which is an ensemble method that combines multiple base models and uses a final estimator to make predictions. The base classifiers, consisting of Logistic Regression (lr), K-Nearest Neighbors (knn), Decision Tree Classifier (dt), and Support Vector Machine (svm), are trained on the same dataset. Each base model makes its own predictions based on the input features, and these predictions are then passed to a final estimator, which is another machine learning model. In this case, the final estimator is a Support Vector Machine (SVC), which combines the predictions from the base models to make the final decision.

The StackingClassifier is initialized with the base models and the final estimator. The cv=3 parameter specifies that 3-fold cross-validation will be used for training the base models to ensure better generalization and avoid overfitting. The classifier is then trained using sc.fit(X_train, y_train), where X_train represents the input features and y_train represents the target labels.

After training, the model makes predictions on the test set using sc.predict(X_test), and its performance is evaluated using accuracy_score(y_test, y_pred), which calculates the proportion of correct predictions. Stacking enhances the performance of individual models by leveraging their diversity and combining their strengths, leading to more accurate predictions.

In [30]:
### Stacking
base_classifiers = [
    ('lr' , LogisticRegression()) ,
    ('knn',KNeighborsClassifier()),
    ('dt' , DecisionTreeClassifier()),
    ('svm' , SVC())
]


# create stacking classifier
sc = StackingClassifier(
        estimators=base_classifiers ,
    final_estimator= SVC()  ,
    cv=3
)


# fit the classifier
sc.fit(X_train,y_train)

# make predictions
y_pred = sc.predict(X_test)

# get model accuracy
accuracy_score(y_test,y_pred)

0.75

#Explanation (next code cell)
In this implementation, the StratifiedKFold cross-validation technique is used with the StackingClassifier to improve the training process. StratifiedKFold is a variation of K-fold cross-validation where the data is split into n_splits (in this case, 5) while maintaining the same proportion of target class labels in each fold. This ensures that each fold is representative of the overall class distribution, which is particularly useful when dealing with imbalanced datasets.

The StackingClassifier itself is an ensemble method that combines multiple base models and a final estimator to make predictions. In this case, the base classifiers are Logistic Regression, K-Nearest Neighbors, Decision Tree Classifier, and Support Vector Machine. The final estimator used to combine the predictions of the base models is an SVC (Support Vector Classifier).

By incorporating StratifiedKFold with cv=StratifiedKFold(n_splits=5), the model trains the base classifiers on each fold while ensuring that the class distribution remains consistent across the training and validation sets. This helps improve the generalization of the stacking model and reduces the risk of overfitting, leading to more reliable and robust predictions when evaluated on the test set.

In [34]:
from sklearn.model_selection import StratifiedKFold

sc = StackingClassifier(
    estimators=base_classifiers,
    final_estimator=SVC(),
    cv=StratifiedKFold(n_splits=5)  # using StratifiedKFold
)

# Explanation (next code cell):

In this code, an AdaBoostClassifier is created with the parameter n_estimators=10, meaning the model will use 10 base classifiers (weak learners) to form the ensemble. AdaBoost (Adaptive Boosting) is an ensemble learning technique that combines multiple weak learners to form a stronger model. It works by fitting a sequence of models, where each subsequent model is trained to correct the errors made by the previous models. The final prediction is made by taking a weighted vote of the individual model predictions.

The AdaBoostClassifier is then trained on the training data (X_train and y_train) using the fit() method. This process involves adjusting the weights of the training samples so that subsequent classifiers focus more on the misclassified instances.

After the model is trained, predictions are made on the test set (X_test) using the predict() method. The predicted values are then compared to the true values (y_test) to evaluate the model's performance using the accuracy_score() function. This function computes the accuracy by calculating the proportion of correct predictions out of all predictions made, providing a measure of the model's effectiveness on unseen data.

In [32]:
# create adaboost classifier
adm = AdaBoostClassifier(n_estimators=10)

# fit the classifier
adm.fit(X_train,y_train)

# make predictions
y_pred = adm.predict(X_test)

# get model accuracy
accuracy_score(y_test,y_pred)

0.75

# Challenging Task

# Problem Statement:

The Heart Disease Dataset (heart_dataset_complete.xlsx) contains data related to individuals' health and medical history, with the goal of predicting the likelihood of a person developing heart disease. This dataset includes various features such as age, sex, blood pressure, cholesterol levels, and other cardiovascular-related metrics.

The objective of this case study is to build a predictive model that can accurately predict whether a person is likely to have heart disease based on these attributes. The task is to use the dataset to:

1. Understand the key factors that contribute to heart disease risk.
2. Develop a machine learning model (e.g., logistic regression, decision trees, support vector machines, ensemble methods) that can predict the presence or absence of heart disease.
3. Evaluate the model’s performance using appropriate metrics such as accuracy, precision, recall, and F1-score.
4. Provide insights on how the identified features contribute to heart disease and what interventions might reduce the risk.

This predictive analysis can be useful for healthcare providers and medical professionals to identify individuals who are at high risk of heart disease, enabling early intervention and better health management.

# Expectation:

1. It is expected that you will successfully complete this challenging task and submit your Python notebook for review.

2. I encourage you to attempt this on your own.

3. To complete this challenging task, you need a solid understanding of the fundamentals of AI model development, and I believe you have successfully grasped these concepts.

# Dataset

1. heart_dataset_complete.xlsx is available in Brightspace under Week7 --> Lab --> Dataset