# Feature Selection Wrapping Methods

## Dataset Overview

The "goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).


Attribute Information: -- Only 14 used

1. #3 (age) Age in years.
2. #4 (sex) Biological sex (1 = male; 0 = female).
3. #9 (cp) (Chest pain type: typical, atypical, non-anginal, or asymptomatic).
4. #10 (trestbps) (Resting blood pressure: Systolic/Diastolic).
5. #12 (chol) (Serum cholesterol).
6. #16 (fbs) (Blood sugar and diabetes status)
7. #19 (restecg) (Resting results, including hypertrophy or ST-T wave abnormalities).
8. #32 (thalach) Heart rate (thalach - max achieved, thalrest - resting).
9. #38 (exang) Exercise-induced angina (1 = yes; 0 = no).
10. #40 (oldpeak) (ST depression and segment slope).
11. #41 (slope) (ST depression and segment slope).
12. #44 (ca) (Number of major vessels colored by fluoroscopy).
13. #51 (thal) (Thallium stress test results).
14. #58 (num) (The Target): num (Diagnosis of heart disease.
- 0 = <50% narrowing/No disease;
- 1: > 50% diameter narrowing (Presence of disease)

https://www.openml.org/search?type=data&sort=runs&status=active&qualities.NumberOfClasses=lte_1&id=204

## Imports Libraries

In [13]:
import pandas as pd
import numpy as np
from scipy.io import arff
import matplotlib.pyplot as plt
import seaborn as sns
import time

# sklearn
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.tree import DecisionTreeClassifier

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

## Forward Selection

## Load Dataset

In [14]:
# While pandas does not have a built-in read_arff()
# function, you can read an ARFF file into a pandas
# DataFrame using external libraries such as SciPy or liac-arff.

PATH_CVS ='/home/ramses2099/Sources/IAProject/machine_learning/data/dataset_2190_cholesterol.arff'

# Load the ARFF file
# loadarff returns a tuple: the first element is the data, the second is the metadata
arff_file = arff.loadarff(PATH_CVS)

# Convert the data part of the result into a pandas DataFrame
df = pd.DataFrame(arff_file[0])

TARGET_COLMN ='num'

In [15]:
# Set target column
CAT_COLUMNS = df.select_dtypes(include='object').columns.to_list()
NUM_COLUMNS = df.select_dtypes(exclude='object').columns.to_list()

print(f"Categorical Columns: {CAT_COLUMNS}")
print(f"Numerical Columns: {NUM_COLUMNS}")

# Label Encoding
mapping = { b'0':0, b'1':1, b'2':2, b'3':3,  b'4':4, b'6':6, b'7':7, b'?':5 }
# print(mapping)

for col in CAT_COLUMNS:
   df[col] = df[col].map(mapping)

# Replace 4 row with null value in column ca
df[df['ca'].isnull()] = 0

Categorical Columns: ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'thal']
Numerical Columns: ['age', 'trestbps', 'thalach', 'oldpeak', 'ca', 'num', 'chol']


In [16]:
# Prepare feature and target sets
X = df.drop(TARGET_COLMN, axis=1)
y = df[TARGET_COLMN]

# Note: the label encoder execute manual in the cell before
# Encode categorical features
# le = LabelEncoder()
# for col in CAT_COLUMNS:
#     X[col] = le.fit_transform(X[col])

In [17]:
# define the model
model = DecisionTreeClassifier()

# Forward selection
start_time = time.time()
sfs = SequentialFeatureSelector(estimator=model, n_features_to_select='auto', direction='forward', cv=3)
sfs = sfs.fit(X, y)

end_time = time.time()
print(f"Time taken: {end_time - start_time:.2f} seconds")

Time taken: 0.41 seconds


In [18]:
# Slected features
selected_features_forward = list(X.columns[sfs.get_support()])
print("Selected features by Forward Slection:", selected_features_forward)

Selected features by Forward Slection: ['sex', 'cp', 'fbs', 'slope', 'ca', 'thal']


# Backward Elimination

In [21]:
# define the model
model = DecisionTreeClassifier()

# Forward selection
start_time = time.time()
sfs = SequentialFeatureSelector(estimator=model, n_features_to_select='auto', direction='backward', cv=3)
sfs = sfs.fit(X, y)

end_time = time.time()
print(f"Time taken: {end_time - start_time:.2f} seconds")

Time taken: 0.48 seconds


In [22]:
# Slected features
selected_features_forward = list(X.columns[sfs.get_support()])
print("Selected features by Backward Slection:", selected_features_forward)

Selected features by Backward Slection: ['age', 'cp', 'thalach', 'exang', 'slope', 'ca', 'chol']


## Recursive Feature Elimination

In [25]:
from sklearn.feature_selection import RFE

# define the model
model = DecisionTreeClassifier()

# Forward selection
start_time = time.time()
rfe = RFE(estimator=model, n_features_to_select=10)
rfe = rfe.fit(X, y)
end_time = time.time()
print(f"Time taken: {end_time - start_time:.2f} seconds")

# Selected featrues
selected_features_rfe = X.columns[rfe.get_support()].to_list()
print("Selected features by Recursive Feature Elimination (RFE):" , selected_features_rfe)

Time taken: 0.01 seconds
Selected features by Recursive Feature Elimination (RFE): ['age', 'cp', 'trestbps', 'fbs', 'thalach', 'oldpeak', 'slope', 'ca', 'thal', 'chol']


## Exhaustive Feature Search

In [28]:
import math
subset_size = 5
full_feature_set = 10
print(f'The number of possible subsets of size 4 for a set of {full_feature_set} features is')
math.comb(full_feature_set, subset_size)

The number of possible subsets of size 4 for a set of 10 features is


252

In [32]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

from mlxtend.feature_selection import ExhaustiveFeatureSelector

# Define the model
model = DecisionTreeClassifier(random_state=42, max_depth=10)

start_time = time.time()
features = ['age', 'cp', 'thalach', 'exang', 'slope', 'ca', 'chol']

efs = ExhaustiveFeatureSelector(model, min_features=5, max_features=5, scoring='accuracy', print_progress=True, cv=3, n_jobs=-1)
efs = efs.fit(X[features], y)
end_time = time.time()
print(f"Time taken: {end_time - start_time:.2f} seconds")

# Selected features
selected_features_exhaustive = list(efs.best_feature_names_)
print("Selected features by Exhauslive Feature Search:", selected_features_exhaustive)

Features: 21/21

Time taken: 3.31 seconds
Selected features by Exhauslive Feature Search: ['age', 'cp', 'exang', 'slope', 'ca']
