## Feature Selection Embedded

## Dataset Overview

The "goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).


Attribute Information: -- Only 14 used

1. #3 (age) Age in years.
2. #4 (sex) Biological sex (1 = male; 0 = female).
3. #9 (cp) (Chest pain type: typical, atypical, non-anginal, or asymptomatic).
4. #10 (trestbps) (Resting blood pressure: Systolic/Diastolic).
5. #12 (chol) (Serum cholesterol).
6. #16 (fbs) (Blood sugar and diabetes status)
7. #19 (restecg) (Resting results, including hypertrophy or ST-T wave abnormalities).
8. #32 (thalach) Heart rate (thalach - max achieved, thalrest - resting).
9. #38 (exang) Exercise-induced angina (1 = yes; 0 = no).
10. #40 (oldpeak) (ST depression and segment slope).
11. #41 (slope) (ST depression and segment slope).
12. #44 (ca) (Number of major vessels colored by fluoroscopy).
13. #51 (thal) (Thallium stress test results).
14. #58 (num) (The Target): num (Diagnosis of heart disease.
- 0 = <50% narrowing/No disease;
- 1: > 50% diameter narrowing (Presence of disease)

https://www.openml.org/search?type=data&sort=runs&status=active&qualities.NumberOfClasses=lte_1&id=204

## Imports Libraries

In [5]:
import pandas as pd
import numpy as np
from scipy.io import arff
import matplotlib.pyplot as plt
import seaborn as sns
import time

# sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso, LogisticRegression
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, explained_variance_score
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
from sklearn.decomposition import PCA
import statsmodels.api as sm
import scipy.stats as stats
from statsmodels.stats.outliers_influence import variance_inflation_factor
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

## Load Dataset

In [6]:
# While pandas does not have a built-in read_arff()
# function, you can read an ARFF file into a pandas
# DataFrame using external libraries such as SciPy or liac-arff.

PATH_CVS ='/home/ramses2099/Sources/IAProject/machine_learning/data/dataset_2190_cholesterol.arff'

# Load the ARFF file
# loadarff returns a tuple: the first element is the data, the second is the metadata
arff_file = arff.loadarff(PATH_CVS)

# Convert the data part of the result into a pandas DataFrame
df = pd.DataFrame(arff_file[0])

TARGET_COLMN ='num'

## EMBEDDED METHODS

In [7]:
# Set target column
CAT_COLUMNS = df.select_dtypes(include='object').columns.to_list()
NUM_COLUMNS = df.select_dtypes(exclude='object').columns.to_list()

print(f"Categorical Columns: {CAT_COLUMNS}")
print(f"Numerical Columns: {NUM_COLUMNS}")

# Label Encoding
mapping = { b'0':0, b'1':1, b'2':2, b'3':3,  b'4':4, b'6':6, b'7':7, b'?':5 }
# print(mapping)

for col in CAT_COLUMNS:
   df[col] = df[col].map(mapping)

# Replace 4 row with null value in column ca
df[df['ca'].isnull()] = 0

Categorical Columns: ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'thal']
Numerical Columns: ['age', 'trestbps', 'thalach', 'oldpeak', 'ca', 'num', 'chol']


In [8]:
# Prepare feature and target sets
X = df.drop(TARGET_COLMN, axis=1)
y = df[TARGET_COLMN]

# Note: the label encoder execute manual in the cell before
# Encode categorical features
# le = LabelEncoder()
# for col in CAT_COLUMNS:
#     X[col] = le.fit_transform(X[col])

In [None]:
# LASSO (L1 regularization)

lasso = LogisticRegression(penalty='l1', solver='saga', max_iter=1000, random_state=42)
lasso.fit(X, y)
end_time = time.time()
