## Use Heart Disease [Dataset](https://github.com/cksajil/DSAIRP25/blob/main/datasets/heart_disease.csv) and answer the following questions

## 1. Find the top 5 important features to the target column

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('heart_disease.csv')
df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1020,59,1,1,140,221,0,1,164,1,0.0,2,0,2,1
1021,60,1,0,125,258,0,0,141,1,2.8,1,1,3,0
1022,47,1,0,110,275,0,0,118,1,1.0,1,1,2,0
1023,50,0,0,110,254,0,0,159,0,0.0,2,0,2,1


In [2]:
from sklearn.ensemble import RandomForestClassifier
X = df.drop('target', axis=1)
y = df['target']
model = RandomForestClassifier(random_state=42)
model.fit(X, y)
feature_importances = pd.Series(model.feature_importances_, index=X.columns)
top_5_features = feature_importances.sort_values(ascending=False).head(5)
print("Top 5 important features to the target column:")
top_5_features

Top 5 important features to the target column:


Unnamed: 0,0
cp,0.134201
thalach,0.120473
ca,0.116755
oldpeak,0.116151
thal,0.097043


## 2. Perform Box-Cox Transformations to relevant features

In [3]:
from scipy import stats
numerical_features = df.select_dtypes(include=np.number).columns.tolist()
features_to_transform = [f for f in numerical_features if f not in ['target', 'sex', 'fbs', 'exang', 'restecg', 'ca', 'thal']]
for feature in features_to_transform:
    if (df[feature] > 0).all():
      df[feature], _ = stats.boxcox(df[feature])
    else:
        print(f"Skipping Box-Cox transformation for '{feature}' as it contains non-positive values.")
print("\nDataFrame after applying Box-Cox transformations:")
print(df.head())

Skipping Box-Cox transformation for 'cp' as it contains non-positive values.
Skipping Box-Cox transformation for 'oldpeak' as it contains non-positive values.
Skipping Box-Cox transformation for 'slope' as it contains non-positive values.

DataFrame after applying Box-Cox transformations:
          age  sex  cp  trestbps      chol  fbs  restecg       thalach  exang  \
0  272.372422    1   0  1.313869  4.138422    0        1  31304.507138      0   
1  280.429390    1   0  1.316925  4.113094    1        0  26281.482428      1   
2  429.185698    1   0  1.317821  4.022191    0        1  16473.082758      1   
3  347.725370    1   0  1.318333  4.113094    0        1  28540.954701      0   
4  356.482692    0   0  1.316551  4.325814    1        1  11515.351967      0   

   oldpeak  slope  ca  thal  target  
0      1.0      2   2     3       0  
1      3.1      0   0     3       0  
2      2.6      0   0     3       0  
3      0.0      2   1     3       0  
4      1.9      1   3     2      

## 3. Perform Feature Binning to Age Column and add it as a new column to the dataset

In [4]:
df['age_binned'] = pd.cut(df['age'], bins=5, labels=False)
print("\nDataFrame with Age Binning column:")
print(df[['age', 'age_binned']].head())


DataFrame with Age Binning column:
          age  age_binned
0  272.372422           2
1  280.429390           2
2  429.185698           4
3  347.725370           3
4  356.482692           3


## 4. Find the most orthogonal feature to the 'chol' feature

In [6]:
correlation_matrix = df.corr()
chol_correlations = correlation_matrix['chol'].drop('chol')
most_orthogonal_feature = chol_correlations.abs().idxmin()
print(f"\nThe most orthogonal feature to 'chol' is: {most_orthogonal_feature}")


The most orthogonal feature to 'chol' is: slope
