![running heart rate](run31.png)

Millions of people develop some sort of heart disease every year, and heart disease is the biggest killer of both men and women in the United States and around the world. Statistical analysis has identified many risk factors associated with heart disease, such as age, blood pressure, total cholesterol, diabetes, hypertension, family history of heart disease, obesity, lack of physical exercise, and more.

In this project, you will run statistical tests and models using the Cleveland heart disease dataset to assess one particular factor -- the maximum heart rate one can achieve during exercise and how it is associated with a higher likelihood of getting heart disease.

Examining how heart rate responds to exercise along with other factors such as age, gender, the maximum heart rate achieved may reveal abnormalities that could be indicative of heart disease. Let's find out more!

## The Data
Available on `Cleveland_hd.csv`
| Column     | Type | Description              |
|------------|------|--------------------------|
|`age` | continuous | age in years | 
|`sex` | discrete | 0=female 1=male |
|`cp`| discrete | chest pain type: 1=typical angina, 2=atypical angina, 3=non-anginal pain, 4=asymptom |
|`trestbps`| continuous | resting blood pressure (in mm Hg) |
|`chol`| continuous | serum cholesterol in mg/dl |
|`fbs`| discrete | fasting blood sugar>120 mg/dl: 1=true 0=False |
|`restecg`| discrete | result of electrocardiogram while at rest are represented in 3 distinct values 0=Normal 1=having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) 2=showing probable or definite left ventricular hypertrophy Estes' criteria (Nominal) |
|`thalach`| continuous | maximum heart rate achieved |
|`exang`| discrete | exercise induced angina: 1=yes 0=no |
|`oldpeak`| continuous | depression induced by exercise relative to rest |
|`slope`| discrete | the slope of the peak exercise segment: 1=up sloping 2=flat, 3=down sloping
|`ca`| continuous | number of major vessels colored by fluoroscopy that ranged between 0 and 3 |
|`thal`| discrete | 3=normal 6=fixed defect 7=reversible defect |
|`class`| discrete | diagnosis classes: 0=no presence 1=minor indicators for heart disease 2=>1 3=>2 4=major indicators for heart disease|

In [79]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from scipy.stats import ttest_ind,chi2_contingency
from sklearn.model_selection import train_test_split

In [80]:
heart_df = pd.read_csv('Cleveland_hd.csv')

In [81]:
print(heart_df)

     age  sex  cp  trestbps  chol  fbs  ...  exang  oldpeak  slope   ca  thal  class
0     63    1   1       145   233    1  ...      0      2.3      3  0.0   6.0      0
1     67    1   4       160   286    0  ...      1      1.5      2  3.0   3.0      2
2     67    1   4       120   229    0  ...      1      2.6      2  2.0   7.0      1
3     37    1   3       130   250    0  ...      0      3.5      3  0.0   3.0      0
4     41    0   2       130   204    0  ...      0      1.4      1  0.0   3.0      0
..   ...  ...  ..       ...   ...  ...  ...    ...      ...    ...  ...   ...    ...
298   45    1   1       110   264    0  ...      0      1.2      2  0.0   7.0      1
299   68    1   4       144   193    1  ...      0      3.4      2  2.0   7.0      2
300   57    1   4       130   131    0  ...      1      1.2      2  1.0   7.0      3
301   57    0   2       130   236    0  ...      0      0.0      2  1.0   3.0      1
302   38    1   3       138   175    0  ...      0      0.0      

In [83]:
heart_df["class"] = heart_df["class"].apply(lambda x: 1 if x > 0 else 0)

In [84]:
feature_cols = heart_df.columns.drop("class")

## Using T-tests to determine highly significant features

In [85]:
p_values = {}

#running t-sets between each feature and class
for col in feature_cols:
    group0 = heart_df[heart_df["class"] == 0][col]
    group1 = heart_df[heart_df["class"] == 1][col]
    stat, pval = ttest_ind(group0, group1, equal_var=False)
    p_values[col] = pval

p_values

{'age': 7.061439075547293e-05,
 'sex': 5.971301198029792e-07,
 'cp': 3.3250723382584374e-14,
 'trestbps': 0.009409469224173054,
 'chol': 0.1366486884334473,
 'fbs': 0.6627122369665601,
 'restecg': 0.003116492399434188,
 'thalach': 9.106165923728815e-14,
 'exang': 3.2800598495609635e-14,
 'oldpeak': 2.195174995995486e-13,
 'slope': 1.174222623926696e-09,
 'ca': nan,
 'thal': nan}

In [86]:
highly_significant = sorted(p_values, key=p_values.get)[:3]
highly_significant

['exang', 'cp', 'thalach']

In [87]:
X = heart_df[highly_significant]
y = heart_df["class"]
X,y

(     exang  cp  thalach
 0        0   1      150
 1        1   4      108
 2        1   4      129
 3        0   3      187
 4        0   2      172
 ..     ...  ..      ...
 298      0   1      132
 299      0   4      141
 300      1   4      115
 301      0   2      174
 302      0   3      173
 
 [303 rows x 3 columns],
 0      0
 1      1
 2      1
 3      0
 4      0
       ..
 298    1
 299    1
 300    1
 301    1
 302    0
 Name: class, Length: 303, dtype: int64)

In [88]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

In [89]:
model = LogisticRegression(max_iter = 1000)

In [90]:
model.fit(X_train, y_train)

In [91]:
y_pred = (model.predict_proba(X_test)[:,1] > 0.5).astype(int)

In [92]:
print("y_test unique values:", np.unique(y_test))
print("y_pred unique values:", np.unique(y_pred))

y_test unique values: [0 1]
y_pred unique values: [0 1]


In [93]:
print("y_test shape:", y_test.shape)
print("y_pred shape:", y_pred.shape)

y_test shape: (91,)
y_pred shape: (91,)


## Using t-tests we achieved an accuracy of 0.76

In [94]:
accuracy = accuracy_score(y_test, y_pred)
confusion = confusion_matrix(y_test, y_pred)
print("Accuracy:", accuracy)
print("Confusion Matrix:\n", confusion)

Accuracy: 0.7692307692307693
Confusion Matrix:
 [[41  7]
 [14 29]]


# Using t-tets and chi square tests in discrete and continuos column values to get the highly significant three features

In [95]:
discrete_cols = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak','ca']
categorical_cols = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'thal']

In [96]:
t_pvals = {}
for col in discrete_cols:
    group0 = heart_df[heart_df["class"] == 0][col]
    group1 = heart_df[heart_df["class"] == 1][col]
    _, pval = ttest_ind(group0, group1, equal_var=False)
    t_pvals[col] = pval

In [97]:
chi_pvals = {}
for col in categorical_cols:
    contingency = pd.crosstab(heart_df[col], heart_df['class'])
    _, pval, _, _ = chi2_contingency(contingency)
    chi_pvals[col] = pval

In [98]:
all_pvals = {**t_pvals, **chi_pvals}
print(all_pvals)
highly_significant = sorted(all_pvals, key=all_pvals.get)[:3]
print("Top 3 highly significant features:", highly_significant)

{'age': 7.061439075547293e-05, 'trestbps': 0.009409469224173054, 'chol': 0.1366486884334473, 'thalach': 9.106165923728815e-14, 'oldpeak': 2.195174995995486e-13, 'ca': nan, 'sex': 2.666712348180942e-06, 'cp': 1.2517106007837527e-17, 'fbs': 0.7812734067063785, 'restecg': 0.006566523814217354, 'exang': 1.413788096718085e-13, 'slope': 1.1428845467527021e-10, 'thal': 8.201820286056396e-19}
Top 3 highly significant features: ['thal', 'cp', 'thalach']


In [99]:
X = heart_df[highly_significant]
y = heart_df['class']

In [100]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [101]:
X_train = X_train.replace([np.inf, -np.inf], np.nan).dropna()
y_train = y_train[X_train.index]

X_test = X_test.replace([np.inf, -np.inf], np.nan).dropna()
y_test = y_test[X_test.index]

In [102]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

In [103]:
y_pred = (model.predict_proba(X_test)[:, 1] > 0.5).astype(int)

In [104]:
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

## Using t-tests and chi square test we achieved an accuracy of 0.82

In [105]:
print(f"\nAccuracy: {accuracy:.4f}")
print("Confusion Matrix:\n", conf_matrix)


Accuracy: 0.8242
Confusion Matrix:
 [[38 10]
 [ 6 37]]


In [106]:
from sklearn.metrics import precision_score, recall_score, f1_score

precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1 Score:  {f1:.4f}")


Precision: 0.7872
Recall:    0.8605
F1 Score:  0.8222
