## Utilizing Hypothesis Testing to Gain Insights from (fake) Patient Data on the Relationships between Heart Disease, Cholesterol, and Fasting Blood Sugar (FSB).

Robert Hall (06/29/2024)

Completed for my Codecademy "Data Scientist: Machine Learning Specialist" Career Path Certification.

#### Preparatory loading-in of data and necessary libraries.

In [1]:
import numpy as np 
import pandas as pd
from scipy.stats import ttest_1samp, binom_test

In [2]:
heart = pd.read_csv('heart.csv')
yes_hd = heart[heart.heart_disease == 'presence']
no_hd = heart[heart.heart_disease == 'absence']

#### Isolate the feature representing cholesterol levels in patients into a single variable 'chol_hd'.

In [3]:
chol_hd = yes_hd['chol']
print(chol_hd.head())

1    286.0
2    229.0
6    268.0
8    254.0
9    203.0
Name: chol, dtype: float64


#### Q: Is the mean cholesterol level in the sample higher than the general threshold of "high cholesterol" of 240 mg/dl?

In [6]:
mean_chol = round(chol_hd.mean(), 2)
print(f"Mean Sample Cholesterol Level: {mean_chol}")

Mean Sample Cholesterol Level: 251.47


The mean cholesterol level for sampled patients is approx. 251 mg/dl, which would be considerd high.

#### Q: Do those with heart disease, on average, have significantly high levels of cholesterol (chol > 240 mg/dl)?

* Null Hypothesis: "Patients with heart disease have an average cholesterol level equal to 240 mg/dl".
* Alternative Hypothesis: "Patients with heart disease have an average cholesterol level greater than 240 mg/dl"
* Significance Threshold: a = 0.05

In [11]:
ttest, pval = ttest_1samp(chol_hd, 240)
print(f"p-value: {round(pval/2, 4)}") # divide by two since 'pval' returns the p-value of a two-sided t-test.

p-value: 0.0035


We reject the null hypothesis and confirm that individuals with heart disease do, on average, have significantly high cholesterol levels (p = 0.0035).

#### Q: Do patients who do NOT have heart disease likewise, on average, have significantly high levels of cholesterol (chol > 240 mg/dl)?

* Null Hypothesis: "Patients without heart disease have an average cholesterol level equal to 240 mg/dl".
* Alternative Hypothesis: "Patients without heart disease have an average cholesterol level greater than 240 mg/dl"
* Significance Threshold: a = 0.05

In [10]:
chol_nhd = no_hd['chol'] # isolated feature measuring cholesterol levels in (fake and anonymous) patients
ttest, pval = ttest_1samp(chol_nhd, 240) 
print(f"p-value: {round(pval/2, 4)}") # divide by two since 'pval' returns the p-value of a two-sided t-test.

p-value: 0.264


We reject the alternative hypothesis and confirm that individuals without heart disease do not have significantly high cholesterol levels on average (p = 0.264).

#### Q: How many patients in the data set have a fasting blood sugar (FBS) greater than 120 mg/dl?

In [25]:
# calculate total number of patients (instances) in the dataset
num_patients = len(heart)
print(f" n = {num_patients}")

 n = 303


In [24]:
# calculate the number of patients in the dataset that have FBS < 120 mg/dl
num_highfbs_patients = int(np.sum(heart.fbs))
print(f"Number of patients in dataset with FBS below threshold: {num_highfbs_patients}")

Number of patients in dataset with FBS below threshold: 45


There are 45 patients, out of the 303 in the sample, who have a fasting blood sugar below 120 mg/dl.

#### Q: Approximately 8% of the U.S. population has Diabetes. Does the sample come from a population where the rate of FBS > 120 mg/dl equal 8%?

In [21]:
# determine 8% of the sample size (n = 303)
eight_pct_sample = int(np.floor(0.08 * num_patients))
print(f"Eight percent of {num_patients} participants: {eight_pct_sample}") # 24 is just over half of the number of people (45) where fbs > 120 in the sample

Eight percent of 303 participants: 24


In [22]:
pval = binom_test(num_highfbs_patients, num_patients, 0.08, alternative='greater')
print(f"p-value: {round(pval, 6)}")

p-value: 4.7e-05


The p-value (p = ~0.000047) is comfortably below the significance threshold of a = 0.05, which indicates that this sample includes more diabetes patients than the general population, and thus the sample is likely to have come from an area with a higher proportion of diabetes patients than the national average.