# Heart Disease Research Part I
In this project, you’ll investigate some data from a sample patients who were evaluated for heart disease at the Cleveland Clinic Foundation. The data was downloaded from the [UCI Machine Learning Repository](https://www.codecademy.com/journeys/data-scientist-ml/paths/dsmlcj-22-data-science-foundations-ii/tracks/dsmlcj-22-statistics-fundamentals-for-data-science/modules/dsf-hypothesis-testing-for-data-science-ade4b838-f5e5-49a0-b33a-13d9b263c6fd/projects/heart-disease-research-i#:~:text=UCI%20Machine%20Learning%20Repository) and then cleaned for analysis. The principal investigators responsible for data collection were:

1. Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
2. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
3. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
4. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.

Note that a **solution.py** file is loaded for you in the workspace, which contains solution code for this project. We highly recommend that you complete the project on your own without checking the solution, but feel free to take a look if you get stuck or want to check your answers when you’re done!

In [1]:
# import libraries
import pandas as pd
import numpy as np
import scipy.stats as stats
from scipy.stats import ttest_1samp
from scipy.stats import binom_test
from scipy.stats import binomtest

# load data
heart = pd.read_csv('heart_disease.csv')
yes_hd = heart[heart.heart_disease == 'presence']
no_hd = heart[heart.heart_disease == 'absence']

In [2]:
# first inspection of dataset
print(heart.info())
print(heart.heart_disease.value_counts())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   age            303 non-null    float64
 1   sex            303 non-null    object 
 2   trestbps       303 non-null    float64
 3   chol           303 non-null    float64
 4   cp             303 non-null    object 
 5   exang          303 non-null    float64
 6   fbs            303 non-null    float64
 7   thalach        303 non-null    float64
 8   heart_disease  303 non-null    object 
dtypes: float64(6), object(3)
memory usage: 21.4+ KB
None
heart_disease
absence     164
presence    139
Name: count, dtype: int64


**Tasks**

## Cholesterol Analysis

**1.** The full dataset has been loaded for you as `heart`, then split into two subsets:

- `yes_hd`, which contains data for patients **with** heart disease
- `no_hd`, which contains data for patients **without** heart disease

For this project, we’ll investigate the following variables:

- `chol`: serum cholestorol in mg/dl
- `fbs`: An indicator for whether fasting blood sugar is greater than 120 mg/dl (`1` = true; `0` = false)

To start, we’ll investigate cholesterol levels for patients with heart disease. Use the dataset yes_hd to save cholesterol levels for patients with heart disease as a variable named chol_hd.

<details><summary><i>Hint</i></summary>

Fill in the following code:

>```py
>chol_hd = yes_hd.___
>```

In [3]:
#1 subset yes_hd cholesterol
chol_yes_hd = yes_hd.chol
chol_yes_hd.head()

1    286.0
2    229.0
6    268.0
8    254.0
9    203.0
Name: chol, dtype: float64

**2.** In general, total cholesterol over 240 mg/dl is considered “high” (and therefore unhealthy). Calculate the mean cholesterol level for patients who were diagnosed with heart disease and print it out. Is it higher than 240 mg/dl?

<details><summary><i>Hint</i></summary>

Use `np.mean` to calculate the mean of `chol_hd` (created in the previous step).

In [4]:
#2 patients with heart disease have cholesterol higher than 240?
print('Chol yes_hd mean:', np.mean(chol_yes_hd))

Chol yes_hd mean: 251.4748201438849


**3.** Do people with heart disease have high cholesterol levels (greater than or equal to 240 mg/dl) on average? Import the function from `scipy.stats` that you can use to test the following null and alternative hypotheses:

- Null: People with heart disease have an average cholesterol level equal to 240 mg/dl
- Alternative: People with heart disease have an average cholesterol level that is greater than 240 mg/dl

Note: Unfortunately, the `scipy.stats` function we’ve been using does not (at the time of writing) have an `alternative` parameter to change the alternative hypothesis for this test. Therefore, you’ll have to run a two-sided test. However, since you calculated earlier that the average cholesterol level for heart disease patients is greater than 240 mg/dl, you can calculate the p-value for the one-sided test indicated above simply by dividing the two-sided p-value in half.

<details><summary><i>Hint</i></summary>

For this test, we need a one-sample t-test. To import the function:

>```py
>from scipy.stats import ttest_1samp
>```

In [5]:
#3 Do people with heart disease have high cholesterol levels (greater than or equal to 240 mg/dl) on average?
# import ttest_1samp from  scipy.stats

**4.** Run the hypothesis test indicated in task 3 and print out the p-value. Can you conclude that heart disease patients have an average cholesterol level significantly greater than 240 mg/dl? Use a significance threshold of 0.05.

<details><summary><i>Hint</i></summary>

`ttest_1samp` has two inputs: the sample of values (in this case, the cholesterol levels for patients with heart disease) and the null value (in this case, 240). It has two outputs, the t-statstic and a p-value.

When you divide the p-value by two (in order to run the one-sided test), you should get a p-value of `0.0035`. This is less than 0.05, suggesting that heart disease patients have an average cholesterol level significantly higher than 240 mg/dl.



In [6]:
#4 run null hypothesis test and print p-value
tstat, pval = ttest_1samp(chol_yes_hd, 240)
print('P-value patients with HD:',pval/2)
print('According to p-val less than the threshold 5%, indicates that the null hypothesis is significant, which means that patients with heart disease are likely to have a cholesterol level above 240 mg/dl, but since the null hipothesis is True, it means we have a Type I Error')

P-value patients with HD: 0.0035411033905155707
According to p-val less than the threshold 5%, indicates that the null hypothesis is significant, which means that patients with heart disease are likely to have a cholesterol level above 240 mg/dl, but since the null hipothesis is True, it means we have a Type I Error


**5.** Repeat steps 1-4 in order to run the same hypothesis test, but for patients in the sample who were not diagnosed with heart disease. Do patients without heart disease have average cholesterol levels significantly above 240 mg/dl?

<details><summary><i>Hint</i></summary>

The syntax should be almost identical, but use the `no_hd` dataset instead of `yes_hd`.

In [7]:
#5 run same test for null hypothesis with patient not diagnosed with heart disease
# subset no_hd cholesterol
chol_no_hd = no_hd.chol
#2 patients without heart disease have cholesterol higher than 240?
print('Chol no_hd mean:', np.mean(chol_no_hd))
tstat, pval = ttest_1samp(chol_no_hd, 240)
print('P-value patients without HD:',pval/2)
print('The outcome of p-value (is not significant) evidencing that its unlikely that patients without heart disease have a cholesterol level higher than 240 mg/dl')


Chol no_hd mean: 242.640243902439
P-value patients without HD: 0.26397120232220506
The outcome of p-value (is not significant) evidencing that its unlikely that patients without heart disease have a cholesterol level higher than 240 mg/dl


## Fasting Blood Sugar Analysis

**6.** Let’s now return to the full dataset (saved as `heart`). How many patients are there in this dataset? Save the number of patients as `num_patients` and print it out.

<details><summary><i>Hint</i></summary>

Use the `len()` function to calculate the number of rows in `heart`.

In [8]:
#6 number of patients in the dataset
num_patients = len(heart)
print('Num patients in dataset (num of observ.):', num_patients)

Num patients in dataset (num of observ.): 303


**7.** Remember that the `fbs` column of this dataset indicates whether or not a patient’s fasting blood sugar was greater than 120 mg/dl (`1` means that their fasting blood sugar was greater than 120 mg/dl; `0` means it was less than or equal to 120 mg/dl).

Calculate the number of patients with fasting blood sugar greater than 120. Save this number as `num_highfbs_patients` and print it out.

<details><summary><i>Hint</i></summary>

Since patients have a value of `1` in the `fbs` column if their fasting blood sugar is greater than 120 mg/dl, and `0` otherwise, you can simply add up all the numbers in the `fbs` column of `heart` using `np.sum()`.

In [9]:
#7 Proportion of Fasting Blood Sugar (fbs) above 120 mg/dl patients
num_highfbs_patients = np.sum(heart.fbs == 1)
print('Num of patients with high level of fbs:', num_highfbs_patients)

Num of patients with high level of fbs: 45


**8.** Sometimes, part of an analysis will involve comparing a sample to known population values to see if the sample appears to be representative of the general population.

By some estimates, about 8% of the U.S. population had diabetes (diagnosed or undiagnosed) in 1988 when this data was collected. While there are multiple tests that contribute to a diabetes diagnosis, fasting blood sugar levels greater than 120 mg/dl can be indicative of diabetes (or at least, pre-diabetes). If this sample were representative of the population, approximately how many people would you expect to have diabetes? Calculate and print out this number.

Is this value similar to the number of patients with a resting blood sugar above 120 mg/dl — or different?

<details><summary><i>Hint</i></summary>

We want to calculate 8% of the sample size (which is 303). Therefore, we should multiply `0.08*303`. This comes out to approximately 24 patients, which is almost half the number with fbs > 120 in the sample (45).

In [10]:
#8 About 8% of US population has diabetes. Calculate 8% of the sample size
print('8% of sample (303 patients):', int(num_patients * .08))

print('Proportion high fbs in sample:', (num_highfbs_patients / num_patients).round(4) *100)
print('The dataset sample shows high fbs patients represents roughly 15% of the sample population, and almost doble od the US average of 8%.')

8% of sample (303 patients): 24
Proportion high fbs in sample: 14.85
The dataset sample shows high fbs patients represents roughly 15% of the sample population, and almost doble od the US average of 8%.


**9.** Does this sample come from a population in which the rate of fbs > 120 mg/dl is equal to 8%? Import the function from `scipy.stats` that you can use to test the following null and alternative hypotheses:

- Null: This sample was drawn from a population where 8% of people have fasting blood sugar > 120 mg/dl
- Alternative: This sample was drawn from a population where more than 8% of people have fasting blood sugar > 120 mg/dl

<details><summary><i>Hint</i></summary>

This hypothesis test requires a binomial test. We can import the function for a binomial test as follows:

>```py
>from scipy.stats import binom_test
>```

In [11]:
#9 Does this sample come from a population in which the rate of fbs > 120 mg/dl is equal to 8%?
# import binom_test or bonomtest from scipy.stats library

**10.** Run the hypothesis test indicated in task 9 and print out the p-value. Using a significance threshold of 0.05, can you conclude that this sample was drawn from a population where the rate of fasting blood sugar > 120 mg/dl is significantly greater than 8%?

<details><summary><i>Hint</i></summary>

The `binom_test()` function takes four parameters (in order):

- The observed number of “successes” (in this case, the number of people in the sample who had fasting blood sugar greater than 120 mg/dl)
- The number of “trials” (in this case, the number of patients)
- The null probability of “success” (in this case, 0.08)
- The `alternative` parameter, which indicates the alternative hypothesis for the test (eg.,`'two-sided'` `'greater'` or `'less'`)

The output is the p-value.

If you run the test correctly, you should get a p-value of `4.689471951449078e-05` which is equivalent to `0.0000469` (the `e-5` at the end indicates scientific notation). This is less than 0.05, indicating that this sample likely comes from a population where more than 8% of people have fbs > 120 mg/dl.

In [12]:
null_p = 0.08
# pval = binom_test(45, num_patients, p=null_p, alternative='greater')
pval = binomtest(num_highfbs_patients, num_patients, p=null_p, alternative='greater').pvalue
print('P-value:', pval)
print('The outcome suggest that this sample likely comes from a population where ore than 8% of people have fbs > 120 mg/dl.')

P-value: 4.689471951448875e-05
The outcome suggest that this sample likely comes from a population where ore than 8% of people have fbs > 120 mg/dl.
