# **Heart Disease - Project 3**
## Inference for the Population Proportion

# **Importing Necessary Python Modules**

Python incorporates a variety of open source add-ins called modules that add extra features to the basic setup. The name of the modules is after the import or from statement, and the purpose is in a non-code comment after the hashtag (#).



In [10]:
import pandas as pd                 #Data analysis
import numpy as np                  #Calculations
from IPython.display import Image   #Display images
from scipy.stats import norm        #Confidence Interval

In [11]:
# Assigns the URL of the image to display to the name 'image_url'.
image_url = 'https://my.clevelandclinic.org/-/scassets/images/org/health/articles/24129-heart-disease-illustration'

# Display the image
Image(url=image_url, width = 500)

# **Context**

Coronary heart disease (CHD), also referred to as coronary artery disease (CAD) involves the reduction of blood flow to the heart muscle due to the build-up of plaque in the arteries of the heart. It is the most common form of cardiovascular disease. Currently, invasive coronary angiography represents the gold standard for establishing the presence, location, and severity of CHD. However, this diagnostic method is costly and associated with morbidity (count of people with the disease) and mortality (count of deaths) in CHD patients. Therefore, it would be beneficial to develop a non-invasive alternative to replace the current gold standard.

Other less invasive diagnostics methods have been proposed in the scientific literature including exercise electrocardiogram, thallium scintigraphy, and fluoroscopy of coronary calcification. However, the diagnostic accuracy of these tests only ranges between 35%-75%. Therefore, it would be beneficial to develop a computer-aided diagnostic tool that could utilize the combined results of these non-invasive tests in conjunction with other patient attributes to boost the diagnostic power of these non-invasive methods with the aim of ultimately replacing the current invasive gold standard.

Three hundred three (303) consecutive patients referred for coronary angiography at the Cleveland Clinic between May 1981 and September 1984 participated in the experiment. No patient had a history of or electrocardiographic evidence of prior myocardial infarction or known valvular or cardiomyopathic diseases.


# **About the Dataset**

The dataset comprises 303 observations, 13 features, and 1 target attribute. A feature is a variable that is believed to contribute to CHD, and is also referred to as a predictive variable. A target variable is the variable you want to predict (CHD, in this situation). The 13 features include the results of the aforementioned non-invasive diagnostic tests along with other relevant patient information. The target variable includes the result of the invasive coronary angiogram which represents the presence or absence of coronary heart disease in the patient. The 14 variables (13 features and 1 target attribute) are described below.

| **Variable**| **Description**                                          |
|:------------|:---------------------------------------------------------|
| AGE         | The age of the individual.                               |
| SEX         | Gender of the individual: 0 = female, 1 = male.          |
| CP          | The type of chest pain experienced by the individual:    |
|             | * 0 = typical angina                                     |
|             | * 1 = atypical angina                                    |
|             | * 2 = non-anginal pain                                   |
|             | * 3 = asymptomatic                                         |
| TRESTBPS    | Resting blood pressure of an individual (mmHg)           |
| CHOL        | Serum cholesterol in mg/dL  |
| FBS         | Compares the fasting blood sugar value with 120mg/dL:    |
|             | * 0: fasting blood sugar ≤ 120mg/dL                      |
|             | * 1: fasting blood sugar >120mg/dL                       |
| RESTECG     | Resting electrocardiographic results:                    |
|             | * 0 = normal                                             |
|             | * 1 = having ST-T wave abnormality                       |
|             | * 2 = left ventricular hypertrophy                       |
| THALACH     | Max heart rate achieved, in beats per minute (bpm)       |
| EXANG       | Exercise-induced angina:                                 |
|             | * 0 = No                                                 |
|             | * 1 = Yes                                                |
| OLDPEAK     | ST depression (mm) induced by exercise relative to rest. |
| SLOPE       | Peak exercise ST segment:                                |
|             | * 1 = upsloping                                          |
|             | * 2 = flat                                               |
|             | * 3 = downsloping                                        |
| CA          | Number of major vessels (0-3) colored by fluoroscopy.    |
| THAL        | Thalassemia:                                             |
|             | * 1 = normal                                             |
|             | * 2 = fixed defect                                       |
|             | * 3 = reversible defect                                  |
| TARGET      | Whether the individual is suffering from heart disease:  |
|             | * 0 = absence                                            |
|             | * 1 = present                                            |



Let's take a look at the data. To do this, first we import it directly from the url below.



# **A Snippet of the Data**

Let's take a look at the data. To do this, first we import it directly from the url.

In [12]:
# Assigns the URL where the data file is stored to 'file_path'.
url='https://raw.githubusercontent.com/ksuaray/STAT108F24_Projects_Jupyter/main/ProjectDataSets/heart.csv'

# Reads in the CSV data file and assigns it to the DataFrame 'df'.
df=pd.read_csv(url)

Next, we can display the data by *typing the name* of the DataFrame. To ensure we can see all columns, we'll use the *pd.set_option* method.

In [13]:
# The variables restecg and cp are categorical variables that were coded
# numerically. Convert both to categorical variables.

df['restecg'] = df['restecg'].astype('category')
df['restecg'] = df['restecg'].cat.rename_categories({
    0: 'Normal',
    1: 'ST-T wave abnormality',
    2: 'Left ventricular hypertrophy'
})

df['cp'] = df['cp'].astype('category')
df['cp'] = df['cp'].cat.rename_categories({
    0: 'Typical angina',
    1: 'Atypical angina',
    2: 'Non-anginal pain',
    3: 'Asymptomatic'
})

In [14]:
# Set display options to show all columns
pd.set_option('display.max_columns', None)

# When you type the object name, the object gets printed.
df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,Asymptomatic,145,233,1,Normal,150,0,2.3,0,0,1,1
1,37,1,Non-anginal pain,130,250,0,ST-T wave abnormality,187,0,3.5,0,0,2,1
2,41,0,Atypical angina,130,204,0,Normal,172,0,1.4,2,0,2,1
3,56,1,Atypical angina,120,236,0,ST-T wave abnormality,178,0,0.8,2,0,2,1
4,57,0,Typical angina,120,354,0,ST-T wave abnormality,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,Typical angina,140,241,0,ST-T wave abnormality,123,1,0.2,1,0,3,0
299,45,1,Asymptomatic,110,264,0,ST-T wave abnormality,132,0,1.2,1,0,3,0
300,68,1,Typical angina,144,193,1,ST-T wave abnormality,141,0,3.4,1,2,3,0
301,57,1,Typical angina,130,131,0,ST-T wave abnormality,115,1,1.2,1,1,3,0


# **INSTRUCTIONS**

* Use Python to analyze the data set and complete each of the following.
* Replace ellipsis (...) with the relavent names or code.  
* For problems that require a written response, replace the ellipsis (...) by double clicking the text box to start typing.
* Reference the tutorial from activity for assistance.
* If you still need help:
 * Watch the video.
 * Attend office hours.

# **The variable to analyze**
You will analyze a category of a qualitative variable. Based on the first initial of your LAST name, analyze the category of the variable listed in the table. Use this category for the entire project.

| Last Name | Variable = Category |
|-----------|-------------------------------|
| A-L       | Resting ECG Results = Normal  |
| M-Z       | Chest Pain = Atypical angina  |

In [15]:
# Print all the category names.
# Use this list to ensure correct spelling of your category.

print("... category names")                          #Replace ... with the variable name written out
print("--------------------------------")
freq_table = pd.Series(df['...']).value_counts()     #Replace ... with the variable name
print(freq_table)


... categories
--------------------------------


KeyError: '...'

# **QUESTION 1**
## Confidence Interval

**Last Names A-L:** Construct and interpret the 92% confidence interval for the population proportion of people who have normal resting ECG results.

**Last Names M-Z:** Construct and interpret the 92% confidence interval for the population proportion of people who have atypical angina chest pain.

**1.1) Parameter: Define the parameter, using correct notation.**

...

**1.2) Method: Name the method you will use.**

...

**1.3) Assumptions:**

Complete the code below to find out how many majors fall under the category assigned to you.

In [None]:
# Count total observations
n = len(df)

#Use this code for students
# Count total successes
# Replace the 1st ... with the variable name
# Replace the 2nd ... with the name of the major category to be analyzed
obs_count = df['...'].value_counts().get('...')

print(f"{obs_count} out of {n} individuals have ....") #Replace ... with the name of your variable's category.


**Show that both assumptions are met.**

...

...

**1.4) Calculate: Complete the code below to calculate the sample proportion of majors that fall under the engineering category, and the confidence interval.**

In [None]:
# Define the confidence level
# Replace the ... with the stated confidence level, as a decimal (ex: 0.83, not 83%)
CL = ...

#Use this code for students
#Calculate the values needed; p-hat, critical value (CV), and standard error (se).
p_hat = obs_count / n
cv = norm.ppf((1+CL)/2)
se = np.sqrt(p_hat * (1-p_hat) / n)

#Calculate the bounds of the interval
ci_lower = (p_hat - cv * se)
ci_upper = (...)     #ci_upper = (...) #Replace ... with the formula for the upper bound.

print(f"p-hat = {obs_count}/{n} = {p_hat.round(5)}")
print(f"The {CL*100}% CI is ({ci_lower.round(5)}, {ci_upper.round(5)})")


**1.5) Communicate Results: Interpret the confidence interval calculated in 1.4 above. Round to three (3) decimal places.**

...

**1.6) Show work to calculate the margin of error. Then interpret the margin of error.**

**Calculation:**

...

**Interpretation:**

...




# **Question 2**

## **Hypothesis Test**

**A-L:** After learning that 40% of people have normal blood pressure, an aspiring nurse wonders if the proportion of individuals who have normal resting ECG results is different? Based on this data set, is there convincing evidence that the population proportion of individuals who have normal resting ECG results is different from 40% (0.4)? Use α=0.08. Write up the solution using the PMACC procedure.

**M-Z:** There are 4 types of chest pain. If each type was equally likely, then 25% of individuals would have atypical angina. Based on this data set, is there convincing evidence that the population proportion of individuals who have atypical angina is different from 25% (0.25)? Use α=0.08. Write up the solution using the PMACC procedure.

**2.1) Parameter: Define the parameter, using correct notation.**

...

**2.2) Method: Name the method you will use, and write the hypotheses.**

**Method name:**

...

**Hypotheses:**

...

**2.3) Assumptions: Show that both assumptions are met. Do NOT round.**

...

...


**2.4) Calculate: Complete the code below to calculate the values required.**

In [None]:
#Use this code for students
#Define P0, the value in H0.
p_0 = ... #Replace ... with p0.

#Calculate the values needed; p-hat, and standard error (se).
p_hat = obs_count / n
se = np.sqrt(p_0 * (1-p_0) / n)

#Calculate the z-score of our p-hat, under the assumption H0 is true.
z_score = (p_hat - p_0) / se

#Calculate the p-value for 1- and 2-sided tests
p_value1 = (1 - norm.cdf(abs(z_score)))
p_value2 = 2 * p_value1

print(f"p-hat = {obs_count}/{n} = {p_hat.round(7)}")
print(f"z-score = {z_score.round(7)}")
print(f"1 sided p-value = {p_value1:.11f}")
print(f"2 sided p-value = {p_value2:.11f}")


**2.5) Communicate Results: What conclusion is made about the null hypothesis? And what does that mean about the alternate hypothesis?**

...

# **Question 3**

## **Do you make the same conclusion if you use the confidence interval?**

**In question 2 you concluded that we either do have or do not have convincing evidence for the alternate hypothesis. Using your confidence interval from question 1, do you reach the same conclusion?**

...

# **QUESTION 4**

Generate a paragraph of at least 100 words to address one of the following questions. That is, answer only 4a or 4b, but not both.

**4a)** Discuss how analyzing your chosen data set using statistical methods could help you become better prepared for future courses in your major?

...

--OR--

**4b)** Discuss how analyzing your chosen data set using statistical methods could be instrumental in becoming better prepared for your future career?

...


<br><br>
### Once you are done and ready to submit, follow the instructions below to save as a PDF and submit to GradeScope.

### Save as PDF
Note 1: You do not have to select Print Preview. You can print directly from the notebook.
Note 2: Image and graph sizes have been set so you should be able to see them correctly without making any changes to the browser width or the layout (portrait vs landscape).
1. Run all code one last time and make sure your graphs can be seen.
2. File -> Print (or ctrl-p/cmnd-p)
3. Change the "Desination" to PDF.
4. Save the PDF, taking note of where it is saved.

### Submit to GradeScope
**Watch the "GradeScope Submission" video for help.**
1. Login to the Canvas course
2. Click on GradeScope in the course navigation.
3. If you see multiple courses in GradeScope, click on the STAT 108 course
4. Click on the name of the assignment that matches your data set
5. Click on "Submit Work", select PDF
6. Select the PDF you just created
7. You need to tell GradeScope which page each problem answer/output is on. You should see a list of problems on the left, and a display of pages (thumbnails) on the right. Assign pages to questions by clicking on the question number on the left, then clicking on all pages that question is on.
8. After ALL questions have been assigned to their respective page(s), click "Submit"

#### **Still need help? Your STAT 108 team is here to help. Take your laptop to office hours.**
