<a href="https://colab.research.google.com/github/talamo13/Intro-To-Data-Science-Assignments/blob/Heart-Disease-%233/Heart_Disease_3_Key.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Heart Disease Data Set**



##**Context**

Coronary heart disease (CHD) involves the reduction of blood flow to the heart muscle due to the build-up of plaque in the arteries of the heart. It is the most common form of cardiovascular disease. Currently, invasive coronary angiography represents the gold standard for establishing the presence, location, and severity of CAD, however, this diagnostic method is costly and associated with morbidity and mortality in CAD patients. Therefore, it would be beneficial to develop a non-invasive alternative to replace the current gold standard.

Other less invasive diagnostics methods have been proposed in the scientific literature including exercise electrocardiogram, thallium scintigraphy, and fluoroscopy of coronary calcification. However, the diagnostic accuracy of these tests only ranges between 35%-75%. Therefore, it would be beneficial to develop a computer-aided diagnostic tool that could utilize the combined results of these non-invasive tests in conjunction with other patient attributes to boost the diagnostic power of these non-invasive methods with the aim of ultimately replacing the current invasive gold standard.

A number of 303 consecutive patients referred for coronary angiography at the Cleveland Clinic between May 1981 and September 1984 participated in the experiment. No patient had a history or electrocardiographic evidence of prior myocardial infarction or known valvular or cardiomyopathic diseases.

##**About Dataset**

The dataset comprises 303 observations, 13 features, and 1 target attribute. The 13 features include the results of the aforementioned non-invasive diagnostic tests along with other relevant patient information. The target variable includes the result of the invasive coronary angiogram which represents the presence or absence of coronary artery disease in the patient. The 14 variables (13 features and 1 target attribute) are described below.

1. AGE: displays the age of the individual.

2. SEX: displays the gender of the individual using the following format:
    - 1 = male
    - 0 = female

3. CP: displays the type of chest-pain experienced by the individual using the following format:
    - 0 = typical angina
    - 1 = atypical angina
    - 2 = non — anginal pain
    - 3 = asymptotic

4. TRESTBPS: displays the resting blood pressure value of an individual in mmHg (unit)

5. CHOL: displays the serum cholesterol in mg/dl (unit)

6. FBS: compares the fasting blood sugar value of an individual with 120mg/dl.  
    - 1: fasting blood sugar >120mg/dl
    - 0: fasting blood sugar ≤ 120mg/dl

7. RESTECG: displays resting electrocardiographic results
    - 0 = normal
    - 1 = having ST-T wave abnormality
    - 2 = left ventricular hyperthrophy

8. THALACH: displays the max heart rate achieved by an individual.

9. EXANG: Exercise induced angina
    - 1 = yes
    - 0 = no

10. OLDPEAK ST depression induced by exercise relative to rest. Displays the value which is an integer or float.

11. SLOPE: Peak exercise ST segment
    - 1 = upsloping
    - 2 = flat
    - 3 = downsloping

12. CA: Number of major vessels (0-3) colored by fluoroscopy. Displays the value as integer or float.

13. THAL: Displays the thalassemia
    - 1 = normal
    - 2 = fixed defect
    - 3 = reversible defect

14. TARGET: Displays whether the individual is suffering from heart disease or not :
    - 0 = absence
    - 1 = present.

The data was collected by Robert Detrano, M.D., Ph.D. of the Cleveland Clinic Foundation. See the Appendix at the end of this document for more details on why these variables are used to analyze CHD. Attribution: UCI Machine Learning Repository

A snippet of the data is as follows

In [15]:
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/talamo13/Intro-To-Data-Science-Assignments/Heart-Disease-%233/Heart-Disease-Data.csv')
df.iloc[0:5]

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


##**Assignment 3**

**Statistical Inference for One Sample Proportion and Mean**

This assignment is on inferential statistics. Use SPSS to conduct the analysis and complete each of the following questions. As appropriate, copy the SPSS output and paste it into the correct part below. For problems that require a written response, type the answer below.

###**Last Name: A-L**

####**Question 1**

Construct and interpret the 95% confidence interval for the population proportion of patients who displayed normal resting electrocardiographic results. Write up the solution using the PMACC procedure.

In [16]:
from scipy import stats
# Filter the DataFrame to include only patients with normal resting electrocardiographic results
normal_ecg_patients = df[df['restecg'] == 'normal']

# Count the number of patients with normal resting electrocardiographic results
n_normal_ecg = len(normal_ecg_patients)

# Total number of patients
n_total = len(df)

# Sample proportion of patients with normal resting electrocardiographic results
sample_proportion = n_normal_ecg / n_total

# Significance level (alpha)
alpha = 0.05

# Calculate the standard error of the proportion (SE)
SE = (sample_proportion * (1 - sample_proportion) / n_total) ** 0.5

# Find the z-value for a 95% confidence interval
z_value = stats.norm.ppf(1 - alpha/2)

# Calculate the margin of error (ME)
ME = z_value * SE

# Calculate the confidence interval (CI)
confidence_interval = (sample_proportion - ME, sample_proportion + ME)

# Write up the solution
print("PMACC Procedure:")
print("Data Summary:")
print("Sample Proportion (p̂):", sample_proportion)
print("Sample Size (n):", n_total)
print("Significance Level (α):", alpha)
print("\nConfidence Interval:")

PMACC Procedure:
Data Summary:
Sample Proportion (p̂): 0.0
Sample Size (n): 303
Significance Level (α): 0.05

Confidence Interval:


####**Question 2**

Total cholesterol levels under 200 mg/dl are healthy for adults. Doctors treat readings of 200–239 mg/dl as borderline high, and readings of at least 240 mg/dl as high. Is there convincing evidence that the population mean cholesterol level of patients is at least 240 mg/dl? Use α=0.05. Write up the solution using the PMACC procedure.

In [17]:
# Extract the cholesterol levels column
cholesterol_levels = df['chol']

# Calculate the sample mean and standard deviation
sample_mean = cholesterol_levels.mean()
sample_std = cholesterol_levels.std(ddof=1)  # Use ddof=1 for sample standard deviation

# Sample size
n = len(cholesterol_levels)

# Hypothesized population mean
mu_0 = 240

# Significance level (alpha)
alpha = 0.05

# Calculate the t-statistic
t_statistic = (sample_mean - mu_0) / (sample_std / (n ** 0.5))

# Degrees of freedom
df = n - 1

# Find the critical t-value for a one-tailed test with alpha=0.05 and df degrees of freedom
critical_t_value = stats.t.ppf(1 - alpha, df)

# Write up the solution
print("PMACC Procedure:")
print("Hypotheses:")
print("H0: The population mean cholesterol level of patients is less than 240 mg/dl.")
print("H1: The population mean cholesterol level of patients is at least 240 mg/dl.")
print("\nData Summary:")
print("Sample Mean (x̄):", sample_mean)
print("Sample Standard Deviation (s):", sample_std)
print("Sample Size (n):", n)
print("Hypothesized Population Mean (μ0):", mu_0)
print("Significance Level (α):", alpha)
print("\nTest Statistic:")
print("t-statistic:", t_statistic)
print("\nCritical t-value:")
print("critical t-value:", critical_t_value)

PMACC Procedure:
Hypotheses:
H0: The population mean cholesterol level of patients is less than 240 mg/dl.
H1: The population mean cholesterol level of patients is at least 240 mg/dl.

Data Summary:
Sample Mean (x̄): 246.26402640264027
Sample Standard Deviation (s): 51.83075098793003
Sample Size (n): 303
Hypothesized Population Mean (μ0): 240
Significance Level (α): 0.05

Test Statistic:
t-statistic: 2.1037173676209817

Critical t-value:
critical t-value: 1.6499148276145903


###**Last Name: M-Z**

####**Question 1**

Construct and interpret the 95% confidence interval for the population mean cholesterol level of patients. Write up the solution using the PMACC procedure.

In [18]:
# Calculate the sample mean and standard deviation
sample_mean = cholesterol_levels.mean()
sample_std = cholesterol_levels.std(ddof=1)  # Use ddof=1 for sample standard deviation

# Sample size
n = len(cholesterol_levels)

# Significance level (alpha)
alpha = 0.05

# Calculate the standard error of the mean (SE)
SE = sample_std / (n ** 0.5)

# Degrees of freedom
df = n - 1

# Find the t-value for a 95% confidence interval and df degrees of freedom
t_value = stats.t.ppf(1 - alpha/2, df)

# Calculate the margin of error (ME)
ME = t_value * SE

# Calculate the confidence interval (CI)
confidence_interval = (sample_mean - ME, sample_mean + ME)

# Write up the solution
print("PMACC Procedure:")
print("Data Summary:")
print("Sample Mean (x̄):", sample_mean)
print("Sample Standard Deviation (s):", sample_std)
print("Sample Size (n):", n)
print("Significance Level (α):", alpha)
print("\nConfidence Interval:")
print("95% Confidence Interval (CI):", confidence_interval)

PMACC Procedure:
Data Summary:
Sample Mean (x̄): 246.26402640264027
Sample Standard Deviation (s): 51.83075098793003
Sample Size (n): 303
Significance Level (α): 0.05

Confidence Interval:
95% Confidence Interval (CI): (240.40455783980744, 252.1234949654731)


####**Question 2**

Is there convincing evidence that the population proportion of individuals who displayed normal resting electrocardiographic results exceeds 50%? Use α=0.1. Write up the solution using the PMACC procedure.

In [24]:
# Count the number of individuals with normal resting electrocardiographic results
n_normal_ecg = len(normal_ecg_patients)

# Total number of individuals
n_total = 302

# Sample proportion of individuals with normal resting electrocardiographic results
sample_proportion = n_normal_ecg / n_total

# Hypothesized proportion
p_0 = 0.50

# Significance level (alpha)
alpha = 0.10

# Calculate the standard error of the proportion (SE)
SE = (p_0 * (1 - p_0) / n_total) ** 0.5

# Calculate the z-statistic
z_statistic = (sample_proportion - p_0) / SE

# Find the critical z-value for a one-tailed test with alpha=0.10
critical_z_value = stats.norm.ppf(1 - alpha)

# Compare the z-statistic with the critical z-value
if z_statistic > critical_z_value:
    decision = "Reject the null hypothesis. There is convincing evidence that the population proportion of individuals who displayed normal resting electrocardiographic results exceeds 50%."
else:
    decision = "Fail to reject the null hypothesis. There is not enough evidence to conclude that the population proportion exceeds 50%."

# Write up the solution
print("PMACC Procedure:")
print("Hypotheses:")
print("H0: The population proportion of individuals who displayed normal resting electrocardiographic results is 50% or less.")
print("H1: The population proportion of individuals who displayed normal resting electrocardiographic results exceeds 50%.")
print("\nData Summary:")
print("Sample Proportion (p̂):", sample_proportion)
print("Sample Size (n):", n_total)
print("Hypothesized Proportion (p0):", p_0)
print("Significance Level (α):", alpha)
print("\nTest Statistic:")
print("z-statistic:", z_statistic)
print("\nCritical z-value:")
print("critical z-value:", critical_z_value)

PMACC Procedure:
Hypotheses:
H0: The population proportion of individuals who displayed normal resting electrocardiographic results is 50% or less.
H1: The population proportion of individuals who displayed normal resting electrocardiographic results exceeds 50%.

Data Summary:
Sample Proportion (p̂): 0.0
Sample Size (n): 302
Hypothesized Proportion (p0): 0.5
Significance Level (α): 0.1

Test Statistic:
z-statistic: -17.378147196982766

Critical z-value:
critical z-value: 1.2815515655446004


###**Everyone**

Generate a paragraph of at least 100 words to address one of the following questions:

1. Discuss how analyzing your chosen data set using statistical methods could help you become better prepared for future courses in your major.

2. Discuss how analyzing your chosen data set using statistical methods could be instrumental in becoming better prepared for your future career.