<a href="https://colab.research.google.com/github/talamo13/Intro-To-Data-Science-Assignments/blob/Heart-Disease-%232/Heart_Disease_2_Key.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Heart Disease Data Set**

## **Context**

Coronary heart disease (CHD) involves the reduction of blood flow to the heart muscle due to the build-up of plaque in the arteries of the heart. It is the most common form of cardiovascular disease. Currently, invasive coronary angiography represents the gold standard for establishing the presence, location, and severity of CAD, however, this diagnostic method is costly and associated with morbidity and mortality in CAD patients. Therefore, it would be beneficial to develop a non-invasive alternative to replace the current gold standard.

Other less invasive diagnostics methods have been proposed in the scientific literature including exercise electrocardiogram, thallium scintigraphy, and fluoroscopy of coronary calcification. However, the diagnostic accuracy of these tests only ranges between 35%-75%. Therefore, it would be beneficial to develop a computer-aided diagnostic tool that could utilize the combined results of these non-invasive tests in conjunction with other patient attributes to boost the diagnostic power of these non-invasive methods with the aim of ultimately replacing the current invasive gold standard.

A number of 303 consecutive patients referred for coronary angiography at the Cleveland Clinic between May 1981 and September 1984 participated in the experiment. No patient had a history or electrocardiographic evidence of prior myocardial infarction or known valvular or cardiomyopathic diseases.


## **About The Dataset**

The dataset comprises 303 observations, 13 features, and 1 target attribute. The 13 features include the results of the aforementioned non-invasive diagnostic tests along with other relevant patient information. The target variable includes the result of the invasive coronary angiogram which represents the presence or absence of coronary artery disease in the patient. The 14 variables (13 features and 1 target attribute) are described below.

**1. AGE:** displays the age of the individual.

**2. SEX:** displays the gender of the individual using the following format:
   - 1 = male
   - 0 = female

**3. CP:** displays the type of chest-pain experienced by the individual using the following format:
   - 0 = typical angina
   - 1 = atypical angina
   - 2 = non-anginal pain
   - 3 = asymptotic

**4. TRESTBPS:** displays the resting blood pressure value of an individual in mmHg (unit)

**5. CHOL:** displays the serum cholesterol in mg/dl (unit)

**6. FBS:** compares the fasting blood sugar value of an individual with 120mg/dl.
   - 1: fasting blood sugar >120mg/dl
   - 0: fasting blood sugar ≤ 120mg/dl

**7. RESTECG:** displays resting electrocardiographic results
   - 0 = normal
   - 1 = having ST-T wave abnormality
   - 2 = left ventricular hypertrophy

**8. THALACH:** displays the max heart rate achieved by an individual.

**9. EXANG:** Exercise induced angina
   - 1 = yes
   - 0 = no

**10. OLDPEAK:** ST depression induced by exercise relative to rest. Displays the value which is an integer or float.

**11. SLOPE:** Peak exercise ST segment
    - 1 = upsloping
    - 2 = flat
    - 3 = downsloping

**12. CA:** Number of major vessels (0-3) colored by fluoroscopy. Displays the value as an integer or float.

**13. THAL:** Displays the thalassemia
    - 1 = normal
    - 2 = fixed defect
    - 3 = reversible defect

**14. Target:** Displays whether the individual is suffering from heart disease or not:
  - 0 = abscence
  - 1 = present


The data was collected by Robert Detrano, M.D., Ph.D. of the Cleveland Clinic Foundation. See the Appendix at the end of this document for more details on why these variables are used to analyze CHD.

Attribution: UCI Machine Learning Repository

A snippet of the data is as follows:

In [6]:
import pandas as pd
import plotly.express as px

df = pd.read_csv("https://raw.githubusercontent.com/talamo13/Intro-To-Data-Science-Assignments/Heart-Disease-%232/Heart-Disease-Data.csv")

# Display the first 5 rows of the dataframe
df.iloc[0:5]

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


# **Correlation and Regression Analysis**

This assignment is intended to explore the univariate linear relationships between quantitative variables in the data. Choose the dependent and independent variables based on your last name. Use SPSS to analyze the relationship between the two variables and complete each of the following questions. As appropriate, copy the SPSS output and paste it into the correct part below. For problems that require a written response, type the answer below.

<div align=center>

| **Last Name** | **Dependent Variable (y)** | **Independent Variable (x)** |
|-----------|-------------------------|---------------------------|
| A-L       | TRESTBPS                | CHOL                      |
| M-Z       | THALACH                 | AGE                       |

</div>


### 1. Construct a scatterplot of the two variables without the line of regression. How would you describe the relationship between the two variables? Is this what you expected?

#### A-M

In [10]:
fig1AM = px.scatter(df, x='chol', y='trestbps', title='Scatter Plot For trestbps By choi')
fig1AM

*· There is a weak positive linear relationship between the resting blood pressure and cholesterol.*

*· (Explanation of whether it was what the student expected.)*

#### N-Z

In [12]:
fig1NZ = px.scatter(df, x='age', y='thalach', title='Scatter Plot Of thalach By age')
fig1NZ

*· There is a weak negative linear relationship between the max heart rate and age.*

*· (Explanation of whether it was what the student expected.)*

### 2. Compute and interpret the values of the correlation coefficient between the two variables

#### A-M

a. Predictors: (Constant), Serum Cholestrol (mg/dl)

· There is a weak positive linear relationship between the resting blood pressure and cholesterol.

#### N-Z

a. Predictors: (Constant), age

· There is a weak negative linear relationship between the max heart rate and age.

### 3. Compute the least squares regression line describing the relationships between the dependent and independent variable. Add the regression line to the scatterplot. Paste the new scatterplot and the output table below. Then type out the prediction equation.

### 4. Interpret the slope of the least squares regression line in the context of this study.

### 5. Interpret the y-intercept of the least squares regression line in the context of this study. State whether the interpretation is reasonable.

### 6. One person in the data set was 74 years old. Predict the value of your dependent variable from above for this person, using the actual value of the independent variable shown in the table below. Type your work below.

<div align=center>

| Last Name | Independent Variable (x) |
|-----------|---------------------------|
| A-L       | CHOL = 269                |
| M-Z       | AGE = 74                  |

</div>

### 7. Look up, in the Excel or SPSS file, the actual value of your dependent variable. Compare your answer in question 6 above (the predicted dependent variable) to the actual value of your dependent variable.

### 8. Generate a paragraph of at least 100 words to address one of the following questions

#### a. Discuss how analyzing your chosen data set using statistical methods could help you become better prepared for future courses in your major?

#### b. Discuss how analyzing your chosen data set using statistical methods could be instrumental in becoming better prepared for your future career