# Summary of Preliminary Analysis

This notebook is for tempral work to think about what we shuold do in our actual analysis. First, we summarize the result of our preliminary analysis in the proposal.

In [1]:
# This cleaning-up procedure is mostly the same as our proposal
# We added some steps to omit rows containing unclassified values
import altair as alt
import numpy as np
import pandas as pd

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"

heart_disease_data = pd.read_csv(url, names=[
    "age", 
    "sex", 
    "chest_pain", 
    "resting_blood_pressure", 
    "cholesterol", 
    "fasting_blood_sugar", 
    "resting_electrocardiographic_results", 
    "max_heart_rate",
    "exercise_induced_angina",  
    "st_depression_exercise", 
    "slope_st", 
    "major_vessels", 
    "thal",
    "diagnosis"
])

heart_disease_data = heart_disease_data.dropna(how='all')

# omit rows containing unclassified values
heart_disease_data = heart_disease_data.drop(
    heart_disease_data[heart_disease_data.major_vessels == '?'].index
)
heart_disease_data["major_vessels"] = pd.to_numeric(heart_disease_data["major_vessels"])

heart_disease_data = heart_disease_data.drop(
    heart_disease_data[heart_disease_data.thal == '?'].index
)

In [2]:
# replace numeric values with categorical lables for categorical variables 
# except for 'resting_electrocardiographic_results' which requires a long description for each label
heart_disease_data["sex"] = heart_disease_data["sex"].apply(lambda x: "male" if (x == 1.0) else "female")

heart_disease_data["chest_pain"] = heart_disease_data["chest_pain"].replace({
    1: 'Typical angina',
    2: 'Atypical angina',
    3: 'Non-anginal pain',
    4: 'Asymptomatic'
})

heart_disease_data["fasting_blood_sugar"] = heart_disease_data["fasting_blood_sugar"].apply(lambda x: True if (x == 1.0) else False)

heart_disease_data["exercise_induced_angina"] = heart_disease_data["exercise_induced_angina"].apply(lambda x: 'Yes' if (x == 1.0) else 'No')

heart_disease_data["slope_st"] = heart_disease_data["slope_st"].replace({
    1.0: 'Upsloping',
    2.0: 'Flat',
    3.0: 'Downsloping'
})

heart_disease_data["thal"] = heart_disease_data["thal"].replace({
    '3.0': 'Normal',
    '6.0': 'Fixed defect',
    '7.0': 'Rreversable defect'
})


In [3]:
# define a column, 'heart_disease', based on the 'diagnosis' column and drop 'diagnosis'
heart_disease_data["heart_disease"] = heart_disease_data["diagnosis"].apply(
    lambda x: "undiagnosed" if (x == 0) else "diagnosed")
heart_disease_data = heart_disease_data.drop(columns=["diagnosis"])

heart_disease_data

Unnamed: 0,age,sex,chest_pain,resting_blood_pressure,cholesterol,fasting_blood_sugar,resting_electrocardiographic_results,max_heart_rate,exercise_induced_angina,st_depression_exercise,slope_st,major_vessels,thal,heart_disease
0,63.0,male,Typical angina,145.0,233.0,True,2.0,150.0,No,2.3,Downsloping,0.0,Fixed defect,undiagnosed
1,67.0,male,Asymptomatic,160.0,286.0,False,2.0,108.0,Yes,1.5,Flat,3.0,Normal,diagnosed
2,67.0,male,Asymptomatic,120.0,229.0,False,2.0,129.0,Yes,2.6,Flat,2.0,Rreversable defect,diagnosed
3,37.0,male,Non-anginal pain,130.0,250.0,False,0.0,187.0,No,3.5,Downsloping,0.0,Normal,undiagnosed
4,41.0,female,Atypical angina,130.0,204.0,False,2.0,172.0,No,1.4,Upsloping,0.0,Normal,undiagnosed
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
297,57.0,female,Asymptomatic,140.0,241.0,False,0.0,123.0,Yes,0.2,Flat,0.0,Rreversable defect,diagnosed
298,45.0,male,Typical angina,110.0,264.0,False,0.0,132.0,No,1.2,Flat,0.0,Rreversable defect,diagnosed
299,68.0,male,Asymptomatic,144.0,193.0,True,0.0,141.0,No,3.4,Flat,2.0,Rreversable defect,diagnosed
300,57.0,male,Asymptomatic,130.0,131.0,False,0.0,115.0,Yes,1.2,Flat,1.0,Rreversable defect,diagnosed


Now we have a 14-column dataset that contains six numerical variables and eight categorical variables. We will show the summary of the culumns in the foolowing table with the possibility of corelation which we determined in our preliminary analysis.

| # | Variable | Description | Value | Possible Co-relation |
|---|----------|-------------|-------|----------------------|
| 1 | age | Individual's age | numerical | Yes |
| 2 | sex | Individual's sex | male or female | Yes |
| 3 | chest_pain | Chest pain type | Typical angina, Atypical angina, Non-anginal pain, or Asymptomatic | Yes |
| 4 | resting_blood_pressure | Resting blood pressure (in mm Hg on admission to the hospital) | numerical | worth checking again (See the below Note 2) |
| 5 | cholesterol | Serum cholestoral in mg/dL | numerical | need re-analysis (See the below Note 2) |
| 6 | fasting_blood_sugar | Fasting blood sugar > 120 mg/dL | True or False | No |
| 7 | resting_electrocardiographic_results | Resting electrocardiographic results | <ul><li>0: normal</li><li>1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)</li><li>2: showing probable or definite left ventricular hypertrophy by Estes' criteria</li></ul> | No |
| 8 | max_heart_rate | Maximum heart rate achieved | numerical | Yes |
| 9 | exercise_induced_angina | Exercise induced angina | Yes or No | Yes |
| 10 | st_depression_exercise | ST depression induced by exercise relative to rest | numerical | Yes |
| 11 | slope_st | The slope of the peak exercise ST segment | Upsloping, Flat, or Downsloping | Yes (but too technical) |
| 12 | major_vessels | Number of major vessels (0-3) colored by flourosopy | numerical | Yes |
| 13 | thal | Thalassemia | Normal, Fixed defect, or Reversable defect | Yes |
| 14 | heart_disease | Diagnosis of presence of heart disease | diagnosed or undiagnosed | NA |

Note 1: For `resting_blood_pressure`, we concluded the variable does not seem to have correlation with heart disease diganosis. However, it seems a pattern in the visualization; the samples diagnosed to heart disease seem to have higher blood pressure. So, we might consider including this variable into our analysis.

Note 2: For `cholesterol`, our preliminary analysis was not correctly done. As the later part of scaling and centering will show, the mean value of the serum cholesterol is about 246 mg/dL. Compared to this value, the visualization in our proposal does not seem to properly represent the observations of this variable. So, we may redo the analysis before actual analysis.

## Training Data and Test Data

We will obtain a training data and a test data and *use only the training data* from now on.

In [4]:
from sklearn.model_selection import train_test_split

heart_disease_train, heart_disease_test = train_test_split(heart_disease_data, test_size=0.25, random_state=123)

## Scaling and Centering

We will show some statistics for all six numeric variables in the training data.

In [5]:
age_stat = heart_disease_train.agg({'age': ['mean', 'median', 'min', 'max']}).round(decimals=1)
rbp_stat = heart_disease_train.agg({'resting_blood_pressure': ['mean', 'median', 'min', 'max']}).round(decimals=2)
chol_stat = heart_disease_train.agg({'cholesterol': ['mean', 'median', 'min', 'max']}).round(decimals=2)
max_hr_stat = heart_disease_train.agg({'max_heart_rate': ['mean', 'median', 'min', 'max']}).round(decimals=2)
st_depression_exercise = heart_disease_train.agg({'st_depression_exercise': ['mean', 'median', 'min', 'max']}).round(decimals=2)
major_vessels = heart_disease_train.agg({'major_vessels': ['mean', 'median', 'min', 'max']}).round(decimals=2)

num_stat = pd.concat([age_stat, rbp_stat, chol_stat, max_hr_stat, st_depression_exercise, major_vessels], axis=1)
num_stat.transpose()

Unnamed: 0,mean,median,min,max
age,54.7,56.0,29.0,77.0
resting_blood_pressure,132.49,130.0,94.0,192.0
cholesterol,246.02,243.0,126.0,417.0
max_heart_rate,149.76,154.0,71.0,202.0
st_depression_exercise,1.02,0.6,0.0,5.6
major_vessels,0.68,0.0,0.0,3.0


We can see the variation of scales in the above variables, so we may need to scale the values when using them for our analysis with the K nearest neighbours algorithms.

Also, we will show summaries for seven categorical variables in the training data. (We omit `st_slope` because it seems too technical for us.) *We may have to consider the imbalance of the values when we train our models.*

In [6]:
print(heart_disease_train["sex"].value_counts(), end='\n\n')
print(heart_disease_train["chest_pain"].value_counts(), end='\n\n')
print(heart_disease_train["fasting_blood_sugar"].value_counts(), end='\n\n')
print(heart_disease_train["resting_electrocardiographic_results"].value_counts(), end='\n\n')
print(heart_disease_train["exercise_induced_angina"].value_counts(), end='\n\n')
print(heart_disease_train["thal"].value_counts(), end='\n\n')
print(heart_disease_train["heart_disease"].value_counts())

male      153
female     69
Name: sex, dtype: int64

Asymptomatic        101
Non-anginal pain     68
Atypical angina      37
Typical angina       16
Name: chest_pain, dtype: int64

False    191
True      31
Name: fasting_blood_sugar, dtype: int64

0.0    111
2.0    108
1.0      3
Name: resting_electrocardiographic_results, dtype: int64

No     157
Yes     65
Name: exercise_induced_angina, dtype: int64

Normal                126
Rreversable defect     82
Fixed defect           14
Name: thal, dtype: int64

undiagnosed    120
diagnosed      102
Name: heart_disease, dtype: int64
