# Summary of Preliminary Analysis

This notebook is for tempral work to think about what we shuold do in our actual analysis. First, we summarize the result of our preliminary analysis in the proposal.

In [1]:
# This cleaning-up procedure is mostly the same as our proposal
# We added some steps to omit rows containing unclassified values
import altair as alt
import numpy as np
import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler

In [2]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"

original_data = pd.read_csv(url, names=[
    "age", 
    "sex", 
    "chest_pain", 
    "resting_blood_pressure", 
    "cholesterol", 
    "fasting_blood_sugar", 
    "resting_electrocardiographic_results", 
    "max_heart_rate",
    "exercise_induced_angina",  
    "st_depression_exercise", 
    "slope_st", 
    "major_vessels", 
    "thal",
    "diagnosis"
])

hd_data = original_data.dropna(how='all')

# omit rows containing unclassified values
hd_data = hd_data.drop(
    hd_data[hd_data.major_vessels == '?'].index
)
hd_data["major_vessels"] = pd.to_numeric(hd_data["major_vessels"])

hd_data = hd_data.drop(
    hd_data[hd_data.thal == '?'].index
)

In [3]:
# replace numeric values with categorical lables for categorical variables 
# except for 'resting_electrocardiographic_results' which requires a long description for each label
hd_data["sex"] = hd_data["sex"].apply(lambda x: "male" if (x == 1.0) else "female")

hd_data["chest_pain"] = hd_data["chest_pain"].replace({
    1: 'Typical angina',
    2: 'Atypical angina',
    3: 'Non-anginal pain',
    4: 'Asymptomatic'
})

hd_data["fasting_blood_sugar"] = hd_data["fasting_blood_sugar"].apply(lambda x: True if (x == 1.0) else False)

hd_data["exercise_induced_angina"] = hd_data["exercise_induced_angina"].apply(lambda x: 'Yes' if (x == 1.0) else 'No')

hd_data["slope_st"] = hd_data["slope_st"].replace({
    1.0: 'Upsloping',
    2.0: 'Flat',
    3.0: 'Downsloping'
})

hd_data["thal"] = hd_data["thal"].replace({
    '3.0': 'Normal',
    '6.0': 'Fixed defect',
    '7.0': 'Rreversable defect'
})


In [4]:
# define a column, 'heart_disease', based on the 'diagnosis' column and drop 'diagnosis'
hd_data["heart_disease"] = hd_data["diagnosis"].apply(
    lambda x: "undiagnosed" if (x == 0) else "diagnosed")
hd_data = hd_data.drop(columns=["diagnosis"])

hd_data.head()

Unnamed: 0,age,sex,chest_pain,resting_blood_pressure,cholesterol,fasting_blood_sugar,resting_electrocardiographic_results,max_heart_rate,exercise_induced_angina,st_depression_exercise,slope_st,major_vessels,thal,heart_disease
0,63.0,male,Typical angina,145.0,233.0,True,2.0,150.0,No,2.3,Downsloping,0.0,Fixed defect,undiagnosed
1,67.0,male,Asymptomatic,160.0,286.0,False,2.0,108.0,Yes,1.5,Flat,3.0,Normal,diagnosed
2,67.0,male,Asymptomatic,120.0,229.0,False,2.0,129.0,Yes,2.6,Flat,2.0,Rreversable defect,diagnosed
3,37.0,male,Non-anginal pain,130.0,250.0,False,0.0,187.0,No,3.5,Downsloping,0.0,Normal,undiagnosed
4,41.0,female,Atypical angina,130.0,204.0,False,2.0,172.0,No,1.4,Upsloping,0.0,Normal,undiagnosed


Now we have a 14-column dataset that contains six numerical variables and eight categorical variables. We will show the summary of the culumns in the foolowing table with the possibility of correlation which we determined in our preliminary analysis.

| # | Variable | Description | Value | Possible Correlation |
|---|----------|-------------|-------|----------------------|
| 1 | age | Individual's age | numerical | Yes |
| 2 | sex | Individual's sex | male or female | Yes |
| 3 | chest_pain | Chest pain type | Typical angina, Atypical angina, Non-anginal pain, or Asymptomatic | Yes |
| 4 | resting_blood_pressure | Resting blood pressure (in mm Hg on admission to the hospital) | numerical | worth checking again (See the below Note 1) |
| 5 | cholesterol | Serum cholestoral in mg/dL | numerical | need re-analysis (See the below Note 2) |
| 6 | fasting_blood_sugar | Fasting blood sugar > 120 mg/dL | True or False | No |
| 7 | resting_electrocardiographic_results | Resting electrocardiographic results | <ul><li>0: normal</li><li>1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)</li><li>2: showing probable or definite left ventricular hypertrophy by Estes' criteria</li></ul> | No |
| 8 | max_heart_rate | Maximum heart rate achieved | numerical | Yes |
| 9 | exercise_induced_angina | Exercise induced angina | Yes or No | Yes |
| 10 | st_depression_exercise | ST depression induced by exercise relative to rest | numerical | Yes |
| 11 | slope_st | The slope of the peak exercise ST segment | Upsloping, Flat, or Downsloping | Yes (but too technical) |
| 12 | major_vessels | Number of major vessels (0-3) colored by flourosopy | numerical | Yes |
| 13 | thal | Thalassemia | Normal, Fixed defect, or Reversable defect | Yes |
| 14 | heart_disease | Diagnosis of presence of heart disease | diagnosed or undiagnosed | NA |

Note 1: For `resting_blood_pressure`, we concluded the variable does not seem to have correlation with heart disease diganosis. However, it seems a pattern in the visualization; the samples diagnosed to heart disease seem to have higher blood pressure. So, we might consider including this variable into our analysis.

Note 2: For `cholesterol`, our preliminary analysis was not correctly done. As the later part of scaling and centering will show, the mean value of the serum cholesterol is about 246 mg/dL. Compared to this value, the visualization in our proposal does not seem to properly represent the observations of this variable. So, we may redo the analysis before actual analysis.

## Scaling and Centering

### Variations in the numeric variables

We will show some statistics for all six numeric variables in the original data.

In [5]:
age_stat = hd_data.agg({'age': ['mean', 'median', 'min', 'max']}).round(decimals=1)
rbp_stat = hd_data.agg({'resting_blood_pressure': ['mean', 'median', 'min', 'max']}).round(decimals=2)
chol_stat = hd_data.agg({'cholesterol': ['mean', 'median', 'min', 'max']}).round(decimals=2)
max_hr_stat = hd_data.agg({'max_heart_rate': ['mean', 'median', 'min', 'max']}).round(decimals=2)
st_depression_exercise = hd_data.agg({'st_depression_exercise': ['mean', 'median', 'min', 'max']}).round(decimals=2)
major_vessels = hd_data.agg({'major_vessels': ['mean', 'median', 'min', 'max']}).round(decimals=2)

num_stat = pd.concat([age_stat, rbp_stat, chol_stat, max_hr_stat, st_depression_exercise, major_vessels], axis=1)
num_stat.transpose()

Unnamed: 0,mean,median,min,max
age,54.5,56.0,29.0,77.0
resting_blood_pressure,131.69,130.0,94.0,200.0
cholesterol,247.35,243.0,126.0,564.0
max_heart_rate,149.6,153.0,71.0,202.0
st_depression_exercise,1.06,0.8,0.0,6.2
major_vessels,0.68,0.0,0.0,3.0


We can see the variation of scales in the above variables, so we will scale these values before using them for our analysis with the K nearest neighbours algorithms.

In [6]:
hd_preprocessor = make_column_transformer(
    (StandardScaler(), [
        "age",
        "resting_blood_pressure",
        "cholesterol",
        "max_heart_rate",
        "st_depression_exercise",
        "major_vessels"
    ]),
    ("passthrough", [
        "sex",
        "chest_pain",
        "fasting_blood_sugar",
        "resting_electrocardiographic_results",
        "exercise_induced_angina",
        "slope_st", 
        "thal",
        "heart_disease"
    ]),
)

hd_scaled = pd.DataFrame(
    hd_preprocessor.fit_transform(hd_data),
    columns=[
        "age",
        "resting_blood_pressure",
        "cholesterol",
        "max_heart_rate",
        "st_depression_exercise",
        "major_vessels",
        "sex",
        "chest_pain",
        "fasting_blood_sugar",
        "resting_electrocardiographic_results",
        "exercise_induced_angina",
        "slope_st", 
        "thal",
        "heart_disease"
    ],
)
hd_scaled.head()

Unnamed: 0,age,resting_blood_pressure,cholesterol,max_heart_rate,st_depression_exercise,major_vessels,sex,chest_pain,fasting_blood_sugar,resting_electrocardiographic_results,exercise_induced_angina,slope_st,thal,heart_disease
0,0.936181,0.75038,-0.276443,0.017494,1.068965,-0.721976,male,Typical angina,True,2.0,No,Downsloping,Fixed defect,undiagnosed
1,1.378929,1.596266,0.744555,-1.816334,0.381773,2.478425,male,Asymptomatic,False,2.0,Yes,Flat,Normal,diagnosed
2,1.378929,-0.659431,-0.3535,-0.89942,1.326662,1.411625,male,Asymptomatic,False,2.0,Yes,Flat,Rreversable defect,diagnosed
3,-1.94168,-0.095506,0.051047,1.63301,2.099753,-0.721976,male,Non-anginal pain,False,0.0,No,Downsloping,Normal,undiagnosed
4,-1.498933,-0.095506,-0.835103,0.978071,0.295874,-0.721976,female,Atypical angina,False,2.0,No,Upsloping,Normal,undiagnosed


### Imbarance of valurs in categorical variables

Also, we will show summaries for seven categorical variables in the original data. (We omit `st_slope` because it seems too technical for us.) *We may have to consider the imbalance of the values when we train our models.*

In [7]:
print(hd_data["sex"].value_counts(), end='\n\n')
print(hd_data["chest_pain"].value_counts(), end='\n\n')
print(hd_data["fasting_blood_sugar"].value_counts(), end='\n\n')
print(hd_data["resting_electrocardiographic_results"].value_counts(), end='\n\n')
print(hd_data["exercise_induced_angina"].value_counts(), end='\n\n')
print(hd_data["thal"].value_counts(), end='\n\n')
print(hd_data["heart_disease"].value_counts())

male      201
female     96
Name: sex, dtype: int64

Asymptomatic        142
Non-anginal pain     83
Atypical angina      49
Typical angina       23
Name: chest_pain, dtype: int64

False    254
True      43
Name: fasting_blood_sugar, dtype: int64

0.0    147
2.0    146
1.0      4
Name: resting_electrocardiographic_results, dtype: int64

No     200
Yes     97
Name: exercise_induced_angina, dtype: int64

Normal                164
Rreversable defect    115
Fixed defect           18
Name: thal, dtype: int64

undiagnosed    160
diagnosed      137
Name: heart_disease, dtype: int64


## Training Data and Test Data

We will obtain a training data and a test data from the `hd_scaled` data and *use only the training data* from now on.

In [8]:
from sklearn.model_selection import train_test_split

hd_scaled_train, hd_scaled_test = train_test_split(hd_scaled, test_size=0.25, stratify=hd_scaled["heart_disease"])
print(hd_scaled_train["heart_disease"].value_counts(normalize=True))
print(hd_scaled_test["heart_disease"].value_counts(normalize=True))

undiagnosed    0.540541
diagnosed      0.459459
Name: heart_disease, dtype: float64
undiagnosed    0.533333
diagnosed      0.466667
Name: heart_disease, dtype: float64
