## Heart Disease Appearance - Statistics and Data Analysis Research

**Content**:

1. [Dataset.](#1)

    1.1. [Data source and a brief overview.](#1.1)
    
    1.2. [Descriptive statistics.](#1.2)
    
2. [Feature selection.](#2)

    2.1. [Linear dependence. Covariance. Normalization.](#2.1)
    
    2.2. [Research question.](#2.2)
    
3. [Features analysis.](#3)

    3.1. [Numeric features.](#3.1)
    
    3.2. [Categorical features.](#3.2)
    
4. [Testing hypothesis.](#4)

    4.1. [Feature importance.](#4.1)
    
    4.2. [Conclusion.](#4.2)


### 1. Dataset

<a id='1'><a/>

#### 1.1. Data source and brief overview
    
<a id='1.1'><a/>

The dataset for this project is the **heart disease** dataset. It is actually a subset of 14 features used for various experiments in medicine. It contains patients' health indicators as well as the **target variable** which represents whether a patient was diagnosed as having heart disease or not.

The heart disease dataset was intially created on a base of 4 medical institutions' databases from Switzerland, Hungary, and the US (check https://archive.ics.uci.edu/ml/datasets/heart+disease for more details).

The authors of the databases have requested that any publications resulting from the use of the data include the names of the **principal investigator responsible for the data collection at each institution**. They would be:
1. Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
2. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
3. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
4. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation:Robert Detrano, M.D., Ph.D.

**Brief overview**

The dataset looks as follows:

<img src='png/dataset.png' width=600/>

There are 1025 observations and 14 features. Every **unit of observation** is a patient and every feature is patients' records. The data is **cross-sectional** since all observations are made before the diagnosis was given and based on the **individual-level** scale because persons' records are presented without aggregation processing (previously, there were social IDs as identificators of each patient).

There is a breakdown of features:

* age : age in years
* sex : (1 = male; 0 = female)
* chest pain type (4 values)
* resting blood pressure : resting blood pressure (in mm Hg on admission to the hospital)
* serum cholestoral in mg/dl
* fasting blood sugar > 120 mg/dl : (1 = true; 0 = false)
* resting electrocardiographic results (values 0,1,2)
* thalach: maximum heart rate achieved
* exang : exercise induced angina (1 = yes; 0 = no)
* oldpeak: ST depression induced by exercise relative to rest
* slope: the slope of the peak exercise ST segment
* ca: number of major vessels (0-3) colored by flourosopy
* thal: 0 = normal; 1 = fixed defect; 2 = reversable defect
* target : disease (1) or no disease (0)

#### 1.2. Descriptive statistics

<a id='1.2'><a/>

**First view**

Aggregated mean values of features grouped by `target` variable:

<img src='png/descriptive-statistics-aggregated.png' width=800/>

Based on the data, it is noticeable that 4 people out of 5 without heart disease (`target` = 0) within our sample are men (`sex` = 1), patients have clearly different subsample means in terms of `cp` variable (probably, homogeneous subsamples could be drawn), `age` shows nothing.

Without better knowledge of sample behavior and characteristics (variance, covariance, imbalances, pdf), it is impossible to draw effective conclusions.

**Categorical and numerical**

Among 14 features, 9 features are categorical and 5 features are numeric.

Numeric:

<img src='png/numeric-features.png' width=600/>

Numeric variables have no insights from the first glance except `oldpeak` one. It has different subsamples means and their distributions are different based on quartile values ($Q_3\{\text{ill}\} = 1.0$, whereas $Q_2\{\text{healthy}\} = 1.4$).

Categorical:

<img src='png/categorical-features.png' width=700/>

### 2. Feature selection

<a id='2'><a/>

Investigation purpose is to figure out variables or set of variables which **strongly relate** to the `target` variable (i.e. could *predict* its behaviour). There are two types of dependence - linear and non-linear.

#### 2.1. Linear dependence. Covariance. Normalization.
    
<a id='2.1'><a/>

In statistics, linear dependence is represented as **covariance**:

$$\large\text{cov}(X,Y) = \mathbb{E}[(X - \mathbb{E}(X))(Y - \mathbb{E}(Y))] = \mathbb{E}[XY] - \mathbb{E}(X)\mathbb{E}(Y)$$

**Correlation** (Pearson's correlation) is normalized version of variance and it is represented as a coefficient from |$\rho$| = 0 (no correlation) to |$\rho$| = 1.0 (full linear dependence):

$$\large\rho_{X,Y} = \text{corr}(X,Y) = \frac{\text{cov}(X,Y)}{\sigma_{X}\sigma_{Y}} = \frac{\mathbb{E}[XY] - \mathbb{E}(X)\mathbb{E}(Y)}{\sigma_{X}\sigma_{Y}}$$

Heart disease dataset correlation coefficients:

<img src='visualizations/correlation-coefficients.png' width=600/>

Indeed, the matrix is useful for interpretation of linear dependencies. For instance, `thalach` feature has -0.39 (moderate correlation) coefficient with `age` and only -0.04 (no correlation) with `trestbps`.

<img src="visualizations/thalach-vs-age-trestbps.png"/>

Linear regression illustrated above shows that `thalach` and `trestbps` have no linear dependence. Touching upon **mean squared errors** (MSEs) of regressions above, it is misleading to "directly" compare their values since the data is not standardly scaled (i.e. not *normalized*).

The key of correlation matrix (mostly) is to explore `target` variable dependencies. As it could be noticed, `target` variable has good correlation coefficients, for example, with `cp`: +0.43, `exang`: -0.44, `oldpeak`: -0.44, `thalach`: +0.42.

#### 2.2. Research question

<a id='2.2'><a/>

Analysis of every feature in a dataset and its relationship to other variables is a thorough process, thus, it is optimal to select several features to start from. Heart rate `thalach`, (it has good correlations with other features and could be very useful), blood pressure `trestbps`, and cholesterol `chol` (these two are taken because they have most clear medical explanation and are numeric, making only two numeric variables (`age` and `oldpeak`) left off) level could play a vital role in human organism. `cp` was previously noticed to potentially have homogeneous subsamples.

Using baseline above, there is a list of four features to be explored in depth and the **research question** is:

"*To what extent maximum heart rate, resting blood pressure, cholesterol level and chest pain type contribute to the appearance of heart disease among people (based on medical database we have)?*"

To answer that question, the features will be analyzed in terms of distribution ([section 3](#3)) and base algorithms ([section 4](#4)).

### 3. Features analysis

<a id='3'><a/>

#### 3.1. Numeric features.
    
<a id='3.1'><a/>

It is important to know the distributions of variables, which could give an intuition of how variable naturally behaves.

<img src='visualizations/thalach.png' width=600/>

Maximum heart rate resembles **normal distribution** very much, which is significant distribution for statistics and machine learning since normal distribution is extensively mathematically described (in terms of probability and Central Limit Theorem) and present in many natural phenomena. The whole population could be described knowing its *mean* and *variance* if it is normally distributed.

PDFs above overlap and grouping by this feature will not produce very effective results. Since already selected *set* of features is under exploration, any of them could be dropped only after [section 4](#4).

<img src='visualizations/chol.png' width=600/>

<img src='visualizations/trestbps.png' width=600/>

Knowing `trestbps` parameters only (not using the rest of dataset), one can make predictions similar to coin flip since the distributions of this variable are identical in both subsamples. The situation is a bit better with `chol` feature. And still, the set of features can have strong predicitng power even containing features performing poor when used alone. This kind of relationship is not neccessarily *linear*, most commonly being *non-linear* one.

#### 3.2. Categorical variables

<a id='3.2'><a/>

Categorical variables are discrete (whereas numeric variables are continuous) and their distribution is better to visualize using *histograms* rather than PDFs.

<img src='visualizations/cp.png' width=600/>

`cp` is a good variable since it contributes to the understanding of the patient's conditions. $\approx 75\%$ of patients without disease have typical angina, whereas only $\approx 23 \%$ of ill patients have that `cp` type. Probability of person without heart disease having asymtpomatic angina is $P(\text{cp = 3}|\text{target = 0}) = 0.065 = 6.5\%$, while $P(\text{cp = 2}|\text{target = 1}) = 0.416 = 41.6\%$ being almost half of the sample of ill people (based on sample results). Subsequently, `cp` could ensure proper group separation, which is valuable both for analysis and predicting algorithms.

### 4. Testing hypothesis

<a id='4'><a/>

To see the whole performance of the chosen set of variables, algorithm to predict the `target` variable should be applied. `RandomForestClassifier` ensemble method implementation from *scikit-learn* library will be used for that.
    
#### 4.1. Feature importances
    
<a id='4.1'><a/>
    
The score with that set of features is $99.12\%$. Algorithm is able to predict almost every patient correctly. 
    
*Permutation* scores are more reliable since *inner* scores are affected by linear correlation of variables.

<img src='visualizations/4-feature-importances.png' width=800/>

As it is illustrated, `cp` and `thalach` variables are the most important ones for the heart disease identification. `trestbps` plays the most insignifiant role. In fact, performing the same research excluding that variable from the chosen set *does not* affect the quality of algorithm predictions.

#### 4.2. Conclusion

<a id='4.2'><a/>
    
The answer to research question is explicit. Based on the sample drawn from the whole population of patients from 4 medical institutions, it is found that maximum heart rate achieved, cholesterol level, chest pain type, and resting blood pressure together can give high accuracy predictions. Thus, development of heart disease (i.e. `target` variable) in humans is highly related to those records.