## Project: Predicting heart diseases with ML

The main objective of this study is to build a model that can __predict__ the heart disease occurrence, based on a combination of features (risk factors) describing the disease. Different machine learning __classification__ techniques will be implemented and compared upon standard performance metric such as accuracy. 

The dataset used for this study was taken from UCI machine learning repository, titled __[“Heart Disease Data Set”](http://archive.ics.uci.edu/ml/datasets/Heart+Disease)__. 


Contents of the Notebook:

1. Dataset structure & description
2. Analyze, identify patterns, and explore the data
3. Data preparation
4. Modelling and predicting with Machine Learning


### Import libraries

In [1]:
# data analysis, splitting and wrangling
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


# import necessary machine learning libraries (from sklearn,preferrably)


## 1. Dataset structure & description

The dataset used in this project contains 14 variables. The independent variable that needs to be predicted, 'diagnosis', determines whether a person is healthy or suffer from heart disease. Experiments with the Cleveland database have concentrated on endeavours to distinguish disease presence (values 1, 2, 3, 4) from absence (value 0). There are several missing attribute values, distinguished with symbol '?'. The header row is missing in this dataset, so the column names have to be inserted manually.

### Features information:

- age - age in years
- sex - sex(1 = male; 0 = female)
- chest_pain - chest pain type (1 = typical angina; 2 = atypical angina; 3 = non-anginal pain; 4 = asymptomatic)
- blood_pressure - resting blood pressure (in mm Hg on admission to the hospital)
- serum_cholestoral - serum cholestoral in mg/dl
- fasting_blood_sugar - fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
- electrocardiographic - resting electrocardiographic results (0 = normal; 1 = having ST-T; 2 = hypertrophy)
- max_heart_rate - maximum heart rate achieved
- induced_angina - exercise induced angina (1 = yes; 0 = no)
- ST_depression - ST depression induced by exercise relative to rest
- slope - the slope of the peak exercise ST segment (1 = upsloping; 2 = flat; 3 = downsloping)
- no_of_vessels - number of major vessels (0-3) colored by flourosopy
- thal - 3 = normal; 6 = fixed defect; 7 = reversable defect
- diagnosis - the predicted attribute - diagnosis of heart disease (angiographic disease status) (Value 0 = < 50% diameter narrowing; Value 1 = > 50% diameter narrowing)

### Types of features:

__Categorical features__ (Has two or more categories and each value in that feature can be categorised by them): __sex, chest_pain__  


__Ordinal features__ (Variable having relative ordering or sorting between the values): __fasting_blood_sugar, electrocardiographic, induced_angina, slope, no_of_vessels, thal, diagnosis__


__Continuous features__ (Variable taking values between any two points or between the minimum or maximum values in the feature column): __age, blood_pressure, serum_cholestoral, max_heart_rate, ST_depression__


### Load data

In [3]:
# name columns according to the feature information provided above


# read the file


# display the first 5 lines


In [5]:
# get info on the dataframe

In [6]:
# extract numeric columns and find categorical ones


## 2.Analyze features, identify patterns, and explore the data

### Target value



In [7]:
# count values of explained variable

In [8]:
# create a boolean vector and map it with corresponding values (True=1, False=0)

### Numeric features

There are 5 numeric columns, so let's take care of them first. 
Outliers occurrence in the dataset may be a result of wrong input and create undesired noise, thus our role is to evaluate their substance. A data point is considered as an outlier when it falls outside 3 standard deviations. 

In [9]:
# view of descriptive statistics


All extreme (min/max) values could occur in a real clinical scenario, hence the decision to keep them as they are. 

We can gain some intuition about relationships amongst numeric features by plotting each pair in a scattered form. To do this efficiently, *pairplot* method from Seaborn library comes in handy.

## 3.Data Preparation

In order to make our dataset compatible with machine learning algorithms contained in Sci-kit Learn library, first of all, we need to handle all missing data.

There are many options we could consider when replacing a missing value, for example:
- A constant value that has meaning within the domain, such as 0, distinct from all other values
- A value from another randomly selected record
- A mean, median or mode value for the column
- A value estimated by another predictive model

In [10]:
# show columns having missing values


In [11]:
# fill missing values with mode

In [12]:
# extract the target variable


In [14]:
# split the data into train and test datasets


Data needs to be normalized or standardized before applying to machine learning algorithms. Standardization scales the data and gives information on how many standard deviations the data is placed from its mean value. Effectively, the mean of the data (µ) is 0 and the standard deviation (σ) is 1.

In [16]:
# scale feature matrices


## 4. Modelling and predicting with Machine Learning

Now you are free to use any machine learning algorithm you find appropriate to build a model for predicting the occurrence of a heart disease. Good luck!