### Objective

Detecting presence of heart disease among patients.

### Inspiration

Project inspired by UCI Center for Machine Learning and Intelligent Systems

### Citation
http://archive.ics.uci.edu/ml/datasets.html

### Data Dictionary

Complete attribute documentation:

      1 age: age in years
      2 sex: sex (1 = male; 0 = female)
      3 cp: chest pain type
        -- Value 1: typical angina
        -- Value 2: atypical angina
        -- Value 3: non-anginal pain
        -- Value 4: asymptomatic
      4 trestbps: resting blood pressure (in mm Hg on admission to the 
        hospital)
      5 chol: serum cholestoral in mg/dl
      6 fbs: (fasting blood sugar > 120 mg/dl)  (1 = true; 0 = false)
      7 restecg: resting electrocardiographic results
        -- Value 0: normal
        -- Value 1: having ST-T wave abnormality (T wave inversions and/or ST 
                    elevation or depression of > 0.05 mV)
        -- Value 2: showing probable or definite left ventricular hypertrophy
                    by Estes' criteria
      8 thalach: maximum heart rate achieved
      9 exang: exercise induced angina (1 = yes; 0 = no)
     10 oldpeak = ST depression induced by exercise relative to rest
     11 slope: the slope of the peak exercise ST segment
        -- Value 1: upsloping
        -- Value 2: flat
        -- Value 3: downsloping
     12 ca: number of major vessels (0-3) colored by flourosopy
     13 thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
     14 num: diagnosis of heart disease (angiographic disease status)
        -- Value 0 (0): < 50% diameter narrowing
        -- Value 1 (1 to 4): > 50% diameter narrowing
        (in any major vessel: attributes 59 through 68 are vessels)

### Phase 1 : Import required libraries and datasets and write user-defined functions

In [1]:
#Import required libraries
import pandas as pd
import numpy as np
import seaborn as sns

In [2]:
from IPython.display import display_html

In [3]:
%matplotlib inline

In [4]:
def print_shape(obj):
    
    '''
    
    Function to print dimension(s) of object
    
    '''
    
    if isinstance(obj, list):
        print('Length of list is : ', len(obj))
    else:
        print('Shape of dataframe is : ', obj.shape)

In [5]:
def print_missing(df):
    
    '''
    
    Function to print missing value and missing percentage of each column of dataframe
    
    '''
    
    miss_val = pd.DataFrame(df.isnull().sum()).reset_index()
    miss_val.columns = ['Column', 'Missing']
    
    miss_perc = pd.DataFrame(df.isnull().sum()/df.shape[0]).reset_index()
    miss_perc.columns = ['Column', 'Missing%']
    
    print_side_by_side(miss_val, miss_perc)

In [6]:
def print_side_by_side(*args):
    
    '''
    
    Function to print pandas dataframes side  by side
    
    '''
    
    html_str=''
    
    for df in args:
        html_str+=df.to_html()
    
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)

In [7]:
def print_dist(df, col, opt):
    
    '''
    
    Function to print distribution of a variable
    
    '''
    
    print('Distribution :\n')
        
    if(opt == 'num'):
        print(df[col].describe)
        
    elif(opt == 'cat'):
        print(df[col].value_counts([0]))
    
    else:
        print('Wrong option!')
        

In [8]:
#Set the column names as there are no headers in the csv file
colnames = ['age', 
            'sex', 
            'cp', 
            'trestbps', 
            'chol', 
            'fbs', 
            'restecg', 
            'thalach', 
            'exang', 
            'oldpeak', 
            'slope', 
            'ca', 
            'thal', 
            'num']

In [9]:
#Read the csv file
df = pd.read_csv('../data/raw/processed.cleveland.data', header=None, names=colnames, na_values=[-9.0, np.nan, '?'])

In [10]:
#Print distribution of target variable
print_dist(df, 'num', 'cat')

Distribution :

0    0.541254
1    0.181518
2    0.118812
3    0.115512
4    0.042904
Name: num, dtype: float64


### Phase 2 : Data Manipulation

#### Part A : Missing value treatment

In [11]:
#Print missing values and percentage of missing
print_missing(df)

Unnamed: 0,Column,Missing
0,age,0
1,sex,0
2,cp,0
3,trestbps,0
4,chol,0
5,fbs,0
6,restecg,0
7,thalach,0
8,exang,0
9,oldpeak,0

Unnamed: 0,Column,Missing%
0,age,0.0
1,sex,0.0
2,cp,0.0
3,trestbps,0.0
4,chol,0.0
5,fbs,0.0
6,restecg,0.0
7,thalach,0.0
8,exang,0.0
9,oldpeak,0.0


Since number of missing values is very less, we shall be dropping them.

In [12]:
#Drop missing values
df = df.dropna()

Sanity check if missing values have been removed : 

In [13]:
#Print missing values and percentage of missing
print_missing(df)

Unnamed: 0,Column,Missing
0,age,0
1,sex,0
2,cp,0
3,trestbps,0
4,chol,0
5,fbs,0
6,restecg,0
7,thalach,0
8,exang,0
9,oldpeak,0

Unnamed: 0,Column,Missing%
0,age,0.0
1,sex,0.0
2,cp,0.0
3,trestbps,0.0
4,chol,0.0
5,fbs,0.0
6,restecg,0.0
7,thalach,0.0
8,exang,0.0
9,oldpeak,0.0


In [14]:
#Print distribution of target variable
print_dist(df, 'num', 'cat')

Distribution :

0    0.538721
1    0.181818
3    0.117845
2    0.117845
4    0.043771
Name: num, dtype: float64


#### Part B : Change the target variable

As mentioned in the data dictionary, the **num** field contains 0-4 values where 0 indicates no presence of heart disease and the rest indicates presence of heart disease. Let us alter this field to reflect the same.

In [15]:
#Change 1-4 to 1 to reflect presence of heart disease
mask = df['num'] > 0
df.loc[mask, 'num'] = 1

In [16]:
#Print distribution of target variable
print_dist(df, 'num', 'cat')

Distribution :

0    0.538721
1    0.461279
Name: num, dtype: float64


#### Part C : Change appropriate columns to integer 

In [17]:
#Specify columns to convert to integer
convert_cols = ['sex', 'fbs', 'exang']
df[convert_cols] = df[convert_cols].apply(lambda x: x.astype('int64', axis=1))

#### Part D : Change appropriate columns to string

In [18]:
#Specify columns to convert to string
convert_cols = ['cp', 'restecg', 'slope', 'ca', 'thal']
df[convert_cols] = df[convert_cols].apply(lambda x: x.astype('str'), axis=1)

### Phase 3 : Data Visualization

We have already made a dashboard which can be accessed at:

http://localhost:8888/notebooks/heart_disease_prediction/notebooks/data_visualization.ipynb?dashboard#

We have also made a report of this and uploaded it [here](http://localhost:8888/tree/heart_disease_prediction/reports/data_visualization.pdf).


### Phase 4 : Master Data Preparation

The columns which we shall dummify are :
+ cp
+ restecg
+ slope
+ ca
+ thal

#### Part A : Dummify required columns

In [19]:
#Specify columns to dummify and dummify it
dummify_cols = ['cp', 'restecg', 'slope', 'ca', 'thal']
df = pd.get_dummies(df, dummify_cols, drop_first=True)
df.head()

Unnamed: 0,age,sex,trestbps,chol,fbs,thalach,exang,oldpeak,num,cp_2.0,...,cp_4.0,restecg_1.0,restecg_2.0,slope_2.0,slope_3.0,ca_1.0,ca_2.0,ca_3.0,thal_6.0,thal_7.0
0,63.0,1,145.0,233.0,1,150.0,0,2.3,0,0,...,0,0,1,0,1,0,0,0,1,0
1,67.0,1,160.0,286.0,0,108.0,1,1.5,1,0,...,1,0,1,1,0,0,0,1,0,0
2,67.0,1,120.0,229.0,0,129.0,1,2.6,1,0,...,1,0,1,1,0,0,1,0,0,1
3,37.0,1,130.0,250.0,0,187.0,0,3.5,0,0,...,0,0,0,0,1,0,0,0,0,0
4,41.0,0,130.0,204.0,0,172.0,0,1.4,0,1,...,0,0,1,0,0,0,0,0,0,0
