# Practice with Heart Disease Data

**age**: age in years

**sex**: (1 = male; 0 = female)

**cp**: chest pain type

**trestbps**: resting blood pressure (in mm Hg on admission to the hospital)

**chol**: serum cholesterol in mg/dl

**fbs**: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

**restecg**: resting electrocardiographic results

**thalach**: maximum heart rate achieved

**exang**: exercise induced angina (1 = yes; 0 = no)

**oldpeak**: ST depression induced by exercise relative to rest

**slope**: the slope of the peak exercise ST segment

**ca**: number of major vessels (0-3) colored by fluoroscopy

**thal**: 3 = normal; 6 = fixed defect; 7 = reversible defect

**target**: 1 or 0

In [7]:
!pip install matplotlib --upgrade
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import statsmodels as sm


from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import confusion_matrix, classification_report

Collecting matplotlib
  Using cached matplotlib-3.3.4-cp37-cp37m-win_amd64.whl (8.5 MB)
Installing collected packages: matplotlib


ERROR: Could not install packages due to an EnvironmentError: [WinError 5] Access is denied: 'c:\\users\\rhais\\anaconda3\\Lib\\site-packages\\matplotlib\\_image.cp37-win_amd64.pyd'
Consider using the `--user` option or check the permissions.

You should consider upgrading via the 'c:\users\rhais\anaconda3\python.exe -m pip install --upgrade pip' command.


ModuleNotFoundError: No module named 'matplotlib.artist'

In [None]:
heart = pd.read_csv('heart.csv')
heart.head()

# Exploratory Data Analysis

## 1. How many are suffering from heart disease? Plot the stats and include conclusion statement at the end?

In [None]:
print(len(heart[heart['target'] == 1]))
print((len(heart[heart['target'] == 1]) / heart.shape[0]) * 100)

In [None]:
sns.countplot(heart['target'])
plt.title('Heart Disease Sufferers')
plt.xlabel('Diagnosis')
plt.ylabel('Count')
plt.show()

More than half (54.45%) of patients in this dataset have been diagnosed with heart disease.

## 2. How many males and females have heart disease? Use only one plot to find the gender most impacted by heart disease.

In [None]:
sns.countplot(heart['target'], hue = heart['sex'])
plt.legend(['F', 'M'])
plt.title('Heart Disease by Gender')
plt.xlabel('Diagnosis')
plt.ylabel('Count')
plt.show()

There are more men overall represented in this dataset. Men are more affected by heart disease than women in this dataset, however a positive diagnosis is more common than negative diagnosis for both sexes.

## 3. Create a visual representation of the frequency distribution of the thalach variable and find the heart rate and heart disease relation? Run various statistical tests to provide a conclusion.

In [None]:
sns.distplot(heart['thalach'])
plt.xlabel('Maximum Heart Rate')
plt.title('Maximum Heart Rate Distribution')

In [None]:
sns.distplot(heart[heart['target']==1]['thalach'])
plt.title('Heart Rate Distribution in Patients with Heart Disease')
plt.xlabel('Maximum Heart Rate')
plt.show()

sns.distplot(heart[heart['target']==0]['thalach'])
plt.title('Heart Rate Distribution in Patients without Heart Disease')
plt.xlabel('Maximum Heart Rate')
plt.show()

It appears that patients with heart disease generally have a higher maximum heart rate, but to determine if this difference is statistically significant, statistical tests are needed. First, assumptions need to be checked. Is the data normally distributed?

In [None]:
stats.describe(heart[heart['target']==1]['thalach'])

In [None]:
stats.describe(heart[heart['target']==0]['thalach'])

Both distributions are close to normally distributed, so I'll use a t-test

In [None]:
heart_pos = heart[heart['target'] ==1]
heart_neg = heart[heart['target'] ==0]

In [None]:
stats.ttest_ind(heart_pos['thalach'], heart_neg['thalach'])

There is a significant difference between the heart rate of patients with heart disease and without heart disease.

## 4. Find correlation matrix for all the variables with target. Find Mean, Min & Max of age and plot its distribution.

In [None]:
sns.heatmap(pd.DataFrame(pd.DataFrame(heart.corr()).target).T, annot = True)

In [None]:
sns.distplot(heart['age'])
plt.xlabel('Age (Years)')
plt.axvline(heart['age'].mean(), color = 'black')
plt.title('Distribution of Age')

In [None]:
print(f'The mean age is {round(heart.age.mean())} years')
print(f'The min age is {heart.age.min()} years')
print(f'The max age is {heart.age.max()} years')



## 5. Age and its relation to heart disease. Are young people more prone to heart disease?

In [None]:
plt.hist(heart_pos['age'], label ='Diseased', alpha = 0.5)
plt.hist(heart_neg['age'], label ='Healthy', alpha = 0.5)
plt.legend(loc = 'upper right')
plt.xlabel('Age (Years)')
plt.title('Age Distribution')
plt.axvline(heart_pos['age'].mean(), color = 'red')
plt.axvline(heart_neg['age'].mean(), color = 'blue')

In [None]:
stats.describe(heart_pos['age'])

In [None]:
stats.describe(heart_neg['age'])

In [None]:
stats.ttest_ind(heart_pos['age'], heart_neg['age'])

In [None]:
print(f"The average age of those with heart disease is {round(heart_pos['age'].mean())} years.")
print(f"The average age of those without heart disease is {round(heart_neg['age'].mean())} years.")

There is a signifcant difference in age between people with and without heart disease. People with heart disease are younger than those without heart disease.

## 6. Plot chest pain type pie chart.

In [None]:
labels =['Type 1', 'Type 2', 'Type 3', 'Type 4']
values = heart['cp'].value_counts().values
explode = (0.1, 0, 0, 0)
plt.pie(values, labels = labels, explode = explode)
plt.show()

## 7. What is the max heart rate achieved in non-heart disease patients?

In [None]:
heart_neg['thalach'].max()

# Machine Learning Model

## 1. Test different Machine models to test which model has higher accuracy to choose from?

In [None]:
X = heart.drop('target', axis = 1)
y = heart['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify = y, random_state = 21)

In [None]:
X

### Logistic Regression

In [None]:
lr_grid = {
    'C': [0.1, 1, 10, 20],
    'solver': ['newton-cg', 'lbfgs', 'liblinear','sag', 'saga'],
    'max_iter': [50, 100, 1000, 10000]
}

model_lr_grid = GridSearchCV(LogisticRegression(max_iter = 1000), param_grid = lr_grid, verbose = 1, n_jobs = -1)
model_lr_grid.fit(X_train, y_train)

print(model_lr_grid.best_params_)

In [None]:
lr = LogisticRegression(C = 10, max_iter = 100, solver ='lbfgs')
lr.fit(X_train, y_train)

y_pred_lr = lr.predict(X_test)
confusion_df = pd.DataFrame(confusion_matrix(y_test, y_pred_lr), columns =['Predicted 0', 'Predicted 1'], index = ['Actual 0', 'Actual 1'])

print('Training Score: {}'.format(lr.score(X_train, y_train)))
print('Test Score: {}'.format(lr.score(X_test, y_test)))
print(classification_report(y_test, y_pred_lr))
print(confusion_df)

### Random Forest

In [None]:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

y_pred_rf = rf.predict(X_test)
confusion_df = pd.DataFrame(confusion_matrix(y_test, y_pred_rf), columns =['Predicted 0', 'Predicted 1'], index = ['Actual 0', 'Actual 1'])

print('Training Score: {}'.format(rf.score(X_train, y_train)))
print('Test Score: {}'.format(rf.score(X_test, y_test)))
print(classification_report(y_test, y_pred_rf))
print(confusion_df)

### KNN

In [None]:
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

y_pred_knn = knn.predict(X_test)

confusion_df = pd.DataFrame(
    confusion_matrix(y_test, y_pred_knn),
    index=["Actually 0", "Actually 1",],
    columns=["Predicted 0", "Predicted 1",],
)


print('Training Score: {}'.format(knn.score(X_train, y_train)))
print('Test Score: {}'.format(knn.score(X_test, y_test)))
print(classification_report(y_test, y_pred_knn))
print(confusion_df)

Logistic Regression is the best model, even though Random Forest gives the same precision/recall. Logistic Regression is less overfit than Random Forest.

## 2. After choosing the best model, try to predict based on user based inputs and let the best model predict whether the user can have heart disease or not.

In [None]:
import numpy as np

In [None]:
X_user = [25, 0, 0, 120, 193, 0, 1, 132, 0, 1.2, 1, 2, 3]

In [None]:
lr.predict(np.array(X_user).reshape(1, -1))

This user is predicted to not have heart disease.

# Deployment of Model

## 1. Create a user-based form within the Jupyter notebook (documentation) to receive input from the user. Form should include all parameters needed to predict heart disease probability.

In [None]:
from ipywidgets import widgets

def heart_pred(age_in, sex_in, cp_in, trestbps_in, chol_in, fbs_in, restecg_in, thalach_in,exang_in,oldpeak_in,slope_in, ca_in, thal_in):
    if sex_in == 'Male':
        sex_bin = 1
    else:
        sex_bin = 0
    X = np.array([age_in, sex_bin, cp_in, trestbps_in, chol_in, fbs_in, restecg_in, thalach_in, exang_in, oldpeak_in, slope_in, ca_in, thal_in]).astype(float)
    pred = lr.predict(X.reshape(1, -1))
    if pred[0] == 1:
        print('Heart Disease Likely')
    else:
        print('No heart disease')


results = widgets.interactive(heart_pred, 
                              age_in = widgets.IntText(discription= 'Age:'),
                              sex_in = widgets.Dropdown(options = ['Male', 'Female'], 
                                                        value ='Male', 
                                                        description ='Sex: '),
                              cp_in = widgets.Dropdown(options = ['0', '1', '2', '3'], 
                                                       value ='0',
                                                       description ='Chest Pain Type: '),
                              trestbps_in = widgets.IntText(description ='Resting BP (mm Hg): '),
                              chol_in = widgets.IntText(description = 'Cholesterol (mg/dL): '),
                              fbs_in = widgets.Dropdown(options = ['0', '1'], 
                                                        value = '0', 
                                                        description = 'Fasting Blood Sugar > 120 mg/dL? '),
                              restecg_in = widgets.Dropdown(options = ['0', '1'],
                                                            description = 'Electrocardiograph Results:'),
                              thalach_in = widgets.IntText(description ='Max Heart Rate'),
                              exang_in = widgets.Dropdown(options =['0', '1'], 
                                                          value ='0', 
                                                          description = 'Exercise-induced Angina?'),
                              oldpeak_in = widgets.FloatText(description ='ST Depression:', 
                                                            value = 1.2),
                              slope_in = widgets.Dropdown(options =['0', '1', '2'], 
                                                          value ='0', description ='Slope of ST peak:'),
                              ca_in = widgets.Dropdown(options =['0', '1', '2', '3'],
                                                       value ='0', 
                                                       description ='Number of vessels colored:'),
                              thal_in = widgets.Dropdown(options =['1', '2', '3'], 
                                                         value ='1', 
                                                         description ='Thalassemia:')
)

## 2. Code form to calculate results when user submits form. Test to see if right answer is provided.

Using the information from the first row of X_test:

In [None]:
X_test.iloc[0]

In [None]:
display(results)

In [None]:
#Check the answer:
y_test.iloc[0]

The correct diagnosis is given in this case.