## Heart Disease UCI
<a href = "https://archive.ics.uci.edu/ml/datasets/Heart+Disease">UCI Machine Learning repository</a>

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

%matplotlib inline

## Loading the data set

Lets just load the data set and just check the data

In [None]:
raw_dataset = pd.read_csv('../input/heart-disease-uci/heart.csv')

In [None]:
raw_dataset.head()

## Now let's understand what each column mean
<br>
<li>Age - Age of the patient in years</li>
<li>sex - Gender of the patient 1 = male, 0 = female</li>
<li>cp - chest pain type (4 values)</li>
<li>trestbps - Resting blood pressure</li>
<li>Chol - Serum cholestoral in mg/dl</li>
<li>fbs - fasting blood sugar > 120 mg/dl </li>
<li>restecg - resting electrocardiographic results (values 0,1,2) </li>
<li>thalach - maximum heart rate achieved</li>
<li>exang - exercise induced angina</li>
<li>oldpeak - ST depression induced by exercise relative to rest</li>
<li>slope - the slope of the peak exercise ST segment</li>
<li>ca - number of major vessels (0-3) colored by flourosopy</li>
<li>thal - thal: 3 = normal; 6 = fixed defect; 7 = reversable defect</li>
<li>target - 1 = heart attack risk 0 = no heart attack</li>

## Data exploration

In [None]:
raw_dataset.describe()

In [None]:
# checking for correlation
raw_dataset.corr()

In [None]:
plt.figure(figsize=(12,8))

sns.heatmap(data = raw_dataset.corr(), annot=True)

## Inference 
<br>
There doesn't seem to be any positve correlation between any of the variables. 
Lets visualize the data even further to see any interesting relations

In [None]:
plt.figure(figsize=(12,8))

#scatter plot with age and sex with target as hue

sns.scatterplot(x = raw_dataset['age'], y = raw_dataset['trestbps'], hue = raw_dataset['target'])

It looks like heart attack risk is fairly common in the people of age range 40-60 and more so if your resting blood pressure is more than 120. Lets have a closer look at these two variables

In [None]:
plt.figure(figsize=(6,4))
plt.title("Resting Blood pressure vs Heart Risk")
sns.swarmplot(x = raw_dataset['target'],
              y = raw_dataset['trestbps'])

On a second look the data seems to be similar for both the target. Resting blood pressure(RBP) doesn't seem to be a contributing factor according to the data. Even though a lot of the patients who are at risk of an heart attack seem to have an RBP between 120-140

In [None]:
#Checking the heart risk with age
plt.figure(figsize=(6,4))
plt.title("Age vs Heart Risk")
sns.swarmplot(x = raw_dataset['target'],
              y = raw_dataset['age'])

In [None]:
sns.set()
p = raw_dataset.hist(figsize=(20,20))

The most interesting of the graphs here are chol, cp, ca and old peak. Lets see some swarm plots to see if they tell us anything more

In [None]:
plt.figure(figsize=(6,4))
plt.title("Cholestrol vs Heart Risk")
sns.swarmplot(x = raw_dataset['target'],
              y = raw_dataset['chol'])

Most people in cholestrol range 200-300 seem to have gotten heart attacks

In [None]:
plt.figure(figsize=(6,4))
plt.title("cp vs Heart Risk")
sns.swarmplot(x = raw_dataset['target'],
              y = raw_dataset['cp'])

People with Chest pain type 1 have a higher risk of getting an heart attack

In [None]:
plt.figure(figsize=(6,4))
plt.title("Old Peak vs Heart Risk")
sns.swarmplot(x = raw_dataset['target'],
              y = raw_dataset['oldpeak'])

Maximum number of people who had a heart attack have an old peak less than 2

In [None]:
plt.figure(figsize=(6,4))
plt.title("CA vs Heart Risk")
sns.swarmplot(x = raw_dataset['target'],
              y = raw_dataset['ca'])

Majority of the heart attacks occured on people with a CA of 0 or 1

## Data preprocessing

In [None]:
shuffled_data_set = raw_dataset.sample(frac=1)

In [None]:
print(shuffled_data_set)

Creating numpy arrays of the shuffled dataset

In [None]:
X = shuffled_data_set.iloc[:,:-1].values
y = shuffled_data_set.iloc[:,-1].values

In [None]:
print(X)

In [None]:
print(y)

## Encoding categorical Data

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [2,11])], remainder='passthrough')

X = np.array(ct.fit_transform(X))

In [None]:
print(X)

## Spliting the data into train and test sets

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

In [None]:
print(y_test)

## Making a simple logistic regression model

In [None]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(solver='liblinear')
classifier.fit(X_train,y_train)

## Predicting the results

In [None]:
y_pred_train = classifier.predict(X_train)
y_pred_test = classifier.predict(X_test)

Accuracy on train and test set

In [None]:
from sklearn.metrics import accuracy_score
acc_train = accuracy_score(y_train,y_pred_train)
acc_test = accuracy_score(y_test,y_pred_test)

In [None]:
print("The accuracy on the training set is: " + str(acc_train))
print("The accuracy on the test set is: " + str(acc_test))

The training set accuracy is lower than the test scale accuracy. One the reasons for this might be that the data(num_examples) is only 150. If we had a large amount of data we could have trained our model better. Let's try to do it anyway.

Let's check for any missing values and the scale the data

In [None]:
shuffled_data_set.isnull().sum()

There is no missing data

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = sc.fit_transform(shuffled_data_set.drop(['target'], axis = 1))

In [None]:
y = shuffled_data_set.iloc[:,-1].values

## Splitting the data again

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.15, random_state = 1)

## Dimensionality reduction
We will be applying LDA(linear discriminant analysis) as our data is labeled

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis()
X_train = lda.fit_transform(X_train, y_train)
X_test = lda.transform(X_test)

## Training the model

In [None]:
# lets use SVM
from sklearn.svm import SVC
classifier = SVC()
classifier.fit(X_train,y_train)

## Prediction

In [None]:
y_pred_train = classifier.predict(X_train)
y_pred_test = classifier.predict(X_test)

from sklearn.metrics import accuracy_score
acc_train = accuracy_score(y_train,y_pred_train)
acc_test = accuracy_score(y_test,y_pred_test)

print("The accuracy on the training set is: " + str(acc_train))
print("The accuracy on the test set is: " + str(acc_test))

## Model Selection

In [None]:
from sklearn.model_selection import GridSearchCV
parameters = [{'C': [0.25,0.5,0.75,1], 'kernel': ['linear']},
                { 'C':[0.25,0.5,0.75,1], 'kernel': ['rbf'], 'gamma':[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]}]

grid_search = GridSearchCV(estimator=classifier, param_grid=parameters, scoring = 'accuracy', cv = 10, n_jobs = -1)
grid_search.fit(X_train,y_train)

In [None]:
best_accuracy = grid_search.best_score_
best_parameter = grid_search.best_params_

In [None]:
print("The best accuracy is : " + str(best_accuracy))
print("The best parameters are: " + str(best_parameter))

Not much of a difference