Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide.
Heart failure is a common event caused by CVDs and this dataset contains 12 features that can be used to predict mortality by heart failure.

Most cardiovascular diseases can be prevented by addressing behavioral risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity, and harmful use of alcohol using population-wide strategies.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidemia, or already established disease). 

* source information and data : https://www.kaggle.com/andrewmvd/heart-failure-clinical-data

![https://www.udmi.net/wp-content/uploads/2020/02/UDMI_Cardiovascular-Disease.png](https://www.udmi.net/wp-content/uploads/2020/02/UDMI_Cardiovascular-Disease.png)



Source Image: https://www.udmi.net/cardiovascular-disease-risk/

This work, we tried to classifying cardiovascular diseases using Random Forest classification. Hopefully can be of great help for early detection and management of cardiovascular diseases.

Predictor variable use in classifying Cardiovascular diseases :

1. age                       
2. anaemia                     
3. creatinine_phosphokinase    
4. diabetes                    
5. ejection_fraction          
6. high_blood_pressure         
7. platelets                 
8. serum_creatinine          
9. serum_sodium               
10. sex                         
11. smoking                     
12. time                       


Import Library

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler

Read Dataset

In [None]:
data = pd.read_csv('../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv')
print('Dataset :',data.shape)
data.info()
data[0:10]

**VISUALIZING THE DATA
**

In [None]:
# Distribution of DEATH_EVENT
data.DEATH_EVENT.value_counts()[0:30].plot(kind='bar')
plt.show()

# Plotting Heatmap
Heatmap can be defined as a method of graphically representing numerical data where individual data points contained in the matrix are represented using different colors. The colors in the heatmap can denote the frequency of an event, the performance of various metrics in the data set, and so on. Different color schemes are selected by varying businesses to present the data they want to be plotted on a heatmap [2].

In [None]:
data1 = data[['age','anaemia','creatinine_phosphokinase','diabetes','ejection_fraction','high_blood_pressure',
'platelets','serum_creatinine','serum_sodium','sex','smoking','time']] #Subsetting the data
cor = data1.corr() #Calculate the correlation of the above variables
sns.heatmap(cor, square = True) #Plot the correlation as heat map

As you can see above, we obtain the heatmap of correlation among the variables. The color palette in the side represents the amount of correlation among the variables. The lighter shade represents a high correlation.

In [None]:
sns.set_style("whitegrid")
sns.pairplot(data,hue="DEATH_EVENT",size=3);
plt.show()

# SPLITING DATA
Data for training and testing
To select a set of training data that will be input in the Machine Learning algorithm, to ensure that the classification algorithm training can be generalized well to new data. For this study using a sample size of 5% ( aims to reduce the overfitting effect).

In [None]:
from sklearn.model_selection import train_test_split
Y = data['DEATH_EVENT']
X = data.drop(columns=['DEATH_EVENT'])
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.05, random_state=9)

In [None]:
print('X train shape: ', X_train.shape)
print('Y train shape: ', Y_train.shape)
print('X test shape: ', X_test.shape)
print('Y test shape: ', Y_test.shape)

# 1. Random forest classification

Random forest is a supervised learning algorithm that creates a forest randomly. This forest, is a set of decision trees, most of the times trained with the bagging method. The essential idea of bagging is to average many noisy but approximately impartial models, and therefore reduce the variation. Each tree is constructed using the following algorithm:

* Let $N$ be the number of test cases, $M$ is the number of variables in the classifier.
* Let $m$ be the number of input variables to be used to determine the decision in a given node; $m<M$.
* Choose a training set for this tree and use the rest of the test cases to estimate the error.
* For each node of the tree, randomly choose $m$ variables on which to base the decision. Calculate the best partition of the training set from the $m$ variables.

For prediction a new case is pushed down the tree. Then it is assigned the label of the terminal node where it ends. This process is iterated by all the trees in the assembly, and the label that gets the most incidents is reported as the prediction. We define the number of trees in the forest in 100. 

Advantages Random Forest:
* runtimes are quite fast
* Are able to deal with unbalanced and missing data


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

# We define the model
rfcla = RandomForestClassifier(n_estimators=100,random_state=9,n_jobs=-1)

# We train model
rfcla.fit(X_train, Y_train)

# We predict target values
Y_predict5 = rfcla.predict(X_test)

In [None]:
test_acc_rfcla = round(rfcla.fit(X_train,Y_train).score(X_test, Y_test)* 100, 2)
train_acc_rfcla = round(rfcla.fit(X_train, Y_train).score(X_train, Y_train)* 100, 2)

In [None]:
model1 = pd.DataFrame({
    'Model': ['Random Forest'],
    'Train Score': [train_acc_rfcla],
    'Test Score': [test_acc_rfcla]
})
model1.sort_values(by='Test Score', ascending=False)

In [None]:
from sklearn.metrics import average_precision_score
average_precision = average_precision_score(Y_test, Y_predict5)

print('Average precision-recall score: {0:0.2f}'.format(
      average_precision))

In [None]:
# The confusion matrix
rfcla_cm = confusion_matrix(Y_test, Y_predict5)
f, ax = plt.subplots(figsize=(5,5))
sns.heatmap(rfcla_cm, annot=True, linewidth=0.7, linecolor='black', fmt='g', ax=ax, cmap="BuPu")
plt.title('Random Forest Classification Confusion Matrix')
plt.xlabel('Y predict')
plt.ylabel('Y test')
plt.show()

## 2. SVM (Support Vector Machine) classification

SVMs (Support Vector Machine) have shown a rapid proliferation during the last years. The learning problem setting for SVMs corresponds to a some unknown and nonlinear dependency (mapping, function) $y = f(x)$ between some high-dimensional input vector $x$ and scalar output $y$. It is noteworthy that there is no information on the joint probability functions, therefore, a free distribution learning must be carried out. The only information available is a training data set $D = {(x_i, y_i) ∈ X×Y }, i = 1$, $l$, where $l$ stands for the number of the training data pairs and is therefore equal to the size of the training data set $D$, additionally, $y_i$ is denoted as $d_i$, where $d$ stands for a desired (target) value. Hence, SVMs belong to the supervised learning techniques.

From the classification approach, the goal of SVM is to find a hyperplane in an N-dimensional space that clearly classifies the data points. Thus hyperplanes are decision boundaries that help classify the data points. Data points falling on either side of the hyperplane can be attributed to different classes.


In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC

# We define the SVM model
svmcla = OneVsRestClassifier(BaggingClassifier(SVC(C=10,kernel='rbf',random_state=9, probability=True), 
                                               n_jobs=-1))

# We train model
svmcla.fit(X_train, Y_train)

# We predict target values
Y_predict2 = svmcla.predict(X_test)

In [None]:
test_acc_svm = round(svmcla.fit(X_train,Y_train).score(X_test, Y_test)* 100, 2)
train_acc_svm = round(svmcla.fit(X_train, Y_train).score(X_train, Y_train)* 100, 2)

In [None]:
model2 = pd.DataFrame({
    'Model': ['SVM'],
    'Train Score': [train_acc_svm],
    'Test Score': [test_acc_svm]
})
model2.sort_values(by='Test Score', ascending=False)

In [None]:
from sklearn.metrics import average_precision_score
average_precision = average_precision_score(Y_test, Y_predict2)

print('Average precision-recall score: {0:0.2f}'.format(
      average_precision))

In [None]:
# The confusion matrix
svm = confusion_matrix(Y_test, Y_predict5)
f, ax = plt.subplots(figsize=(5,5))
sns.heatmap(svm, annot=True, linewidth=0.7, linecolor='black', fmt='g', ax=ax, cmap="BuPu")
plt.title('SVM Classification Confusion Matrix')
plt.xlabel('Y predict')
plt.ylabel('Y test')
plt.show()

# Features Selection

1.In here we drop 1.age, 2.anaemia, 4.diabetes, 6.high_blood_pressure  from data. We use features :
                
3. creatinine_phosphokinase    
5. ejection_fraction          
7. platelets                 
8. serum_creatinine          
9. serum_sodium               
10. sex                         
11. smoking                     
12. time                       



In [None]:
Y1 = data['DEATH_EVENT']
X1 = data.drop(columns=['age','anaemia','diabetes','high_blood_pressure'])
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel

lsvc = LinearSVC(C=0.06, penalty="l1", dual=False,random_state=10).fit(X1, Y1)
model = SelectFromModel(lsvc, prefit=True)
X_new = model.transform(X1)
cc = list(X1.columns[model.get_support(indices=True)])
print(cc)
print(len(cc))

In [None]:
# Principal component analysis
from sklearn.decomposition import PCA

pca = PCA().fit(X1)
plt.figure()
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel("'age','anaemia','diabetes','high_blood_pressure'")
plt.ylabel('% Variance Explained')
plt.title('PCA Analysis')
plt.grid(True)
plt.show()

In [None]:
# Percentage of total variance explained
variance = pd.Series(list(np.cumsum(pca.explained_variance_ratio_)), 
                        index= list(range(0,9))) 
print(variance[20:80])

In [None]:
X1 = data[cc] 
from sklearn.model_selection import train_test_split
X1_train, X1_test, Y1_train, Y1_test = train_test_split(X1, Y1, test_size=0.05, random_state=10)

In [None]:
# Random forest classification
rfcla.fit(X1_train, Y1_train)
Y1_predict5 = rfcla.predict(X1_test)
rfcla_cm = confusion_matrix(Y1_test, Y1_predict5)
score1_rfcla = rfcla.score(X1_test, Y1_test)

In [None]:
test_acc_rfcla = round(rfcla.fit(X1_train,Y1_train).score(X1_test, Y1_test)* 100, 2)
train_acc_rfcla = round(rfcla.fit(X1_train, Y1_train).score(X1_train, Y1_train)* 100, 2)

In [None]:
# SVM classification
svmcla.fit(X1_train, Y1_train)
Y1_predict2 = svmcla.predict(X1_test)
svmcla_cm = confusion_matrix(Y1_test, Y1_predict2)
score1_svmcla = svmcla.score(X1_test, Y1_test)

In [None]:
test_acc_svm2 = round(svmcla.fit(X_train,Y_train).score(X_test, Y_test)* 100, 2)
train_acc_svm2 = round(svmcla.fit(X_train, Y_train).score(X_train, Y_train)* 100, 2)

In [None]:
model3 = pd.DataFrame({
    'Model': ['Random Forest','SVM'],
    'Train Score': [train_acc_rfcla,train_acc_svm2 ],
    'Test Score': [test_acc_rfcla, test_acc_svm2]
})
model3.sort_values(by='Test Score', ascending=False)

In [None]:
fig = plt.figure(figsize=(15,15))
ax1 = fig.add_subplot(3, 3, 1) 
ax1.set_title('Random Forest') 
ax2 = fig.add_subplot(3, 3, 2) 
ax2.set_title('SVM Classification')


sns.heatmap(data=rfcla_cm, annot=True, linewidth=0.7, linecolor='black',cmap="BuPu" ,fmt='g', ax=ax1)
sns.heatmap(data=svmcla_cm, annot=True, linewidth=0.7, linecolor='black',cmap="BuPu" ,fmt='g', ax=ax2)
plt.show()

# Conclusion
Random Forest performs better than SVM. The test accuracy of SVM  high even if the training accuracy (training error) is lesser than the expectation it might because of dissimilarity between the test and training pattern.