<a href="https://colab.research.google.com/github/saaaady/Big-Data-Heart-Disease-Prediction/blob/main/Big_Data_Project_Heart_Disease_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Heart Disease Prediction - Big Data**

## **Creating a Spark Session and reading the CSVs**

Two .csv files will be imported from the Github repository.
The data will be cleaned and then the model will be trained/tested on
"Cardiac Health Dataset.csv".
Then the model will be tested on the "FIC Heart Conditions Dataset.csv" dataset.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType
from pyspark.ml.feature import VectorAssembler

In [2]:
spark=SparkSession.builder.appName("Project").getOrCreate()

In [3]:
! git clone https://github.com/saaaady/Big-Data-Heart-Disease-Prediction

Cloning into 'Big-Data-Heart-Disease-Prediction'...
remote: Enumerating objects: 21, done.[K
remote: Counting objects: 100% (21/21), done.[K
remote: Compressing objects: 100% (20/20), done.[K
remote: Total 21 (delta 2), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (21/21), 28.10 KiB | 5.62 MiB/s, done.
Resolving deltas: 100% (2/2), done.


In [4]:
ls

[0m[01;34mBig-Data-Heart-Disease-Prediction[0m/  [01;34msample_data[0m/


In [5]:
cd Big-Data-Heart-Disease-Prediction/

/content/Big-Data-Heart-Disease-Prediction


In [6]:
ls

'Cardiac Health Dataset (Train & Test).csv'  'FIC Heart Conditions Dataset (Test).csv'


In [7]:
data=spark.read.csv('/content/Big-Data-Heart-Disease-Prediction/Cardiac Health Dataset (Train & Test).csv',header=True,inferSchema=True)
ficdata=spark.read.csv('/content/Big-Data-Heart-Disease-Prediction/FIC Heart Conditions Dataset (Test).csv',header=True,inferSchema=True)

In [8]:
data.show()

+---+---+---------------+---+-----------+------------+-----------+------+---------------+-------------+-----------+-----------------------+--------+-------------+
|Age|Sex|Chest pain type| BP|Cholesterol|FBS over 120|EKG results|Max HR|Exercise angina|ST depression|Slope of ST|Number of vessels fluro|Thallium|Heart Disease|
+---+---+---------------+---+-----------+------------+-----------+------+---------------+-------------+-----------+-----------------------+--------+-------------+
| 70|  1|              4|130|        322|           0|          2|   109|              0|          2.4|          2|                      3|       3|     Presence|
| 67|  0|              3|115|        564|           0|          2|   160|              0|          1.6|          2|                      0|       7|      Absence|
| 57|  1|              2|124|        261|           0|          0|   141|              0|          0.3|          1|                      0|       7|     Presence|
| 64|  1|             

In [9]:
ficdata.show()

+---+---------+------+----------+-------------------------------------+----------------------------------------------------------------------------------------+-----+--------+----------+---------+-------+--------------+---------+--------+---+---------+-----+------------+---+------+----+--------+-----------+----------+-----+-----+---+-----+----+----------+-----+-----+-----+-------+--------------+----------+------+--------+------+---------------+--------------------+--------------------+----------------+---+--------+----+---+-------+-------+-----+-------+-----+---+----+---+---+--------+--------+---------+---------+
|Age|Age.Group|Gender|Locality  |Marital status                       |Life.Style                                                                              |Sleep|Category|Depression|Hyperlipi|Smoking|Family.History|F.History|Diabetes|HTN|Allergies|   BP|Thrombolysis|BGR|B.Urea|S.Cr|S.Sodium|S.Potassium|S.Chloride|C.P.K|CK.MB|ESR|  WBC| RBC|Hemoglobin|P.C.V|M.C.V|M.C.H|M.C.

## **Cleaning the Raw datasets**
The datasets have incorrectly labeled columns and missing data/Null value entries. The FIC dataset has extra columns which we will not be using as features which will be trimmed.

**Cardiac Health Dataset Cleaning**

In [10]:
data.show()

+---+---+---------------+---+-----------+------------+-----------+------+---------------+-------------+-----------+-----------------------+--------+-------------+
|Age|Sex|Chest pain type| BP|Cholesterol|FBS over 120|EKG results|Max HR|Exercise angina|ST depression|Slope of ST|Number of vessels fluro|Thallium|Heart Disease|
+---+---+---------------+---+-----------+------------+-----------+------+---------------+-------------+-----------+-----------------------+--------+-------------+
| 70|  1|              4|130|        322|           0|          2|   109|              0|          2.4|          2|                      3|       3|     Presence|
| 67|  0|              3|115|        564|           0|          2|   160|              0|          1.6|          2|                      0|       7|      Absence|
| 57|  1|              2|124|        261|           0|          0|   141|              0|          0.3|          1|                      0|       7|     Presence|
| 64|  1|             

In [11]:
data=data.drop('Chest pain type')
data=data.withColumnRenamed('BP','Blood Pressure(mm Hg)').withColumnRenamed('Cholesterol','Cholesterol(mg/dl)').withColumnRenamed('FBS over 120','Fasting Blood Sugar > 120(mg/dl)').withColumnRenamed('EKG results','Resting EKG').withColumnRenamed('Max HR','Max Resting Heartrate').withColumnRenamed('Exercise angina','Exercise Angina')
data = data.replace("Absence", "0", "Heart Disease")
data = data.replace("Presence", "1", "Heart Disease")
data=data.withColumn("Heart Disease", data["Heart Disease"].cast(IntegerType()))

1. Chestpain type column was dropped
2. Columns were renamed to provide clarity
3. Presence/Absence of Heart Disease denoted by 1/0 respectively.
4. Heart Disease column entries changed to integer type

In [12]:
data.show()

+---+---+---------------------+------------------+--------------------------------+-----------+---------------------+---------------+-------------+-----------+-----------------------+--------+-------------+
|Age|Sex|Blood Pressure(mm Hg)|Cholesterol(mg/dl)|Fasting Blood Sugar > 120(mg/dl)|Resting EKG|Max Resting Heartrate|Exercise Angina|ST depression|Slope of ST|Number of vessels fluro|Thallium|Heart Disease|
+---+---+---------------------+------------------+--------------------------------+-----------+---------------------+---------------+-------------+-----------+-----------------------+--------+-------------+
| 70|  1|                  130|               322|                               0|          2|                  109|              0|          2.4|          2|                      3|       3|            1|
| 67|  0|                  115|               564|                               0|          2|                  160|              0|          1.6|          2|             

**FIC Heart Conditions Dataset**

In [13]:
ficdata.show()

+---+---------+------+----------+-------------------------------------+----------------------------------------------------------------------------------------+-----+--------+----------+---------+-------+--------------+---------+--------+---+---------+-----+------------+---+------+----+--------+-----------+----------+-----+-----+---+-----+----+----------+-----+-----+-----+-------+--------------+----------+------+--------+------+---------------+--------------------+--------------------+----------------+---+--------+----+---+-------+-------+-----+-------+-----+---+----+---+---+--------+--------+---------+---------+
|Age|Age.Group|Gender|Locality  |Marital status                       |Life.Style                                                                              |Sleep|Category|Depression|Hyperlipi|Smoking|Family.History|F.History|Diabetes|HTN|Allergies|   BP|Thrombolysis|BGR|B.Urea|S.Cr|S.Sodium|S.Potassium|S.Chloride|C.P.K|CK.MB|ESR|  WBC| RBC|Hemoglobin|P.C.V|M.C.V|M.C.H|M.C.

In [14]:
ficdata=ficdata.drop('Age.Group','Locality  ','Others ','Marital status                       ','Family.History','Life.Style                                                                              ','Sleep','Category','Depression','Hyperlipi','Smoking','F.History','Diabetes','HTN','Allergies','Thrombolysis','BGR','B.Urea','S.Cr','S.Sodium','S.Potassium','S.Chloride','C.P.K','CK.MB','ESR','WBC','RBC','M.C.H','Hemoglobin','P.C.V','M.C.V','M.C.','M.C.H.C','PLATELET_COUNT','NEUTROPHIL','LYMPHO','MONOCYTE','EOSINO','        Others','CO','Diagnosis','Hypersensitivity','cp','trestbps','num','SK','SK.React','Reaction','Follow.Up')
ficdata=ficdata.withColumnRenamed('BP','Blood Pressure(mm Hg)').withColumnRenamed('Gender','Sex').withColumnRenamed('chol','Cholesterol(mg/dl)').withColumnRenamed('fbs','Fasting Blood Sugar > 120(mg/dl)').withColumnRenamed('restecg','Resting EKG').withColumnRenamed('thalach','Max Resting Heartrate').withColumnRenamed('exang','Exercise Angina').withColumnRenamed('oldpeak','ST depression').withColumnRenamed('slope','Slope of ST').withColumnRenamed('ca','Number of vessels fluro').withColumnRenamed('thal','Thallium').withColumnRenamed('Mortality','Heart Disease')
ficdata=ficdata.replace("Female", "0", "Sex")
ficdata=ficdata.replace("Male", "1", "Sex")
ficdata=ficdata.replace(0, 1, "Heart Disease")
ficdata=ficdata.withColumn("Sex", ficdata["Sex"].cast(IntegerType()))
ficdata=ficdata.withColumn("Blood Pressure(mm Hg)", ficdata["Blood Pressure(mm Hg)"].cast(IntegerType()))

1. Some columns dropped
2. Columns were renamed to provide clarity
3. Male/Female o denoted by 1/0 respectively.
4. Blood Pressure(mm Hg) column entries changed to integer type

In [15]:
ficdata.show()

+---+---+---------------------+------------------+--------------------------------+-----------+---------------------+---------------+-------------+-----------+-----------------------+--------+-------------+
|Age|Sex|Blood Pressure(mm Hg)|Cholesterol(mg/dl)|Fasting Blood Sugar > 120(mg/dl)|Resting EKG|Max Resting Heartrate|Exercise Angina|ST depression|Slope of ST|Number of vessels fluro|Thallium|Heart Disease|
+---+---+---------------------+------------------+--------------------------------+-----------+---------------------+---------------+-------------+-----------+-----------------------+--------+-------------+
| 45|  0|                  100|               341|                               1|          2|                  136|              1|          3.0|          2|                      0|       7|            1|
| 51|  0|                   90|               305|                               0|          0|                  142|              1|          1.2|          2|             

In [16]:
data.describe().show()

+-------+-----------------+------------------+---------------------+------------------+--------------------------------+------------------+---------------------+------------------+------------------+------------------+-----------------------+------------------+------------------+
|summary|              Age|               Sex|Blood Pressure(mm Hg)|Cholesterol(mg/dl)|Fasting Blood Sugar > 120(mg/dl)|       Resting EKG|Max Resting Heartrate|   Exercise Angina|     ST depression|       Slope of ST|Number of vessels fluro|          Thallium|     Heart Disease|
+-------+-----------------+------------------+---------------------+------------------+--------------------------------+------------------+---------------------+------------------+------------------+------------------+-----------------------+------------------+------------------+
|  count|              270|               270|                  270|               270|                             270|               270|                  

The describe method was used for both dataframes to check if there were errors inn the entries e.g Blood Pressure and Cholesterol cannot have a zero value

In [17]:
ficdata.describe().show()

+-------+-----------------+------------------+---------------------+------------------+--------------------------------+------------------+---------------------+------------------+------------------+------------------+-----------------------+------------------+-------------+
|summary|              Age|               Sex|Blood Pressure(mm Hg)|Cholesterol(mg/dl)|Fasting Blood Sugar > 120(mg/dl)|       Resting EKG|Max Resting Heartrate|   Exercise Angina|     ST depression|       Slope of ST|Number of vessels fluro|          Thallium|Heart Disease|
+-------+-----------------+------------------+---------------------+------------------+--------------------------------+------------------+---------------------+------------------+------------------+------------------+-----------------------+------------------+-------------+
|  count|              368|               368|                  368|               368|                             368|               368|                  368|           

## **Feature Selection**

In [18]:
for i in data.columns:
  print("{} and Heart Disease = {}".format(i,data.stat.corr("Heart Disease",i)))

Age and Heart Disease = 0.2123221874434284
Sex and Heart Disease = 0.29772075572408496
Blood Pressure(mm Hg) and Heart Disease = 0.1553826561757688
Cholesterol(mg/dl) and Heart Disease = 0.11802053060517001
Fasting Blood Sugar > 120(mg/dl) and Heart Disease = -0.016318834144205503
Resting EKG and Heart Disease = 0.18209075568278285
Max Resting Heartrate and Heart Disease = -0.4185139653265933
Exercise Angina and Heart Disease = 0.4193027091902966
ST depression and Heart Disease = 0.4179674372274274
Slope of ST and Heart Disease = 0.33761595723299065
Number of vessels fluro and Heart Disease = 0.45533645047270893
Thallium and Heart Disease = 0.5250203329618742
Heart Disease and Heart Disease = 1.0


We will be keeping all the columns as features as no column is completely redundant in dictating presence/absence of heart disease. This is evident from the correlation between each feature and the Diagnosis.

##  **Creating a Feature Vector Column using Vector Assembler**

Merging multiple columns to a vector column.

In [19]:
data.columns

['Age',
 'Sex',
 'Blood Pressure(mm Hg)',
 'Cholesterol(mg/dl)',
 'Fasting Blood Sugar > 120(mg/dl)',
 'Resting EKG',
 'Max Resting Heartrate',
 'Exercise Angina',
 'ST depression',
 'Slope of ST',
 'Number of vessels fluro',
 'Thallium',
 'Heart Disease']

In [20]:
featuremaker=VectorAssembler(inputCols=['Age','Sex','Blood Pressure(mm Hg)','Cholesterol(mg/dl)','Fasting Blood Sugar > 120(mg/dl)','Resting EKG','Max Resting Heartrate','Exercise Angina','ST depression','Slope of ST','Number of vessels fluro','Thallium'],outputCol='Features as Vectors')

In [21]:
readyforml=featuremaker.transform(data)

In [22]:
readyforml.select('Features as Vectors').show()

+--------------------+
| Features as Vectors|
+--------------------+
|[70.0,1.0,130.0,3...|
|[67.0,0.0,115.0,5...|
|[57.0,1.0,124.0,2...|
|[64.0,1.0,128.0,2...|
|[74.0,0.0,120.0,2...|
|[65.0,1.0,120.0,1...|
|[56.0,1.0,130.0,2...|
|[59.0,1.0,110.0,2...|
|[60.0,1.0,140.0,2...|
|[63.0,0.0,150.0,4...|
|[59.0,1.0,135.0,2...|
|[53.0,1.0,142.0,2...|
|[44.0,1.0,140.0,2...|
|[61.0,1.0,134.0,2...|
|[57.0,0.0,128.0,3...|
|[71.0,0.0,112.0,1...|
|[46.0,1.0,140.0,3...|
|[53.0,1.0,140.0,2...|
|[64.0,1.0,110.0,2...|
|[40.0,1.0,140.0,1...|
+--------------------+
only showing top 20 rows



The above column has been generated from the dataframe and called "Features as vectors" (Last column) which contains data points from all the columns for each individual.

## **Importing and Training Machine Learning Model**
As this is a classification problem i.e a person has Heart Disease or not, we will be using the Logistic regression model.

In [23]:
from pyspark.ml.classification import LogisticRegression

In [24]:
mldata=readyforml.select('Features as Vectors','Heart Disease')

We will train the model on a part of the "Cardiac Health Dataset" and test it on the remaining part. Then we will test the model on the FIC Heart Condtions" Dataset.

In [25]:
trainingdata,testingdata=mldata.randomSplit([0.8,0.2])

In [26]:
heartmodel=LogisticRegression(labelCol='Heart Disease',featuresCol="Features as Vectors")

In [27]:
hdp=heartmodel.fit(trainingdata)

In [28]:
summary=hdp.summary

In [29]:
summary.predictions.describe().show()

+-------+-------------------+-------------------+
|summary|      Heart Disease|         prediction|
+-------+-------------------+-------------------+
|  count|                218|                218|
|   mean|0.42660550458715596|0.41284403669724773|
| stddev|0.49572219844795723|0.49347837484425744|
|    min|                0.0|                0.0|
|    max|                1.0|                1.0|
+-------+-------------------+-------------------+



## **Evaluation and Testing of Model - Cardiac Health Dataset**




In [30]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [31]:
predictions1=hdp.evaluate(testingdata)

In [32]:
predictions1.predictions.show(20)

+--------------------+-------------+--------------------+--------------------+----------+
| Features as Vectors|Heart Disease|       rawPrediction|         probability|prediction|
+--------------------+-------------+--------------------+--------------------+----------+
|(12,[0,2,3,6,9,11...|            0|[5.42219717339242...|[0.99560199775022...|       0.0|
|(12,[0,2,3,6,9,11...|            0|[4.04010987917666...|[0.98270871062913...|       0.0|
|(12,[0,2,3,6,9,11...|            0|[5.07519489675048...|[0.99378895000222...|       0.0|
|[34.0,0.0,118.0,2...|            0|[4.98577388663885...|[0.99321190627510...|       0.0|
|[35.0,1.0,126.0,2...|            1|[-0.5547385791837...|[0.36476572279464...|       1.0|
|[40.0,1.0,152.0,2...|            1|[1.96951919854923...|[0.87755946098982...|       0.0|
|[41.0,1.0,112.0,2...|            0|[3.77253472112344...|[0.97752312005992...|       0.0|
|[42.0,1.0,120.0,2...|            0|[2.22811873174028...|[0.90274631740333...|       0.0|
|[42.0,1.0

In [33]:
predictions1.predictions.describe().show()

+-------+------------------+-------------------+
|summary|     Heart Disease|         prediction|
+-------+------------------+-------------------+
|  count|                52|                 52|
|   mean|0.5192307692307693|0.46153846153846156|
| stddev|0.5045045954972344| 0.5033822257076337|
|    min|                 0|                0.0|
|    max|                 1|                1.0|
+-------+------------------+-------------------+



In [34]:
evaluator=BinaryClassificationEvaluator(rawPredictionCol="rawPrediction",labelCol='Heart Disease')

In [35]:
evaluator.evaluate(hdp.transform(testingdata))

0.9037037037037038

This number is the accuracy of our model. How successful it was in predicting the correct label. It was fairly successful in evaluating them correctly. Now we will test the model on THE FIC dataset.

## **Evaluation and Testing of Model - FIC Heart Conditions Dataset**

All the people in the FIC Dataset were diagnosed with Heart Disease. Lets see how successfil is our model in predicting all of the labels correctly.




In [36]:
ficdata.show()

+---+---+---------------------+------------------+--------------------------------+-----------+---------------------+---------------+-------------+-----------+-----------------------+--------+-------------+
|Age|Sex|Blood Pressure(mm Hg)|Cholesterol(mg/dl)|Fasting Blood Sugar > 120(mg/dl)|Resting EKG|Max Resting Heartrate|Exercise Angina|ST depression|Slope of ST|Number of vessels fluro|Thallium|Heart Disease|
+---+---+---------------------+------------------+--------------------------------+-----------+---------------------+---------------+-------------+-----------+-----------------------+--------+-------------+
| 45|  0|                  100|               341|                               1|          2|                  136|              1|          3.0|          2|                      0|       7|            1|
| 51|  0|                   90|               305|                               0|          0|                  142|              1|          1.2|          2|             

In [37]:
ficdata.columns

['Age',
 'Sex',
 'Blood Pressure(mm Hg)',
 'Cholesterol(mg/dl)',
 'Fasting Blood Sugar > 120(mg/dl)',
 'Resting EKG',
 'Max Resting Heartrate',
 'Exercise Angina',
 'ST depression',
 'Slope of ST',
 'Number of vessels fluro',
 'Thallium',
 'Heart Disease']

In [38]:
featuremaker=VectorAssembler(inputCols=['Age','Sex','Blood Pressure(mm Hg)','Cholesterol(mg/dl)','Fasting Blood Sugar > 120(mg/dl)','Resting EKG','Max Resting Heartrate','Exercise Angina','ST depression','Slope of ST','Number of vessels fluro','Thallium'],outputCol='Features as Vectors')

In [39]:
ficreadyforml=featuremaker.transform(ficdata)

In [40]:
ficreadyforml.select('Features as Vectors').show()

+--------------------+
| Features as Vectors|
+--------------------+
|[45.0,0.0,100.0,3...|
|[51.0,0.0,90.0,30...|
|[55.0,0.0,100.0,3...|
|[55.0,0.0,160.0,2...|
|[56.0,0.0,90.0,28...|
|[56.0,0.0,140.0,4...|
|[57.0,0.0,120.0,2...|
|[57.0,0.0,100.0,2...|
|[58.0,0.0,90.0,31...|
|[58.0,0.0,100.0,2...|
|[59.0,0.0,160.0,2...|
|[60.0,0.0,90.0,25...|
|[60.0,0.0,140.0,3...|
|[61.0,0.0,120.0,3...|
|[61.0,0.0,100.0,3...|
|[62.0,0.0,90.0,26...|
|[62.0,0.0,100.0,1...|
|[62.0,0.0,160.0,2...|
|[62.0,0.0,90.0,29...|
|[62.0,0.0,140.0,2...|
+--------------------+
only showing top 20 rows



In [41]:
ffdata=ficreadyforml.select('Features as Vectors','Heart Disease')
predictions2=hdp.evaluate(ffdata)
predictions2.predictions.describe().show(20)


+-------+-------------+------------------+
|summary|Heart Disease|        prediction|
+-------+-------------+------------------+
|  count|          368|               368|
|   mean|          1.0|0.7445652173913043|
| stddev|          0.0|0.4366990697024343|
|    min|            1|               0.0|
|    max|            1|               1.0|
+-------+-------------+------------------+



In [42]:
predictions2.predictions.show(20)

+--------------------+-------------+--------------------+--------------------+----------+
| Features as Vectors|Heart Disease|       rawPrediction|         probability|prediction|
+--------------------+-------------+--------------------+--------------------+----------+
|[45.0,0.0,100.0,3...|            1|[-1.0468751829600...|[0.25982560277479...|       1.0|
|[51.0,0.0,90.0,30...|            1|[0.28589069097959...|[0.57098980973713...|       0.0|
|[55.0,0.0,100.0,3...|            1|[-0.5702691636531...|[0.36117471906292...|       1.0|
|[55.0,0.0,160.0,2...|            1|[-2.1583326052281...|[0.10355513587861...|       1.0|
|[56.0,0.0,90.0,28...|            1|[-3.7207621957189...|[0.02364297724985...|       1.0|
|[56.0,0.0,140.0,4...|            1|[-4.2205025625215...|[0.01447855122161...|       1.0|
|[57.0,0.0,120.0,2...|            1|[0.34260705680107...|[0.58482366887448...|       0.0|
|[57.0,0.0,100.0,2...|            1|[2.69932385037531...|[0.93698673410870...|       0.0|
|[58.0,0.0