<a href="https://colab.research.google.com/github/ravikarora/Prediction-of-Cardiomegaly-Risk-Factor-using-Machine-Learning/blob/main/final_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **<p style="color:Blue;">About The Dataset :</p>**

age: Age of the patient

sex: Sex of the patient

cp: Chest pain type, 0 = Typical Angina, 1 = Atypical Angina, 2 = Non-anginal Pain, 3 = Asymptomatic

trtbps: Resting blood pressure (in mm Hg)

chol: Cholestoral in mg/dl fetched via BMI sensor

fbs: (fasting blood sugar > 120 mg/dl), 1 = True, 0 = False

restecg: Resting electrocardiographic results, 0 = Normal, 1 = ST-T wave normality, 2 = Left ventricular hypertrophy

thalachh: Maximum heart rate achieved

oldpeak: Previous peak

slp: Slope

caa: Number of major vessels

thall: Thalium Stress Test result ~ (0,3)

exng: Exercise induced angina ~ 1 = Yes, 0 = No

output: Target variable



In [None]:
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
import xgboost as xgb
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

***Reading The Dataset***

In [None]:
heart=pd.read_csv('/content/heart.csv')
heart

***Checking the shape of DataFrame***

In [None]:
print('Number of rows are',heart.shape[0], 'and number of columns are ',heart.shape[1])

***Checking for null values***

In [None]:
heart.isnull().sum()/len(heart)*100

**No null values found**

***Checking For datatypes of the attributes***

In [None]:
heart.info()

**All attributes are of type 'int' except 'oldpeak'**

***Checking for duplicate rows***

In [None]:
heart[heart.duplicated()]


***Removing the duplicates***

In [None]:
heart.drop_duplicates(keep='first',inplace=True)

**Checking new shape**

In [None]:
print('Number of rows are',heart.shape[0], 'and number of columns are ',heart.shape[1])

# Pandas Profiling

In [None]:
# !pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

In [None]:
# import pandas_profiling as pp
# pp.ProfileReport(heart)

# ***Data Visualization***

***Breakdown for chest pain***

In [None]:
x=(heart.cp.value_counts())
print(x)
p = sns.countplot(data=heart, x="cp")
plt.show()

1. It can be observed people have chest pain of type 0 i.e 'Typical Angina' is the highest.
2. It can be observed people have chest pain of type 3 i.e 'Asymptomatic' is the lowest
3. It can also be observed people with chest pain of type 0 is almost 50% of all the people.

***Breakdown of ECG***

In [None]:
x=(heart.restecg.value_counts())
print(x)
p = sns.countplot(data=heart, x="restecg")
plt.show()



ECG count is almost the same for type 0 and 1. Also, for type 2 its almost negligible in comparision to type 0 and 1.

***Breakdown for Exercise Induced Angina***

In [None]:
x=(heart.exng.value_counts())
print(x)
p = sns.countplot(data=heart, x="exng")
plt.show()


***EXNG count is more than double for type 0***

***Breakdown for Thalium Stress Test***

In [None]:
x=(heart.thall.value_counts())
print(x)
p = sns.countplot(data=heart, x="thall")
plt.show()


***Thall count is max for type 2 and min for type 0.***

***Heart Disease Vs Age***

In [None]:
plt.figure(figsize=(10,10))
sns.distplot(heart[heart['output'] == 0]["age"], color='green',kde=True,)
sns.distplot(heart[heart['output'] == 1]["age"], color='red',kde=True)
plt.title('Disease versus Age')
plt.show()



In [None]:
plt.figure(figsize=(10,10))
sns.distplot(heart[heart['output'] == 0]["chol"], color='green',kde=True,)
sns.distplot(heart[heart['output'] == 1]["chol"], color='red',kde=True)
plt.title('Cholestrol versus Age')
plt.show()


***Pair Plot***

In [None]:
# plt.figure(figsize=(20,20))
# sns.pairplot(heart)
# plt.show()

***Violin Plot***

In [None]:
plt.figure(figsize=(13,13))
plt.subplot(2,3,1)
sns.violinplot(x = 'sex', y = 'output', data = heart)
plt.subplot(2,3,2)
sns.violinplot(x = 'thall', y = 'output', data = heart)
plt.subplot(2,3,3)
sns.violinplot(x = 'exng', y = 'output', data = heart)
plt.subplot(2,3,4)
sns.violinplot(x = 'restecg', y = 'output', data = heart)
plt.subplot(2,3,5)
sns.violinplot(x = 'cp', y = 'output', data = heart)
plt.xticks(fontsize=9, rotation=45)
plt.subplot(2,3,6)
sns.violinplot(x = 'fbs', y = 'output', data = heart)

plt.show()

# **Data preprocessing**

**There's no need for categorical encoding**

In [None]:
x = heart.iloc[:, 1:-1].values
y = heart.iloc[:, -1].values
x,y

**Splitting the dataset into training and testing data**

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size= 0.2, random_state= 0)

In [None]:
print('Shape for training data', x_train.shape, y_train.shape)
print('Shape for testing data', x_test.shape, y_test.shape)

**Feature Scaling**

In [None]:
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

In [None]:
x_train,x_test

**1. Logistic Regression**

In [None]:
model = LogisticRegression()
model.fit(x_train, y_train)
predicted=model.predict(x_test)
conf = confusion_matrix(y_test, predicted)
print ("Confusion Matrix : \n", conf)
print()
print()
print ("The accuracy of Logistic Regression is : ", accuracy_score(y_test, predicted)*100, "%")

In [None]:
from scikitplot.estimators import plot_feature_importances
from scikitplot.metrics import plot_confusion_matrix, plot_roc

In [None]:
Y_test_probs = model.predict_proba(x_test)

plot_roc(y_test, Y_test_probs, title="Logistic Regression", figsize=(12,6));

In [None]:
# !pip install scikit-plot
import scikitplot as skplt

Y_test_pred = predicted

fig = plt.figure(figsize=(15,6))

ax1 = fig.add_subplot(121)
skplt.metrics.plot_confusion_matrix(y_test, predicted,
                                    title="Confusion Matrix",
                                    cmap="Oranges",
                                    ax=ax1)

ax2 = fig.add_subplot(122)
skplt.metrics.plot_confusion_matrix(y_test, predicted,
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Purples",
                                    ax=ax2);


**2.Gaussian Naive Bayes**

In [None]:
model = GaussianNB()
model.fit(x_train, y_train)

predicted = model.predict(x_test)

print("The accuracy of Gaussian Naive Bayes model is : ", accuracy_score(y_test, predicted)*100, "%")

In [None]:
Y_test_probs = model.predict_proba(x_test)

plot_roc(y_test, Y_test_probs,
                       title="Gaussian Naive Bayes", figsize=(12,6));

In [None]:
# !pip install scikit-plot
import scikitplot as skplt

Y_test_pred = predicted

fig = plt.figure(figsize=(15,6))

ax1 = fig.add_subplot(121)
skplt.metrics.plot_confusion_matrix(y_test, predicted,
                                    title="Confusion Matrix",
                                    cmap="Oranges",
                                    ax=ax1)

ax2 = fig.add_subplot(122)
skplt.metrics.plot_confusion_matrix(y_test, predicted,
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Purples",
                                    ax=ax2);


**3.Bernoulli Naive Bayes**

In [None]:
model = BernoulliNB()
model.fit(x_train, y_train)

predicted = model.predict(x_test)

print("The accuracy of Bernoulli Naive Bayes model is : ", accuracy_score(y_test, predicted)*100, "%")

* True Positive + True Negative : 54
* False Positive + False Negative : 7

In [None]:
Y_test_probs = model.predict_proba(x_test)

plot_roc(y_test, Y_test_probs,
                       title="Bernoulli Naive Bayes", figsize=(12,6));

In [None]:
# !pip install scikit-plot
import scikitplot as skplt

Y_test_pred = predicted

fig = plt.figure(figsize=(15,6))

ax1 = fig.add_subplot(121)
skplt.metrics.plot_confusion_matrix(y_test, predicted,
                                    title="Confusion Matrix",
                                    cmap="Oranges",
                                    ax=ax1)

ax2 = fig.add_subplot(122)
skplt.metrics.plot_confusion_matrix(y_test, predicted,
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Purples",
                                    ax=ax2);


**4.Support Vector Machine**

In [None]:
model = SVC(probability=True)
model.fit(x_train, y_train)

predicted = model.predict(x_test)
print("The accuracy of SVM is : ", accuracy_score(y_test, predicted)*100, "%")



In [None]:
Y_test_probs = model.predict_proba(x_test)

skplt.metrics.plot_roc_curve(y_test, Y_test_probs,
                       title="Support Vector Classifer (SVC)", figsize=(12,6));

In [None]:
# !pip install scikit-plot
import scikitplot as skplt

Y_test_pred = predicted

fig = plt.figure(figsize=(15,6))

ax1 = fig.add_subplot(121)
skplt.metrics.plot_confusion_matrix(y_test, predicted,
                                    title="Confusion Matrix",
                                    cmap="Oranges",
                                    ax=ax1)

ax2 = fig.add_subplot(122)
skplt.metrics.plot_confusion_matrix(y_test, predicted,
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Purples",
                                    ax=ax2);


**5.Random Forest**

In [None]:
model = RandomForestRegressor(n_estimators = 100, random_state = 0)
model.fit(x_train, y_train)
predicted = model.predict(x_test)
print("The accuracy of Random Forest is : ", accuracy_score(y_test, predicted.round())*100, "%")


**6.K Nearest Neighbours**

In [None]:

model = KNeighborsClassifier(n_neighbors = 1)
model.fit(x_train, y_train)
predicted = model.predict(x_test)


print(confusion_matrix(y_test, predicted))
print("The accuracy of KNN is : ", accuracy_score(y_test, predicted.round())*100, "%")




In [None]:
Y_test_probs = model.predict_proba(x_test)

skplt.metrics.plot_roc_curve(y_test, Y_test_probs,
                       title="KNN", figsize=(12,6));

In [None]:
# !pip install scikit-plot
import scikitplot as skplt

Y_test_pred = predicted

fig = plt.figure(figsize=(15,6))

ax1 = fig.add_subplot(121)
skplt.metrics.plot_confusion_matrix(y_test, predicted,
                                    title="Confusion Matrix",
                                    cmap="Oranges",
                                    ax=ax1)

ax2 = fig.add_subplot(122)
skplt.metrics.plot_confusion_matrix(y_test, predicted,
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Purples",
                                    ax=ax2);


**Optimizing the KNN**

In [None]:
error_rate = []

for i in range(1, 40):

    model = KNeighborsClassifier(n_neighbors = i)
    model.fit(x_train, y_train)
    pred_i = model.predict(x_test)
    error_rate.append(np.mean(pred_i != y_test))

plt.figure(figsize =(10, 6))
plt.plot(range(1, 40), error_rate, color ='blue',
                linestyle ='dashed', marker ='o',
         markerfacecolor ='red', markersize = 10)

plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')



With k=7 as it hovers after that

In [None]:
model = KNeighborsClassifier(n_neighbors = 7)

model.fit(x_train, y_train)
predicted = model.predict(x_test)

print('Confusion Matrix :')
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, predicted))

print()
print()
print("The accuracy of KNN is : ", accuracy_score(y_test, predicted.round())*100, "%")


In [None]:
Y_test_probs = model.predict_proba(x_test)

skplt.metrics.plot_roc_curve(y_test, Y_test_probs,
                       title="Optimized KNN", figsize=(12,6));

In [None]:
# !pip install scikit-plot
import scikitplot as skplt

Y_test_pred = predicted

fig = plt.figure(figsize=(15,6))

ax1 = fig.add_subplot(121)
skplt.metrics.plot_confusion_matrix(y_test, predicted,
                                    title="Confusion Matrix",
                                    cmap="Oranges",
                                    ax=ax1)

ax2 = fig.add_subplot(122)
skplt.metrics.plot_confusion_matrix(y_test, predicted,
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Purples",
                                    ax=ax2);


**7.X Gradient Boosting**

In [None]:
model = xgb.XGBClassifier(use_label_encoder=False)
model.fit(x_train, y_train)

predicted = model.predict(x_test)

cm = confusion_matrix(y_test, predicted)
print(cm)
print ("The accuracy of X Gradient Boosting is : ", accuracy_score(y_test, predicted)*100, "%")


In [None]:
Y_test_probs = model.predict_proba(x_test)

skplt.metrics.plot_roc_curve(y_test, Y_test_probs,
                       title="XGBoost", figsize=(12,6));

In [None]:
# !pip install scikit-plot
import scikitplot as skplt

Y_test_pred = predicted

fig = plt.figure(figsize=(15,6))

ax1 = fig.add_subplot(121)
skplt.metrics.plot_confusion_matrix(y_test, predicted,
                                    title="Confusion Matrix",
                                    cmap="Oranges",
                                    ax=ax1)

ax2 = fig.add_subplot(122)
skplt.metrics.plot_confusion_matrix(y_test, predicted,
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Purples",
                                    ax=ax2);


**8.MLP PSO**

In [None]:
from PSOMLP import PSOMLP

n = 200
i = 5
# generate random dataset
x = np.random.normal(-1, 1, size=(n, i))
# the class is defined by a real function applied to x
y = np.array([1 if sum(a) >= 1 else 0 for a in x])

pso = PSOMLP(hlayers=(10,))
mlp = pso.fit(x, y, iterations=100)
print("Accuracy for trainning data:", 100 * mlp.score(x, y))

mlp = pso.fit(x_train, y_train, iterations=100)
print("Accuracy for testing data:", 100 * mlp.score(x_test, y_test))

In [None]:
plt.style.use('fivethirtyeight')
# plot histograms for each variable
heart.hist(figsize = (18, 12))
plt.show()

**Neural Network**

In [None]:
import keras
from keras.models import Sequential
from keras.layers import Dense
model = Sequential()
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(x_train, y_train, epochs=25, validation_data=(x_test, y_test))

In [None]:
# Import the classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import roc_curve, roc_auc_score

# Instantiate the classfiers and make a list
classifiers = [LogisticRegression(random_state=42),
               GaussianNB(), BernoulliNB(), SVC(probability=True),
               KNeighborsClassifier(n_neighbors=7),
               RandomForestClassifier(random_state=42),
               xgb.XGBClassifier(use_label_encoder=False)]

# Define a result table as a DataFrame
result_table = pd.DataFrame(columns=['classifiers', 'fpr','tpr','auc'])

# Train the models and record the results
for cls in classifiers:
    model = cls.fit(x_train, y_train)
    yproba = model.predict_proba(x_test)[::,1]

    fpr, tpr, _ = roc_curve(y_test,  yproba)
    auc = roc_auc_score(y_test, yproba)

    result_table = result_table.append({'classifiers':cls.__class__.__name__,
                                        'fpr':fpr,
                                        'tpr':tpr,
                                        'auc':auc}, ignore_index=True)

# Set name of the classifiers as index labels
result_table.set_index('classifiers', inplace=True)

In [None]:
fig = plt.figure(figsize=(8,6))

for i in result_table.index:
    plt.plot(result_table.loc[i]['fpr'],
             result_table.loc[i]['tpr'],
             label="{}, AUC={:.3f}".format(i, result_table.loc[i]['auc']))

plt.plot([0,1], [0,1], color='orange', linestyle='--')

plt.xticks(np.arange(0.0, 1.1, step=0.1))
plt.xlabel("False Positive Rate", fontsize=15)

plt.yticks(np.arange(0.0, 1.1, step=0.1))
plt.ylabel("True Positive Rate", fontsize=15)

plt.title('ROC Curve Analysis', fontweight='bold', fontsize=15)
plt.legend(prop={'size':13}, loc='lower right')

plt.show()


In [None]:
import matplotlib.pyplot as plot

plot.plot(['Log Reg', 'SVM'])
plot.plot([88.5, 91.8])
plot.title('model accuracy')
plot.ylabel('accuracy')

 # **Conclusion**

1. Most of the models are performing really well.
2. SVM is performing the best for the given dataset.