<center><h1>Hello everyone!</h1></center>
I created a notebook where I made a little of analysis and prediction. I hope you will like it.

<div style="text-align:center"><img src="https://www.eehealth.org/-/media/images/modules/blog/posts/heartline.jpg?h=500&w=750&hash=8147AFA9A68A838E7227B6524E566A99" /></div>

In [None]:
# Visualization
import pandas as pd
import plotly.graph_objects as go
import plotly.figure_factory as ff
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from plotly.subplots import make_subplots

# Model selection
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import confusion_matrix, f1_score
from sklearn.preprocessing import MinMaxScaler

# Machine learning models
import xgboost as xgb
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier

import warnings
warnings.filterwarnings('ignore')

In [None]:
data = pd.read_csv('../input/heart-attack-analysis-prediction-dataset/heart.csv')

In [None]:
data

In [None]:
data.columns

<h1>Data Preparation</h1>
First we need to check if there is any missing values or outliers

In [None]:
# Check for missing values
data.isnull().sum()

It's looks like our data is fully filled with values. We don't need to worry.

In [None]:
def box_plot(column, plot_name):
    fig = go.Figure()

    fig.add_trace(go.Box(
        y = column,
        name = ''
    ))

    fig.update_layout(
        template = 'plotly_dark',
        title_text = plot_name
    )

    fig.show()

In [None]:
box_plot(data['age'], 'Age box plot')

In [None]:
box_plot(data['trtbps'], 'Resting blood pressure box plot')

Here we can see couple of outliers but they are not that significante and probably they are not outliers at all. Someone probably had such a high score.

In [None]:
box_plot(data['chol'], 'Cholesterol box plot')

In [None]:
data[data['chol'] > 500]

In this situation we should think about this high cholesterol measurement. After a couple of minutes of research I can tell that measurement of over 500 is possible if we measure triglycerides but still, it would be a problem for our prediction in the future. We will delete this row from our dataset.

In [None]:
data_del = data[data['chol'] < 500]

In [None]:
box_plot(data_del['chol'], 'Cholesterol box plot')

In [None]:
box_plot(data_del['thalachh'], 'Maximum heart rate achieved box plot')

In [None]:
box_plot(data_del['oldpeak'], 'Previous peak box plot')

<h1>Exploratory data analysis</h1>
First we have to answer a very important question. How does each feature affect our goal?

In [None]:
fig = go.Figure()

to_plot = data_del['output'].replace({0: 'Less chance of heart attack', 1: 'More chance of heart attack'}).value_counts()
labels = to_plot.index
values = to_plot.values

fig.add_trace(go.Pie(
    labels = labels,
    values = values,
    textinfo='percent'
))

fig.update_layout(
    title_text='Survival',
    template='plotly_dark'
)

We can be happy that our dataset is balanced. It will be easier to perform prediction. 

In [None]:
# Age
fig = go.Figure()

fig.add_trace(go.Histogram(
    x = data_del[data_del['output'] == 0]['age'],
    name = 'Less chance of heart attack'
))

fig.add_trace(go.Histogram(
    x = data_del[data_del['output'] == 1]['age'],
    name = 'More chance of heart attack'
))

fig.update_layout(
    width = 1000,
    template = 'plotly_dark',
    title_text = 'Age by chance of heart stroke'
)

Conclusion: People between 40 and 55 years of age have more chance to have a stroke. Which is a little weird in my opinion. I always thought that stroke chance increase with age. Unfortunately with this dataset we don't have enough data to verify it. Therefore, this thought will remain only a guess.  

In [None]:
# Male vs female chance of stroke
fig = make_subplots(rows=1, cols=2, specs=[[{"type": "pie"}, {"type": "pie"}]], subplot_titles=['Female', 'Male'])

to_plot_female = data_del[data_del['sex'] == 0]['output'].replace({0: 'Less chance of heart attack', 1: 'More chance of heart attack'}).value_counts()
labels_female = to_plot_female.index
values_female = to_plot_female.values

to_plot_male = data_del[data_del['sex'] == 1]['output'].replace({0: 'Less chance of heart attack', 1: 'More chance of heart attack'}).value_counts()
labels_male = to_plot_male.index
values_male = to_plot_male.values

fig.add_trace(
    go.Pie(
    labels = labels_female,
    values = values_female,
    textinfo='percent',
    name='Female'),
    row=1,
    col=1
)

fig.add_trace(
    go.Pie(
    labels = labels_male,
    values = values_male,
    textinfo='percent',
    name='Man'),
    row=1,
    col=2
)

fig.update_layout(
    title_text='Male vs Female chance of stroke',
    template='plotly_dark'
)

In [None]:
# Balance of sex feature
to_plot_balance = data_del['sex'].replace({0: 'Female', 1: 'Male'}).value_counts()
labels_balance = to_plot_balance.index
values_balance = to_plot_balance.values

fig = go.Figure()

fig.add_trace(go.Pie(
    labels = labels_balance,
    values = values_balance
))

fig.update_layout(
    title_text='Population by gender',
    template='plotly_dark'
)

Conclusion: Women are more prone to heart attacks <br>
From plots above we can tell that females have more chance of stroke but at the same time data are fairly unbalanced. This makes this conclusion irrelevant in my opinion.

In [None]:
# Resting blood pressure 
trtbps_1 = data_del[data_del['output'] == 1]['trtbps']
trtbps_0 = data_del[data_del['output'] == 0]['trtbps']

# Group data together
hist_data = [trtbps_1, trtbps_0]
group_labels = ['More chance of heart attack', 'Less chance of heart attack']

# Create distplot with custom bin_size
fig = ff.create_distplot(hist_data, group_labels)

fig.update_layout(
    title_text = 'Resting blood pressure by stroke chance'
)

fig.show()

Conclusion: Resting blood pressure isn't correlated with chance of heart stroke

In [None]:
# Cholesterol
chol_1 = data_del[data_del['output'] == 1]['chol']
chol_0 = data_del[data_del['output'] == 0]['chol']

# Group data together
hist_data = [chol_1, chol_0]
group_labels = ['More chance of heart attack', 'Less chance of heart attack']

# Create distplot with custom bin_size
fig = ff.create_distplot(hist_data, group_labels)

fig.update_layout(
    title_text = 'Cholesterol by stroke chance'
)

fig.show()

Conclusion: People with cholesterol level (mg/dl) between 170 and 250 have more chance of stroke

In [None]:
# Maximum heart rate achieved
thalachh_1 = data_del[data_del['output'] == 1]['thalachh']
thalachh_0 = data_del[data_del['output'] == 0]['thalachh']

# Group data together
hist_data = [thalachh_1, thalachh_0]
group_labels = ['More chance of heart attack', 'Less chance of heart attack']

# Create distplot with custom bin_size
fig = ff.create_distplot(hist_data, group_labels)

fig.update_layout(
    title_text = 'Maximum heart rate achieved by stroke chance'
)

fig.show()

Conclusion: People who achieved maximum heart rate larger than 150 have very big chance to experience a heart attack

To check correlation of other features with heart stroke chance we'll go the easy way and simply display pearson correlation values.

In [None]:
plt.figure(figsize=(8, 12))

heatmap = sns.heatmap(data.corr()[['output']].sort_values(by='output', ascending=False),
                     vmin=-1, vmax=1, annot=True, cmap='BrBG')

heatmap.set_title('Features correlated with chance of stroke', fontdict={'fontsize': 18}, pad=16);

Conclusion: This chart shows that only two features (fbs and cho) aren't correlated with stroke chance at all. Rest of them are fairly correlated which can be helpful in prediction.

<h1>Prediction</h1>

In [None]:
data_del

In [None]:
# Split data
X = data_del.drop(['output'], axis=1)
y = data_del['output']

In [None]:
# Scale values
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

In [None]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=42)

In [None]:
def model_evaluation(model):
    # Train our model and predict
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    # Cross validation
    scores = cross_val_score(model, X, y, cv=3, scoring='f1')
    f1_scores_mean = scores.mean()
    print('F1 scores: {}'.format(scores))
    print('F1 mean score: {}'.format(f1_scores_mean))
    
    # Confusion matrix
    print('Confusion Matrix: ')
    matrix = confusion_matrix(y_test, y_pred)
    group_names = ['True Negative','False Positive','False Negative','True Positive']
    group_counts =['{0:0.0f}'.format(value) for value in matrix.flatten()]
    group_percentages = ['{0:.2%}'.format(value) for value in matrix.flatten()/np.sum(matrix)]
    
    labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in zip(group_names,group_counts,group_percentages)]
    labels = np.asarray(labels).reshape(2, 2)

    sns.heatmap(matrix, annot=labels, fmt='', cmap='rocket_r')
    
    return f1_scores_mean

In [None]:
# K-nearest neighbors 
KNN_model = KNeighborsClassifier()

# SVC
SVC_model = SVC()

# Logistic regression
LR_model = LogisticRegression()

# Decision tree
DT_model = DecisionTreeClassifier()

# Random Forest
RF_model = RandomForestClassifier()

# XGBoost
XGB_model = xgb.XGBClassifier()

# LightGBM
LGBM_model = LGBMClassifier()

In [None]:
f1_KNN = model_evaluation(KNN_model)

In [None]:
f1_SVC = model_evaluation(SVC_model)

In [None]:
f1_LR = model_evaluation(LR_model)

In [None]:
f1_DT = model_evaluation(DT_model)

In [None]:
f1_RF = model_evaluation(RF_model)

In [None]:
f1_XGB = model_evaluation(XGB_model)

In [None]:
f1_LGBM = model_evaluation(LGBM_model)

From the above results we see that models are unstable and accuracy jumps uncontrollably. I'm pretty sure that is a result of the small size of the dataset. We will try to tune the logistic regression model to make him more stable. I chose this algorithm because he is simple and that kind of models works best for small datasets.

In [None]:
LR_new = LogisticRegression(
    solver = 'liblinear',
    penalty = 'l1',
    # C = 1
)

In [None]:
model_evaluation(LR_new)

Nice! After we added a regularization parameter our model become more stable. We achieved our goal, the model doesn't look overfitted and accuracy is fair enough for such small dataset. I think we can leave it like this. 