# Cadiovascular disease data

## by Aarzoo Kuhar

The dataset consists of 70 000 records of patients data in 12 features, such as age, gender, systolic blood pressure, diastolic blood pressure, and etc. The target class "cardio" equals to 1, when patient has cardiovascular desease, and it's 0, if patient is healthy.

The task is to predict the presence or absence of cardiovascular disease (CVD) using the patient examination results.

Data description
There are 3 types of input features:

Objective: factual information;
Examination: results of medical examination;
Subjective: information given by the patient

In [None]:
import pandas
import numpy as np

from sklearn import model_selection
from matplotlib import pyplot as plt
import seaborn as sns
import os

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
import warnings
warnings.filterwarnings("ignore")





In [None]:
dataframe = pandas.read_csv("../input/cardiovascular-disease-dataset/cardio_train.csv",sep=";")

In [None]:
dataframe.head()


In [None]:
dataframe.info()

All features are numerical, 12 integers and 1 decimal number (weight). The second column gives us an idea how big is the dataset and how many non-null values are there for each field. We can use describe() to display sample statistics such as min, max, mean,std for each attribute:

In [None]:
dataframe.describe()

Age is measured in days, height is in centimeters. Let's look ate the numerical variables and how are they spread among target class. For example, at what age does the number of people with CVD exceed the number of people without CVD?

In [None]:
print("There is {} duplicated values in data frame".format(dataframe.duplicated().sum()))

# Discover and visualize the data to gain insights

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
dataframe.hist(bins=50, figsize=(20,15));


 It can be observed that people over 55 of age are more exposed to CVD. From the table above, we can see that there are outliers in ap_hi, ap_lo, weight and height. We will deal with them late

# Visualizing geographical data


In [None]:
dataframe.plot(kind="scatter", x="id", y="age")

Setting alpha=0.1 makes it easier to visualize places where there is a high risk of cardiovascular diesease of data points A better visualization highlighting high active roleof disease.



In [None]:
dataframe.plot(kind="scatter", x="id", y="age", alpha=0.1)

In [None]:
dataframe.plot(kind="scatter", x="id", y="age", alpha=0.4,s=dataframe["age"]/365, label="age", figsize=(10,7),c="cardio", cmap=plt.get_cmap("jet"), colorbar=True)


It can be observed that people over 55 of age are more exposed to CVD. 

# CORRELATION

Correlation coeff ranges from -1 to 1. When it is close to 1, it means there is strong positive correlation, -1 is strong negative correlation 0 indicates no correlation.



In [None]:
# check correlations
plt.subplots(figsize=(20,15))
sns.heatmap(dataframe.corr(), vmin = -0.5, vmax = 1, annot=True)


# Experimenting with Attribute Combinations

In [None]:
dataframe["age_per_alco"] = dataframe["smoke"]/dataframe["alco"]
dataframe["years_per_cardio"] = dataframe["age"]/dataframe["gender"]

dataframe["gender_per_weight"]=dataframe["gender"]/dataframe["weight"]

In [None]:
corr_matrix = dataframe.corr()
corr_matrix["cholesterol"].sort_values(ascending=False)

We can see from correlation  cholesterol, blood pressure (ap_hi and ap_low both) and age have a powerful relationship with cardiovascular diseases.
Glucogen and cholesterol have a strong relationship among them either.

In [None]:
dataframe.describe()

# Prepare the data for Machine Learning algorithms

In [None]:
X = dataframe.iloc[:, [0,1,2,3,4,5,6,7,9,10,11,12]]
y = dataframe.iloc[:, 8]

In [None]:
X.iloc[:,11]=np.ceil(X.iloc[:,11])

In [None]:
X.head()

# Handling Text and categorical values

In [None]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X.iloc[:, 8] = labelencoder_X.fit_transform(X.iloc[:, 8])

In [None]:
print(labelencoder_X.classes_)

In [None]:
X.head()

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

In [None]:
dataframe_categorical = dataframe.loc[:,['cholesterol','gluc', 'smoke', 'alco', 'active']]
sns.countplot(x="variable", hue="value",data= pandas.melt(dataframe_categorical));

# Data Cleaning

Are there any NAs or missing values in a dataset?



In [None]:
dataframe.isnull().values.any()

In [None]:
sns.heatmap(X.isnull(),yticklabels=False,cbar=False,cmap='viridis')

In [None]:
median=X_test.iloc[:, 4].median()
X_test.iloc[:, 4].fillna(median, inplace=True)

In [None]:
median=X_test.iloc[:, 11].median()
X_test.iloc[:, 11].fillna(median, inplace=True)

In [None]:
median=X_train.iloc[:, 4].median()
X_train.iloc[:, 4].fillna(median, inplace=True)

In [None]:
median=X_train.iloc[:, 11].median()
X_train.iloc[:, 11].fillna(median, inplace=True)

In [None]:
sns.heatmap(X_test.isnull(),yticklabels=False,cbar=False,cmap='viridis')

In [None]:
sns.heatmap(X_test.isnull(),yticklabels=False,cbar=False,cmap='viridis')

# REGRESSION MODEL

Test Set Accuracy Score Now we have selected our model with better hyper parameters than default ones. It is time to evaluate model with our test set



In [None]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

In [None]:
X_train.head()

In [None]:
y_pred = lin_reg.predict(X_test)

Compare against the actual values:



In [None]:
from sklearn.metrics import mean_squared_error
lin_mse = mean_squared_error(y_test, y_pred)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

In [None]:
from sklearn.metrics import mean_absolute_error

lin_mae = mean_absolute_error(y_test, y_pred)
lin_mae

In [None]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(X_train, y_train)

In [None]:
y_pred = tree_reg.predict(X_test)
tree_mse = mean_squared_error(y_test, y_pred)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

# Fine-tune your model(MODEL COMPARISON)

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, X_train, y_train,
                         scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)

In [None]:
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

display_scores(tree_rmse_scores)

In [None]:
lin_scores = cross_val_score(lin_reg, X_train, y_train,
                             scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

we specify n_estimators=10 to avoid a warning about the fact that the default value is going to change to 100 in Scikit-Learn 0.22.

In [None]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(n_estimators=10, random_state=42)
forest_reg.fit(X_train, y_train)

In [None]:
y_pred = forest_reg.predict(X_test)
forest_mse = mean_squared_error(y_test, y_pred)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

In [None]:
from sklearn.model_selection import cross_val_score

forest_scores = cross_val_score(forest_reg, X_train, y_train,
                                scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)

In [None]:
scores = cross_val_score(lin_reg, X_train, y_train, scoring="neg_mean_squared_error", cv=10)
pandas.Series(np.sqrt(-scores)).describe()

# CLASSIFICATION

In [None]:

dec = DecisionTreeClassifier()

In [None]:
ran = RandomForestClassifier(n_estimators=100)

In [None]:
knn = KNeighborsClassifier(n_neighbors=100)

In [None]:
svm = SVC(random_state=1)
naive = GaussianNB()


In [None]:
models = {"Decision tree" : dec,
          "Random forest" : ran,
          "KNN" : knn,
          "Naive bayes" : naive}
scores= { }


In [None]:
for key, value in models.items():    
    model = value
    model.fit(X_train, y_train)
    scores[key] = model.score(X_test, y_test)

In [None]:
scores_frame = pandas.DataFrame(scores, index=["Accuracy Score"]).T
scores_frame.sort_values(by=["Accuracy Score"], axis=0 ,ascending=False, inplace=True)
scores_frame

In [None]:
plt.figure(figsize=(5,5))
sns.barplot(x=scores_frame.index,y=scores_frame["Accuracy Score"])
plt.xticks(rotation=45)

It seems that KNN and Random Forest algorithms are far ahead of the others.

In [None]:
# Random Forest
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train)

In [None]:
y_pred_rfc = rfc.predict(X_test)

In [None]:
# Random Forest Model Evaluation
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test, y_pred_rfc))
print(classification_report(y_test, y_pred_rfc))

In [None]:
rfc.score(X_test, y_test)

K-Fold cross-valuidation of Random Forest Model

In [None]:
#Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies_rfc = cross_val_score(estimator=rfc, X=X_train, y=y_train, cv=10)

In [None]:
accuracies_rfc

In [None]:
accuracies_rfc.mean()

In [None]:
accuracies_rfc.std()

In [None]:
# KNN
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=100)
knn.fit(X_train, y_train)

In [None]:
y_pred_knn = knn.predict(X_test)

In [None]:
# KNN Model Evaluation
print(confusion_matrix(y_test, y_pred_knn))
print(classification_report(y_test, y_pred_knn))

In [None]:
knn.score(X_test, y_test)


In [None]:
#Applying k-Fold Cross Validation
accuracies_knn = cross_val_score(estimator=knn, X=X_train, y=y_train, cv=10)

In [None]:
accuracies_knn

In [None]:
accuracies_knn.mean()

In [None]:
accuracies_knn.std()

In [None]:
# Naive Bayes
from sklearn.naive_bayes import GaussianNB
nbc = GaussianNB()
nbc.fit(X_train, y_train)

In [None]:
y_pred_nbc = nbc.predict(X_test)


In [None]:
# Naive Bayes Model Evaluation
print(confusion_matrix(y_test, y_pred_nbc))
print(classification_report(y_test, y_pred_nbc))

In [None]:
nbc.score(X_test, y_test)

In [None]:
#Applying k-Fold Cross Validation
accuracies_nbc = cross_val_score(estimator=nbc, X=X_train, y=y_train, cv=10)

In [None]:
accuracies_nbc

In [None]:
accuracies_nbc.mean()

In [None]:
accuracies_nbc.std()


WE get the higher accuracy of  cholesterol, blood pressure (ap_hi and ap_low both) and age have a powerful relationship with cardiovascular diseases.

The naive bayes model was the best performer out of all models giving us a mean accuracy score of 81.7%. K-Fold cross validation was used to ensure no overfitting was done.