# About this notebook 

In this notebook we are going to explore the dataset in https://www.kaggle.com/sulianova/cardiovascular-disease-dataset .
The dataset maps the cardiac condition of several patients (70k). We are also going to try a logistic regression to predict
if a given patient have a cardiovascular disease.

# Table of contents

* Importing and reading the data
* Feature analysis
* Cleaning the data
* Exploratory data analysis
* Logistic regression
* Notes



What is the plan?

Of course we need to first read the data and see the basic information there. Then we can see the distribution of each feature in the data and compare the distributions between 'gender' =1 and =2 and 'cardio' =0 and =1, taking notes as we go.
To do the logistic regression we are going to use the sklearn package. 

# Importing the data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split, cross_val_score


data=pd.read_csv('../input/cardiovascular-disease-dataset/cardio_train.csv'
                 ,sep=';'#';' separates the data
                 )

# Feature analysis

In [None]:
data['age']=data['age']/365 #the date were in days
data.head()







In [None]:
data.describe()

Features:

* Age | Objective Feature | age | int (days)
* Height | Objective Feature | height | int (cm) |
* Weight | Objective Feature | weight | float (kg) |
* Gender | Objective Feature | gender | categorical code | 1 - women, 2 - men
* Systolic blood pressure | Examination Feature | ap_hi | int |
* Diastolic blood pressure | Examination Feature | ap_lo | int |
* Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |
* Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |
* Smoking | Subjective Feature | smoke | binary |
* Alcohol intake | Subjective Feature | alco | binary |
* Physical activity | Subjective Feature | active | binary |
* Presence or absence of cardiovascular disease | Target Variable | cardio | binary |


All of the dataset values were collected at the moment of medical examination.


Here we can se some usefull information, like:
* The mean age is 53.3 years and the youngest patient is almost 30 years old
* The "mean" of gender is 1.3 (1 - women, 2 - men) which means there is more women in the dataset than men
* The mean of the height is 164 cm
* The mean of the weight is 74.2 kg 
* If the systolic and diastolic blood pressure are measured in mmHg, we shouldn't be getting negative minimum values, wich suggests some transcription errors. Also some values (visible from the data.head()) are beyond 120 mmHg for ap_hi and 80 mmHg for ap_lo , and that consitutes hypertension already.
* The maximum value of ap_lo and ap_hi is way too high to make any sense, so we better clean the impossible values
* The mean of smoke is 0.08, so few people in this dataset smokes, the same can be said for alcohol intake(0.05).
* The mean of active is 0.8 so a lot of people do regular physical exercise
* The mean of cardio is 0.499 so nearly half of the people in the dataset have some cardiovascular disease. This also means this dataset is balanced so we dont need to convert to a balanced dataset to do our machine learning algoritm.

In [None]:
data.info()

not a single value is missing(NaN)

# Data cleaning

In [None]:
data[data['ap_hi']<0]

it would make sense if the negative values are all transcription errors since all values are in range of acceptable pressures except for the signal. We can just convert these negative values to positive and keep in the dataset. The same can be said for the ap_lo since:

In [None]:
data[data['ap_lo']<0]

In [None]:
data['ap_lo']=data['ap_lo'].abs()
data['ap_hi']=data['ap_hi'].abs()
data.describe()

Lets define the values acceptable for the pressures:
* ap_hi have to be 10<ap_hi<220 
* ap_lo tave to be 10<ap_lo<190

In [None]:
data = data.loc[data['ap_lo']>10]
data = data.loc[data['ap_lo']<190]
data = data.loc[data['ap_hi']>10]
data = data.loc[data['ap_hi']<220]
data.describe()

We gave up little bit over a thousand values. Now we adjust the weight and drop the 'id' column.

In [None]:
data = data.loc[data['weight']>30]
data=data.drop(columns='id')
data.describe()

# Exploratory data analysis
Now we need to look for diferences between populations. First lets see if anything changes between men and women.

In [None]:
datam = data[data['gender']==2]
dataw = data[data['gender']==1]


fig, axs = plt.subplots(2, 2,figsize=(10,10))
axs[0, 0].hist(dataw['age'], bins=15,alpha=0.5,color='blue')
axs[0, 0].hist(datam['age'], bins=15,alpha=0.5,color='green')
axs[0, 0].legend(['women','men'])
axs[0, 0].set_title('age')
axs[0, 1].hist(dataw['height'],bins=15, alpha=0.5,color='blue')
axs[0, 1].hist(datam['height'],bins=15, alpha=0.5,color='green')
axs[0, 1].legend(['women','men'])
axs[0, 1].set_title('height')
axs[1, 0].hist(dataw['weight'],bins=15, alpha=0.5,color='blue')
axs[1, 0].hist(datam['weight'],bins=15, alpha=0.5,color='green')
axs[1, 0].legend(['women','men'])
axs[1, 0].set_title('weight')
axs[1, 1].hist(dataw['smoke'],bins=15, alpha=0.5,color='blue')
axs[1, 1].hist(datam['smoke'],bins=15, alpha=0.5,color='green')
axs[1, 1].legend(['women','men'])
axs[1, 1].set_title('smoke')



Here we can see 
* The age distribution between men and women is nearly the same (there is just more women in the dataset)
* Men in this dataset is higher on the average than women
* The weight distribution is also the same for both men and women, and the gaussian have a skew to the right
* There is more men that smoke even considering there is more women in the dataset

In [None]:
print(dataw['weight'].skew(),
datam['weight'].skew()) # measure of skewness (>0:to the right ;=0 symmetrical ; <0 to the left)

m = datam['smoke'].sum() / datam.size # proportion of men that smoke
w = dataw['smoke'].sum() / dataw.size # proportion of women that smoke

print(w,m)

So men are 12 times more smokers than women in this dataset

In [None]:

fig, axs = plt.subplots(2, 2,figsize=(10,10))
axs[0, 0].hist(dataw['ap_hi'], bins=12, alpha=0.5,color='blue')
axs[0, 0].hist(datam['ap_hi'], bins=12, alpha=0.5,color='green')
axs[0, 0].legend(['women','men'])
axs[0, 0].set_title('ap_hi')

axs[0, 1].hist(dataw['ap_lo'],bins=12, alpha=0.5,color='blue')
axs[0, 1].hist(datam['ap_lo'],bins=12, alpha=0.5,color='green')
axs[0, 1].legend(['women','men'])
axs[0, 1].set_title('ap_lo')

axs[1, 0].hist(dataw['cholesterol'],bins=4, alpha=0.5,color='blue')
axs[1, 0].hist(datam['cholesterol'],bins=4, alpha=0.5,color='green')
axs[1, 0].legend(['women','men'])
axs[1, 0].set_title('cholesterol')

axs[1, 1].hist(dataw['gluc'],bins=4, alpha=0.5,color='blue')
axs[1, 1].hist(datam['gluc'],bins=4, alpha=0.5,color='green')
axs[1, 1].legend(['women','men'])
axs[1, 1].set_title('gluc')

The distributions for pressure, cholesterol and glucose are nearly the same for both men and women.Now the same for cardiovascular condition

In [None]:
data0 = data[data['cardio']==0]
data1 = data[data['cardio']==1]


fig, axs = plt.subplots(2, 2,figsize=(10,10))
axs[0, 0].hist(data0['age'], bins=10, alpha=0.5,color='blue')
axs[0, 0].hist(data1['age'], bins=10, alpha=0.5,color='green')
axs[0, 0].legend(['healthy','cardio'])
axs[0, 0].set_title('age')
axs[0, 1].hist(data0['height'],bins=12, alpha=0.5,color='blue')
axs[0, 1].hist(data1['height'],bins=12, alpha=0.5,color='green')
axs[0, 1].legend(['healthy','cardio'])
axs[0, 1].set_title('height')
axs[1, 0].hist(data0['weight'],bins=15, alpha=0.5,color='blue')
axs[1, 0].hist(data1['weight'],bins=15, alpha=0.5,color='green')
axs[1, 0].legend(['healthy','cardio'])
axs[1, 0].set_title('weight')
axs[1, 1].hist(data0['smoke'],bins=3, alpha=0.5,color='blue')
axs[1, 1].hist(data1['smoke'],bins=3, alpha=0.5,color='green')
axs[1, 1].legend(['healthy','cardio'])
axs[1, 1].set_title('smoke')

We can see that age is a important factor to predict a cardiovascular disease. 

In [None]:
fig, axs = plt.subplots(2, 2,figsize=(10,10))
axs[0, 0].hist(data0['ap_hi'], bins=12, alpha=0.5,color='blue')
axs[0, 0].hist(data1['ap_hi'], bins=12, alpha=0.5,color='green')
axs[0, 0].legend(['healthy','cardio'])
axs[0, 0].set_title('ap_hi')

axs[0, 1].hist(data0['ap_lo'],bins=12, alpha=0.5,color='blue')
axs[0, 1].hist(data1['ap_lo'],bins=12, alpha=0.5,color='green')
axs[0, 1].legend(['healthy','cardio'])
axs[0, 1].set_title('ap_lo')

axs[1, 0].hist(data0['cholesterol'],bins=4, alpha=0.5,color='blue')
axs[1, 0].hist(data1['cholesterol'],bins=4, alpha=0.5,color='green')
axs[1, 0].legend(['healthy','cardio'])
axs[1, 0].set_title('cholesterol')

axs[1, 1].hist(data0['gluc'],bins=4, alpha=0.5,color='blue')
axs[1, 1].hist(data1['gluc'],bins=4, alpha=0.5,color='green')
axs[1, 1].legend(['healthy','cardio'])
axs[1, 1].set_title('gluc')

Here we can also indentify ap_hi, cholesterol and glucose as important factors

The correlation matrix...

In [None]:
plt.figure(figsize=(16, 6))
sns.heatmap(data.corr()).set_title('Correlation Heatmap', fontdict={'fontsize':12}, pad=12);


We can see that age,weight,ap_hi,ap_lo,cholesterol and gluc have the biggest correlation value with cardio. Now we can train a model to predict if a patient have some cardiovascular disease. For this run we are going with a logistic regression strategy.

# Logistic regression

The strategy here is to use the features : age,gender,height,weight,dyastolic and systolic pressures, cholesterol,glucose,smoke and active

For the training and validation, cross validation were used. 80% of the dataset were used to train and 20% to validade the model.



In [None]:
log_data=[data['age'],data['gender'],data['height'],data['weight'],data['ap_hi'],data['ap_lo'],data['cholesterol'],data['gluc'],data['smoke'],data['active']]
log_data=np.array(log_data)
log_data=log_data.transpose()
train_data,test_data,train_target,test_target = train_test_split(log_data, data['cardio'], test_size=0.2)


lr = LogisticRegression().fit(train_data,train_target)


p=lr.score(test_data,test_target)


print(p)


# Notes

Here we explored a very superficial logistic regression. We could try neural networks, support vector machines etc.
We made some assumptions in the way and it would be nice to made them clear.



* The systolic and dyastolic pressures (columns ap_hi and ap_lo) were measured in mmHg
* The same pressures had sometimes negative values and other impossible ones (>1000). We made the assumptions that the negative ones were transcription errors. We dropped the super-large ones
* We also assumed a minimum weight of 30kg to be valid, and that may not be true if dwarfism cases were in the dataset
* We defined the acceptable intervals for the pressures to be 10<ap_hi<220 and 10<ap_lo<190, and this assumption was not based in any concrete reference (higher pressures may be possible)


This notebook was inspired by this work: https://www.kaggle.com/mnassrib/titanic-logistic-regression-with-python by Baligh Mnassri on the titanic dataset



