<a href="https://colab.research.google.com/github/quicksilverri/medical_data_visualization/blob/main/medical_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Importing and examining the data

In [2]:
import pandas as pd

data = pd.read_csv('medical_examination.csv')

In [3]:
data.sample(5)

Unnamed: 0,id,age,sex,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
49691,70942,22126,1,167,59.0,120,80,1,1,0,0,1,0
1839,2595,19614,2,168,65.0,130,90,1,1,0,0,1,1
29279,41839,18332,1,160,77.0,120,80,1,1,0,0,1,0
10042,14329,20197,1,157,68.0,130,80,1,1,0,0,1,0
29262,41813,22562,1,168,66.0,120,70,1,1,0,0,1,0


age: person's age in days  
sex: person's sex (categorical data)   
height: person's height in santimeters   
weight: person's weight in kilograms   
ap_hi: person's systolic blood pressure	  
ap_lo: person's diastolic blood pressure  	
cholesterol: person's level of cholesterol (1: normal, 2: above normal, 3: well above normal)  
gluc: person's level of glucose (1: normal, 2: above normal, 3: well above normal)  
smoke: if person smokes   
alco: if person consumps alcohol  
active: if person is physically active  
cardio: if person has any cardiovascular illnesses  
(1: True, 0: False)



In [4]:
data.dtypes

id               int64
age              int64
sex              int64
height           int64
weight         float64
ap_hi            int64
ap_lo            int64
cholesterol      int64
gluc             int64
smoke            int64
alco             int64
active           int64
cardio           int64
dtype: object

In [21]:
# we do not have any nan values which is nice

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           70000 non-null  int64  
 1   age          70000 non-null  int64  
 2   sex          70000 non-null  int64  
 3   height       70000 non-null  int64  
 4   weight       70000 non-null  float64
 5   ap_hi        70000 non-null  int64  
 6   ap_lo        70000 non-null  int64  
 7   cholesterol  70000 non-null  int64  
 8   gluc         70000 non-null  int64  
 9   smoke        70000 non-null  int64  
 10  alco         70000 non-null  int64  
 11  active       70000 non-null  int64  
 12  cardio       70000 non-null  int64  
dtypes: float64(1), int64(12)
memory usage: 6.9 MB


In [27]:
data.nunique()

id             70000
age             8076
sex                3
height           109
weight           287
ap_hi            153
ap_lo            157
cholesterol        3
gluc               3
smoke              2
alco               2
active             2
cardio             2
dtype: int64

In [31]:
data.describe()

Unnamed: 0,id,age,sex,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
count,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0
mean,49972.4199,19468.865814,1.349843,164.359229,74.20569,128.817286,96.630414,1.366871,1.226457,0.088129,0.053771,0.803729,0.4997
std,28851.302323,2467.251667,0.477253,8.210126,14.395757,154.011419,188.47253,0.68025,0.57227,0.283484,0.225568,0.397179,0.500003
min,0.0,10798.0,1.0,55.0,10.0,-150.0,-70.0,1.0,1.0,0.0,0.0,0.0,0.0
25%,25006.75,17664.0,1.0,159.0,65.0,120.0,80.0,1.0,1.0,0.0,0.0,1.0,0.0
50%,50001.5,19703.0,1.0,165.0,72.0,120.0,80.0,1.0,1.0,0.0,0.0,1.0,0.0
75%,74889.25,21327.0,2.0,170.0,82.0,140.0,90.0,2.0,1.0,0.0,0.0,1.0,1.0
max,99999.0,23713.0,3.0,250.0,200.0,16020.0,11000.0,3.0,3.0,1.0,1.0,1.0,1.0


# Creating a new column

In [5]:
data['BMI'] = data['weight'] * 10000 / data['height'] ** 2
data['overweight'] = data['BMI'] > 25

data['overweight'] = data['BMI'].apply(lambda x: 1 if x > 25 else 0)

data.sample(5)

Unnamed: 0,id,age,sex,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,BMI,overweight
60192,85939,20984,1,157,68.0,110,70,1,1,0,0,1,0,27.587326,1
4990,7069,14398,2,177,82.0,110,80,1,1,1,1,1,0,26.173833,1
11775,16818,16701,2,174,102.0,130,80,1,1,0,0,0,0,33.690052,1
49498,70674,15973,1,168,63.0,120,80,1,1,0,0,1,0,22.321429,0
8817,12572,21954,1,166,65.0,120,79,3,3,0,0,1,1,23.588329,0


In [6]:
# by the task we do not need BMI column so we drop it 

data = data.drop('BMI', axis=1)

# Normalizing data 
Let's have a look at _gluc_ and _cholesterol_ columns

In [35]:
data['gluc'].value_counts()

1    59479
3     5331
2     5190
Name: gluc, dtype: int64

In [36]:
data['cholesterol'].value_counts()

1    52385
2     9549
3     8066
Name: cholesterol, dtype: int64

_Task:_ Normalize the data by making 0 always good and 1 always bad. If the value of cholesterol or gluc is 1, make the value 0. If the value is more than 1, make the value 1.

In [12]:
normalize = lambda x: 0 if x == 1 else 1
data['cholesterol'] = data['cholesterol'].apply(normalize)
data['gluc'] = data['gluc'].apply(normalize)

In [16]:
data['gluc'].value_counts()

0    59479
1    10521
Name: gluc, dtype: int64

In [17]:
data['cholesterol'].value_counts()

1    52385
0    17615
Name: cholesterol, dtype: int64

# Plotting
_Task:_ Convert the data into long format and create a chart that shows the value counts of the categorical features using seaborn's catplot(). The dataset should be split by 'Cardio' so there is one chart for each cardio value. The chart should look like examples/Figure_1.png.

# Cleaning 

# Heatmap plot