# EDA Lab 1
For this lab, you'll get to explore the cardio dataset on your own. More information can be found about this dataset here: 
https://www.kaggle.com/sulianova/cardiovascular-disease-dataset#cardio_train.csv

## Download Dataset
This dataset is located in the Datasets folder of our GitHub repo. Make sure to pull the repo for any changes. 
You may need to update the file path below, depending on where you downloaded the repo on your computer: 

In [1]:
import pandas as pd
cardio = pd.read_csv("../../Datasets/cardio_train.csv", sep=",")

In [2]:
cardio.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,male,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,female,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,female,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,male,169,82.0,150,100,1,1,0,0,1,1
4,4,17474,female,156,56.0,100,60,1,1,0,0,0,0


## One Hot Encode Gender Column
We're going to take a look at the effect of gender on cardiovascular disease. Before we can do that, we'll need to one-hot-encode the gender column using pandas  
`get_dummies(df, columns=['col1'])`. 

We can use the syntax above to create the encoded columns and automatically add them to the original dataframe:  

In [3]:
# first encode the categories for male and female
one_hot = pd.get_dummies(cardio, columns=['gender'])

In [4]:
one_hot.head()

Unnamed: 0,id,age,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,gender_female,gender_male
0,0,18393,168,62.0,110,80,1,1,0,0,1,0,0,1
1,1,20228,156,85.0,140,90,3,1,0,0,1,1,1,0
2,2,18857,165,64.0,130,70,3,1,0,0,0,1,1,0
3,3,17623,169,82.0,150,100,1,1,0,0,1,1,0,1
4,4,17474,156,56.0,100,60,1,1,0,0,0,0,1,0


## Manually Encode Column
Since there are only two categoreis we need to convert, male and female, we can also approach this in a different way by turning the gender column into a binary column where: 
- 0 = female
- 1 = male

In [5]:
cardio.loc[cardio.gender == "female", 'gender'] = 0

In [6]:
cardio.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,male,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,0,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,0,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,male,169,82.0,150,100,1,1,0,0,1,1
4,4,17474,0,156,56.0,100,60,1,1,0,0,0,0


Using the code above, convert male entries in the gender column to the integer 1. 

In [7]:
cardio.loc[cardio.gender == "male", 'gender'] = 1

In [8]:
cardio.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,1,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,0,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,0,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,1,169,82.0,150,100,1,1,0,0,1,1
4,4,17474,0,156,56.0,100,60,1,1,0,0,0,0


## Bin Weight Column
Rather than looking at age, as we did in our walk-through, this time we'll take a look at the effect of weight. Let's examine the weight distribution to get a better idea of how we can break up the bins. 

Since America just doesn't want to step away from the empirical system (/rant over), we'll need to convert the weight from kilograms to pounds. Use the following formula to convert the weight column from pounds to kilograms: `lb = kg * 2.2046`. 

Make sure to round and convert to an integer. 

In [11]:
cardio['weight'] = cardio.weight * 2.2046 

In [12]:
cardio.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,1,168,136.6852,110,80,1,1,0,0,1,0
1,1,20228,0,156,187.391,140,90,3,1,0,0,1,1
2,2,18857,0,165,141.0944,130,70,3,1,0,0,0,1
3,3,17623,1,169,180.7772,150,100,1,1,0,0,1,1
4,4,17474,0,156,123.4576,100,60,1,1,0,0,0,0


Now let's look at the distribution using the describe function on the weight column: 

In [13]:
cardio.weight.describe()

count    70000.000000
mean       163.593864
std         31.736885
min         22.046000
25%        143.299000
50%        158.731200
75%        180.777200
max        440.920000
Name: weight, dtype: float64

Clearly we have some outliers. For now let's just pretend like it's completely normal to be a 51 year old man who weights 22 pounds and we will deal with this when we get to outlier detection.

## Bin the weight column
Let's keep things easy by binning our weight column into 50 pound buckets, starting with 50 and including up to 250. Create a list of bins in this range: 

In [33]:
bins = list(range(50, 300, 50))

In [34]:
bins

[50, 100, 150, 200, 250]

Create a list of labels for your bins: 

In [36]:
for x in pd.cut(cardio.weight, bins).unique():
    print(x)

(100, 150]
(150, 200]
(200, 250]
(50, 100]
nan


Now create a column of labels for each row using the pandas cut function: 

In [43]:
labels = ["<= 100", "100-150", "150-200", "200-250"]

In [44]:
cardio["weight_bins"] = pd.cut(cardio.weight, 
                              bins=bins,
                              labels=labels)

In [45]:
cardio.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,weight_bins
0,0,18393,1,168,136.6852,110,80,1,1,0,0,1,0,100-150
1,1,20228,0,156,187.391,140,90,3,1,0,0,1,1,150-200
2,2,18857,0,165,141.0944,130,70,3,1,0,0,0,1,100-150
3,3,17623,1,169,180.7772,150,100,1,1,0,0,1,1,150-200
4,4,17474,0,156,123.4576,100,60,1,1,0,0,0,0,100-150


And now you can use [drop_na()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) to drop any rows where the weight_bins contain NaN values: 

In [49]:
len(cardio)

70000

In [48]:
len(cardio[pd.isnull(cardio).any(axis=1)])

992

In [50]:
cardio = cardio.dropna(axis=0)

In [51]:
# check that we properly dropped null values
assert len(cardio) == 70000-992

## Examine Results
Now we can answer two important questionsa about our dataset. 

1) How many people in each weight bin have cardiovascular disease: 

In [53]:
cardio.groupby('weight_bins')["cardio"].sum()

weight_bins
<= 100       116
100-150    10795
150-200    18506
200-250     4855
Name: cardio, dtype: int64

2) How many men vs women have cardiovascular disease? 

In [56]:
len(cardio[(cardio.gender == 1) & (cardio.cardio == 1)])

12045

And how many men? 

Does this seem like a reasonable way to measure this relationship? 