In [None]:
# Run this first! 
import micropip 

await micropip.install("matplotlib")

%matplotlib inline

from js import fetch

async def get_csv(url):
    res = await fetch(url)
    text = await res.text()
    filename = 'data.csv'
    with open(filename, 'w') as f:
        f.write(text)

await get_csv("https://raw.githubusercontent.com/sadams-teaching/PGPM-503-ENV/main/data/size_data.csv")

# Data Wrangling - Basic Biometric Data

In this activity, you'll be working with a dataset called "size_data.csv"

This is a dataset that contains data for 1000 simulated patients. 

Over the next sets of code blocks, you'll do some basic data exploration and wrangling. 

*Hint*: The first few code blocks are the same as last week's assignment (and most assignments will follow that pattern).


<span style="color: blue; background-color: white">**TASK**: Prepare your environment and load the dataset into Pandas</span>

Remember the command? Feel free to look back on old assignments. 
Set your dataframe name to whatever you want (ideally something intuitive).

<span style="color: blue; background-color: white">**TASK**: Use the "head" function to view the first few rows from the data frame</span>

## Initial thoughts: what's in our dataset?

You see that our dataset has some basic measurements from a cohort of patients. 
What is we wanted more focused information about our dataset?

<span style="color: blue; background-color: white">**TASK**: Use the "describe" function to learn more about our dataset</span>

Use 'describe' instead of 'head' in the next code cell

## Missing Data

We frequently encounter missing data in our datasets. 
Let's see if this dataset is any exception. 

Based on the results above - you should see that Systolic Blood Pressure has a lower "count" than the other variables. 

We can also use a more specific command to see how many cells are NA for each variable. 

<span style="color: blue; background-color: white">**TASK**: Look for NA variables.</span>

Run the following command: 

```python
data.isnull().sum()
```

*Change "data" to whatever you named your data frame*

## Deal with Missing Data

Pandas has several "built in" methods for accounting for missing data. 
We are going to use a simple method that just "imputes" missing values with the overall mean. 
Sometimes this is appropriate, and sometimes it isn't. 
Here's more information (warning, quite dense) about Pandas and missing data: https://pandas.pydata.org/docs/user_guide/missing_data.html

<span style="color: blue; background-color: white">**TASK**: Fill missing Systolic Blood Pressure with the mean</span>

Use the following command to create a new data frame with Systolic Blood Pressure NA values filled with the mean. 

```python
data.fillna(data.mean()['Systolic Blood Pressure'])
```

## Wrapping complex code in a function

Our command to fill NA variables is kind of messy. 

<span style="color: blue; background-color: white">**TASK**: Make a function to fill missing SBP with the mean in a data frame</span>

Remember that we can use functions to isolate complex functions and make our code easier to read and understand. 

Let's wrap our missing data command into a function that we can call with a chain. 

Put the following in the next code cell: 

```python
def fill_missing_SBP_with_mean(df):
    return df.fillna(df.mean()['Systolic Blood Pressure'])

data = data\
        .pipe(fill_missing_SBP_with_mean)
```

## Calculate New Variables

We have height and weight, which are the components of BMI. 
Let's add a new column called "BMI" and calculate it. 

BMI is calculated with weight (in kg) and height (in m) as $weight/height^2$

We have weight in pounds and height in inches, so both need to be converted in our function. 
Let's do this in a chainable function so that we can isolate the math parts and add it to our data frame in line with other modifications. 

Here's how to do it. 
You will just need to make sure that the data frame variable name matches yours. 

In [None]:
def height_inches_to_meters(df):
    # Adds a new column called height_meters
    df.loc[:, 'height_meters'] = df.loc[:, 'Height'] * 0.0254
    return df

def weight_pounds_to_kg(df):
    # Adds a new column called weight_kg
    df.loc[:, 'weight_kg'] = df.loc[:, 'Weight'] * 0.453592
    return df

def bmi(df):
    # Adds a new column called BMI
    df.loc[:, 'BMI'] = df.loc[:, 'weight_kg'] / (df.loc[:, 'height_meters'] ** 2)
    return df

# Update your variable name below if you use something different
data = data\
        .pipe(height_inches_to_meters)\
        .pipe(weight_pounds_to_kg)\
        .pipe(bmi)

data.head()

## Add Creatinine Clearance

You have height, weight, and serum creatinine. 
Perhaps you have an analysis that requires kidney function?

<span style="color: blue; background-color: white">**TASK**: Make a function that will calculate creatinine clearance. </span>

Use the same techniques as above to calculate creatinine clearance with the Cockcroft-Gault Equation. 

We don't know the subjects' sex, so use actual body weight and do not multiply any by 0.85. 

## Exploration

<span style="color: blue; background-color: white">**TASK**: Scatterplots!</span>

Remember from the first assignment how to make a scatterplot of two variables?

```python
data.plot.scatter(x = "variable_name", y = "variable name")
```
Change the x and y variables to match columns you want to visualize. 
Try a few combinations. 
Before you move on, leave at least one of your scatterplots with variables you think might be correlated. 


You can also add a code comment to describe your findings (prepend the line with a "#").