# **Body Data**

#### Previously we focused on analyzing one or two **qualitative** variables. We now turn our focus to analyzing one **quantitative** variable by:

  * constructing a histogram
  * constructing a box plot
  * calculating numerical summaries
  * describing what the graph and numerical summaries tell us about the variable
  * comparing 1 or more quantitative variables among different categories of a qualitative variable


# **Importing Necessary Python Modules**

Python incorporates a variety of open source add-ins called **modules** that add extra features to the basic setup. The name of the modules is after the `import` statement, and the purpose is in a non-code comment after the hashtag (#).



In [1]:
import pandas as pd                 #Data analysis
import numpy as np                  #Calculations
import plotly.express as px         #Graphing
import matplotlib.pyplot as plt     #Graphing
from IPython.display import Image   #Display images
import warnings                     #Ignore version warnings
warnings.simplefilter('ignore', FutureWarning)


In [2]:
# Replace 'image_url' with the URL of the image you want to display
image_url = 'https://media.istockphoto.com/id/456054995/photo/dna-molecules-and-virtuvian-man.jpg?s=612x612&w=0&k=20&c=v5qZJ5Ty4RwDbyGRx_v-tYd1-LfTZwTi-Aend5Q_sqA='

# Display the image
Image(url=image_url)

# **Context**

The National Center for Health Statistics (NCHS) offers downloadable public-use data files through the Centers for Disease Control and Prevention's (CDC) FTP file server. Users of this service have access to data sets, documentation, and questionnaires from NCHS surveys and data collection systems.

Public-use data files are prepared and disseminated to provide access to the full scope of the data. This allows researchers to manipulate the data in a format appropriate for their analyses. NCHS makes every effort to release data collected through its surveys and data systems in a timely manner.

More information can be found at https://www.cdc.gov/nchs/data_access/ftp_data.htm.


# **About the Dataset**

This dataset contains 301 rows corresponding to a sample of Americans. A total of 16 variables are provided as listed below:

**Variables**

| Column     | Description                                                                 |
|------------|-----------------------------------------------------------------------------|
| AGE        | Age in years|
| GENDER     | Gender: 0=female, 1=male|
| PULSE      | Pulse rate in beats per minute (bpm)|
| SYSTOLIC   | Systolic blood pressure (mm Hg)|
| DIASTOLIC  | Diastolic blood pressure (mm Hg)|
| CATEGORY   | Blood Pressure Category based on the table below from the American Heart Association|
| HDL        | HDL cholesterol (mg/dL)|
| LDL        | LDL cholesterol (mg/dL)|
| WHITE      | White blood cell count (1000 cells/µL) |
| RED        | Red blood cell count (million cells/µL)|
| PLATE      | Platelet count (1000 cells/µL)|
| WEIGHT     | Weight (kg)|
| HEIGHT     | Height (cm)|
| WAIST      | Waist circumference (cm)|
| ARM CIRC   | Arm circumference (cm)|
| BMI        | Body mass index (kg/m²)|

# **Blood Pressure Category Table**

The table below from the American Heart Association classifies blood pressure into five (5) categories based on a combination of the individual's systolic and diastolic blood pressure.

In [3]:
# Replace 'image_url' with the URL of the image you want to display
image_url = 'https://www.heart.org/-/media/Images/Health-Topics/High-Blood-Pressure/Rainbow-Chart/blood-pressure-readings-chart.jpg?h=294&iar=0&mw=440&w=440&sc_lang=en'

# Display the image
Image(url=image_url)


# **A Snippet of the Data**

We can view a snippet of the data by first importing it directly from the url below.


In [4]:
# Assign the name of the web address to 'url'
url="https://raw.githubusercontent.com/thamilton562/STAT108_Projects_Students/main/Project1/Body%20Data.csv"

# df stands for data frame, which is what the raw data is referred as.
# This code reads in the data file from the address specified in 'url'
df=pd.read_csv(url)

Next, we can display the data by *typing the name* of the DataFrame. To ensure we can see all columns, we'll use the *pd.set_option* method.

In [5]:
# Set display options to show all columns
pd.set_option('display.max_columns', None)
df

Unnamed: 0,AGE,GENDER (1=M),PULSE,SYSTOLIC,DIASTOLIC,CATEGORY,HDL,LDL,WHITE,RED,PLATE,WEIGHT,HEIGHT,WAIST,ARM CIRC,BMI
0,43,0,80,100,70,NORMAL,73,68,8.7,4.80,319,98.6,172.0,120.4,40.7,33.3
1,38,0,94,134,94,HYPERTENSION STAGE 2,36,223,6.9,4.47,297,108.2,154.4,120.3,44.3,45.4
2,69,0,58,138,80,HYPERTENSION STAGE 1,40,140,8.1,4.60,286,79.2,155.7,103.5,34.2,32.7
3,44,0,66,114,66,NORMAL,45,136,8.0,4.09,263,64.2,157.6,89.7,32.5,25.8
4,72,0,56,110,72,NORMAL,53,102,6.9,4.15,215,98.2,168.6,115.3,38.5,34.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
295,18,1,72,106,44,NORMAL,40,124,4.0,5.17,221,71.6,172.8,78.1,31.0,24.0
296,67,1,62,136,82,HYPERTENSION STAGE 1,39,62,7.7,3.90,305,110.2,169.1,125.5,39.0,38.5
297,24,1,94,96,62,NORMAL,43,102,7.0,5.29,260,56.3,162.7,78.4,27.9,21.3
298,53,1,86,132,74,HYPERTENSION STAGE 1,42,112,8.4,4.07,75,102.6,181.0,117.7,36.5,31.3


# **INSTRUCTIONS**

* Use Python to analyze the data set and complete each of the following.
* **Replace ellipsis (...)** with the relavent names or code.  
* For problems that require a written response, **double click** the text box to start typing.

# **Analyzing a Quantitative Variable**
* We will work together in question 1 to analyze HDL cholesterol levels for the 300 patients in the data set.
* You will then complete question 2 on your own to analyze white blood cell counts.
* In question 3, we will compare HDL cholesterol levels and white blood cell counts among males and females.

# **Question 1**

Blood contains two types of cholesterol; low-density lipoprotein (LDL) and high-density liprotein (HDL).

LDL: Also known as "bad" cholesterol, LDL carries cholesterol from the liver to other parts of the body. High levels of LDL can lead to plaque buildup in the arteries, which can narrow and harden them. This can increase the risk of heart disease and stroke.

HDL: Also known as "good" cholesterol, HDL absorbs excess cholesterol from the blood and carries it back to the liver. The liver then removes the cholesterol from the body. High levels of HDL can lower the risk of heart disease and stroke.

(Source: https://www.cdc.gov/cholesterol/about/ldl-and-hdl-cholesterol-and-triglycerides.html#:~:text=What%20are%20LDL%20and%20HDL,for%20heart%20disease%20and%20stroke.)

**1a)** When analyzing a quantitative variable, it is a good idea to start with a graph. Construct a histogram for HDL cholesterol. Use 15 bins (class intervals).

What information do you get when you hover over a bar? Try different bars. Try parts of the graph.

ANS: ...

In [6]:
# Create the histogram, with the x-axis being HDL levels.
# nbins specifies the number of bins to use.
# Notice the addition of the units in the labels.
fig = px.histogram(x=df['...'], nbins = ...,
             title='...',
             labels={'x':...,'y':...})

# Update the vertical axis title.
fig.update_layout(yaxis_title="...")

# Print the histogram.
fig.show()

KeyError: '...'

**1b)** Construct a boxplot for HDL cholesterol.

In [None]:
# Create the boxplot, with a title, and specify horizontal axis label.
# Include units in your axis label.
px.box(x=df[...],
       title=...,
       labels={'x':...})

**1c)** We learn in lecture that the 5-number summary consists of the minimum, Q<sub>1</sub>, median, Q<sub>3</sub>, and the maximum. Calculate the 5-number summary, the mean, and the standard deviation for HDL cholesterol. Round to 3 decimal places.

In [None]:
# .describe calculates all required summary statistics, and the number
#   of individuals in the data set.
descriptive_stats = df[[...]].describe().round(3)

# Print the results
...

#### Investigation 1: The table above includes something labeled 25%. What does this refer to? What about 50%? What about 75%?

* 25% is ...
* 50% ia ...
* 75% is ...

#### Investigation 2: Hover over the boxplot from (1b). Notice the values it provides.
* Verify the 5 number summary matches what is in the table. (Noting to write)
* List the values of all the outliers below.

Outliers: ...



**1d)** Use the above information, in (1a)-(1c), to describe HDL cholesterol in terms of shape, outliers, center, and spread.
* Use the correct center and the correct spread based on the shape of the distribution.
* Specify which shape and which center you are using. For ex: Say "The mean is ..." or "The median is ...", rather than "The center is ..."
* When addressing outliers, if any, list a minimum of 3 values.
* Include units, if any, for all numbers.



# Question 2: Now you try

Repeat #1 to analyze white blood count.

**2a)** Construct a histogram for white blood count. Use 20 class intervals.

In [None]:
# Create the histogram, with the x-axis being white blood cell counts.
# Include units in the labels.
fig = px.histogram(x=df[...],nbins = ...,
                  title=...,
                  labels={'x':'...',
                     'y':'...'})

# Update the vertical axis title.
fig.update_layout(yaxis_title="...")

# Print the histogram.
fig.show()

**2b)** Construct a boxplot for white blood count.

In [None]:
# Create the boxplot, with a title, and specify horizontal axis label.
# Include units in the labels.
px.box(x=df[...],
       title=...,
       labels={'x':...})

**2c)** You know that the 5-number summary consists of the minimum, Q<sub>1</sub>, median, Q<sub>3</sub>, and the maximum. Calculate the 5-number summary, the mean, and the standard deviation for white blood count. Round to 3 decimal places.

In [None]:
# Calculate the required summary statistics.
descriptive_stats = df[[...]].describe().round(...)

# Print the summary statistics.
...

#### From the output above, the mean is 6.542. How many white blood cells are there per µL of blood?

...

**2d)** Use the above information, in (2a)-(2c), to describe white blood count in terms of shape, outliers, center, and spread.
* Use the correct center and the correct spread based on the shape of the distribution.
* Specify which shape and which center you are using. For ex: Say "The mean is ..." or "The median is ...", rather than "The center is ..."
* When addressing outliers, if any, list a minimum of 3 values.
* Include units, if any, for all numbers.

...

# Question 3:

It is common to compare the distribution of one or more quantitative variables among different categories of a qualitative variable.

How do the mean HDL and mean white blood cell counts compare among males versus females? Let's investigate.

**3a)** Calculate the mean HDL and white blood cell counts for males and females. Round to 2 decimal places.

In [None]:
# df[df['GENDER (1=M)'] == 0] filters the DataFrame to just females.
# [['HDL', 'WHITE']] selects just the 2 variables listed.
# .mean() calculates the mean for just those 2 variables
# female_means = assigns the name female_means to that filtered DataFrame with
#   the means calculated.
# Note: (==) tests if the comparison is true, rather than assigning a value (=).
#   If GENDER = 0, include that person in the new DataFrame.

# female_means = df[df['qual variable'] == 0][['quant 1', 'quant 2']].mean().round(#)
female_means = df[df['...'] == 0][['...', '...']].mean().round(...)


# Repeat the process for just males.
# male_means = df[df['qual'] == 1][['quant 1', 'quant 2']].mean().round(#)
male_means = df[df['GENDER (1=M)'] == ...][['...', '....']].mean().round(...)

# Creates a table (DataFrame) to display the information calculated above.
means_table = pd.DataFrame({'HDL': [male_means['HDL'], female_means['HDL']],
                            'White': [male_means['WHITE'], female_means['WHITE']],
                           },
                           index=['Male', 'Female'])
# Print the table.
print('Mean HDL and Mean White Blood Cell Count For Each Gender')
...

**3b)** Compare the mean HDL cholesterol levels and the mean white blood cell counts for males and females. Always include units.

...

<br><br>
# Practice Submission
### Save as PDF
1. Run all code one last time
2. File-> Print Preview opens in a new browser window
3. Verify you can see all graphs. If not, go back to step 1.
4. File -> Print (or ctrl-p/cmnd-p)
5. Change destination to PDF (don't save, yet)
6. Scroll through preview to make sure you can see your graphs entirely. If not, click Cancel. Make the browser window narrower. Go back to step 4.
7. Repeat steps 4-6 until you can see your graphs completely. But do not make them too narrow.
8. Save the PDF, taking note of where it is saved.

### Submit to GradeScope
1. Login to the Canvas course
2. Click on GradeScope in the course navigation
3. Click on the STAT 108 course
4. Click on the "Tutorial 2 Practice Upload" assignment
5. Click on "Submit Work" or "Start Submission" button.
6. Select the PDF you just created.
7. Assign pages to questions by clicking on the question number then clicking on all pages that question is on.
8. Click "Submit"

# Keep this for your reference for Project 1. You are now ready to complete Project 1 on your own. If you have not yet done so, go to Canvas and navigate to the Projects module. There are 6 data sets. Pick one and click on the link for Project 1 for your chosen data set to open the related Jupyter Notebook. Once completed, follow the directions in the notebook for submitting.