# **Body Data**

#### In this tutorial you will anaylze a qualitative variable by:

  * ordering the categories of a variable.
  * creating a frequency distribution and relative frequency distribution as one table, with the categories ordered.
  * creating a bar graph for a single variable, with the categories ordered.
  * creating a two-way table for analyzing two variables.
  * creating a side-by-side (comparative, grouped) bar graph.


# **Importing Necessary Python Modules**

Python incorporates a variety of open source add-ins called **modules** that add extra features to the basic setup. The name of the modules is after the `import` statement, and the purpose is in a non-code comment after the hashtag (#).



In [None]:
import pandas as pd                 #Data analysis
import numpy as np                  #Calculations
import plotly.express as px         #Graphing
import matplotlib.pyplot as plt     #Graphing
from IPython.display import Image   #Display images
import warnings                     #Ignore version warnings
warnings.simplefilter('ignore', FutureWarning)


In [None]:
# Replace 'image_url' with the URL of the image you want to display
image_url = 'https://media.istockphoto.com/id/456054995/photo/dna-molecules-and-virtuvian-man.jpg?s=612x612&w=0&k=20&c=v5qZJ5Ty4RwDbyGRx_v-tYd1-LfTZwTi-Aend5Q_sqA='

# Display the image
Image(url=image_url)

# **Context**

The National Center for Health Statistics (NCHS) offers downloadable public-use data files through the Centers for Disease Control and Prevention's (CDC) FTP file server. Users of this service have access to data sets, documentation, and questionnaires from NCHS surveys and data collection systems.

Public-use data files are prepared and disseminated to provide access to the full scope of the data. This allows researchers to manipulate the data in a format appropriate for their analyses. NCHS makes every effort to release data collected through its surveys and data systems in a timely manner.

More information can be found at https://www.cdc.gov/nchs/data_access/ftp_data.htm.


# **About the Dataset**

This dataset contains 301 rows corresponding to a sample of Americans. A total of 16 variables are provided as listed below:

**Variables**

| Column     | Description                                                                 |
|------------|-----------------------------------------------------------------------------|
| AGE        | Age in years|
| GENDER     | Gender: 0=female, 1=male|
| PULSE      | Pulse rate in beats per minute (bpm)|
| SYSTOLIC   | Systolic blood pressure (mm Hg)|
| DIASTOLIC  | Diastolic blood pressure (mm Hg)|
| CATEGORY   | Blood Pressure Category based on the table below from the American Heart Association|
| HDL        | HDL cholesterol (mg/dL)|
| LDL        | LDL cholesterol (mg/dL)|
| WHITE      | White blood cell count (1000 cells/µL) |
| RED        | Red blood cell count (million cells/µL)|
| PLATE      | Platelet count (1000 cells/µL)|
| WEIGHT     | Weight (kg)|
| HEIGHT     | Height (cm)|
| WAIST      | Waist circumference (cm)|
| ARM CIRC   | Arm circumference (cm)|
| BMI        | Body mass index (kg/m²)|

# **Blood Pressure Category Table**

The table below from the American Heart Association classifies blood pressure into five (5) categories based on a combination of the individual's systolic and diastolic blood pressure.

In [None]:
# Replace 'image_url' with the URL of the image you want to display
image_url = 'https://www.heart.org/-/media/Images/Health-Topics/High-Blood-Pressure/Rainbow-Chart/blood-pressure-readings-chart.jpg?h=294&iar=0&mw=440&w=440&sc_lang=en'
# Display the image
Image(url=image_url)


# **A Snippet of the Data**

We can view a snippet of the data by first importing it directly from the url below.


In [None]:
# Assign the name of the web address to 'url'
url="https://raw.githubusercontent.com/ksuaray/STAT108F24_Projects_Jupyter/main/Project0/Body%20Data.csv"

# df stands for data frame, which is what the raw data is referred as.
# This code reads in the data file from the address specified in 'url'
df=pd.read_csv(url)

Next, we can display the data by *typing the name* of the DataFrame. To ensure we can see all columns, we'll use the *pd.set_option* method.

In [None]:
# Set display options to show all columns
pd.set_option('display.max_columns', None)
df

Unnamed: 0,AGE,GENDER (1=M),PULSE,SYSTOLIC,DIASTOLIC,CATEGORY,HDL,LDL,WHITE,RED,PLATE,WEIGHT,HEIGHT,WAIST,ARM CIRC,BMI
0,43,0,80,100,70,NORMAL,73,68,8.7,4.80,319,98.6,172.0,120.4,40.7,33.3
1,38,0,94,134,94,HYPERTENSION STAGE 2,36,223,6.9,4.47,297,108.2,154.4,120.3,44.3,45.4
2,69,0,58,138,80,HYPERTENSION STAGE 1,40,140,8.1,4.60,286,79.2,155.7,103.5,34.2,32.7
3,44,0,66,114,66,NORMAL,45,136,8.0,4.09,263,64.2,157.6,89.7,32.5,25.8
4,72,0,56,110,72,NORMAL,53,102,6.9,4.15,215,98.2,168.6,115.3,38.5,34.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
295,18,1,72,106,44,NORMAL,40,124,4.0,5.17,221,71.6,172.8,78.1,31.0,24.0
296,67,1,62,136,82,HYPERTENSION STAGE 1,39,62,7.7,3.90,305,110.2,169.1,125.5,39.0,38.5
297,24,1,94,96,62,NORMAL,43,102,7.0,5.29,260,56.3,162.7,78.4,27.9,21.3
298,53,1,86,132,74,HYPERTENSION STAGE 1,42,112,8.4,4.07,75,102.6,181.0,117.7,36.5,31.3


# **INSTRUCTIONS**

* Use Python to analyze the data set and complete each of the following.
* Replace ellipsis (...) with the relavent names or code.  
* For problems that require a written response, double click the text box to start typing.

# **Analyzing a Qualitative Variable**

# **Question 1**

**1a)** In Project 0 we learned how to create a frequency table and a relative frequency table, and we printed them separately. It is often helpful to have these two pieces of information (frequency and relative frequency) in the same talbe. Today, we will construct a single table that contains the frequency and relative frequency for the variable CATEGORY.

In [None]:
# Define the name of the variable to be analyzed
variable = df['...']  # Replace ... with the variable name.

# Reorder blood pressure categories in CATEGORY.
# Note that the category names are case sensitive and match
#      the names in the data set.
CatOrder = ['NORMAL', '...',
            '...', '...',
            '...' ]      # Replace all ... with the remaining 4 blood pressure
                         #   categories, in order of severity.

# Create the frequency table and change the order of the blood pressure
#      categories to that specificed in CatOrder above, and rename the counts
#      column to Frequency.
freq_table = pd.value_counts(variable).reindex(CatOrder)
freq_table = freq_table.rename('Frequency')

# Create the relative frequency table, and rename the counts column to
#      Relative Frequency.
relative_freq_table = freq_table/...   #See Project 0 if you don't remember
relative_freq_table = relative_freq_table.rename('Relative Frequency')

# Combine (concatenate) the two tables.
combined_table=pd.concat([freq_table, relative_freq_table], axis=1)

# Print the new, combined, table
... # What would you replace ... with to print the combined table?

KeyError: '...'

**1b)** In Project 0 we constructed a frequency bar graph. Today we will construct a relative frequency bar graph of the variable CATEGORY.

In [None]:
# The argument in (...) tells the system which table to use to make the
#      bar graph. Since we want a relative freqency bar chart, which is the
#      correct label to replace ... in this code?
dfrf = pd.DataFrame(...) # Replace (...) the name of the rel. freq. table

# Create the bar graph.
fig = px.bar(x=dfrf.index,y=dfrf['Relative Frequency'],
             title='...')                   # Main title
fig.update_layout(xaxis_title="...")        # xaxis_title
fig.update_layout(yaxis_title="...")        # yaxis_title=...
fig.show() # Notice that the categories are still in the order specified previously.

**1c)** Notable Fact:

...

**1d)** We want to compare blood pressure category for the two genders. Why should be use **relative** frequency counts instead of frequency counts to construct the side-by-side bar chart?

...

**1e)** Construct a relative frequency distribution table to compare blood pressure categories for each gender.

In [None]:
# Create a two-way table (cross tabulation) of frequency counts of the variables
#  gender and category
# The first variable listed is the row (left) variable (CATEGORY)
# The second variable listed is the column (top) variable (GENDER)
# Replace the variables names with the correct names from the data set.
crosstab = pd.crosstab(df['...'], df['...'], margins = True)

# Reorder the CATEGORY categories and add the column totals
crosstab = crosstab.reindex(index=CatOrder+['All'])

# print the frequency table. \n adds a line break (return)
print("Frequency:\n\n", crosstab, "\n\n") #Print frequency table. Notice that
                                      #row totals match the table in (1a) above.

# Create a relative frequency two-way table.
# axis = 0 refers to rows, axis = 1 refers to columns
# cross_tab.div(..., axis = 1) divides each cell of the table by its
#  column total, which is the blood pressure category total.
rel_freq_crosstab = crosstab.div(crosstab.loc['All'], axis = 1)

# Number of decimal places to round to
rel_freq_crosstab = rel_freq_crosstab.round(3)

# Put the Category variable categories in the order specified previously.
rel_freq_crosstab = rel_freq_crosstab.reindex(index=CatOrder)

# Print the table
print("Relative Frequency for each Gender:\n")
... # What would you replace ... with to print the relative frequency table?

**1f)** Construct a side-by-side bar chart to compare blood pressure categories for each gender.

In [None]:
# Adjusts the format of the table as required by Plotly.Express for graphing
rel_freq_crosstab_long = rel_freq_crosstab.reset_index().melt(
                         id_vars='CATEGORY',
                         var_name='GENDER (1=M)',
                         value_name='Count')

# Filter out the 'All' row and column
rel_freq_crosstab_long = rel_freq_crosstab_long[
    (rel_freq_crosstab_long['CATEGORY'] != 'All') &
    (rel_freq_crosstab_long['GENDER (1=M)'] != 'All')]

# Create a side-by-side bar chart
# barmode ensures gender bars are grouped by blood pressure category
fig = px.bar(rel_freq_crosstab_long,
             x='CATEGORY',     # places categories on horizontal axis
             y='Count',        # places Relative Frequency on vertical axis
             color='GENDER (1=M)',
             barmode='group',
             title='Relative Frequency of Blood Pressure Categories by Gender',
             labels={'Count': 'Relative Frequency',
                     'CATEGORY': 'Blood Pressure Category',
                     'GENDER (1=M)': 'Gender (1 = Male)'})

# Show the bar graph
fig.show()

**1g)** Describe the relationship between blood pressure and gender. Be sure to use words that indicate you are referencing relative frequency, not frequency.

...

# Keep this for your reference for Project 1. There is nothing to turn in after this tutorial.

# As soon as the data sets and Project 1 instructions are posted to Canvas, you can choose a data set and begin working on Project 1.