# **Body Data**

# **Inference for the Population Proportion**


#### This tutorial is designed to help you complete Project 3, in which you will create a confidence interval and conduct a hypothesis test using the PMACC process.


# **Importing Necessary Python Modules**

Python incorporates a variety of open source add-ins called **modules** that add extra features to the basic setup. The name of the modules is after the `import` statement, and the purpose is in a non-code comment after the hashtag (#).



In [1]:
import pandas as pd                 #Data analysis
import numpy as np                  #Calculations
from IPython.display import Image   #Display images
from scipy.stats import norm        #Confidence Interval

In [2]:
# Replace 'image_url' with the URL of the image you want to display
image_url = 'https://media.istockphoto.com/id/456054995/photo/dna-molecules-and-virtuvian-man.jpg?s=612x612&w=0&k=20&c=v5qZJ5Ty4RwDbyGRx_v-tYd1-LfTZwTi-Aend5Q_sqA='

# Display the image
Image(url=image_url)

# **Context**

The National Center for Health Statistics (NCHS) offers downloadable public-use data files through the Centers for Disease Control and Prevention's (CDC) FTP file server. Users of this service have access to data sets, documentation, and questionnaires from NCHS surveys and data collection systems.

Public-use data files are prepared and disseminated to provide access to the full scope of the data. This allows researchers to manipulate the data in a format appropriate for their analyses. NCHS makes every effort to release data collected through its surveys and data systems in a timely manner.

More information can be found at https://www.cdc.gov/nchs/data_access/ftp_data.htm.


# **About the Dataset**

This dataset contains 301 rows corresponding to a sample of Americans. A total of 16 variables are provided as listed below:

**Variables**

| Column     | Description                                                                 |
|------------|-----------------------------------------------------------------------------|
| AGE        | Age in years|
| GENDER     | Gender: 0=female, 1=male|
| PULSE      | Pulse rate in beats per minute (bpm)|
| SYSTOLIC   | Systolic blood pressure (mm Hg)|
| DIASTOLIC  | Diastolic blood pressure (mm Hg)|
| CATEGORY   | Blood Pressure Category based on the table below from the American Heart Association|
| HDL        | HDL cholesterol (mg/dL)|
| LDL        | LDL cholesterol (mg/dL)|
| WHITE      | White blood cell count (1000 cells/µL) |
| RED        | Red blood cell count (million cells/µL)|
| PLATE      | Platelet count (1000 cells/µL)|
| WEIGHT     | Weight (kg)|
| HEIGHT     | Height (cm)|
| WAIST      | Waist circumference (cm)|
| ARM CIRC   | Arm circumference (cm)|
| BMI        | Body mass index (kg/m²)|

# **Blood Pressure Category Table**

The table below from the American Heart Association classifies blood pressure into five (5) categories based on a combination of the individual's systolic and diastolic blood pressure.

In [3]:
# Replace 'image_url' with the URL of the image you want to display
image_url = 'https://www.heart.org/-/media/Images/Health-Topics/High-Blood-Pressure/Rainbow-Chart/blood-pressure-readings-chart.jpg?h=294&iar=0&mw=440&w=440&sc_lang=en'

# Display the image
Image(url=image_url)


# **A Snippet of the Data**

We can view a snippet of the data by first importing it directly from the url below.


In [4]:
# Assign the name of the web address to 'url'
url="https://raw.githubusercontent.com/thamilton562/STAT108_Projects_Students/main/Project1/Body%20Data.csv"

# df stands for data frame, which is what the raw data is referred as.
# This code reads in the data file from the address specified in 'url'
df=pd.read_csv(url)

Next, we can display the data by *typing the name* of the DataFrame. To ensure we can see all columns, we'll use the *pd.set_option* method.

In [5]:
# Set display options to show all columns
pd.set_option('display.max_columns', None)
df

Unnamed: 0,AGE,GENDER (1=M),PULSE,SYSTOLIC,DIASTOLIC,CATEGORY,HDL,LDL,WHITE,RED,PLATE,WEIGHT,HEIGHT,WAIST,ARM CIRC,BMI
0,43,0,80,100,70,NORMAL,73,68,8.7,4.80,319,98.6,172.0,120.4,40.7,33.3
1,38,0,94,134,94,HYPERTENSION STAGE 2,36,223,6.9,4.47,297,108.2,154.4,120.3,44.3,45.4
2,69,0,58,138,80,HYPERTENSION STAGE 1,40,140,8.1,4.60,286,79.2,155.7,103.5,34.2,32.7
3,44,0,66,114,66,NORMAL,45,136,8.0,4.09,263,64.2,157.6,89.7,32.5,25.8
4,72,0,56,110,72,NORMAL,53,102,6.9,4.15,215,98.2,168.6,115.3,38.5,34.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
295,18,1,72,106,44,NORMAL,40,124,4.0,5.17,221,71.6,172.8,78.1,31.0,24.0
296,67,1,62,136,82,HYPERTENSION STAGE 1,39,62,7.7,3.90,305,110.2,169.1,125.5,39.0,38.5
297,24,1,94,96,62,NORMAL,43,102,7.0,5.29,260,56.3,162.7,78.4,27.9,21.3
298,53,1,86,132,74,HYPERTENSION STAGE 1,42,112,8.4,4.07,75,102.6,181.0,117.7,36.5,31.3


# **INSTRUCTIONS**

* Use Python to analyze the data set and complete each of the following.
* **Replace ellipsis (...)** with the relavent names or code.  
* For problems that require a written response, **double click** the text box to start typing.

# **Question 1**

## **Confidence Interval**

Construct and interpret the 90% confidence interval for the population proportion of people who have normal blood pressure. Write up the solution using the PMACC procedure.

**1.1) Parameter: Define the parameter, using correct notation.**

...

**1.2) Method: Name the method you will use.**

...

**1.3) Assumptions:**

Complete the code below to find out how many patients have normal blood pressure.

In [6]:
# Count total observations
n = len(df)

# Count total successes
# Replace the 1st ... with the variable name
# Replace the 2nd ... with the name of the blood pressure category to be analyzed.
obs_count = df['...'].value_counts().get('...')     #obs_count = df['...'].value_counts().get('...', 0)

print(f"{obs_count} out of {n} patients have normal blood pressure.")

KeyError: '...'

**Show that both assumptions are met.**

...

**1.4) Calculate: Complete the code below to calculate the sample proportion of people who have normal blood pressure, and the confidence interval.**

In [None]:
# Define the confidence level
# Replace the ... with the stated confidence level,
#   as a decimal (ex: 0.83, not 83%)
CL = ...

#Calculate the values needed; p-hat, critical value (CV), and standard error (se).
p_hat = obs_count / n
cv = norm.ppf((1+CL)/2)
se = np.sqrt(p_hat * (1-p_hat) / n)

#Calculate the bounds of the interval
ci_lower = (p_hat - cv * se)
ci_upper = (...) #Replace ... with the formula for the upper bound.

print(f"p-hat = {obs_count}/{n} = {p_hat.round(5)}")
print(f"The {CL*100}% CI is ({ci_lower.round(5)}, {ci_upper.round(5)})")

**1.5) Communicate Results: Interpret the confidence interval calculated in 1.4 above. Round to 3 decimal places.**

...

**1.6) Use the confidence interval and show work to calculate the margin of error. Then, interpret the margin of error.**

**Calculation**

...


**Interpretation**

...

# **Question 2**

## **Hypothesis Test**

Is there convincing evidence that the true proportion of individuals who have normal blood pressure is different from 50%? Use α=0.10. Write up the solution using the PMACC procedure.

**2.1) Parameter: Define the parameter, using correct notation.**

...

**2.2) Method: Name the method you will use, and write the hypotheses.**

**Name:**

...


**Hypotheses:**

...

**2.3) Assumptions: Show that both assumptions are met.**


**1.**  ...

**2.**  ...

**2.4) Calculate: Complete the code below to calculate the values required.**

In [None]:
#Define P0, the value in H0.
p_0 = ...     #Replace ... with p0, the value in the null hypothesis.

#Calculate the values needed; p-hat, and standard error (se).
p_hat = obs_count / n
se = np.sqrt(p_hat * (1-p_hat) / n)

#Calculate the z-score of our p-hat, under the assumption H0 is true.
z_score = (p_hat - p_0) / se

#Calculate the p-value for 1- and 2-sided tests
p_value1 = (1 - norm.cdf(abs(z_score)))
p_value2 = 2 * p_value1

print(f"p-hat = {p_hat.round(7)}")
print(f"z-score = {z_score.round(7)}")
print(f"1 sided p-value = {p_value1.round(7)}")
print(f"2 sided p-value = {p_value2.round(7)}")

**2.5) Communicate Results: What conclusion is made about the null hypothesis? And what does that mean about the alternate hypothesis?**

...

# **Question 3**

## **Do you make the same conclusion if you use the confidence interval?**

In question 2 you concluded that we either do have or do not have convincing evidence for the alternate hypothesis. Using your confidence interval from question 1, do you reach the same conclusion?

...

<br><br>


# Keep this for your reference for Project 3. You are now ready to complete Project 3 on your own. You may choose to work on the same data set that you did for Project 1 and/or Project 2, or you may choose to do analyze a different data set. Pick just one (1) data set to analyze.