# **Body Data**

# **Analyzing The Linear Relationship Between Two Quantitative Variables**


#### This tutorial is designed to help you complete Project 2, in which you will be exploring the linear relationship between two quantitative variables. In this tutorial you will learn how to

  * construct a scatterplot
  * compute and interpret the correlation coefficient;
  * add the regression line to the scatterplot;
  * calculate the prediction equation;
  * interpret the slope and y-intercept;
  * predict the value of the response variable for a given value of the explanatory variable; and
  * look up a value from the dataset.


# **Importing Necessary Python Modules**

Python incorporates a variety of open source add-ins called **modules** that add extra features to the basic setup. The name of the modules is after the `import` statement, and the purpose is in a non-code comment after the hashtag (#).



In [1]:
import pandas as pd                 #Data analysis
import numpy as np                  #Calculations
import plotly.express as px         #Graphing
#import matplotlib.pyplot as plt     #Graphing
from IPython.display import Image   #Display images
import statsmodels.api as sm        #Linear regression
#import warnings                     #Ignore version warnings
#warnings.simplefilter('ignore', FutureWarning)


In [2]:
# Replace 'image_url' with the URL of the image you want to display
image_url = 'https://media.istockphoto.com/id/456054995/photo/dna-molecules-and-virtuvian-man.jpg?s=612x612&w=0&k=20&c=v5qZJ5Ty4RwDbyGRx_v-tYd1-LfTZwTi-Aend5Q_sqA='

# Display the image
Image(url=image_url)

# **Context**

The National Center for Health Statistics (NCHS) offers downloadable public-use data files through the Centers for Disease Control and Prevention's (CDC) FTP file server. Users of this service have access to data sets, documentation, and questionnaires from NCHS surveys and data collection systems.

Public-use data files are prepared and disseminated to provide access to the full scope of the data. This allows researchers to manipulate the data in a format appropriate for their analyses. NCHS makes every effort to release data collected through its surveys and data systems in a timely manner.

More information can be found at https://www.cdc.gov/nchs/data_access/ftp_data.htm.


# **About the Dataset**

This dataset contains 301 rows corresponding to a sample of Americans. A total of 16 variables are provided as listed below:

**Variables**

| Column     | Description                                                                 |
|------------|-----------------------------------------------------------------------------|
| AGE        | Age in years|
| GENDER     | Gender: 0=female, 1=male|
| PULSE      | Pulse rate in beats per minute (bpm)|
| SYSTOLIC   | Systolic blood pressure (mm Hg)|
| DIASTOLIC  | Diastolic blood pressure (mm Hg)|
| CATEGORY   | Blood Pressure Category based on the table below from the American Heart Association|
| HDL        | HDL cholesterol (mg/dL)|
| LDL        | LDL cholesterol (mg/dL)|
| WHITE      | White blood cell count (1000 cells/µL) |
| RED        | Red blood cell count (million cells/µL)|
| PLATE      | Platelet count (1000 cells/µL)|
| WEIGHT     | Weight (kg)|
| HEIGHT     | Height (cm)|
| WAIST      | Waist circumference (cm)|
| ARM CIRC   | Arm circumference (cm)|
| BMI        | Body mass index (kg/m²)|

# **Blood Pressure Category Table**

The table below from the American Heart Association classifies blood pressure into five (5) categories based on a combination of the individual's systolic and diastolic blood pressure.

In [3]:
# Replace 'image_url' with the URL of the image you want to display
image_url = 'https://www.heart.org/-/media/Images/Health-Topics/High-Blood-Pressure/Rainbow-Chart/blood-pressure-readings-chart.jpg?h=294&iar=0&mw=440&w=440&sc_lang=en'

# Display the image
Image(url=image_url)


# **A Snippet of the Data**

We can view a snippet of the data by first importing it directly from the url below.


In [4]:
# Assign the name of the web address to 'url'
url="https://raw.githubusercontent.com/thamilton562/STAT108_Projects_Students/main/Project1/Body%20Data.csv"

# df stands for data frame, which is what the raw data is referred as.
# This code reads in the data file from the address specified in 'url'
df=pd.read_csv(url)

Next, we can display the data by *typing the name* of the DataFrame. To ensure we can see all columns, we'll use the *pd.set_option* method.

In [5]:
# Set display options to show all columns
pd.set_option('display.max_columns', None)
df

Unnamed: 0,AGE,GENDER (1=M),PULSE,SYSTOLIC,DIASTOLIC,CATEGORY,HDL,LDL,WHITE,RED,PLATE,WEIGHT,HEIGHT,WAIST,ARM CIRC,BMI
0,43,0,80,100,70,NORMAL,73,68,8.7,4.80,319,98.6,172.0,120.4,40.7,33.3
1,38,0,94,134,94,HYPERTENSION STAGE 2,36,223,6.9,4.47,297,108.2,154.4,120.3,44.3,45.4
2,69,0,58,138,80,HYPERTENSION STAGE 1,40,140,8.1,4.60,286,79.2,155.7,103.5,34.2,32.7
3,44,0,66,114,66,NORMAL,45,136,8.0,4.09,263,64.2,157.6,89.7,32.5,25.8
4,72,0,56,110,72,NORMAL,53,102,6.9,4.15,215,98.2,168.6,115.3,38.5,34.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
295,18,1,72,106,44,NORMAL,40,124,4.0,5.17,221,71.6,172.8,78.1,31.0,24.0
296,67,1,62,136,82,HYPERTENSION STAGE 1,39,62,7.7,3.90,305,110.2,169.1,125.5,39.0,38.5
297,24,1,94,96,62,NORMAL,43,102,7.0,5.29,260,56.3,162.7,78.4,27.9,21.3
298,53,1,86,132,74,HYPERTENSION STAGE 1,42,112,8.4,4.07,75,102.6,181.0,117.7,36.5,31.3


# **INSTRUCTIONS**

* Use Python to analyze the data set and complete each of the following.
* **Replace ellipsis (...)** with the relavent names or code.  
* For problems that require a written response, **double click** the text box to start typing.

# **Determine which variable is explanatory and which is response.**

We want to predict a person's waist circumference when given their BMI. Which variable is the explanatory and which is the response variable?




Explanatory: ...

Response: ...

What do you think the direction (positive/negative) and strength (weak/moderate/strong) will be?

# ...

# **Question 1**

## **Scatterplot**

**1.1)** Construct a scatterplot that could be used to show the relationship stated in Question 1.

In [6]:
# Create a scatter plot using Plotly Express
scatter_plot = px.scatter(df,
                 x='...', #explanatory variable name
                 y='...', #response variable name
                 labels={'...': 'Body Mass Index', #updates axis labels
                         '...': 'Waist Circumference'},
                 title='Scatterplot of BMI vs Waist Circumference')

# Updating layout: Students do not change anything in this block of code.
scatter_plot.update_layout(
    plot_bgcolor='rgba(255,255,255,1)',        # Sets the background color of the plot area to white with full opacity.
    xaxis=dict(
        showline=True,                         # Displays a line on the x-axis.
        showgrid=False,                        # Hides the grid lines on the x-axis.
        linecolor='black'                      # Sets the color of the x-axis line to black.
    ),
    yaxis=dict(
        showline=True,                         # Displays a line on the y-axis.
        showgrid=False,                        # Hides the grid lines on the y-axis.
        linecolor='black'                      # Sets the color of the y-axis line to black.
    ),
    title={
        'text': 'Scatter Plot of BMI vs Waist Circumference',   # Sets the title text.
        'y': 0.9,                              # Positions the title 90% of the way up the plot.
        'x': 0.5,                              # Centers the title horizontally.
        'xanchor': 'center',                   # Anchors the title at its center on the x-axis.
        'yanchor': 'top',                      # Anchors the title at the top on the y-axis.
        'font': dict(
            size=16                            # Sets the title font size to 16 (smaller than default).
        ),
    }
)

# Show the plot
scatter_plot.show()

ValueError: Value of 'x' is not the name of a column in 'data_frame'. Expected one of ['AGE', 'GENDER (1=M)', 'PULSE', 'SYSTOLIC', 'DIASTOLIC', 'CATEGORY', 'HDL', 'LDL', 'WHITE', 'RED', 'PLATE', 'WEIGHT', 'HEIGHT', 'WAIST', 'ARM CIRC', 'BMI'] but received: ...

**1.2)** Describe the relationship between BMI and waist circumference. Remember to include context and identify the direction, strength, and form.

# ...

Is this what you expected?

# ...

# **Question 2**
## Correlation Coefficient ##

**2.1)** Calculate the value of the correlation coefficient.

In [None]:
# Calculate the correlation coefficient between WAIST and BMI
# Replace the first set of ... with the response variable name
# Replace the second set of ... with the explanatory variable name
correlation = df['...'].corr(df['...'])

# Print the correlation coefficient
print(f"Correlation coefficient between WAIST and BMI: r = {correlation}")

**2.2)** Interpret the correlation coefficient.

# ...

# **Question 3**

## Least Squares Regression

**3.1)** Calculate the linear model and add the regression line to the scatterplot.

In [None]:
# Calculate the LSRL and add the regression line to the scatterplot.

# Extract the variables from the dataframe
X = df['...'] #explanatory variable
Y = df['...'] #response variable

# Add a constant to the explanatory variable for the regression model.
# This forces Statsmodels to calculate the y-intercept.
X_with_const = sm.add_constant(X)

# Fit the linear regression model (calculate slope and y-intercept)
model = sm.OLS(Y, X_with_const).fit()

# If we repeatedly run this section of code then a new regression line is
# added each run, unless we recreate the scatterplot each iteration.
scatter_plot = px.scatter(df,
                 x='BMI', #explanatory variable name
                 y='WAIST', #response variable name
                 labels={'BMI': 'Body Mass Index', #updates axis labels
                         'WAIST': 'Waist Circumference'},
                 title='Scatterplot of BMI vs Waist Circumference')

# Updating layout: Students do not change anything in this block of code.
scatter_plot.update_layout(
    plot_bgcolor='rgba(255,255,255,1)',        # Sets the background color of the plot area to white with full opacity.
    xaxis=dict(
        showline=True,                         # Displays a line on the x-axis.
        showgrid=False,                        # Hides the grid lines on the x-axis.
        linecolor='black'                      # Sets the color of the x-axis line to black.
    ),
    yaxis=dict(
        showline=True,                         # Displays a line on the y-axis.
        showgrid=False,                        # Hides the grid lines on the y-axis.
        linecolor='black'                      # Sets the color of the y-axis line to black.
    ),
    title={
        'text': 'Scatter Plot of BMI vs Waist Circumference',   # Sets the title text.
        'y': 0.9,                              # Positions the title 90% of the way up the plot.
        'x': 0.5,                              # Centers the title horizontally.
        'xanchor': 'center',                   # Anchors the title at its center on the x-axis.
        'yanchor': 'top',                      # Anchors the title at the top on the y-axis.
        'font': dict(
            size=16                            # Sets the title font size to 16 (smaller than default).
        ),
    }
)

# Add the regression line to the scatterplot
scatter_plot.add_scatter(x=X, y=model.predict(X_with_const),
                         mode='lines', name='Regression Line')

# Show the plot
scatter_plot.show()

**3.2)** Print the coefficients (Intercept and Slope)



In [None]:
# Print the coefficients (Intercept and Slope)

# The coefficients were calculated when the regression line was added to the
# scatterplot and stored in 'model'. This assigns the values of the y-intercept
# and slope to 'coefficients', then prints the values.
coefficients = model.params
coefficients.name='Coefficients'
coefficients

**3.3)** Write the equation of the line that predicts waist circumference from BMI. To type $\hat{y}$ &nbsp;&nbsp;you can type “y-hat” or “predicted y”.




# ...

# **Question 4**
### Slope
Interpret the slope

# ...

# **Question 5**
### Y-intercept
Interpret the y-intercept

# ...

# **Question 6**
### Prediction
Show work to predict the waist circumference of someone with a BMI of 40.8 kg/m<sup>2</sup>.

# ...

# **Question 7**
### Residual
**7.1)** Look up the actual waist circumference of the person with a BMI of 40.8 kg/m<sup>2</sup>.

In [None]:
# Filter the dataframe to find the row where BMI is 40.8
# Replace the first ... with the explanatory variable name
# Replace the second ... with the value you want to compare (no' ')
# Replace the third ... with the response variable name
waist_circ = df[df['...'] == ...]['...']

# Extract the value of the waist circumference
waist = waist_circ.values[0]

# Print the result
print(f"The waist circumference for the person with a BMI of 40.8 is: {waist} inches")

**7.2)** Compare your answer in question 6 to the actual value.


# ...

<br><br>


# Keep this for your reference for Project 2. You are now ready to complete Project 2 on your own. You may choose to work on the same data set that you did for Project 1, or you may choose to do analyze a different data set. Pick just one (1) data set to analyze.