<a href="https://colab.research.google.com/github/shawnwiggins/OLI_probability_and_statistics/blob/master/Lab_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


<center>
<br>
<img src="https://www.ccsf.edu/sites/default/files/inline-images/CCSF%20LOGO.png" width = 30%/>
<h1><strong>PROBABILITY AND STATISTICS</strong></h1>
<h2>LAB 2</h2>
<br><br>
</center>

**Caution! Make sure that you are signed in to your Google account and that you have saved a copy of this notebook to your Drive. If you do not do this, then your work will not be saved.** 

In this lab, you will use the R ecosystem called [Tidyverse](https://www.tidyverse.org/) to work with actual NHANES government data from the CDC where you will:

* Graphically and numerically summarize a categorical variable.
* Graphically and numerically summarize a quantitative variable.
* Remove outliers from a dataset using the 1.5IQR Method

You should have attended the live meeting for the week or viewed the recording of that meeting before working on this assignment. The meeting contained a demonstration of tasks similar to what you are asked to do here. 

There are assigned DataCamp Assignments that help explain more about what is going on with Tidyverse. Also, you are encouraged to work together and get support in order to complete this assignment. Use the Pronto messenger in Canvas to ask the class for help.


## Task 0: Set Up the Lab

Every lab assignment will need to be set up before you can begin and will need to be set up again if you ever have to quit working and come back.

Run the following Code cell by pressing the "play" style button and wait for a SUCCESS statement. If you have any errors with the setup, make sure to contact the instructor for help.

In [None]:
#@title **Set Up the Lab**

print('The lab is being set up...')
library(tidyverse)
library(haven)
DEMO_J <- read_xpt(file="https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DEMO_J.XPT")
VIC_J <- read_xpt(file="https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/VIC_J.XPT")

print('SUCCESS: The lab is set up and ready to go!')

## National Health and Nutrition Examination Survey (NHANES)

For several of the labs in this course, you will be exploring the results of the most recently [(2017-2018) published results](https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2017) from the [National Health and Nutrition Examination Survey](https://www.cdc.gov/nchs/nhanes/index.htm) (NHANES). For the [video overview of NHANES](https://youtu.be/75Ur89rMsSA) for more insight into how the data is collected using mobile units across the country.




## NHANES 2017-2018 Demographics and Laboratory Data

This lab uses demographics and laboratory data from NHANES.

* The demongraphics data was downloaded from the NHANES website and saved in the data frame called `DEMO_J`. 
* The laboratory data for Vitamin C was downloaded from the NHANES website and saved in the data frame called `VIC_J`. 

This data contains many encoded variables. Review the [2017-2018 Demographics document file](https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DEMO_J.htm) and [2017-2018 Vitamin C document file](https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/VIC_J.htm) the to understand what these cryptic variable names mean.

### Missing Information

In the last lab, you may have noticed a lot of `NA` values in the dataset. These are missing values! Although this is a very detailed study, there are still missing data values. Those that are familiar with surveys the medical world, know that there are often many blank answers to questions. There are many reasons why missing values exist in a dataset. Some of those reasons are unavoidable and some are avoidable.

We need to address what we will do with the missing data. There are many techniques for dealing with missing values, but the simple technique is just to remove the individuals that have missing values. You will see this done with the `drop_na()` function. 

## Lab Tasks

### Categorical Variable Summaries

Using the following tasks, summarize the distribution of the categorical variable that represents the education level of adults in the study. An adult in this study is someone that is 20 years old or older. These tasks will use the `DEMO_J` demographics data.

#### Task 1: Select, Cleanup, Save

From the `DEMO_J` dataset:

* Select the categorical variable that represents adult educational levels using `select()`.
* Drop the individuals from the data set with missing values `NA` using `drop_na()`.
* Assign the data to a variable named `adult_education_levels` using `<-`.
* Use `glimpse()` to get check out `adult_education_levels`.

You should notice that the values for this variable are encoded. You will need to look up what the values mean in the [2017-2018 Demographics document file](https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DEMO_J.htm).




In [None]:
# Save the cleaned up adult education level data to a new variable.


#### Task 2: Numerical Summaries

Working from `adult_education_levels`, instead of `DEMO_J`:

* Using `group_by()`, `summarize()`, and `mutate()`, create a relative frequency table with a column showing the frequency for the above variable and a column showing the relative frequency for the above variable.
* Using `<-`, assign this table to variable name `adult_education_level_frequency_table`
* Type `adult_education_levels_frequency_table` to see the table that you created.

In [None]:
# Create a relative frequency table for a categorical variable.


#### Task 3: Graphical Summaries

The table that you created is very informative, but a good visual really goes a long way at summarizing data for most of us.

* Using `ggplot()` with the layer `geom_col()` create a bar chart for the above variable. 
* Additionally, use `xlab()`, `ylab()`, and `ggtitle()` to add meaningful labels to the graphic.

In [None]:
# Create a bar chart summary for the distribution of this varaible


#### Task 4: Report your Findings

Provide a short paragraph summary of the distribution of one of the categorical variables. Make sure your summary addresses the following items: 
* Specifically, identify the most frequent value(s) for the variable.
* Specifically, identify the least frequent value(s) for the variable.
* Point out anything you find interesting about the data in relation to how you view adults (20+) living in the United States.

Again, the variable's values are encoded, so you will need to look up what the values mean in the [2017-2018 Demographics document file](https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DEMO_J.htm).

*Replace this text with your response to this task.*


### Quantitative Variable Summaries

Now, use the following tasks to summarize the distribution of the quantitative variable that represents the concentration levels of vitamin C (measured in milligrams per deciliter) for the individuals in the study. The following tasks will use the `VIC_J` vitamin C data.

#### Task 5: Select, Cleanup, Save

From the `VIC_J` dataset:

* Select the quantitative variable that represents the concentration of vitamin C (measured in mg/DL) in the individual using `select()`,
* Drop the individuals in the data set with missing values using `drop_na()`, and
* Assign the data to a variable named `vitamin_C` using `<-`.

In [None]:
# Save the cleaned up vitamin C concentration data to a new variable.


#### Task 6: Numerical Summaries

Using `summarize()`, create a table that shows the mean, standard deviation, median, IQR, and range for the vitamin C concentration.

In [None]:
# Numerically summarize the distribution of the vitamin C variable.


#### Graphical Summaries

There can feel like a never-ending amount of numerical summaries that can be used to summarize the distribution of a quantitative variable. Move on to create two common graphics for this type of data.

##### Task 7: Histogram

* Using `ggplot()` with the layer `geom_histogram()`, create a histogram for the distribution of the above variable. Use `bins = 30` for the number of bins in the histogram. 
* Additionally, use `xlab()` and `ggtitle()` to add meaningful labels to the graphic.

In [None]:
# Create a histogram for the distribution of the vitamin C data


##### Task 8: Boxplot

* Using `ggplot()` with the layer `geom_boxplot()`, create a boxplot for the distribution of the above variable.
* Additionally, use `xlab()` and `ggtitle()` to add meaningful labels to the graphic.

In [None]:
# Create a boxplot for the distribution of the vitamin C data


#### Task 9: Vitamin C Overdose?!

Both of those graphics are pretty challenging to read because it seems like there are some outliers!

* What is it about the boxplot (specifically) that indicates that there are outliers according to the 1.5 IQR method outlined in the OLI workbook?
* What is your opinion about why these people have much more vitamin C in their bodies compared to the others? (Is there a logical reason, do you think it is a mistake by the CDC lab technicians, should the individuals be concerned, etc.?)

*Replace this text with your response to this task.*


#### Task 10: Remove the Outliers

The OLI workbook says that any data values that are smaller than the value $Q1 - 1.5IQR$ or larger than the value $Q3 + 1.5IQR$ are outliers. This means that any data values between those two specifically calculated numbers are not outliers.

* Using `filter()`, remove the outliers from `vitamin_C`.
* Using `<-`, save the result as `vitamin_C_no_outliers`.

In [None]:
# Remove the outliers according to the 1.5IQR method


#### Task 11: Numerical Summaries ... Again

Recalculate the numerical summaries from Task 6 for the vitamin C data now that the outliers have been removed.

In [None]:
# Numerically summarize the non-extreme vitamin C data


#### Task 12: Graphical Summary Again

Create a histogram to summarize the distribution of the vitamin C data now that the outliers have been removed. You can use 30 bins again for the histogram.

In [None]:
# Create a histogram to summarize the distribution of vitamin_C_no_outliers


#### Task 13: Mean vs. Median

Using the shape of the histogram in your reasoning, why is the mean value less than the median value?

*Replace this text with your response to this task.*


#### Task 14: Normal Data

Often in the scientific community, the word "normal" refers to when the histogram summary for the distribution of values is symmetric and unimodal. (*It is more specific than that, but let's avoid those details for now.*)

You would think that the amount of vitamin C in a person's body is normally distributed, but it seems the case when looking at the histogram.

* What visual evidence from the shape of the histogram suggests that this distribution is not normally distributed?
* Run the following cell to see the proportion of the data values that are more than 2 standard deviations below the mean for this data set.


In [None]:
#@title Run this calculation
proportion_more_than_2sd_below_mean <- vitamin_C_no_outliers %>%
  filter(LBXVIC < mean(LBXVIC) - 2*sd(LBXVIC)) %>%
  count()/count(vitamin_C_no_outliers)

cat('The proportion is ... ', proportion_more_than_2sd_below_mean[[1]], '😔')


* Using the Standard Deviation Rule for normal data from the OLI workbook, what proportion of data values "should" be more than 2 standard deviations below the mean?
* Provide your opinion as to why you think the vitamin C data is not more normal in shape. Again, normal refers to the unimodal, symmetric shape.

*Replace this text with your response to this task.*


# Submit the Lab

1. Double-check that you completed all the Lab Tasks.
2. Submit the Share link to this notebook by doing the following:
  1. Click the **Share** button at the top of this page.
  2. Click **Change** within the Get Link section of the pop-up.
  3. Change **Viewer** to **Commenter** within the Get Link section of the pop-up. 
  4. Click **Copy link** within the Get Link section of the pop-up.

