03-Lab-4.Rmd

# (PART\*) Lab 4 - BIOL 205 {-}

*Last updated `r Sys.Date()`*

# Scripting in R {-}

At this stage, you have your draft protocol submitted which includes an articulation of the statistical analyses you plan to use. In this lab, we'll start preparing for running those analyses. To do this, we'll be moving away from using an R Shiny App and instead use a few `R` scripts.

# Using the RScripts {-}

`R` scripts are text files containing the commands (aka code) and comments used to perform computations. For BIOL 205, we've written the majority of the code for you. In BIOL 202, you'll be doing much more of this from scratch. Even though much of the code has been written for you, we encourage you to read through the scripts in detail and see if you can figure out with the help of the comments exactly what's going on! 

There is a separate `R` script for each type of variable combination you might have for your research project. When it comes time for your final analysis, download the `R` script that corresponds to the types of variables you have in your experiment. In the mean time, download all or any that you're interested in looking through. 

* Both response and explanatory variables are categorical 
    * [BIOL205_Script_Categorical-Categorical.R](https://osf.io/download/sg7y6)
* Both response and explanatory variables are quantitative 
    * [BIOL205_Script_Quantitative-Quantitative.R](https://osf.io/download/adwg4)
* Categorical explanatory variable and quantitative response variable 
    * [BIOL205_Script_Quantitative-Categorical.R](https://osf.io/download/2hsk5)

<div class = "note">
While there are instructions within the `R` scripts themselves, the following sections describe how to use the `R` scripts in more detail. Be sure to read **everything here and in the `R` scripts!**

Alongside each line of code in the script is a comment proceeded with a hash tag (#). The comments are either descriptions of what the code is doing or instructions describing things you need to do.
</div>

<div class = "takeAway">
You must correctly set up your project and working directory using the instructions below for this to work. You must also not simply open your `R` script, but instead launch your RProject file and then from within RStudio load the script from the built in file manager.
</div>

The general workflow here is:

1. Launch RStudio using your RPoject file
2. Load your data
3. Provide the script with your variable names and the labels you'd like to use on your graphs
4. Generate a graph; the script will save these to your `figures/` directory
5. Run your descriptive stats; the script will save these to your `report/` directory where you can copy the relevant information to your RMarkdown file
6. Run you statistical analyses; the script will save these to your `report/` directory where you can copy the relevant information to your RMarkdown file

If you quite RStudio, you will need to re-run steps 1-3 to do any of steps 4-6 again.

### Step 1: Set your working directory {-}

If you haven't already, make sure your working directory is set according to the instructions [here](https://ubco-biology.github.io/Procedures-and-Guidelines/set-a-working-directory-in-rstudio.html){target="_blank"} in the Procedures and Guidelines Document.

Your directory should look something like this when you're done, that is, after downloading at least one script, ensuring your data is in the project's data folder, you've created an RMarkdown file, you've generated an RPoject file and put everything in its respective directory. Don't worry if you don't have all these things ready yet, just make sure that when you're ready to run your scripts with your data, this is the model you're working with.

```
├── BIOL205_RP/
│   ├── BIOL205_report.RProj
│   ├── data/
│   │   ├── 20221023_sample-data.csv
│   ├── figures/
│   ├── report/
│   │   ├── 20220101_Lab03_205_Assignment_V1.Rmd
│   ├── scripts/
│   │   ├── BIOL205_Script_Quantitative-Categorical.R
```

### Step 2: Installing & Loading Required Packages {-}

We'll be using a couple of features that are not part of the basic install of `R` in our scripts. We add additional features in `R` with packages.

Packages in `R` contain a set of functions, code, and data that you can use for your analysis. Before you can use the functions within a package, the package must be both installed in `R`, which we do once, and loaded into `R`, which we do each time we start the application.

<div class = "note">
While packages only need to be installed once, if you\'re on a lab computer, since the computers are re-set at the end of each day you may need to re-install packages at each log in.
</div>

To install packages noted in the `R` script, copy the installation lines of code without the preceding hashtag (#) ie. remove the `#` from

```
# install.packages("ggplot2")
```

so it looks like this

```
install.packages("ggplot2")
```

and copy it into the `R` console (lower left pane of RStudio). When you hit enter on your keyboard the package will install. Be sure to do this for all of the required packages noted in the `R` script; these are in the first few lines of each script. 

Once packages are installed, they must be loaded to your current session. You will have to do this each time you re-start `R.` To load packages use the `library()` function by running the lines of code pre-written into your `R` script. For example, run `library(ggplot2)` by placing your cursor in that line of code and clicking the 'Run' button at the top right of the working document. To load all packages at the same time, use your cursor to highlight all of the sections of code you want to run (lines 35-38 in the screenshot below), and then click 'run'.

![](images/run_button.png)

<div class = "note">
Running code can be done several ways in RStudio. You can go through line by line, putting you cursor in the line you wish to run, and executing each independently of the next. Or you can run a section of code by highlighting the 'chunk' you want to run.

The scripts are broken into 'steps'. It's probably best to highlight whole sections of each step and then running that chunk. But if you'd like to run line by line to get a better sense of what's happening, go for it!

To quickly run a line of code, place your cursor on the line and use `ctrl + Enter` or `Command + Enter` if you're on a Mac.
</div>

### Step 3: Uploading your data {-}

Replace the content within the quotes that read 'insert-data-file-name-here.csv' with the file name that your data is saved as to your computer. This will be in between lines 45 and 50 depending on the script you're using.

If my data file was called `20221023_sample-data.csv`, I'd change the following line

```
my_file <- paste0(dir,"insert-data-file-name-here.csv") # assign file name to a variable
```

to

```
my_file <- paste0(dir,"20221023_sample-data.csv") # assign file name to a variable
```

If your data file contains a header row, be sure that the following line, which reads

```
my_data <- read.csv(file = my_file, header = TRUE)
```

shows `header = TRUE`. If you have no header row, simply change the code to `header = FALSE`. 
    
### Step 4: Visualizing your data {-}

You\'ll need to tell `R` some information before you can create graphs or do analyses. Specifically, you need to assign the names of your variables and axis labels. You can do this by 

* Replacing "X variable name" and "Y variable name" with the names of your x and y variables
* Replacing "X label name" and "Y label name" with your desired x and y axes labels

So, we update the following lines

```
x_var <- "X variable name" # Replace with the name of your x variable
x_label <- "X label name" # Replace with your desired x axis label
y_var <- "Y variable name" # Replace with the name of your x variable
y_label <- "Y label name" # Replace with your desired y axis label
```

with something like

```
x_var <- "ht" # Replace with the name of your x variable
x_label <- "Height (in cm)" # Replace with your desired x axis label
y_var <- "day" # Replace with the name of your x variable
y_label <- "Day" # Replace with your desired y axis label
```

<div class = "note">
`R` is case sensitive. When providing your variable names, make sure they match exactly the variables used in your csv file!
</div>

<div class = "flag">
In the `R` script for a categorical explanatory and a categorical response variable, code for both a boxplot and a stripchart is included.

Recall that according to the [Biology Procedures and Guidelines document](https://ubco-biology.github.io/Procedures-and-Guidelines/figures.html), you should use a boxplot if the groups of your categorical variable have more than 20 data points. Alternatively, you should use a stripchart if each group contains less than 20 data points. Be sure to only produce one desired figure.
</div>

<div class = "advanced">
In some of the `R` scripts there are some additional pieces of code that you don't need to worry about for now; you'll learn more about these in BIOL 202!

For example, we have to factor the categorical variables before producing graphs. You will see this additional code if the script you are using includes categorical variables. 

In the code for creating a boxplot or stripchart, we refer to a function that calculates confidence intervals. So we have to define that function prior to running the code to produce the figure. You will find this additional piece of code in your `R` script if your experiment has a categorical explanatory variable and quantitative response variable.

There are some regions in the `R` scripts where variables are re-named or manipulated. If your up for it, take some time to read through the script and see if you can figure out exactly what's happening! 
</div>

Now you're ready to run the code to create your figures and statistical calculations! Remember, once the code has been run, your figures will show in the lower right panel of RStudio under 'Plots' as well as being exported as .png image to your `figures/` directory, and the results of your statistical analyses will be displayed in your `R` console, the lower left panel, as well as being saved as either .csv or .txt to your `reports/` directory. 

### Step 5: Calculate descriptive statistics {-}

The code here should run smoothly if everything in the proceeding steps was done correctly. Once the code has been run, your descriptive statistics will be printed in the console of RStudio (lower left panel). The instructions below describe how to interpret this output depending on the types of variables in your experiment. 

**Two Categorical Variables**

Here is an example of the <a href="https://ubco-biology.github.io/Procedures-and-Guidelines/glossary#Descriptive-statistics">descriptive statistics</a> output produced by the `R` script for two categorical variables. The explanatory (independent) variable and its associated groups appear in the first column, while the response (dependent) variable appears at the top. In this case the explanatory variable was species and the response variable was sex. The numbers provided in the table describe the frequency for each category. For example, in this sample there was 1 female Adelie penguin and 0 male Chinstrap penguins.

```{r, echo=FALSE, message=FALSE}
# load data
library(palmerpenguins)

# build contingeny table
with(penguins, table(species, sex))
```

**Two Quantitative Variables**

Here is an example of the output produced by the `R` script for two quantitative variables. The explanatory (independent) variable appears in the first row, while the response (dependent) variable appears in the second row. `n` indicates the sample size, while the following columns represent the <a href="https://ubco-biology.github.io/Procedures-and-Guidelines/glossary#Mean">mean</a>, <a href="https://ubco-biology.github.io/Procedures-and-Guidelines/glossary#Standard-deviation">standard deviation</a>, <a href="https://ubco-biology.github.io/Procedures-and-Guidelines/glossary#Median">median</a>, and <a href="https://ubco-biology.github.io/Procedures-and-Guidelines/glossary#Interquartile-range">interquartile range</a>, respectively.

```{r, echo=FALSE, message=FALSE}
library(tidyverse)
# subset and summarize first var
bill_length_summary <- penguins %>%
                        select(bill_length_mm) %>%
                        summarise(
                          n = n(),
                          mean = round(mean(bill_length_mm, na.rm = T), 1),
                          sd = round(sd(bill_length_mm, na.rm = T), 2),
                          median = round(median(bill_length_mm, na.rm = ), 1),
                          iqr = round(IQR(bill_length_mm, na.rm = T), 2)
                          ) %>%
                        as.data.frame()

# subset and summarize second var
bill_depth_summary <- penguins %>%
                          select(bill_depth_mm) %>%
                          summarise(
                            n = n(),
                            mean = round(mean(bill_depth_mm, na.rm = T), 1),
                            sd = round(sd(bill_depth_mm, na.rm = T), 2),
                            median = round(median(bill_depth_mm, na.rm = ), 1),
                            iqr = round(IQR(bill_depth_mm, na.rm = T), 2)
                            ) %>%
                          as.data.frame()

# merge vars
summary_table <- full_join(bill_length_summary, bill_depth_summary)

# update row names
rownames(summary_table) <- c("bill_length_mm", "bill_depth_mm")

#print results
summary_table
```

**One Categorical and One Quantitative Variable**

Here is an example of the output produced by the `R` script for one categorical and one quantitative variable. This table is organized so that the descriptive statistics of the quantitative response variable are reported based on the groups of the explanatory variable. `n` indicates the sample size, while the following columns represent the mean, standard deviation, median, and interquartile range, respectively. For example, the mean mass (g) for Adelie penguins in this sample is 3775 g. 

```{r, echo=FALSE, message=FALSE}
penguins %>%
  group_by(species) %>%
  summarise(
    n = n(),
    mean = round(mean(body_mass_g, na.rm = T), 1),
    sd = round(sd(body_mass_g, na.rm = T), 2),
    median = round(median(body_mass_g, na.rm = T), 1),
    iqr = round(IQR(body_mass_g, na.rm = T), 2)
  ) %>%
  as.data.frame()
```

Take a screenshot of these values or write them down for writing your lab report later. 

### Step 6: Performing statistical analyses {-}

Similar to with the descriptive statistics, the code here should run smoothly if everything in the proceeding steps was done correctly. The type of <a href="https://ubco-biology.github.io/Procedures-and-Guidelines/glossary#Statistical-analysis">statistical analyses</a> you will perform depends on the types of variables in your experiment.

<div class = "note">
Code for all possible statistical tests is included within the `R` scripts. Only run the code that corresponds with the type of variables in your experiment. Use the guidelines below to help you choose the appropriate statistical test.
</div>

**Both response (dependent) and explanatory (independent) variables are categorical**

* Both your response and explanatory variables have exactly 2 groups -> Use Fisher's Exact Test
* At least one of your response or explanatory variables has more than 2 groups -> Use Chi-Square Contingency Analysis 

**Both response (dependent) and explanatory (independent) variables are quantitative**

* Use Correlation Analysis

**Response (dependent) variable is quantitative and explanatory (independent) variable is categorical** 

* Your categorical variable has exactly 2 groups -> Use Two Sample T-test
* Your categorical variable has more than 2 groups -> Use ANOVA
    
When interpreting the output from a statistical analysis for this project, focus on the <a href="https://ubco-biology.github.io/Procedures-and-Guidelines/glossary#P-value">p-value</a> provided by `R.` You'll learn more about the other details shown by the output in BIOL 202! Below are some examples of output for each type of statistical test. 
    
**<a href="https://ubco-biology.github.io/Procedures-and-Guidelines/glossary#Fisher's-exact-test">Fisher's Exact Test</a>**

`Penguins` data set with `sex` and `species` variables - the p-value is evident:

```{r, echo=FALSE, message=FALSE}
fisher.table <- xtabs(~ species + sex, data = penguins)
fisher.results <- fisher.test(fisher.table)
fisher.results
```

**<a href="https://ubco-biology.github.io/Procedures-and-Guidelines/glossary#Chi-square-($\chi$^2^)-contingency-test">Chi-Square Contingency Analysis</a>**

`Penguins` data set with `sex` and `species` variables - the p-value is evident:

```{r, echo=FALSE, message=FALSE}
chi.table <- xtabs(~ sex + species, data = penguins) 
chisq.results <- chisq.test(chi.table) 
chisq.results
```

**<a href="https://ubco-biology.github.io/Procedures-and-Guidelines/glossary#Pearson-correlation">Correlation Analysis</a>**

`Penguins` data set with `bill_length_mm` and `bill_depth_mm` variables - the p-value is fairly evident, labeled p.value:

```{r, echo=FALSE, message=FALSE}
library(broom)
cor <- cor.test(
      x = penguins$bill_length_mm, 
      y = penguins$bill_depth_mm, 
      method = "pearson", 
      conf.level = 0.95, 
      alternative = "two.sided") 
                                                                                             
    cor.tidy <- tidy(cor) 
    cor.tidy 
```

**<a href="https://ubco-biology.github.io/Procedures-and-Guidelines/glossary#Two-sample-t-test">Two-Sample T-test</a>**

`Penguins` data set with `species` - filtered to Chinstrap and Adelie - and `body_mass_g` variables - the p-value is evident:

```{r, echo=FALSE, message=FALSE}
penguins_subset <- with(penguins, subset(penguins, species == "Chinstrap" | species == "Adelie"))
t.test(body_mass_g ~ species, data = penguins_subset, # chooses the variables and data set
       var.equal = TRUE, # assumption of equal variance is met
       conf.level = 0.95) # defines the confidence interval
```

**<a href="https://ubco-biology.github.io/Procedures-and-Guidelines/glossary#Analysis-of-variance">ANOVA</a>**

`Penguins` data set with `species` and `body_mass_g` variables - the p-value is under the header Pr(>F):

```{r, echo=FALSE, message=FALSE}
    lm <- lm(body_mass_g ~ species, data = penguins) 
    lm.anova <- anova(lm) 
    lm.anova  
```

# Practicing Using the `R` Scripts {-}

We\'ll practice using the `R` script for a categorical explanatory variable and quantitative response variable. Follow these steps:

* download this `R` script and save it in your `scripts/` folder
  * [BIOL205_Script_Quantitative-Categorical.R](https://osf.io/download/46mpk)
* download this sample data set and save it in your `data/` folder
  * [BIOL205_sample-data.csv](https://osf.io/download/3r76k)
* launch your project using your RProject file
* open the `R` `script from within RStudio
* follow the instructions provided [earlier](https://ubco-biology.github.io/BIOL-205-Lab-Manual/using-the-rscripts.html) for working with the script - setting up variable names etc - and run *ALL* code in the script
  * the boxplot should be automatically saved to your `figures/` folder.
  * the results of your descriptive statistics and statistical analyses should be automatically save to your `report/` folder.