# An Intro to R & RStudio
R is a programming language primarily designed for statistical computing, data analysis, and graphical visualization. Unlike Python or Java, R is a domain-specific language (DSL) optimized for statistics. Its syntax is array-oriented, and it emphasizes vectorized operations for efficiency.

### **Key Features of R**
- **Statistical \& Data Analysis**: Built-in functions for linear/nonlinear modeling, hypothesis testing, clustering, time-series analysis, and more.
- **Data Visualization**: Native support for high-quality plots (e.g., ggplot2 for publication-ready graphics)
- **Extensibility**: Over 18,000+ packages (e.g., Bioconductor for bioinformatics, tidyverse for data wrangling).
- **Interoperability**: Integrates with C/C++, Python, and SQL, and runs on Windows, macOS, and Linux.


### **Common Use Cases**
- **Bioinformatics**: DNA sequence analysis (Biostrings), RNA-seq (DESeq2).
- **Machine Learning**: Predictive modeling (caret, randomForest).

### **Ecosystem**
- **RStudio**: Preferred IDE (Integrated Development Environment) with tools for debugging, package management, and Knitr for reproducible reports.
- **CRAN \& Bioconductor**: Central repositories for packages necessary to run your analyses.

### This tutorial will introduce R and RStudio for bioinformatics, covering installation and basic syntax.

### Step 1: Install R & RStudio
The first thing you're going to do is install R and RStudio from the internet:
1. Download R from [CRAN](https://cran.r-project.org/) (choose your OS).
2. Download RStudio (Desktop version) from [RStudio](https://posit.co/download/rstudio-desktop/).

After you have these downloaded, go ahead and open up RStudio. Because RStudio is our IDE for R, we will be working in RStudio to conduct our analyses.

#### Really quick, what is the difference between R and RStudio? Do we need both? Let's go over it. 

---

##### 🧠 **R**: The Brain

**R** is the actual programming language. It does all the heavy lifting behind the scenes—math, data analysis, creating plots, running code, and more.

You can think of **R** as the engine of a car. It’s what makes everything go.

When you install R, you’re installing this core language. Technically, you could write and run R code in a plain text editor or a terminal window... but it wouldn’t be very user-friendly.

---

##### 🖥️ **RStudio**: The Friendly Workspace

**RStudio** is an **Integrated Development Environment (IDE)**—basically, a nice interface built *on top* of R to make your life easier.

It gives you:
- A **Console** to type and run code
- A **Script editor** to write and save code files
- A **Viewer** for plots and data tables
- A **File pane** to browse your project folders
- An **Environment pane** to see your variables and data

RStudio doesn't replace R—it just gives you a better way to interact with it.

---

#### In Short:

|              | **R**                          | **RStudio**                          |
|--------------|--------------------------------|--------------------------------------|
| What it is   | A programming language         | A program (IDE) to use R easily      |
| Role         | The engine                     | The dashboard                        |
| Required?    | Yes                             | Optional, but highly recommended     |
| How they work| R does the work                | RStudio sends instructions to R      |

---

So you’ll need to install **R first**, and then install **RStudio** to get that helpful interface.

### Let's talk about the layout of RStudio
RStudio is a user-friendly interface designed to make working with the R programming language easier, especially for data analysis and bioinformatics. When you open RStudio, you’ll see that the window is divided into four main panels (or panes), each with a specific role:

**1. Source (Script Editor) – Top Left**
* Purpose: This is where you write, edit, and save your R scripts (code files), RMarkdown documents, or notes.
* How to use: If you don’t see it at first, create a new script via the menu: File > New File > R Script. Code written here can be saved for future use.
* Tip: Write your code here and use the "Run" button or Ctrl+Enter (Windows) / Cmd+Enter (Mac) to send a line or selection to the Console

**2. Console – Bottom Left**
* Purpose: This is where R executes your commands and displays the results. Think of it as R’s direct command line.
* How to use: Type commands here for quick tests or exploration. Results appear immediately below your command.
* Tip: Commands run here are not saved unless you copy them to a script

**3. Environment/History – Top Right**
* Environment Tab: Shows all the objects (variables, data frames, functions) you’ve created in your current session. You can see their names, types, and sometimes a preview of their contents.
* History Tab: Keeps a record of all the commands you’ve run. You can reuse previous commands by clicking them.
* Tip: You can remove objects from the environment here, or click on them to inspect their contents.

**4. Files/Plots/Packages/Help/Viewer – Bottom Right**
* Files: Browse your computer’s folders and files.
* Plots: View any graphs or charts you create with R.
* Packages: See which R packages are installed; load or update them here.
* Help: Search for help on R functions and packages.
* Viewer: View web content or HTML outputs generated by RMarkdown.
* Tip: Click the tabs to switch between these tools as you work.

This organized layout helps you write, run, and manage your R code efficiently as you learn bioinformatics or any data science task in RStudio.

### Step 2: Basic Calculations
At its most basic level, R functions like a powerful calculator, allowing you to perform standard arithmetic operations. Here are some examples. (Remember, we're working in the "source" pane—you can copy and paste these code blocks directly into it if you'd like.)

In [None]:
6 + 4
6 / 2
6 * 4
6 ^ 4 

The `[1]` you see to the left of the output is called an “index” number. We’ll explore its usefulness more later, but for now, just know that it indicates the position of the first item in each row of the output. Since we currently only have one row of output, it’s always showing `[1]`.

### Step 3: Basic R Syntax
Let's go over some basic R syntax to get us started.

All of the following code will be run in your source pane. 

To start. let's get help on a function/command. For example, here is how to see the help info for the function to see what our current working directory is in R:


In [None]:
?getwd

In the bottom right panel, you will see information and usage for this function. The same thing will happen if you write the `?getwd()` line of code and run it in the console. Try it out! 

#### What is a variable?
In R, most operations are performed on things stored in variables, which are called “objects.” Objects can hold different types of data, and the type of object—as well as the type of data it contains—affects what you can do with it. A variable in R is simply a name that refers to a storage location holding a value, such as a number, a piece of text (called a string), a vector, or more complex structures like data frames. Unlike some programming languages, R doesn’t require you to declare a variable ahead of time—it’s created automatically when you assign a value to it.

##### So how do we assign a variable?
We will use ```<-``` (preferred) or ```=``` to assign values to a variable. Let's look at an example below:

In [None]:
x <- 6

After executing the command, our “environment” pane should now show this variable there.

Now that we have the value “6” stored in the variable “x”, we can use the variable name in functions. Here’s some examples doing the same calculations we performed above, but now with our variable.

In [None]:
x
x + 4
x / 2
x * 4
x ^ 4

We can also check what type of data is contained within this variable with the `class()` function:

In [None]:
class(x)

And we find out it is of class “numeric”. Let’s try storing a different type, like a word:

In [None]:
w <- "books" 
class(w)

In this case, the class of the object is “character.” When working with character data (text), you need to use quotes—just like we did when setting the working directory earlier. If you leave out the quotes, R will assume you're referring to a variable with that name instead of treating it as plain text. This requirement doesn’t apply to numbers—for example, when we assigned the value 4 to the variable `x`, no quotes were needed. This is also why you can’t start a variable name with a number in R.

We can also store multiple values in a single variable. In R, a one-dimensional object that holds multiple items is called a *vector*. To combine several items into one object, we use the `c()` function, which stands for “concatenate.” Here's an example of how to create a vector of numbers:

In [None]:
y <- c(5, 6, 7)
y
class(y)

Note that this is still of class “numeric” by checking with `class(y)`. It can be helpful to get used to actively being aware of what type of objects we are working with.

Variables can also store tables of data. In this example, we’ll create another vector with three numbers and combine it with our previous vector to form what's called a *data frame*. (Don’t worry about memorizing all the terminology just yet—we’re just getting familiar with the concepts for now.) We’ll use the `data.frame()` function to do this and create a new variable called `our_table`.

In [None]:
z <- c(8, 9, 10)
our_table <- data.frame(y, z)

our_table

class(our_table)

Data frames are two-dimensional objects made up of rows and columns. In this case, the `data.frame()` function took our two vectors and placed them into a table, with each vector becoming a column by default. R also has another table-like structure called a *matrix*, which is similar but not the same. Depending on what you're trying to do, you may need to convert between a data frame and a matrix. If you run into an error while working with tables, this is one of the first things to check.

Let's try working with some practice genomic data:

In [None]:
gene <- "BRCA1"  # Character,string variable
count <- 42      # Numeric variable
class(gene)
class(count)

What we've done here is create a variable called `gene` and stores the string "`BRCA1`" in it. So gene now holds the name of a gene — in this case, BRCA1, which is well known for its role in breast and ovarian cancer susceptibility.

We've also created a variable called `count` and assigns it the numeric value `42`.

Ultinmately, we are storing some pretty basic information:
* A gene name in the variable `gene`
* A numeric value in the variable `count`

**You will see these variables in your environment panel!**

Let's assign some other variables:

In [None]:
dna_sequence <- "ATGCGTA"  # Assign a DNA sequence  
protein_sequence <- "MADGKT"  # Assign a protein sequence  
print(dna_sequence)  # Output: "ATGCGTA"  

Here, we have done a few different things:
1. Assigning a DNA sequence to a variable
   * This creates a variable named `dna_sequence` and assigns it the string "`ATGCGTA`", which is a short segment of DNA (composed of the bases A, T, G, and C).
2. Assigning a protein sequence to a variable
   * This creates a variable called ```protein_sequence``` and stores the string "`MADGKT`" in it. This string represents a sequence of amino acids (each letter corresponds to a specific amino acid).
3. Printing the DNA sequence
   * The `print` command will output the data you've assigned to `dna_sequence`. In this case, `[1] "ATGCGTA"`. The [1] just indicates the first element of the output (since vectors are the base data structure in R).

### Let's work with our practice data.

First, let's download some data from our GitHub to work with. If you want to work on the Binder, the data is already there for you. If you want to work on your local computer, please download the `R_basics_temp` folder from our GitHub. 

##### **Now, let's set up our working directory.**
Just like when using the command line or navigating files in a graphical user interface, it's important to know your current location within your computer's file system when working in R. The functions getwd() and setwd() help manage this. Most R commands follow a specific format: the function name is followed by parentheses. Any necessary inputs, called arguments, go inside the parentheses. For example, getwd() doesn't require any arguments—it simply tells you your current working directory.

In [None]:
getwd()

The pathname that prints out is your current working directory. If you want to change your working directory, we use the `setwd()` function in our source/console. 

In [None]:
setwd("~/Desktop/How-to-Bioinformatics/R_basics_temp")

In [None]:
getwd()

Now we see that our working directory has been changed, and we are now in the working directory we're going to be using!

In this tutorial, we will be using the "gene_annotations.txt" file. To get this table into R, we have to "read it in". You will often hear bioinformaticians refer to "uploading" data into R like this. One of the most common ways of reading tables into R is to use the `read.table()` function. To start, let’s try reading our “gene_annotations.txt” table into R with no arguments other than specifying the file name:

In [None]:
gene_annotations_tab <- read.table("gene_annotations.txt")

We see that we've gptten an error message. Error messages in R can seem confusing at first, but over time, they often start to make more sense. In this case, the key part of the error is the message at the end: “line 1 did not have 22 elements.” If we check our table—either in the console or using the graphical viewer—we can see that it should have 8 columns. This suggests that R is having trouble correctly splitting each line into columns, likely due to formatting issues in the data.

Let's investigate what the problem might be:

In [None]:
?read.table()

The help file appears in the bottom right pane. If we scan through it looking for anything related to setting the delimiter, we’ll find the argument `sep`. By default, `sep` is set to split on *any* whitespace—which includes both tabs and spaces. But if we look at our `"gene_annotations.txt"` file, we can see that there are spaces within the KO and COG annotation columns.

So what’s happening is that `read.table()` is treating every space as a column separator, which creates more columns than expected. Then it gives us an error saying, “Hey, the first line doesn’t have as many columns as the rest of the file!”—which is actually helpful, because it’s warning us that something’s not quite right with how the data is being read. Let’s try running the command again, but this time we’ll specify that the delimiter should be tabs only. In R, a tab character is written as a backslash followed by the letter "t", like this: `\t`.

In [None]:
gene_annotations_tab <- read.table("gene_annotations.txt", sep = "\t")

We see that this works without any errors! However, when we take a look at it with the `head` function in R, we can see that upon reading in our data, R put our column names in the first row and created new column names ("V1, V2, etc.). Let's see if we can fix that.

If we take another look at the help menu for `read.table()` in the bottom right pane, we’ll see there’s an argument called `header`, which is set to `FALSE` by default. Since our file does have a header row, let’s try running the command again—this time specifying that `header = TRUE`.

In [None]:
gene_annotations_tab <- read.table("gene_annotations.txt", sep = "\t", header = TRUE)

head(gene_annotations_tab)

There we go! It looks great. 

Now let's check our column names and the size/dimensions of our table!

In [None]:
colnames(gene_annotations_tab)

dim(gene_annotations_tab)

So our table is 84,784 rows by 8 columns, which is great as that’s what we also see if we investigate this file in command line terminal (`wc -l gene_annotations.txt`).

Now let’s create a new table so we can practice writing data to a file from R. You might have noticed that our “gene_annotations_tab.txt” table contains some **`NA`** values—these are special placeholders in R that represent missing data. In this case, they appear in the KEGG and COG annotation and ID columns for genes that weren't annotated.

Let’s say we want to create a subset of the table that includes only the genes *with* KEGG annotations. The **`is.na()`** function in R is useful here—it checks whether each item in an object is **`NA`**. But since we’re interested in values that are *not* **`NA`**, we’ll use the **`!`** operator to reverse the result. 

In [None]:
KEGG_only_tab <- gene_annotations_tab[!is.na(gene_annotations_tab$KO_ID), ]

##### Let's break down what we just did:

- **`KEGG_only_tab`**  
  This is the name of the new variable we’re creating. It will store the subset of our original table.

- **`<-`**  
  This is the *assignment operator* in R. It tells R to store the result of the expression on the right into the variable on the left.

- **`gene_annotations_tab[!is.na(gene_annotations_tab$KO_ID), ]`**  
  This is the subsetting operation. While it looks complex, it follows the same logic as our earlier examples—just applied to a full data frame.

---

##### Breaking Down the Subsetting:

- **`gene_annotations_tab`**  
  This is our original data frame.

- **`[ ]`**  
  Square brackets are used to subset data in R. Inside the brackets, we specify which *rows* and *columns* we want. These are separated by a comma: `[rows, columns]`.

- **`!is.na(gene_annotations_tab$KO_ID)`**  
  This is the first argument (before the comma), and it defines which rows we want.  
  - `is.na()` checks for missing values (`NA`) in the `KO_ID` column.  
  - `!` inverts the result, so we’re selecting only the rows where `KO_ID` is *not* `NA`.

- **`,`**  
  The comma separates the row condition from the column selection.

- **(nothing after the comma)**  
  When nothing is specified for the columns, R keeps *all* columns in the resulting subset.

---

Let's review how conditional statements return `TRUE` or `FALSE`—this logic is key to filtering data in R.

In R, when you filter or subset data, you're basically asking a **yes or no** question about each row:  
> “Should I keep this row?”

R answers that question using **TRUE** (yes, keep it) or **FALSE** (no, skip it).

For example, when we use something like this:

```r
!is.na(gene_annotations_tab$KO_ID)
```

We’re asking:  
> “Is this value **not** missing?”

R goes through the whole `KO_ID` column and checks each value. It returns something like:

```r
TRUE  TRUE  FALSE  TRUE ...
```

Then R uses those answers to decide which rows to keep.  
Only the rows where the answer was **TRUE** will end up in your new table.

So learning how these **TRUE/FALSE checks** work is really helpful—it’s the main way we filter and organize data in R!

---

I also want to mention the `!` operator really quick.

The **`!`** symbol in R is a *logical NOT* operator. It’s used to reverse the result of a logical test—turning `TRUE` into `FALSE`, and `FALSE` into `TRUE`.

For example, the function **`is.na()`** checks if a value is `NA` (missing), and returns `TRUE` for each item that *is* `NA`. But if we want to find the values that *aren’t* `NA`, we use **`!is.na()`**—this flips the `TRUE` and `FALSE` values, giving us `TRUE` only for the items that have actual values (i.e., not missing).

Here’s a quick example:

In [None]:
x <- c(3, NA, 5)
is.na(x)
# [1] FALSE  TRUE FALSE

!is.na(x)
# [1]  TRUE FALSE  TRUE

So `!` just means “not”—it’s a simple way to reverse a condition.

Since I am explaining the`!` operator, it’s a great idea to also include the `==` operator, since both are used in logical comparisons:

---

#### The `==` Operator: *“Is equal to?”*

In R, `==` is used to **check if two things are equal**. It returns either `TRUE` or `FALSE`.

Think of it as asking R:  
> “Is this value exactly equal to that one?”

##### Example:

```r
x <- 5
x == 5  # This will return TRUE, because x is equal to 5
x == 3  # This will return FALSE, because x is not equal to 3
```

> Note: A single equals sign `=` is used for assigning values (like `x = 5`), while `==` is used for *comparing* values.

---

#### The `!` Operator: *“Not”*

The `!` operator is used to **reverse** a logical condition. It flips `TRUE` to `FALSE`, and `FALSE` to `TRUE`.

Think of it as telling R:  
> “Give me the opposite of this.”

##### Example:

```r
x <- 5
x == 5      # TRUE
!(x == 5)   # FALSE — because we're asking if x is NOT equal to 5

is.na(x)      # FALSE — because x is not missing
!is.na(x)     # TRUE — meaning x *is* present (not NA)
```

---

#### Using Them Together

You’ll often see `==` and `!` used together in filtering data. For example:

```r
data[data$category == "A", ]     # keeps rows where category is "A"
data[data$category != "A", ]     # keeps rows where category is NOT "A"
```

Here, `!=` is just a shortcut that combines `!` and `==`—it means “not equal to.”

Let's move on.

If we peek at our new table with `head()`, we see all top 6 have KEGG annotations, where as before some where NA:

In [None]:
head(KEGG_only_tab)

And we can also look at how many genes we dropped that didn’t have a KEGG annotation assigned to them:

In [None]:
dim(gene_annotations_tab) # 84,784 genes
dim(KEGG_only_tab) # 37,319 had KEGG annotations assigned

Now let’s save our new table—which includes only the genes that were annotated by KEGG—to a new tab-delimited file called **"KEGG_annotated.tsv"** (using the `.tsv` extension this time, which is more appropriate for tab-separated files).

We can use the **`write.table()`** function to do this. If we take a quick look at the help menu using `?write.table`, we’ll see that by default it uses a space to separate columns. But since we want **tab-separated** values, we need to add the argument:

```r
sep = "\t"
```

There are a couple more things we want to customize:
- **We don’t want R to include row numbers** in our output file (those leftmost numbers), so we’ll add:
  ```r
  row.names = FALSE
  ```
- **We also don’t want quotation marks** around our text fields (like the annotation columns), so we’ll use:
  ```r
  quote = FALSE
  ```

How would we know to do all this the first time? Honestly, we wouldn’t! It’s totally normal to write out a file, look at it, decide something looks off, and then Google or check the help menu to figure out how to fix it. Even experienced R users do this all the time.

So let’s use all these options in our command to write the file exactly how we want it.

In [None]:
write.table(KEGG_only_tab, "KEGG_annotated.tsv", sep = "\t", row.names = FALSE, quote = FALSE)
list.files() # checking it is there now in our working directory

---

As mentioned earlier, it’s a good habit to **take a quick look at the output file** in the terminal while you're figuring out the right options for writing it. This helps you confirm that the file looks the way you expect.

To do this, switch back to the **“Terminal” tab** in the bottom-left pane of RStudio, and then use the `less` command to view your new file without opening it in a separate program. For example:

```bash
less KEGG_annotated.tsv
```

This will let you scroll through the file and check if the formatting (like tabs, no quotes, no row numbers) looks correct.

### Step 4: R Packages

#### Installing and Using R Packages

One of the most powerful things about **R** is its huge community of people who create and share **packages**. These packages are collections of pre-written code that make it easier to perform specific tasks in R. Packages can help you analyze data, create plots, or even interact with other software tools.

To use a package, you first need to **install** it (if you haven’t already) and then **load** it into your R session. Let’s go over the three main ways to install packages in R.

---

##### 📚 **1. Installing Packages from CRAN with `install.packages()`**

Most of the time, you can install a package from **CRAN** (the Comprehensive R Archive Network) using the **`install.packages()`** function. For example, if you want to install the popular **`tidyverse`** package, you would run:

In [None]:
install.packages("tidyverse")

This will download and install the package. You’ll see some installation info printed in the console. Once it’s done, you can **load** the package into your session with the `library()` function:

In [None]:
library("tidyverse")

#### What if a package isn't available?

Sometimes, when you try to install a package with **`install.packages()`**, you might get a warning like this:

```r
install.packages("dada2")
Warning in install.packages:
  package ‘dada2’ is not available for this version of R
```

Don’t worry! This just means that the package isn’t available for your version of R through CRAN. You can usually find alternative installation instructions by Googling the package name. For example, for **`dada2`**, you would search for “install dada2” and find that you need to use a different method (explained below).

---

##### 🧬 **2. Installing Packages from Bioconductor with `BiocManager::install()`**

Some packages are hosted on **Bioconductor**, a repository for bioinformatics tools. To install packages from Bioconductor, you need to use **`BiocManager`**.

1. First, install **`BiocManager`** if you haven’t already:

In [None]:
install.packages("BiocManager")

2. Once **`BiocManager`** is installed, you can use it to install packages like **`dada2`** (which isn’t available on CRAN). For example:

In [None]:
BiocManager::install("dada2")

---

##### 🌐 **3. Installing Development Versions from GitHub with `devtools::install_github()`**

Sometimes you might want to install a development version of a package directly from **GitHub**. This is useful if you want to get the latest version of a package (perhaps to fix a bug) before it’s officially released.

1. First, install the **`devtools`** package if you don’t have it yet:

In [None]:
install.packages("devtools")

2. Then, use **`devtools::install_github()`** to install the package from GitHub. For example, to install the **`tidyr`** package from GitHub, you would run:

In [None]:
devtools::install_github("tidyverse/tidyr")

3. After installation, you can load the package as usual:

In [None]:
library("tidyr")

---

##### 📝 **Best Practices for Installing Packages**

- **Always follow the package’s installation instructions**: Many packages have specific requirements or steps for installation. Check the package documentation (usually available on CRAN, Bioconductor, or GitHub) for the recommended installation method.
- **Start with `install.packages()`**, as it works for most packages available on CRAN.
- If the package isn’t on CRAN or you get a warning, try **Bioconductor** with `BiocManager::install()`.
- For the latest development version, you can try **GitHub** using `devtools::install_github()`.

These are the most common ways to install packages in R, and following the package’s documentation is almost always the best approach.

---