# NGS Data Analysis Practice

This set of exercises will provide an insight into one of the most common applications of NGS data analysis: the identification of human mitochondrial variants from a sequenced sample.  

It will follow the simple workflow described during the [NGS Data Analysis](https://bit.ly/ngs-data) lesson:  

- quality check and preprocessing  
- alignment  
- variant calling  
- variant annotation 

Starting from an output of the sequencing machine, we will first clean these data, align them against a reference mitochondrial genome in order to identify variations, and finally annotate these variations to look for interesting findings. 

The entire practice will be conducted online, so you will only need a functional web browser and will be able to follow it using your laptop or tablet. 

___ 

## Setup instructions 

### Galaxy

Our NGS data analysis will be performed using [Galaxy](https://usegalaxy.org), "an open source, web-based platform for data intensive biomedical research". It allows to explore and use several different bioinformatics tools without the need to install them on your machine, since they are all hosted on the Galaxy platform, ready to use with your own data. 

The only thing required to use Galaxy is registration. In the Galaxy home page, click on **Login or Register** in the top bar, then on **Register**. 

![](data/imgs/galaxy_1.jpg)

In the **Create account** page, fill in your email address and choose a password, then choose a username (only lowercase letters, numbers, `.`, `_` and `-` are allowed!) and finally click on **Submit**. 

Your should see a notification saying that everything went well and a verification mail was sent to the mail address you provided. Please check your email to click on the verification link in order to start using Galaxy. 

### RStudio

RStudio is the official platform used to perform analyses using the R programming language. RStudio Cloud is a free online resource that allows users to use R and RStudio right from their browser, without the need to install these softwares. 

You will need to register to [RStudio Cloud](https://rstudio.cloud), by clicking on **Sign up** in the upper right corner of its home page. Fill in your email address, choose a password and type your name and surname; you can also sign up using your Google credentials, if you want. After clicking on **Sign up** or **Sign up with Google**, you will have to choose a username.  
You will then receive a verification email; click on the verification link and login to RStudio Cloud using your new account. 

___ 

## Tools overview

### Galaxy

All bioinformatics softwares available on Galaxy are listed in the **ToolBox** on the left. Every tool performs a specific task, and similar tools are grouped together in categories. Tools can also be searched by keyword using the search bar on the top. 

![](data/imgs/galaxy_2.jpg)

When clicking on a specific tool, it will be loaded and is ready to be used on your data. Every tool has its own set of options, which can be tweaked to better suit our specific needs and provide better results.  
Below the main pane, a help section is present, with some more information to understand how that specific software works.  
When you are ready to use the tool you chose, click on **Execute**. 

![](data/imgs/galaxy_3.jpg)

On the right there is the **History** pane. Every step you take during your analysis will be recorded here, along with details of the data it worked on and specific parameters used.  
When you launch a Galaxy tool, a new entry will be added to the **History**: at first it is <span style="color:gray">**gray**</span>, meaning that the analysis is queued and you need to wait; then the tool will start running, and will be shown in <span style="color:yellow">**yellow**</span>; when the tool has successfully completed its job, it will be shown in <span style="color:green">**green**</span>, while in case some errors are issued it will be coloured in <span style="color:red">**red**</span> (and you should raise your hand and ask for my help!). 

![](data/imgs/galaxy_4.jpg)

You can rename your history to something meaningful, as well as renaming each step taken during the analysis, using the pencil icon in each tool. You can also view the results produced by each tool, by clicking on the eye icon, and you can remove a specific step clicking on the X icon. 

### RStudio

RStudio is composed of 4 main panels:  

- Source 
- Console 
- Files, Plots, Packages, Help, Viewer 
- Environment, History, Connections 

Each one of them serves a specific purpose, but we'll be using mostly the Console, Files and Plots tabs. 

The Console tab is where you will type your R commands. Each command is typed on one line, and is followed by a return (Enter key) to launch the command. You can try to type some math operations in the Console, and R will output their results. 

![](data/imgs/rstudio_1.jpg)

While you can use R as a calculator, most of its functionality comes from functions that perform specific actions on some data. A function is called with its name followed by some optional arguments in brackets:  

```r 
my_function(argument1, argument2)
```

Groups of related functions are usually gathered together in Packages, which can be loaded into R when needed. When you find an interesting package online, you should first install it in R using `install.packages()`, then load it using `library()`, as such: 

```r 
install.packages("my_package")
library(my_package)
```

After having installed and loaded the desired package, you can access its function with the syntax seen above. 