# ChemBioSys and AquaDiva | Genome-resolved metagenomics workshop

This is your interactive course script that is designed as [`Jupyter Notebook`](https://jupyter.org/), an interactive computational document that allows you to execute the code/tools we are using throughout the course in a browser window. You find here all the background information plus instructions needed for the individual parts of the course.

## In a nutshell
This workshop aims to provide you with detailed insights into  genome-resolved metagenomics. We will familiarize ourselves with the needed knowledge about the [`Linux commandline`](https://www.gnu.org/software/bash/), and progress through the bioinformatic analyses that are necessary to reconstruct genomes from metagenome datasets.

The contents of the course are summarized in the figure below, the course day will be accompanied by short discussions/presentations about the aspects we are working on.

<img src="img/chembiosys_metaG_workshop.png" alt="Course contents" width="750"/>

## The (rough) schedule

| **Course day** | **To do**                                                      | **Discussions / Presentations**                                          |
|----------------|----------------------------------------------------------------|--------------------------------------------------------------------------|
| _1 - morning_  | CLI 101, QAQC sequencing data                                  | Sequencing methods, past, present, future                                |
|                |                                                         **COFFEE BREAK 10.45-11.00**                                                      |   
|                | Short-read taxonomic profiling, metagenome coverage estimation | A few words about study design and metagenomics                          |
|                |                                                         **LUNCH BREAK 12.30-1.30**                                                        |
| _1 - afternoon_ | Metagenome assembly, binning, bin refinement                   | How do assembly and binning work                                        |
|                |                                                         **COFFEE BREAK 3.00-3.15**                                                        |
|                | Assessing genome completeness/contamination, taxonomic placement | The beauty of single copy marker genes                                 |
| _2 - morning_  | Phylogenomics                                                  | Things you need to know about phylogenetics/-omics                       |
|                |                                                         **COFFEE BREAK 10.45-11.00**                                                      |   
|                | Pangenomics                                                    | What is a pangenome and what is it good for                              |
|                |                                                         **LUNCH BREAK 12.30-1.30**                                                        |
| _2 - afternoon_ | Mining biosynthetic gene clusters                             | Background antiSMASH and related tools                                   |
|                |                                                         **COFFEE BREAK 3.00-3.15**                                                        |
|                |                                                         **OPEN DISCUSSION**                                                               |

## Managing expectations

Based on the survey we did with you, we know that most of you have very limited or basic knowledge about commandline work, sequence data processing as a whole, and genome-resolved metagenomics. At the same time, some of you have probably a very comprehensive background regarding some of the contents covered. This is no one-size-fits-all workshop, but we try to address your interests best possible.

After finishing the workshop you should...:

* ... have a good understanding of the steps involved when doing genome-resolved metagenomics,
* ... be aware of shortcomings, bottlenecks, and ways to solve problems that typically arise,
* ... know how to get started with genome-resolved metagenomics on your own.

Feel free to ask/approach us throughout the workshop if you have particular questions that we maybe don't cover in depth 😉.

## Session 00 | Getting everything up and running

For the workshop, we use computing resources provided by the German Network for Bioinformatics Infrastructure [`deNBI`](https://www.denbi.de/cloud). For every participant we have created a virtual machine (VM) with the following specs:

* 28 VCPUs
* 64 GB RAM
* 50 GB root disk

Each virtual maschine (VM) comes with [`Ubuntu`](https://ubuntu.com/) as pre-installed OS, and features [`JupyterLab`](https://jupyter.org/) as frontend for accessing [`Jupyter Notebooks`](https://jupyter-notebook.readthedocs.io/en/latest/). See below.

### Connecting to the deNBI server

To access `deNBI` virtual machines you either need a regular [`LifeScience RI`](https://lifescience-ri.eu/home.html) account, or a `deNBI` _hostel account_. You find instructions about how to get a _hostel account_ [here](https://signup.aai.lifescience-ri.eu/non/registrar/?vo=lifescience_hostel&targetnew=https%3A%2F%2Flifescience-ri.eu%2Faai%2Fhow-use&targetexisting=https%3A%2F%2Flifescience-ri.eu%2Faai%2Fhow-use&targetextended=https%3A%2F%2Flifescience-ri.eu%2Faai%2Fhow-use).

<img src="img/hostel_account.png" alt="Hostel account" width="750"/>

<font size="2"><i> Signing up for a hostel account. </i></font>


We have assigned one VM to each participant and provided each of you with details how to connect to the VM and access `JupyterLab`.

<img src="img/deNBI_login.png" alt="Connect"/>

<font size="2"><i> How to connect. </i></font>

We only need access to the `JupyterLab` frontend.


## Session 00 | Using Jupyter Notebooks

`Jupyter Notebooks` are interactive, computational documents, they are shareable documents that combine (executable) computer code, plain language descriptions, data (output, think graphs, tables), and interactive controls. Opening a notebook using `JupyterLab` as graphical interface gives us a fast, easy to use environment for prototyping commands and tools.

![Say hello to the GUI](img/jupyter_GUI.png)

`Jupyter Notebooks` contain two different types of _cells_, _markdown cells_ and _code cells_. _Markdown cells_ contain formatted text. What you read right now is such a _markdown cell_. [`Markdown`](https://en.wikipedia.org/wiki/Markdown) is a "lightweight markup language for creating formatted text". If you _DOUBLE CLICK_ into the cell you read right now, you can see the formatting. A comprehensive guide to the `Markdown` syntax can be found [here](https://www.markdownguide.org/). 

---
❗**NOTE**

What makes `Jupyter Notebooks` even more awesome is that you can modify them while you use them. They are **not static**. You _DOUBLE CLICK_ a cell, modify it as you want (think additional comments), and run it (in the case of a markdown cell translates into "typesetting" it. To run a cell click either the _PLAY_ button or hit _SHIFT+ENTER_.

---

Below you find a _code cell_. Throughout the workshop we use _code cells_ to execute code on the commandline.

In [None]:
### An empty code cell
### every line starting with a "#" is considered a comment and not executed

During the workshop, we work through different `Jupyter Notebooks` that are executed on the `deNBI` server we have access to. 

## Session 00 | Basics of using the command-line interface (CLI)

Linux is king in bioinformatics, which makes the CLI omnipresent when you deal with any random task in bioinformatics. Let's spend some time familiarizing ourselves with a few basic, and useful commands.

Below you find a couple of _code cells_ that show what some exemplary commands/tools are doing.

---
❗**NOTE**

You know your way around the CLI? **Excellent**, the part below is really just optional and for participants that are not so familiar with the CLI yet.

---

### Getting to know some basic commands

In [1]:
### We move into our home directoy
cd ~
### Print the working directory (where are we)
pwd
### We move to the data directory within our home folder
cd data
### The ls=list command allows us to show the content of the data folder
ls -lrth
### The data folder contains several subfolders
### containing needed databases, used data, the 
### jupyter notebooks, as well as conda environments
### see below

/home/ubuntu
total 16K
drwxrwxr-x  5 ubuntu ubuntu 4.0K Sep  7 06:12 DBs
drwxrwxr-x  3 ubuntu ubuntu 4.0K Sep  9 08:30 notebooks
drwxrwxr-x  2 ubuntu ubuntu 4.0K Sep 10 21:38 conda_env
drwxrwxr-x 14 ubuntu ubuntu 4.0K Sep 11 12:09 workshop


With the first command `cd ~`, you always return to your home directory `/home/<username>`. `cd` stands for change directory and allows you to move to any specified location. The second command `pwd` prints the working directory (aka your current location. With `cd data` we move into the data directory that is located within your home folder. `ls -lrth` shows us then finally the content of data. Everything listed behind `-` » `-lrth` are so-called flags or parameters. `cd`, `pwd`, `ls` are all command-line tools and many of them come with a lot of options, which makes them very versatile and powerful. Let's figure out what these flags `-lrth` are doing. All command-line tools come with a help or manual, which you can, dependent on the tool, call using `<command> --help`, `<command> -h`, or `man <command>`. What does the `ls` help tell us?!

In [4]:
### Calling the help of the program ls
ls --help

Usage: ls [OPTION]... [FILE]...
List information about the FILEs (the current directory by default).
Sort entries alphabetically if none of -cftuvSUX nor --sort is specified.

Mandatory arguments to long options are mandatory for short options too.
  -a, --all                  do not ignore entries starting with .
  -A, --almost-all           do not list implied . and ..
      --author               with -l, print the author of each file
  -b, --escape               print C-style escapes for nongraphic characters
      --block-size=SIZE      with -l, scale sizes by SIZE when printing them;
                               e.g., '--block-size=M'; see SIZE format below
  -B, --ignore-backups       do not list implied entries ending with ~
  -c                         with -lt: sort by, and show, ctime (time of last
                               modification of file status information);
                               with -l: show ctime and sort by name;
                               othe

You lost your sense of time after the summer break. Check out `cal`.

In [None]:
### Calling the CLI's calendar
cal

You wonder how much memory your maschine has left, let's have a look.

In [None]:
### How much memory is left
free

You know you can call the help/manual of any commandline tool. These manual can be really, really long. The [`tldr`](https://tldr.sh/) project is a community effort to streamline the manuals commandline tools and provide useful examples for all of them. Check it out:

In [3]:
### Call tldr for ls
tldr ls

ls
[0mList directory contents.More information: https://www.gnu.org/software/coreutils/ls.

 - [23;22;24;25;32mList files one per line:
[23;22;24;25;33m   ls -1
[0m
 - [23;22;24;25;32mList all files, including hidden files:
[23;22;24;25;33m   ls -a
[0m
 - [23;22;24;25;32mList all files, with trailing / added to directory names:
[23;22;24;25;33m   ls -F
[0m
 - [23;22;24;25;32mLong format list (permissions, ownership, size, and modification date) of all files:
[23;22;24;25;33m   ls -la
[0m
 - [23;22;24;25;32mLong format list with size displayed using human-readable units (KiB, MiB, GiB):
[23;22;24;25;33m   ls -lh
[0m
 - [23;22;24;25;32mLong format list sorted by size (descending):
[23;22;24;25;33m   ls -lS
[0m
 - [23;22;24;25;32mLong format list of all files, sorted by modification date (oldest first):
[23;22;24;25;33m   ls -ltr
[0m
 - [23;22;24;25;32mOnly list directories:
[23;22;24;25;33m   ls -d */
[0m[0m


One concept that is often not easy to grasp for users that are new to the commandline is the difference between _relative_ and _absolute_ paths.

An _absolute_ path begins with the root directory and follows the tree branch by branch until the path to the desired directory or file is completed. What does that mean? On Linux maschines, there is a directory that contains a lot of the systems programs `/usr/bin`. In the `root` directory, there is a directory called `usr` that contains a directory called `bin`.

The difference between _absolute_ and _relative_ paths is that a _relative_ path starts from the current working directory. A couple of special symbols are used to represent relative positions in the file system: `.` refers to the working directoy and `..` to the working directory's parent directory.

In [None]:
### We move to /usr/bin
cd /usr/bin
pwd
### The parent directory of /usr/bin is /usr
### we can get there by using absolute and relative paths
### 1| absolute path
cd /usr
### 2| relative path
cd bin/
pwd
cd ~

We are lazy, we usually opt for the method that requires the least typing 😁.

You can use your newly acquired "navigation skills" to check out the filesystem of your maschine a bit. Below you find some interesting directories/locations that are common for any Linux-based system.

### A (kind of) guided tour of the filesystem

You know how to move using the commandline, let's use this knowledge to check out the file system.

|  **Directory** | **Comments**                                                     |
|----------------|------------------------------------------------------------------|
|      /         | the root directory                                               |
|      /bin      | programs needed by the system to boot and run                    |
|      /etc      | contains system-wide configuration files                         | 
|      /home     | each user has a directory in home                                |
|      /opt      | contains "optional software"                                     |
|      /root     | home folder of the root user (=sys admin)                        |
|      /tmp      | place for temporary, transient files                             |
|      /usr      | contains programs and support files used by users                |
|      /usr/bin  | executable programs held by the Linux system                     |
|      /var/log  | log files, records of system activity                            |

In [None]:
### How many programs does /usr/bin hold
cd /usr/bin
### ls lists the content of a directory
### here we pipe "|" its output into another program wc
### wc counts lines, words, etc. here it counts all files present
ls | wc -l

### Dealing with files and directories

When you work on the commandline you constantly deal with files and directories, below you find some examples regarding commands that are needed to manipulate them.

#### Wildcards

One aspect that makes working with the commandline so powerful, and which adds a touch of magic to it, are _wildcards_. Wildcards are special characters that allow you to specify group of files. Wildcards can be used with any command that accepts filenames and directories as arguments.

|  **Wildcard**  | **Match**                                                        |
|----------------|------------------------------------------------------------------|
|   *            | any characters                                                   |
|   ?            | any single character                                             |
|   \[ABC\]      | any character specified in [], here A, B, or C                   | 
|  \[!ABC\]      | any character NOT specified in []                                |
|  \[:class:\]   | any character that is a member of "class", see below             |
|  \[:alnum:\]   | any alphanumeric character                                       |
|  \[:alpha:\]   | any alphabetic character                                         |
|  \[:digit:\]   | any numeral                                                      | 
|  \[:lower:\]   | any lowercase letter                                             |
|  \[:upper:\]   | any uppercase letter                                             |

#### mkdir, cp, mv, rm

The `mkdir` command is used to create directories.

In [None]:
### Let's create a directory and move into it
cd ~/data/workshop
mkdir 00_CLI
cd 00_CLI
### You can create multiple directories at once
mkdir test1 test2 test3
ls -lrt

The `cp` command copies files or directories.

In [None]:
### We create an empty .txt file with touch
pwd
touch test.txt
### and copy it from test.txt to new.txt
cp test.txt new.txt
### you can copy files into directories the same way
cp new.txt test1/
### check out the options of cp for commonly used options .e.g -R
tldr cd

`mv`is used in very much the same way as `cp`. In comparison to `cp`, imagine _copy & paste_, `mv` is equivalent to _cut & paste_.

In [None]:
### Move new.txt
mv new.txt test2/
### check out the options of mc for commonly used options .e.g -i
tldr mv

### A quick word about conda

Throughout the course we use many different software tools. A lot of these tools have conflicting dependencies (tool X relies on package Z, tool Y too, but tool Y needs version 0.2 of Z, while tool X needs version 0.1). In order to have them available anyway, we use `virtual environments` (and `container`) a lot in bioinformatics.

`conda` is a package manager and tool to set up and manage `virtual environments`. `Virtual environments` can be best imagined as encapsulations that are to a certain extent isolated from the respective computer system/operating system. 

By default, every user on our servers can make use of global `conda` environments.

One can check available `conda` environments as follows:

`conda info --envs`

Environments are activated and deactivated using the following commands:

`conda activate <name_of_environment>`

`conda deactivate`

### Glossary

Below you find the definitions of common metagenomics vocabulary (modified from [here](https://anvio.org/vocabulary/)).

`Read recruitment`: A set of computational strategies to align sequencing reads to one or more reference sequences, read recruitment is the basis for the determination of coverage.

`Coverage`: Average number of reads that map to each (!) nucleotide position in a reference sequence. Proxy for abundance in the context of metagenome analysis.

`Detection`: The proportion of nucleotides in a given reference sequence that are covered by at least one short read.

`Contigs`: A contiguous segment of DNA that is often ‘assembled’ from short reads or long rads, but still represents only a fraction of the longer context to which it belongs.

`Binning`: Grouping contigs that belong to the same population, often based on differential coverage and sequence composition data (e.g. GC-content, k-mer profiles).

`MAG`: A genome that is reconstructed or recovered from a metagenome, people also often refer to bins.

`Population`: Co-existing microbes in an environment whose genomes are similar enough to map to the context of the same reference genome.

`Metagenome`: The entire DNA content of an environment, includes extracellular DNA, can include host DNA when looking for instance at gut or plant microbiomes.

`Pangenome`:  entire collection of genes found in two or more genomes.

`Phylogenomics`: The practice of inferring evolutionary history and relationships between different organisms, based on genomic differences across multiple conserved genes.

`Gene cluster`: Fundamental units of pangenomes which appear in the literature also as ‘protein clusters’, ‘orthogroups’, ‘groups of orthologous genes’, or ‘operational protein families’ (and they should not be confused with biosynthetic gene clusters which describe functionally related genes that belong to the same operon in a single chromosome). Commonly used computational strategies for pangenomics that consider entire contents of input genomes determine gene clusters typically by (1) identifying all genes among a set of genomes, (2) computing similarities between each gene using translated DNA sequences, and (3) determining which genes are homologous enough to be described in the same cluster. Hence, a gene cluster in a given pangenome corresponds to a de novo identified virtual construct that contain one or more genes from one or more genomes.


`Single-copy core gene`: A gene that is found in the vast majority of genomes and yet occurs only once within a single genome.

`Completion`: A rough estimate of how completely a set of contigs represents a full genome based on the presence or absence of single-copy core genes (SCGs) they contain. 

`Redundancy/Contamination`: A measure of how many copies of each single-copy core gene (SCG) is found within a genome. Due to the special single-copy nature of SCGs, their occurrence as multi-copy in a genome is commonly used as an estimate of the level of ‘contamination’ within a genome bin.



---
🔓**SUMMARY**

* 💻 we have set up access to the server we are using, 
* 👊 familiarized ourselves with the JupyterLab interface, 
* 🔍 and dabbled in command-line usage! 
  
--- 

<sub> © Carl-Eric Wegner, 2023-08 </sub>