# ChemBioSys and AquaDiva | Genome-resolved metagenomics workshop

## Session 04 | Phylogenomics and Phylogenetics

Date: 18-19 September 2023

---
### **What is a phylogenetic tree?**

It is the study of the evolutionary relationship among biological organisms and traces the shared patterns between lineages. This can be DNA sequences, amino acid sequences or even morphology.

<font size="2"> Darwin tree of life from 'origin of species'</font>

![Darwintreeoflife](attachment:0BMupm.jpg)



.......................

❓ **Fun fact:**
- Phylogeny is a term derived from 'Phylogenie' that was first introduced by Ernst Häckel in 1866 in Jena.

.......................


---

### **Where do we use phylogenetic trees?**
 1. Evolutionary progression of species across time 
     - e.g., in ancient metagenomics and population genetics<br>
<br>
 2. Organise and explore species with similar traits, producing possibly similar compounds 
     - e.g., in drug and natural product discovery<br>
<br>
 3. Follow progression of viruses 
     - e.g. in virology and disease transmission<br>
<br>
 4. Explore family trees 
     - e.g., Ancestry(c) and 23andMe(c)<br>
<br>
 5. Explore biodiversity changes in relation to environmental factors
     - e.g. in ecology



<font size="2"> [source](https://evolution.berkeley.edu/what-are-evograms/the-emergence-of-humans/#) - Evolution of Hominids across time  </font>


![hominid_evo.jpg](attachment:hominid_evo.jpg)

<font size="2"> [source](https://onlinelibrary.wiley.com/doi/10.1002/vms3.394) - Evolution of COVID-19  </font>


![vms3394-fig-0002-m.png](attachment:vms3394-fig-0002-m.png)

---

### **Phylogenetic terminology**

- **Branches/edges**: vertical lines
- **Internal nodes**: are where they diverge, representing a speciation event from a common ancestor. 
- **Root**: the trunk at the base of the tree
- **Root node**: the most recent common ancestor of all of the taxa represented on the tree
- **Time**: the oldest at the bottom to the most recent at the top
- **Clade/monophyletic**: a group of taxa that includes a common ancestor and all of its descendants
- **Paraphyletic**: a group that excludes one or more descendants 
- **Polyphyletic**: a group that excludes the common ancestor

For more info. please refer to this [article](https://www.nature.com/scitable/topicpage/reading-a-phylogenetic-tree-the-meaning-of-41956/#)


<font size="2"> [source1](https://www.researchgate.net/publication/8567873_Phylogenetic_analyses_A_brief_introduction_to_methods_and_their_application) + [source2](https://users.ugent.be/~avierstr/principles/phylogeny.html) - Tree terminology </font>

![treeterm.gif ](attachment:treeterm.gif)

![Tree-terminology-Considering-the-same-true-phylogeny-as-FIGURE-1-the-figure-shows-some.png](attachment:Tree-terminology-Considering-the-same-true-phylogeny-as-FIGURE-1-the-figure-shows-some.png)

---


### 🔧 **Game time:**
| Term | Colour|
| --- | --- |
| Paraphyletic | Green  |
|
| Polyphyletic | Blue   |
|
| Clade        | Red    |
|
| Not a clade  | Orange |


<img src="img/game_tree.png" width="500"/>
<font size="2"> 

---

### **Types of phylogenetic trees**
 - **Phylograms**
     - Nonultrametric - Branch lengths measures amount of evolutionary change-substitutions and site variations. Distance from root to leafs are different.<br>
<br>
 - **Chronogram**
     - Ultrametric - Branch lengths reflect time. Distance from root to leaf is the same


---

### **Infering phylogenetic trees**

Trees are constructed based on mathematical models that include transitions, transvertions, deletion and insertions based on :

- **maximum parsimony**
     - tree with the fewest evolutionary changes<br>
<br>
- **maximum likelihood (ML)**
     - evalutaes all possible trees given a score
     - the likelihood for each position aligned is then multiplied to give a total likelihood for each tree. The tree with the maximum likelihood is the most probable tree. 
     - most commonly used<br>
<br>
- **MCMC Bayesian inference**
     - Markov chain Monte Carlo (MCMC) that uses probability distributions to describe the uncertainty of all unknowns including the model parameters. *Catch:* prior knowledge or assumptions about the parameters with the likelihood of the observed data is required to  obtain a posterior distribution of the parameters.
     - e.g. BEAST<br>
<br>
- **distance based**
     - pairwise distances e.g. mash distances 
     - followed by neighborjoining which assumes evolutionary rate is the same

---
### **Issues to consider before constructing phylogenetic trees**
- **Stratified sampling**?
    - make sure that a large sample set is used to cover the poulation or lineage to be studied<br>
<br>
- **Alignment based tree**? 
    - what to align - genomes, orthologues, core genes, etc..<br>
<br>
- **Genome qualities**? 
    - sequence come from high quality genomes (e.g. coverage and MIMAG qualities)<br>
<br>
- **Sequence qualities**? 
    - complete or incomplete sequences<br>
<br>
- **Rooted or unrooted** tree?
    - use a hypothetical common ancestor or is the direction of evolutionary transformation not important (branch lengths = amount of genetic change)?<br>
<br>
- **Timescale important**?<br>
<br>
- **Bootrapping required**? 
    - It helps to estimate the reliability and robustness of the branches in the phylogenetic tree by resampling the topology, in other words the frequency at which a particular branch appears in the set of bootstrap trees. 

### <code style="background:blue;color:black">**Hands on time!**</code>


---

For phylogenetic tree construction we used the highest quality bins (completeness>95% and contamination<2%). These bins were either not classified to species level, so a phylogenetic analysis is required to classify the closest genome or in other cases we want to know which genomes share the same genomic traits with out bins for downstream functional analysis.

**Note** The bin names will be called differently in each of your cases, so for the bins we chose to showcase the phylogenetic analysis using three bins that can be found here `~/data/workshop/05_GTOTREE/bins`


| Bins | Completeness | Contamination |
| --- | --- | --- |
| BIN_X aka Nitrospira | 100 | 0 |
| BIN_Y aka E.coli | 99.48 | 0.15 |
| BIN_Z aka Y.pestis | 98.69 | 0.86 |

Here we chose 3 tools to showcase a typical phylogenetic analysis which is also described in details in [Klapper, Hübner and Ibrahim *et.al.,* 2023](https://www.science.org/doi/10.1126/science.adf5300). 

Alternative tools amongst others inlcude [BEAST](https://github.com/beast-dev/beast-mcmc), [raxML](https://github.com/amkozlov/raxml-ng), [IQTREE](https://github.com/Cibiv/IQ-TREE) and [Mash](https://github.com/marbl/Mash).

---

#### (1) GTOTREE
- a so called quick and dirty approach (using approximately-maximum-likelihood) to classify which clade the unknown bin falls in. It is a phylogenetics pipeline that includes many tools associated with phylogenetic tree construction.
- if you want to align your bins to entire bacterial kingdom, then you would use the HMMER model based on 74 single copy genes generated by [Hug et al 2016](https://www.nature.com/articles/nmicrobiol201648). These so called SCGs/core genes are orthologous genes shared by genomes under question.

<font size="2"> [source](ttps://github.com/AstrobioMike/GToTree/) - GTOTREE  </font>
![GToTree-Overview.png](attachment:GToTree-Overview.png)

---

**Download ref genomes (bacteria-archaea) from NCBI = 4924**

On 05.09.2023:
- complete genomes
- assembly
- taxonomy check OK
- latest refseq
- representative

In [None]:
# Load the environment
conda activate blast
#conda activate phylogenomics_I

# Set-up the directory paths
PROJECT_DIR=~/data/workshop
OUTPUT_DIR=$PROJECT_DIR/05_GTOTREE
BINS=$OUTPUT_DIR/bins

mkdir $OUTPUT_DIR
cd $OUTPUT_DIR

#Grab the search query from the Assembly database on ncbi first - https://www.ncbi.nlm.nih.gov/assembly -
esearch -db assembly -query '"Archaea"[Organism] OR Archaea[All Fields]) OR ("Bacteria"[Organism] OR Bacteria[All Fields]) AND (latest[filter] AND "complete genome"[filter] AND "representative genome"[filter] AND all[filter] NOT anomalous[filter] AND "taxonomy check ok"[filter]' \
| esummary | xtract -pattern DocumentSummary -element AssemblyAccession > $OUTPUT_DIR/assembly-accs-complete-genomes.txt


**Construct the phylogenetic tree** : 
 - single copy genes used correspond to bacteria and archaea = 24
 - you can find all hmm models already available for GTOTREE [here](https://github.com/AstrobioMike/GToTree/blob/master/hmm_sets/hmm-sources-and-info.tsv)

<img src="img/hmm_gtotree_scgs.png" width="500"/>
<font size="2"> 




<span style="color:red">*Note - The following takes - 3 hours and 7 minutes to run. We will **skip** that today * 
</span>



In [None]:
# Load the environment
conda activate gtotree

# Set-up the directory paths
PROJECT_DIR=~/data/workshop
OUTPUT_DIR=$PROJECT_DIR/05_GTOTREE
BINS=$OUTPUT_DIR/bins

mkdir $OUTPUT_DIR
cd $OUTPUT_DIR

# Create a file with fasta paths
ls $BINS/* > fasta_files.txt

# Run gtotree
GToTree -a $OUTPUT_DIR/assembly-accs-complete-genomes.txt \
        -f fasta_files.txt -H Bacteria_and_Archaea \
        -t -L Species,Strain -j 25 \
        -o refseq-compgenomes
        
conda deactivate


Results: 
1. Check the log file to check if some genomes were removed from the tree 

```
less ~/data/workshop/05_GTOTREE/refseq-compgenomes/gtotree-runlog.txt
```
2. Download the tree file `~/data/workshop/05_GTOTREE/refseq-compgenomes/refseq-compgenomes.tre`<br>
<br>
3. Open [itol](https://itol.embl.de/) in browser and click on `My trees` then `Tree upload` and drag the tree in.<br>
<br>
4. Download the `leaf_colours.txt` and drag the file into the tree. This colours the branches containing your samples of interest. **Note** Change according to the bin names you have individually.<br>
<br>
5. Now that we can infer the bin taxonomy:

| Bins | Closest taxa |
| --- | --- |
| BIN_X aka Nitrospira | Nitrospira moscoviensis |
| BIN_Y aka E.coli | Escherischia furgusonii |
| BIN_Z aka Y.pestis | Yersinia pestis |

Main tree

<img src="img/gtotree_main.png" width="500"/>
<font size="2"> 

Pruned: BIN_X

<img src="img/gtotree_main_bin_X.png" width="500"/>
<font size="2"> 


Pruned: BIN_Y and BIN_Z

<img src="img/gtotree_pruned_biny_z.png" width="500"/>
<font size="2"> 

We can now download the genomes of the closest genomes found to share a common ancestoral node and identify the closest genomes for downstream analysis i.e., biosynthetic gene clsuter alignment, pangenome, SNP and genetic variants. 

As an example, we will only process the Nitrospina bin (BIN_X) downstream to explore which genome our unknown bin is closest to in the Nitrospina genus.

**Bin_X - aka Nitrospira** : shares a LCA with GCF_001273775.1 Nitrospira moscoviensis NSP M-1. When the tree is pruned to this clade we obtain 1 genome sharing the LCA with two other genomes more phylogenetically distant (- GCF_014701235.1_Dissulfurispira_thermophila_T55J - GCF_946900835.1_Nitrospina_watsonii_347). A quick search in the [NCBI](https://www.ncbi.nlm.nih.gov/) assembly database shows that there are only 1 representative genome of these genera. Therefore it is more likely that this bin is very similar to Nitrospira moscoviensis NSP M-1.


---

#### (2) Phylophlan3.0
- uses clade-specific maximally informative phylogenetic markers
- aligns the mash distance between metagenomes and genomes core genes. The nucleitide fasta files are used as input and the sequences are translated. The phylogenetic markers comprise of core genes are obtained from UniRef90. 
- UniRef is a clustered UniProtKB database that aims to reduce redundancy and computational load while maintaining a high coverage of protein sequences. It groups protein sequences from UniProtKB into clusters based on sequence similarity -50/90/100.
- for use please refer to the [documentation](https://github.com/biobakery/phylophlan)

In our example here, BIN_X shared an ancestoral node with GCF_001273775.1_Nitrospira_moscoviensis_NSP M-1. When looking at the [uniRef90](https://www.uniprot.org/uniref?query=Nitrospira+moscoviensis&facets=identity%3A0.9), there are 5,065 core genes. To test whether the core genes influence the phylogeny we can also do the same using Dissulfurispira thermophila, which has 2,197 core genes.



<img src="img/phyloPhlAn_overview.png" width="500"/>
<font size="2">

<font size="2">[Asnicar et.al. 2020](https://www.nature.com/articles/s41467-020-16366-7) 

---

<span style="color:red">*Note - The following takes - 9 days to run and needs 20 GB of space to store the 700 genomes. We will **skip** that today due to time and space limitations * 
</span>

**Download the root genome from NCNI (Assembly database)**
- The root genome is Dissulfurispira thermophila and can be downloaded from [NCBI](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_014701235.1/
)

In [None]:
#Download is through [NCBI FTP](https://ftp.ncbi.nlm.nih.gov/genomes/)

mkdir root_genome
cd root_genome

wget https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Dissulfurispira_thermophila/representative/GCF_014701235.1_ASM1470123v1/GCF_014701235.1_ASM1470123v1_genomic.fna.gz
#unzip file
gunzip *

**Download all genomes corresponding to Nitrospira (743 genomes) from NCBI** :

All Nitrospira genomes are downloaded from NCBI assembly database. Another method of reducing the genetic noise caused by many similar (complete and non complete) genomes in the ref. database is by clustering the genomes into representative genomes based on fastANI and genome qualities. This can be done with [drep](https://github.com/MrOlm/drep). Here we will skip this part, due to time and space restrictions.

In [None]:
# Load the environment
conda activate blast

# Set-up the directory paths
PROJECT_DIR=~/data/workshop
OUTPUT_DIR=$PROJECT_DIR/06_PHYLOPHLAN
BINS=$PROJECT_DIR/05_GTOTREE/bins

mkdir $OUTPUT_DIR
cd $OUTPUT_DIR

# Grab the search query from the Assembly database on ncbi first  
esearch -db assembly -query '"Nitrospira"[Organism] OR "Nitrospiria"[Organism] OR Nitrospira[All Fields]) AND (latest[filter] AND ("complete genome"[filter] OR "scaffold level"[filter] OR "contig level"[filter]) AND all[filter] NOT anomalous[filter]' \
| esummary | xtract -pattern DocumentSummary -element AssemblyAccession > assembly-accs.txt

conda deactivate

# Load the environment
conda activate bit

# Download the fasta files corresponnding to the accessions
mkdir $OUTPUT_DIR/fasta
cd $OUTPUT_DIR/fasta

bit-dl-ncbi-assemblies -w $OUTPUT_DIR/assembly-accs.txt -f fasta -j 20
gzip -d *.gz 


**Construct the tree**
- here we use `Diamond` to align the core genes across gene orthologues and `trimal` to trim teh sequences
- here we use `MAFFT` for multiple sequence alignment
- here we use `raxML` for the best representative tree wth bootstrapping

In [None]:
# Load the environment
conda activate phylophlan

# Set-up the directory paths
PROJECT_DIR=~/data/workshop
OUTPUT_DIR=$PROJECT_DIR/06_PHYLOPHLAN
BINS=$PROJECT_DIR/05_GTOTREE/bins

mkdir $OUTPUT_DIR
cd $OUTPUT_DIR

cp $BINS/BIN_X* $OUTPUT_DIR/fasta/
cp $PROJECT_DIR/root_genome/*.fna $OUTPUT_DIR/fasta/

# change the extension to fna
for file in $OUTPUT_DIR/fasta/*.fa; do
    mv "$file" "${file%.fa}.fna"
done

mkdir $OUTPUT_DIR/tree
cd $OUTPUT_DIR/tree

# Make config file : allignment based on translated sequences : using mafft and diamond. If nucleotide allignents needed use  --force_nucleotides here and in phylophlan command.
phylophlan_write_config_file \
    -d a \
    -o $OUTPUT_DIR/tree/config.cfg \
    --db_aa diamond \
    --map_dna diamond \
    --map_aa diamond \
    --msa mafft \
    --trim trimal \
    --tree1 fasttree \
    --tree2 raxml
    
# Set up database (can either just use phylophlan database or a specific species. 
# Here Nitrospira was used because otherwise the query seqs will be removed due to not fullfilling the hmm models created to retain the sequences.
mkdir $OUTPUT_DIR/tree/core_genes/

phylophlan_setup_database \
    -g s__Nitrospira_moscoviensis --database_update \
    -o $OUTPUT_DIR/tree/core_genes/ \
    --verbose
    
## Use this flag when phylophlan run crashes  --clean_all 
phylophlan \
-i $OUTPUT_DIR/fasta \
-d $OUTPUT_DIR/tree/core_genes/s__Nitrospira_moscoviensis \
-o $OUTPUT_DIR/TREE \
--diversity low \
--accurate \
-f $OUTPUT_DIR/tree/config.cfg \
--nproc 28 --min_num_markers 1 \
--verbose > $OUTPUT_DIR/tree/phylophlan_output.log

conda deactivate

# add bootstrapping
cd $OUTPUT_DIR/TREE 
#raxmlHPC-PTHREADS-SSE3 -m PROTCATLG -f a -x 12345 -p 12345 -# 100 -T 28 -w $OUT/TREE_chlorobiales_genomes/fasta_concatenated.aln -s $OUT/TREE_chlorobiales_genomes/*.aln -n bootstrap.tre
raxmlHPC-PTHREADS-SSE3 -s fasta_concatenated.aln -n bootstrap_tree.tre -f a -m PROTCATLG -N 100 -p 1989 -x 1989 -T 30


**Annotate the tree**

In [None]:
# Load the environment
conda activate blast

# Set-up the directory paths
PROJECT_DIR=~/data/workshop
OUTPUT_DIR=$PROJECT_DIR/06_PHYLOPHLAN
BINS=$PROJECT_DIR/05_GTOTREE/bins

mkdir $OUTPUT_DIR
cd $OUTPUT_DIR

# Grab the assembly accession for annotation
ls $OUTPUT_DIR/fasta > ass_acc.txt

# Filter the txt files
#sed -i 's/ /_/g'  ass_acc.txt
sed -i 's/.fna//g' ass_acc.txt

file=ass_acc.txt
lines=$(cat $file)

for line in $lines ; do
esearch -db assembly -query "$line" |esummary | \
xtract -pattern DocumentSummary \
-sep "\t" -element AssemblyAccession,Taxid,AssemblyName,Organism,assembly-status >> ass_acc_2.txt ; done

# Create a iTOL accession to lineage dataset:
# Print the acc. list
awk -F '\t' '{print $1"\t"$4}' \
ass_acc_2.txt > ass_acc_itol2.txt

Results: 
1. Download the tree file `/data/workshop/06_PHYLOPHLAN/TREE/xx.tre`.<br>
<br>
1. Open [itol](https://itol.embl.de/) in browser and click on `My trees` then `Tree upload` and drag the tree in.<br>
<br>
1. Add the contents of the `ass_acc_itol2.txt` to `leaf_labels.txt`. **Note** Check the seperator you are using.<br>
<br>
1. Download the `leaf_labels.txt` and drag the file into the tree. This renames your leaf labels according to the NCBI taxonomic classification. **Note** Change bin names in there accordingly.<br>
<br>
4. Our unknown bin is sharing the ancestoral node with ... ?

---
As both methods are based on SCGs alignment, what about distance based alignment based on orthologous fragments on the nucleotide level? Do they show the same clade seperation?

---

#### (3) Average nucleotide identities (ANIs) 
- is a fast alignment-free comparison of whole-genome Average Nucleotide Identity (ANI)
- ANI stands for the mean nucleotide identity of orthologous gene pairs shared between two genomes using mash distance. 
- here we create a distance based tree using [FastANI](https://github.com/ParBLiSS/FastANI). An alternative is using [Mash](https://github.com/marbl/mash)



<img src="img/fastani_doc.png" width="500"/>
<font size="2"> 

<font size="2">[Jain ei.al. 2018](https://www.nature.com/articles/s41467-018-07641-9)
---

**All genomes**

In [None]:
# Load the environment
conda activate fastani

# Set-up the directory paths
PROJECT_DIR=~/data/workshop
OUTPUT_DIR=$PROJECT_DIR/07_ANI
GENOMES=$PROJECT_DIR/06_PHYLOPHLAN/fasta
SCRIPTS=~/data/notebooks/scripts

mkdir $OUTPUT_DIR
cd $OUTPUT_DIR

# Download/copy over the pruned genomes
cp $GENOMES/*.fna $OUTPUT_DIR

# Prepare the genome ids for fastani
ls $OUTPUT_DIR > acc_IDs.txt

# Align all fragments of >3000 bases
fastANI --ql acc_IDs.txt --rl acc_IDs.txt -t 25 -k 16 --fragLen 600 -o ANI_output_fg3000.txt --matrix

awk -F '\t' '{print $1"\t"$2"\t"$3}' ANI_output_fg3000.txt > ANI_output1_fg3000.txt
awk 'BEGIN {FS=OFS="\t"} 
           {col[$1]; row[$2]; val[$2,$1]=$3}
     END   {for(c in col) printf "%s", OFS c; print "";
            for(r in row)
              {printf "%s", r;
               for(c in col) printf "%s", OFS val[r,c]
               print ""}}' ANI_output1_fg3000.txt > ANI_OUT_MATRIX_fg3000.txt

# Load the environment
conda activate rst

# Using the matrix as input run the R script ANI.R
Rscript $SCRIPTS/ANI.R $OUTPUT_DIR

# move files to new dir
mkdir $OUTPUT_DIR/all_genomes
cp ANI_output_fg3000.txt $OUTPUT_DIR/all_genomes/
cp ANI_output_fg3000.txt.matrix $OUTPUT_DIR/all_genomes/
cp ANI_output1_fg3000.txt $OUTPUT_DIR/all_genomes/
cp ANI_OUT_MATRIX_fg3000.txt $OUTPUT_DIR/all_genomes/
cp ANI_output_fg3000.pdf $OUTPUT_DIR/all_genomes/
cp ANI_output_fg3000.png $OUTPUT_DIR/all_genomes/
cp acc_IDs.txt $OUTPUT_DIR/all_genomes/


ANI heatmap:

<img src="img/ANI_all_genomes.png" width="500" height="500">
<font size="2"> 

---

As the figure is very noisy, and we only are interested in the 'interesting' blue box, we will prune the tree. From the `~/data/workshop/all_genomes/ANI_output_fg3000.pdf` we copy all genomes IDs that fall in the same cluster as BIN_X and make a new text file with the contents. This can be found in `~/data/workshop/acc_IDs_2.txt`.

**BIN_X cluster only**

In [None]:
# Load the environment
conda activate fastani

# Set-up the directory paths
PROJECT_DIR=~/data/workshop
OUTPUT_DIR=$PROJECT_DIR/07_ANI
GENOMES=$PROJECT_DIR/06_PHYLOPHLAN/fasta
SCRIPTS=~/data/notebooks/scripts

mkdir $OUTPUT_DIR
cd $OUTPUT_DIR

# Download/copy over the pruned genomes
cp $GENOMES/*.fna $OUTPUT_DIR

# Prepare the genome ids for fastani
ls $OUTPUT_DIR > acc_IDs.txt

# Align all fragments of >3000 bases
fastANI --ql acc_IDs_2.txt --rl acc_IDs_2.txt -t 25 -k 16 --fragLen 600 -o ANI_output_fg3000.txt --matrix

awk -F '\t' '{print $1"\t"$2"\t"$3}' ANI_output_fg3000.txt > ANI_output1_fg3000.txt
awk 'BEGIN {FS=OFS="\t"} 
           {col[$1]; row[$2]; val[$2,$1]=$3}
     END   {for(c in col) printf "%s", OFS c; print "";
            for(r in row)
              {printf "%s", r;
               for(c in col) printf "%s", OFS val[r,c]
               print ""}}' ANI_output1_fg3000.txt > ANI_OUT_MATRIX_fg3000.txt

# Load the environment
conda activate rst

# Using the matrix as input run the R script ANI.R
Rscript $SCRIPTS/ANI.R $OUTPUT_DIR


ANI heatmap:

<img src="img/ANI_pruned.png" width="500" height="500">
<font size="2"> 

---

---

### What is a pangenome?
It refers to the complete set of genes that are present in a group of related organisms. It provides insight on the evolutionary dynamics and genomic variation across the studied organism/species. Further, it performs clustering of genes based on sequence similarity and generates a pangenome matrix representing the presence or absence of genes across the strains - therefore it does not use a database in itself but rather relies on the annotation tool used. It comprises of 

- Core genes: genes that are shared by all groups being studied. These genes typically represent essential functions or traits that are conserved across the population.

- Dispensable genes: genes that are present in some individuals or subgroups but not in others. These may confer specific adaptations or traits that are beneficial under certain conditions but not universally required.

- Accessory genes: genes that are present in only a subset of individuals or strains within the species. These genes often contribute to the diversity and variability observed within the population, potentially influencing traits such as virulence, antibiotic resistance, or metabolic capabilities.

There many tools to carry out a pangenome analysis *E.g., [Panaroo](https://github.com/gtonkinhill/panaroo), [PGAP](https://academic.oup.com/bioinformatics/article/28/3/416/188161), etc. Here we chose [Roary](https://github.com/sanger-pathogens/Roary) to showcase it.


Today we will compare our unknown genome only with one other genome with the 38 genomes found to be on the same clade as our BIN_X. We will use the same file we created above `=~/data/workshop/07_ANI/acc_IDs_2.txt`

**Annotate the genomes and explore the Pangenomes**

In [None]:
# Load the environment
conda activate prokka

# Set-up the directory paths
PROJECT_DIR=~/data/workshop
OUTPUT_DIR=$PROJECT_DIR/08_ROARY
GENOMES=$PROJECT_DIR/06_PHYLOPHLAN/fasta
SCRIPTS=~/data/notebooks/scripts

mkdir $OUTPUT_DIR
cd $OUTPUT_DIR

while IFS= read -r F; do
  N=$(basename "$F" .fna)
  mkdir "$OUTPUT_DIR/$N"
  prokka --quiet --metagenome --outdir "$OUTPUT_DIR/$N" --prefix "$N" --locustag "$N" --force --compliant --cpus 28 "$GENOMES/$F"
done < $PROJECT_DIR/07_ANI/acc_IDs_2.txt

# Load the environment
conda activate roary

roary -p 28 -e -n -v -r $OUTPUT_DIR/*/*.gff -f $OUTPUT_DIR/run

# Load the environment
conda activate rst

Rscript $SCRIPTS/ROARY.R $OUTPUT_DIR/run



Results: 
1. If you open `~/data/workshop/08_ROARY/run/gene_presence_absence.csv` you will find all details of the core/shared genes and accessory genes for further downstream functional and genome analysis<br>
<br>
2. CLI result:
``` 
[1] "The number of genes detected across the pangenomes (including hypothetical proteins) is equal to: 34589"
[1] "The number of genes detected across the pangenomes (excluding those without gene label) is equal to: 3712"

```
🔧 **Game time:**

We find that there is another genome in the database that has an identical pangenome, any idea which one ?


Pangenome heatmap: BIN_X *vs.* clade genomes withou hypothetical genes

<img src="img/pangenomes.png" width="500" height="500">
<font size="2"> 

---

---

### 🔓**SUMMARY**

* 🎓 We defined phylogenetics and the different types of Phylogenetic tree
* 📈 We discussed the points we need to consider before setting up a tree
* 🔍 We constructed ML and distance based phylogenetic trees
* 🎯 We defined pangenomes
* 🌲 We tested out a pangenome analysis
  
--- 
<sub> © Anan Ibrahim, 2023-09 </sub>