# Pipeline exercise

Fortunately there are computational pipelines which enable you to process many samples jointly and which make the whole workflow more user-friendly. These pipelines also help to produce a consistent, documented and therefore reproducible workflow. Here we are going to use the [SECAPR pipeline](https://github.com/AntonelliLab/seqcap_processor) on a dataset of **Ultraconserved Elements (UCEs)** that were samples for the  in South America.

![img.png](./img/topaza_distribution_sampling_map.png)

It's not clear if the existing morphological species assignments are justified and if there might be cryptic species within these morphospecies. We want to use this UCE dataset to generate a phylogeny (species tree) of these samples to define coalescent species and see if these assignments are in agreement with population genetics analyses using SNP data extracted from the UCEs.

In this tutorial you'll go through the following steps:


![](./img/secapr_workflow.png)



_________

**0)** Let's first make sure that we are connected to the correct software environment (`forbio_env`)

In [None]:
%%bash
module load Anaconda3/5.1.0
source activate forbio_env


_______

**1)** Then copy the pipeline tutorial folder into your directory at `/work/users/USERNAME/`.

<div class="alert alert-block alert-danger">
This is not your default home directory! Make sure you really work at `/work/users/USERNAME/`
</div>


In [None]:
%%bash
cp -r /work/projects/forbio/tutorials_tobi/pipeline/ /work/users/USERNAME/


______

**2)** Now let's run the cleaning and trimming script for all of your samples.

<div class="alert alert-block alert-info">
We are appending `2> warnings.txt` to all following commands because the cluster is printing a lot of annoying warning messages when loading some of the SECAPR dependencies. This command will silence those warnings and print them into the file `warnings.txt`
</div>

Every command has a help function that shows you the available options. Check out the help function of `secapr clean_reads`:

In [None]:
%%bash
secapr clean_reads -h

Now run the cleaning and trimming with this command. Feel free to add any flags you feel are necessary.

In [None]:
%%bash
secapr clean_reads --input raw_reads/ --config helpfiles/adapters_info_topaza.txt --output cleaned_reads --index single --headCrop 10 2> warnings.txt

Once it is running **INTERRUPT THIS COMMAND** using `ctrl+c`. Since it will take around 30 minutes to clean all the samples, we instead submit this as a job script. You find the job script in the `scripts` folder. Fill in the correct paths and submit it with:

In [None]:
%%bash
sbatch scripts/clean_trim_secapr.sh


______

**3)** You can check the quality of the cleaned reads for all samples using the `secapr quality_check` command. This will create a plot `QC_plots.pdf` with an overview of the failed and passed test of all samples (you can skip this step for this exercise).

In [None]:
%%bash
secapr quality_check --input cleaned_reads/ --output quality_test 2> warnings.txt


______

**4)** Now run a de novo assembly:

In [None]:
%%bash
secapr assemble_reads --input ./cleaned_reads/ --output ./contigs_abyss --assembler abyss


______

**5)** Extract the target regions:

In [None]:
%%bash
secapr find_target_contigs --contigs contigs_abyss/ --reference helpfiles/Tetrapods-UCE-2.5Kv1.fasta --output target_contigs  

Check the output folder and have a look at the `match_table.txt` file.


______

**6)** Build multiple sequence alignments (MSAs) between all our samples

In [None]:
%%bash
secapr align_sequences --sequences target_contigs/extracted_target_contigs_all_samples.fasta --output alignments/contig_alignments/ --aligner mafft --output-format fasta --no-trim --ambiguous


______

**7)** Now we run the reference assembly, using the consensus sequence of each of our assembled contig multiple sequence alignments (MSAs).

In [None]:
%%bash
secapr reference_assembly --reads cleaned_reads --reference_type alignment-consensus --reference alignments/contig_alignments --output remapped_reads --min_coverage 4

Inspect some of the files again using `samtools tview`.

______

**8)** You can use the `secapr locus_selection` function to find the loci that were assmebled across all samples. You can set the number of extracted loci very high, to ensure that all loci that are present in all samples will be extracted.

In [None]:
%%bash
secapr locus_selection --input remapped_reads --output locus_selection/ --n 2000 --read_cov 3


______

**9)** Now we can build alignments from these loci:

In [None]:
%%bash
secapr align_sequences --sequences locus_selection/joined_fastas_selected_loci.fasta --output alignments/exon_intron_alignments/ --aligner mafft --output-format fasta --no-trim --ambiguous

What is the difference between these alignments and the ones we created in step **6)** ?


______

**10)** SECAPR also has a function that enables allele phasing. This will produce two separate BAM files per samples which in tunr can be summarized into two separate sequences (allele sequences) per sample and locus:

In [None]:
%%bash
secapr phase_alleles --input remapped_reads/ --output allele_sequences --min_coverage 3


______

**11)** Finally we can use the phased BAM files to generate allele sequence alignments (MSAs) for all samples:

In [None]:
%%bash
secapr align_sequences --sequences allele_sequences/joined_allele_fastas.fasta --output alignments/allele_alignments/ --aligner mafft --output-format fasta --no-trim --ambiguous