# Assembly pipeline results

***

## Introduction

When your sample data is in the Pathogen Informatics databases, it becomes available to the automated analysis pipelines. After the analysis pipelines have been requested and run, you can use the `pf` scripts to return the results of each of the automated analysis pipelines.

The genome assembly pipeline used depends on sequence data and organism:

  * [bacteria assembly](http://mediawiki.internal.sanger.ac.uk/index.php/Pathogen_Informatics_Bacterial_Assembly_Pipeline)
  * [virus assembly](http://mediawiki.internal.sanger.ac.uk/index.php/Pathogen_Informatics_Viral_Assembly_Pipeline)
  * [eukaryote assembly](http://mediawiki.internal.sanger.ac.uk/index.php/Pathogen_Informatics_Eukaryote_Assembly_Pipeline)
  * [pacbio assembly](http://mediawiki.internal.sanger.ac.uk/index.php/Pathogen_Informatics_Automated_PacBio_Assembly_Pipeline)

We can use `pf assembly` to return the location of assembly pipeline results.

In this section of the tutorial we will cover:

  * using `pf assembly` to get assembly pipeline results
  * filtering `pf assembly` results by program
  * using `pf assembly` to symlink assembly pipeline results
  * using `pf assembly` to get assembly statistics

***

## Exercise 8

**First, let's tell the system the location of our tutorial configuration file.**

In [9]:
export PF_CONFIG_FILE=$PWD/data/pathfind.conf



**Let's take a look at the `pf assembly` usage.**

In [None]:
pf assembly -h

In [20]:
pf info -t study -i 664

5477_6#1        Tw01_0055                 NA                        NA                        NA
5477_6#2        HK01_0168                 NA                        NA                        NA
5477_6#3        HK01_0036                 NA                        NA                        NA
5477_6#4        HK01_0147                 NA                        NA                        NA
5477_6#5        M01_9962                  NA                        NA                        NA
5477_6#6        M01_9964                  NA                        NA                        NA
5477_6#7        M01_0087                  NA                        NA                        NA
5477_6#8        M01_9995                  NA                        NA                        NA
5477_6#9        V01_99112                 NA                        NA                        NA
5477_6#10       V01_01107                 NA                        NA                        NA
5477_6#11       delt

In [22]:
pf info -t study -i 607

10018_1#1       APP_N2_OP1                NA                        APP_N2_OP1                NA
10018_1#2       APP_IN_2                  NA                        APP_IN_2                  NA
10018_1#3       APP_T1_OP2                NA                        APP_T1_OP2                NA
10018_1#4       APP_IN_4                  NA                        APP_IN_4                  NA
10018_1#5       APP_N1_OP2                NA                        APP_N1_OP2                NA
10018_1#6       APP_N1_OP1                NA                        APP_N1_OP1                NA
10018_1#7       APP_T1_OP1                NA                        APP_T1_OP1                NA
10018_1#8       APP_IN_3                  NA                        APP_IN_3                  NA
10018_1#9       APP_N2_OP2                NA                        APP_N2_OP2                NA
10018_1#10      APP_IN_1                  NA                        APP_IN_1                  NA
10018_1#11      APP_

**Now, let's get the assembly pipeline results for run 10018_1#2.**

In [None]:
pf assembly -t lane -i 10018_1#2

This returns the locations of the FASTA-formatted contig files which were produced by the assembly pipeline. 

By default, `pf assembly` will return the scaffolded contigs. But, what if you want to see all of the assembled contigs. To get these we can use the `--filetype` or `-f` option.

In [None]:
pf assembly -t lane -i 10018_1#2 -f all

This returns two files "contigs.fa" and "unscaffolded_contigs.fa".  

Notice that these results are located in a directory called "spades_assembly".  This tells us that [SPAdes](http://cab.spbu.ru/software/spades/) was the program used to generate the assembly. A quick way to filter assembly pipeline results by program is to use the `--progam` or `-P` option.

**Let's get all assembly pipeline results for run 10018_1 which were generated using "spades".**

In [None]:
pf assembly -t lane -i 10018_1 -P spades

Here we can see that SPAdes was used to generate assemblies for lanes 10018_1#2, 10018_1#3 and 10018_1#50.  We can symlink these assemblies into a directory using the `--symlink` or `-l` option.

**Let's symlink the assembly pipeline results for run 10018_1 which were generated with SPAdes to "10018_1_spades".**

In [None]:
pf assembly -t lane -i 10018_1 -P spades -l 10018_1_spades

In [None]:
ls 10018_1_spades

We can also get some statistics from our assembly results using the `--stats` or `-s` option.

**Let's get some assembly statistics for lane 10018_1#2.**

In [7]:
pf assembly -t lane -i 10018_1#2 -s 

ERROR: output file "10018_1_2.assemblyfind_stats.csv" already exists; not overwriting existing file. Use "-F" t- force overwriting


This generated a new file called "10018_1_2.assemblyfind_stats.csv" which contains our assembly statistics.

In [8]:
cat 10018_1_2.assemblyfind_stats.csv

Lane,"Assembly Type","Total Length","No Contigs","Avg Contig Length","Largest Contig",N50,"Contigs in N50",N60,"Contigs in N60",N70,"Contigs in N70",N80,"Contigs in N80",N90,"Contigs in N90",N100,"Contigs in N100","No scaffolded bases (N)","Total Raw Reads","Reads Mapped","Reads Unmapped","Reads Paired","Reads Unpaired","Total Raw Bases","Total Bases Mapped","Total Bases Mapped (Cigar)","Average Read Length","Maximum Read Length","Average Quality","Insert Size Average","Insert Size Std Dev"
10018_1#2,"Scaffold: Correction, Normalisation, Primer Removal + SPAdes + Improvement"


***

## Questions

**Q1: How many assembly files are returned by default for lane 10018_1#50?**

In [None]:
# Enter your answer here

**Q2: Which program was used to generate the assembly for lane 10018_1#51?**  
_Hint: look at the location path_

In [None]:
# Enter your answer here

**Q3: Symlink the assembly/assemblies generated by "IVA" for run 10018_1 into a new directory called "iva_results".**  
_Hint: don't forget to filter the results if more than one program has been used_

In [None]:
# Enter your answer here

**Q4: How many contigs were assembled by velvet for lane 5477_6#1 and what is the N50?**  
_Hint: you'll need to get some statistics for this lane and filter by program_

In [None]:
# Enter your answer here

In [None]:
# Enter your answer here

***

## What's next?

For a quick recap of how to get QC pipeline results, head back to [SNP calling pipeline results](snp-pipeline-results.ipynb).

Otherwise, let's move on to how to get your [annotation pipeline results](annotation-pipeline-results.ipynb).