# Answers

***

Here are the answers to the questions from each of the tutorial sections.

  * [Introduction](#Introduction)
  * [Finding your data](#Finding-your-data)
  * [Sample information and accessions](#Sample-information-and-accessions)
  * [Analysis pipeline status](Analysis-pipeline-status)
  * [QC pipeline results](QC-pipeline-results)  
  * [Mapping pipeline results](Mapping-pipeline-results)  
  * [SNP calling pipeline results](snp-pipeline-results)  
  * [Assembly pipeline results](assembly-pipeline-results)  
  * [Annotation pipeline results](annotation-pipeline-results)
  * [RNA-Seq expression pipeline results](RNA-Seq-expression-pipeline-results)

**First, let's tell the system the location of our tutorial configuration file.**

In [None]:
export PF_CONFIG_FILE=$PWD/data/pathfind.conf

***

## Introduction

**Q1: How many lanes are associated with study 2363?**

**88**

For this search, you need to set the type (`-t`) to study and the id (`-i`) to 2363. You can then pipe the locations returned by `pf data` into `wc -l` to count the number of locations (lines) returned.

In [None]:
pf data -t study -i 2363 | wc -l

**Q2: How many lanes are returned if you search using the file "data/lanes_to_search.txt"?** 

**25**

For this search, you need to set the type (`-t`) to file and the id (`-i`) to the location of the file, "data/lanes_to_search.txt". You can then pipe the locations returned by `pf data` into `wc -l` to count the number of locations (lines) returned.

In [None]:
pf data -t file -i data/lanes_to_search.txt | wc -l

You can check that all the lanes in the file have been found by counting the number of lanes in the file.

In [None]:
wc -l data/lanes_to_search.txt

***

## Finding your data

**Q1: What is the location of the top level directory for data and results associated with lane 10050_2#1?**

The location of the top directory on disk is:

**/Users/vo1/sanger-pathogens/pathogen-informatics-training/Notebooks/PathFind/data/prokaryotes/seq-pipelines/Streptococcus/pneumoniae/TRACKING/2363/2363STDY5509234/SLX/7376889/10050_2#1**

In [None]:
pf data -t lane -i 10050_2#1

**Q2: What is the location of the two FASTQ files associated with lane 10050_2#1?**

The location of the two FASTQ files are:

**/Users/vo1/sanger-pathogens/pathogen-informatics-training/Notebooks/PathFind/data/prokaryotes/seq-pipelines/Streptococcus/pneumoniae/TRACKING/2363/2363STDY5509234/SLX/7376889/10050_2#1/10050_2#1_2.fastq.gz**

**/Users/vo1/sanger-pathogens/pathogen-informatics-training/Notebooks/PathFind/data/prokaryotes/seq-pipelines/Streptococcus/pneumoniae/TRACKING/2363/2363STDY5509234/SLX/7376889/10050_2#1/10050_2#1_1.fastq.gz**

You need to use the `-f` or `--filetype` option to get the location of the FASTQ files.

In [None]:
pf data -t lane -i 10050_2#1 -f fastq

**Q3: Symlink the FASTQ files from study 2363 into a directory called "study_2363_links". How many FASTQ files were symlinked to "study_2363_links?**

**176**

First, we need to get the FASTQ files for study 2363 using the `-f` or `--filetype` option in case there are any non-FASTQ files. We then add the `-l` or `--symlink` option with directory we want to symlink to "study_2363_links".

In [None]:
pf data -t study -i 2363 -f fastq -l study_2363_links

We then look at the contents of "study_2363_links" with `ls` and count the number of files (lines) returned with `wc -l`.

In [None]:
ls study_2363_links | wc -l

**Q4: What reference was used to map lane 10050_2#1 during QC and what percentage of the reads were mapped to the reference?**

**Streptococcus_pneumoniae_INV200_v1** and **89.9%**

First, we need to get the statistics for lane 10050_2#1 using the `-s` or `--stats` option.

In [None]:
pf data -t lane -i 10050_2#1 -sF

Then, we need to find the "Reference" and "Mapped %" column in the statistics file (10050_2_1.pathfind_stats.csv).

In [None]:
cat 10050_2_1.pathfind_stats.csv

***

## Sample information and accessions

**Q1: What is the sample name that corresponds with lane 10050_2#1?**

**2363STDY5509234**

We can use the default output from running `pf info` with the identifier type (`-t` or `--type`) set as "lane" and the identifier (`-i` or `--id`) as 10050_2#1 to get the sample name.

In [None]:
pf info -t lane -i 10050_2#1

We could also have used `pf accession`.

In [None]:
pf accession -t lane -i 10050_2#1

**Q2: What is the lane name that corresponds with sample 2363STDY5509320?**

**10050_2#87**

We can use the default output from running `pf info` with the identifier type (`-t` or `--type`) set as "sample" and the identifier (`-i` or `--id`) as 2363STDY5509320 to get the sample name.

In [None]:
pf info -t sample -i 2363STDY5509320

Again, we could also have used `pf accession`.

In [None]:
pf accession -t sample -i 2363STDY5509320

**Q3: What are the sample and lane names of the last lane in the file "data/lanes_to_search.txt"?**

**10050_2#25** and **2363STDY5509258**

We can use the default output from running `pf info` with the identifier type (`-t` or `--type`) set as "file" and the identifier (`-i` or `--id`) as "data/lanes_to_search.txt" to get the lane and sample names. To get the last line output (analogous to the last line in the file) we can use `tail -1`.

In [None]:
pf info -t file -i data/lanes_to_search.txt | tail -1

Again, we could also have used `pf accession`.

In [None]:
pf accession -t file -i data/lanes_to_search.txt | tail -1

**Q4: What are the sample and lane accessions for lane 10050_2#1?**

**ERS225583** and **ERR331391**

We can use the default output from running `pf accession` with the identifier type (-t or --type) set as "lane" and the identifier (-i or --id) as 10050_2#1 to get the lane and sample accessions.

In [None]:
pf accession -t lane -i 10050_2#1

**Q5: What are the two URLs which can be used to download the FASTQ files for lane 10050_2#1 from the ENA?**

**ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR331/ERR331391/ERR331391_1.fastq.gz**

and 

**ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR331/ERR331391/ERR331391_2.fastq.gz**

We can get the ENA download URLs by running pf accession with the identifier type (-t or --type) set as "lane" and the identifier (-i or --id) as 10050_2#1 with the option `-f` or `--fastq`.

In [None]:
pf accession -t lane -i 10050_2#1 -f

This will generate "fastq_urls.txt" which contains the two URLS you're looking for.

In [None]:
cat fastq_urls.txt

_Note: if the file "fastq&#95;urls.txt" already exists you will need to remove it before you can use `pf accession` to create it again._

***

## Analysis pipeline status

**Q1: Has the assembly pipeline been run on lane 10018_1#1? If so, what is the status?**

**No**.  

The status for the assembly pipeline for lane 10018_1#1 is '-' which means that the assembly pipeline has not been run for this data.

In [None]:
pf status -t lane -i 10018_1#1

**Q2: Which lanes in study 607 has the assembly pipeline been run on?**

**10018_1#2**, **10018_1#2**, **10018_1#2**, **10018_1#2** and  **10018_1#51** 

We can pipe the output of `pf status` for study 607 into `awk`. The assembly pipeline status is found in column 9 and we want to filter for values which are "Done".  This should return five lanes.

In [None]:
pf status -t study -i 607 | awk '$9 == "Done"'

**Q3: How many lanes in study 607 has the mapping pipeline been run on?**

**41**

The command structure here is similar to before except we want to filter values for the mapping pipeline in column 4. We can then count the number of lines returned with `wc -l`.

In [None]:
pf status -t study -i 607 | awk '$4 == "Done"' | wc -l

***

## QC pipeline results

**Q1: What percentage of the reads from lane 10050_2#1 were "unclassified" by Kraken?**

**0.03**

We can use the default output from running `pf qc` with the identifier type (`-t` or `--type`) set as "lane" and the identifier (`-i` or `--id`) as 10050_2#1 to get the location of the Kraken report. We then use `xargs` to pass this location to `head` so that we can see the first few lines of the report.

In [None]:
pf qc -t lane -i 10050_2#1 | xargs head

**Q2: What percentage of the reads from the lane 10050_2#1 were classified to the genus _Streptococcus_ by Kraken?**

**99.9%**

We can write a summary of the Kraken report using the `--summary` or `-s` option. Here we called this file "qc_genus_summary.csv". To set the taxonomic level for the summary we use the `--level` or `-L` option. Genus is represented by a "G".

In [None]:
pf qc -t lane -i 10050_2#1 -L G -s qc_genus_summary.csv

We then look to the summary file to see what precentage of reads were classified to the genus _Streptococcus_.

In [None]:
head qc_genus_summary.csv

***

## Mapping pipeline results

**Q1: How many BAM files are returned by default for lane 25243_6#1?**

**3** 

You can use `grep -c` to count the number of returned locations ending in .bam (".bam$"). Notice we use a dollar sign to signify the end as we don't want to count the index files (.bam.bai).

In [None]:
pf map -t lane -i 25243_6#1 | grep -c ".bam$"

**Q2: Which mappers have been used with the mapping pipeline for lane 25243_6#1?**

**bwa** and **smalt**

We can use the `--details` or `-d` option to get information about which mapper and reference were used to generate each of the BAM files.  Then we can use `awk` to get the 3rd column which contains the mapper.  

In [None]:
pf map -t lane -i 25243_6#1 -d | awk '{print $3}' 

If you want you can also `sort` to find the unique mappers with `uniq`.

In [None]:
pf map -t lane -i 25243_6#1 -d | awk '{print $3}' | sort | uniq

**Q3: Which references have been used with the mapping pipeline for lane 25243_6#1?**

**Salmonella_enterica_subsp_enterica_serovar_Typhimurium_VNB151_ST34_v0.1**
and 
**Salmonella_enterica_subsp_enterica_serovar_Typhimurium_str_D23580_GCF_000027025_1**

You can us the same command as before except this time we are looking for the references in column 2 with `awk`.

In [None]:
pf map -t lane -i 25243_6#1 -d | awk '{print $2}' | sort | uniq

**Q4: What percentage of the reads from lane lane 25243_6#1 were mapped to "Salmonella_enterica_subsp_enterica_serovar_Typhimurium_VNB151_ST34_v0.1" using bwa?**

**99.0%**

First, we need to filter our returned mapping pipeline results by reference using the `--reference` or `-R` option. Then we write the comma-delimited statistics for the returned BAM files to file using the `--stats` or `-s` option.

In [None]:
pf map -t lane -i 25243_6#1 -R "Salmonella_enterica_subsp_enterica_serovar_Typhimurium_VNB151_ST34_v0.1" -s

This generates "25243_6_1.mapping_stats.csv" which we can filter by mapper (column 10) using `awk` and return only the mapping percentage (column 12).

In [None]:
awk -F',' '$10=="bwa" {print $12}' 25243_6_1.mapping_stats.csv

***

## SNP pipeline results

**Q1: How many lanes from run 10018_1 has the SNP calling pipeline been completed on?**  

**3**

You can use `pf status` to tell you which of the lanes in run 10018_1 the SNP calling pipeline has been completed on.

In [None]:
pf status -t lane -i 10018_1

To count these you can get all of the rows where the SNP calling is "Done" (column 7) with `awk` and then count the number of lines returned with `wc -l`.

In [None]:
pf status -t lane -i 10018_1 | awk '$7=="Done"' | wc -l

**Q2: How many gzipped VCF files are returned by default for lane 10018_1#20?**

**1**

In [None]:
pf snp -t lane -i 10018_1#20

**Q3: Which mapper and reference was used by the SNP calling pipeline for lane 10018_1#20?**

**smalt** and ***Streptococcus_suis_P1_7_v1***

You can get the mapper and reference information using the `--details` or `-d` option.

In [None]:
pf snp -t lane -i 10018_1#20 -d

**Q4: Generate the pseudogenome for lane 10018_1#20 excluding the reference.**

To generate the pseudogenome you can use the `--pseudogenome` or `-p` option and `--exclude-reference` or `-x` option to exclude the reference.

In [None]:
pf snp -t lane -i 10018_1#20 -p -x

**Q5: Symlink the gzipped VCF files generated by the SNP calling pipeline for run 10018_1 to a new directory called "10010_1_vcfs".**

You can symlink the VCF files using the `--symlink` or `-l` option, followed by the name of the directory you want to create.

In [None]:
pf snp -t lane -i 10018_1#20 -l 10010_1_vcfs

***

# Assembly pipeline results

**Q1: How many assembly files are returned by default for lane 10018_1#50?**

**2**

Assemblies have been generated using IVA and SPAdes (look at the result paths).

In [None]:
pf assembly -t lane -i 10018_1#50

**Q2: Which program was used to generate the assembly for lane 10018_1#51?**  

**velvet**

Look at the end of the path - "10018_1#51/**velvet**_assembly/contigs.fa".

In [None]:
pf assembly -t lane -i 10018_1#51

**Q3: Symlink the assembly/assemblies generated by "IVA" for run 10018_1 into a new directory called "iva_results".**  

In [None]:
pf assembly -t lane -i 10018_1 -P iva -l iva_results

**Q4: How many contigs were assembled for lane 5477_6#1 and what is the N50?**  

**66** contigs with an N50 of **61,250**

First, you need to generate the statistics file using the `--stats` or `-s` option.

In [None]:
pf assembly -t lane -i 5477_6#1 -s

Then, you need to look at the contents.

In [None]:
cat 5477_6_1.assemblyfind_stats.csv

***

## Annotation pipeline results

**Q1: How many GFF files are returned by default for lane 10018_1#1?**

**2**

There are two GFF file returned, one for an IVA assembly and one for a SPAdes assembly.

In [None]:
pf annotation -t lane -i 10018_1#1

**Q2: What is the location of the annotation for the SPAdes assembly of lane 10018_1#1?**

The location of the SPAdes annotation is:

**/Users/vo1/sanger-pathogens/pathogen-informatics-training/Notebooks/PathFind/data/prokaryotes/seq-pipelines/Actinobacillus/pleuropneumoniae/TRACKING/607/APP_N2_OP1/SLX/APP_N2_OP1_7492530/10018_1#1/spades_assembly/annotation/10018_1#1.gff**

You need to use the `--program` or `-P` option to filter the results by assembler.

In [None]:
pf annotation -t lane -i 10018_1#1 -P spades

**Q3: What is the location of the translated CDS sequence file for the SPAdes assembly of lane 10018_1#1?**

The location of the translated CDS sequence file for the SPAdes assembly is:

**/Users/vo1/sanger-pathogens/pathogen-informatics-training/Notebooks/PathFind/data/prokaryotes/seq-pipelines/Actinobacillus/pleuropneumoniae/TRACKING/607/APP_N2_OP1/SLX/APP_N2_OP1_7492530/10018_1#1/spades_assembly/annotation/10018_1#1.faa**

To get the translated CDS sequence file you need to use the `--filetype` or `-f` option.

In [None]:
pf annotation -t lane -i 10018_1#1 -P spades -f faa

**Q4: How many of the assemblies for run 5477_6 contain the gene "_dnaG_"?**  

**3**

You need to use the `--gene` or `-g` option to search for a gene name.

In [None]:
pf annotation -t lane -i 5477_6 -g dnaG 

***

## RNA-Seq expression pipeline results

***