# Answers

***

Here are the answers to the questions from each of the tutorial sections.

  * [Introduction](#Introduction)
  * [Finding your data](#Finding-your-data)
  * [Sample information and accessions](#Sample-information-and-accessions)

**First, let's tell the system the location of our tutorial configuration file.**

In [None]:
export PF_CONFIG_FILE=$PWD/data/pathfind.conf

***

## Introduction

**Q1: How many lanes are associated with study 2363?**

**88**

For this search, you need to set the type (-t) to study and the id (-i) to 2363. You can then pipe the locations returned by `pf data` into `wc -l` to count the number of locations (lines) returned.

In [None]:
pf data -t study -i 2363 | wc -l

**Q2: How many lanes are returned if you search using the file "data/lanes_to_search.txt"?** 

**25**

For this search, you need to set the type (-t) to file and the id (-i) to the location of the file, "data/lanes_to_search.txt". You can then pipe the locations returned by `pf data` into `wc -l` to count the number of locations (lines) returned.

In [None]:
pf data -t file -i data/lanes_to_search.txt | wc -l

You can check that all the lanes in the file have been found by counting the number of lanes in the file.

In [None]:
wc -l data/lanes_to_search.txt

***

## Finding your data

**Q1: What is the location of the top level directory for data and results associated with lane 10050_2#1?**

The location of the top directory on disk is:

**/Users/vo1/sanger-pathogens/pathogen-informatics-training/Notebooks/PathFind/data/prokaryotes/seq-pipelines/Streptococcus/pneumoniae/TRACKING/2363/2363STDY5509234/SLX/7376889/10050_2#1**

In [None]:
pf data -t lane -i 10050_2#1

**Q2: What is the location of the two FASTQ files associated with lane 10050_2#1?**

The location of the two FASTQ files are:

**/Users/vo1/sanger-pathogens/pathogen-informatics-training/Notebooks/PathFind/data/prokaryotes/seq-pipelines/Streptococcus/pneumoniae/TRACKING/2363/2363STDY5509234/SLX/7376889/10050_2#1/10050_2#1_2.fastq.gz**

**/Users/vo1/sanger-pathogens/pathogen-informatics-training/Notebooks/PathFind/data/prokaryotes/seq-pipelines/Streptococcus/pneumoniae/TRACKING/2363/2363STDY5509234/SLX/7376889/10050_2#1/10050_2#1_1.fastq.gz**

You need to use the `-f` or `--filetype` option to get the location of the FASTQ files.

In [None]:
pf data -t lane -i 10050_2#1 -f fastq

**Q3: Symlink the FASTQ files from study 2363 into a directory called "study_2363_links". How many FASTQ files were symlinked to "study_2363_links?**

**176**

First, we need to get the FASTQ files for study 2363 using the `-f` or `--filetype` option in case there are non-FASTQ files. We then add the `-l` or `--symlink` option with directory we want to symlink to "study_2363_links".

In [None]:
pf data -t study -i 2363 -f fastq -l study_2363_links

We then look at the contents of "study_2363_links" with `ls` and count the number of files (lines) returned with `wc -l`.

In [None]:
ls study_2363_links | wc -l

**Q4: What reference was used to map lane 10050_2#1 during QC and what percentage of the reads were mapped to the reference?**

**Streptococcus_pneumoniae_INV200_v1** and **89.9%**

First, we need to get the statistics for lane 10050_2#1 using the `-s` or `--stats` option.

In [None]:
pf data -t lane -i 10050_2#1 -s

Then, we need to find the "Reference" and "Mapped %" column in the statistics file (10050_2_1.pathfind_stats.csv).

In [None]:
cat 10050_2_1.pathfind_stats.csv

***

## Sample information and accessions

**Q1: What is the sample name that corresponds with lane 10050_2#1?**

**2363STDY5509234**

We can use the default output from running `pf info` with the identifier type (`-t` or `--type`) set as "lane" and the identifier (`-i` or `--id`) as 10050_2#1 to get the sample name.

In [None]:
pf info -t lane -i 10050_2#1

We could also have used `pf accession`.

In [None]:
pf accession -t lane -i 10050_2#1

**Q2: What is the lane name that corresponds with sample 2363STDY5509320?**

**10050_2#87**

We can use the default output from running `pf info` with the identifier type (`-t` or `--type`) set as "sample" and the identifier (`-i` or `--id`) as 2363STDY5509320 to get the sample name.

In [None]:
pf info -t sample -i 2363STDY5509320

Again, we could also have used `pf accession`.

In [None]:
pf accession -t sample -i 2363STDY5509320

**Q3: What are the sample and lane names of the last lane in the file "data/lanes_to_search.txt"?**

**10050_2#25** and **2363STDY5509258**

We can use the default output from running `pf info` with the identifier type (`-t` or `--type`) set as "file" and the identifier (`-i` or `--id`) as "data/lanes_to_search.txt" to get the lane and sample names. To get the last line output (analogous to the last line in the file) we can use `tail -1`.

In [None]:
pf info -t file -i data/lanes_to_search.txt | tail -1

Again, we could also have used `pf accession`.

In [None]:
pf accession -t file -i data/lanes_to_search.txt | tail -1

**Q4: What are the sample and lane accessions for lane 10050_2#1?**

**ERS225583** and **ERR331391**

We can use the default output from running `pf accession` with the identifier type (-t or --type) set as "lane" and the identifier (-i or --id) as 10050_2#1 to get the lane and sample accessions.

In [None]:
pf accession -t lane -i 10050_2#1

**Q5: What are the two URLs which can be used to download the FASTQ files for lane 10050_2#1 from the ENA?**

**ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR331/ERR331391/ERR331391_1.fastq.gz**

and 

**ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR331/ERR331391/ERR331391_2.fastq.gz**

We can get the ENA download URLs by running pf accession with the identifier type (-t or --type) set as "lane" and the identifier (-i or --id) as 10050_2#1 with the option `-f` or `--fastq`.

In [None]:
pf accession -t lane -i 10050_2#1 -f

This will generate "fastq_urls.txt" which contains the two URLS you're looking for.

In [None]:
cat fastq_urls.txt

_Note: if the file "fastq&#95;urls.txt" already exists you will need to remove it before you can use `pf accession` to create it again._

***