# Finding a reference

## Introduction

For a reference to be used with the Pathogen Informatics analysis pipelines, the reference must first be in the pathogen databases. All complete bacterial genomes from [RefSeq](https://www.ncbi.nlm.nih.gov/refseq/) are automatically imported and updated. There are user-submitted references too. 

We can look at the available reference sequences using `pf ref`. This differs from the other `pf` scripts that we've looked at in that it doesn't require a type (`-t`). It only needs a partial id (`id` or `i`).

In this section of the tutorial we will cover:

  * using `pf ref` to find a FASTA-formatted reference
  * using `pf ref` to find the GFF annotation for a reference
  * using `pf ref` to symlink a reference

## Exercise 11

We can find a reference using part of the reference name using `pf ref`.

**Let's take a look at the usage information for `pf ref`.**

In [None]:
pf ref -h

So, if we wanted to look which mouse (_Mus&#95;musculus_) references are available we can run:

```
pf ref -i Mus_musculus
```
or 
```
pf ref -i "Mus musculus"
```

Notice that in the first command there are no spaces between the genus, species and strain. Instead, these are replaced with underscores!

The commands above would give you the same output of...

    No exact match for "Mus_musculus". Did you mean:
      [1] Mus_musculus_mm10
      [2] Mus_musculus_mm9
      [a] all references
    Which reference?

You would then enter the number corresponding to the reference location you need. Say we want to find the reference for "Mus_musculus_mm9", we would enter 1 which would give us:

    /path/to/refs/Mus/musculus/Mus_musculus_mm10.fa

We can also use the `--all` or `-A` option to list all of the available references that match our query.

**Let's see which _Salmonella_ _enterica_ references are available.**

In [None]:
pf ref -i "Salmonella enterica" -A

This gives us the location of the reference FASTA file on disk. However, maybe we just want to see the reference names. We can do this using the `--reference-names` or `-R` option.  These can be useful where you need to specify a reference name when requesting the analysis pipelines on the command line.

**Now, let's get the _Salmonella_ _enterica_ reference names.**

In [None]:
pf ref -i "Salmonella enterica" -A -R 

Notice the version numbers at the end of the reference name. There is usually a naming convention with the references based on their source:

  * RefSeq accession (e.g. GCF_001887015_1) - complete genome imported from RefSeq
  * version (v) >=1 (e.g. v1) - genome requested by user and imported from public repository (e.g. ENA/GenBank)
  * version (v) <1 (e.g. v0.1) - internal genome assembly requested by user
  
But, perhaps you don't want the FASTA file, perhaps you want the reference annotation (i.e. GFF file). To get this, we need to use the `--filetype` or `-f` option.
  
**Let's get the annotation (GFF) locations for the available _Salmonella_ _enterica_ references.**

In [None]:
pf ref -i "Salmonella enterica" -A -f gff 

Finally, you might want to use the reference files in an analysis. The simplest way is to symlink them using the `--symlink` or `-l` option.

**Let's symlink our _Salmonella_ _enterica_ reference genomes to a directory called "salmonella_enterica_refs".**

In [None]:
pf ref -i "Salmonella enterica" -A -l salmonella_enterica_refs

In [None]:
ls salmonella_enterica_refs

## Questions

**Q1: How many _Streptococcus pneumoniae_ references are available?**  
_Hint: you can use `wc` to count the number of references returned_

In [None]:
# Enter your answer here

**Q2: What is the location of the annotation (GFF) file for _Streptococcus pneumoniae P1031_.**

In [None]:
# Enter your answer here

**Q3: Symlink the annotation (GFF) file for _Streptococcus pneumoniae P1031 v1_ to your current directory.**

In [None]:
# Enter your answer here

## What's next?

You can head back to [RNA-Seq expression pipeline results](rnaseq-pipeline-results.ipynb).

Otherwise, let's move on to looking at [troubleshooting](troubleshooting.ipynb).