# Finding your data

***

## Introduction

To search for the location(s) of data stored in the pathogen databasese, we can use `pf data`.  In the [previous](intro.ipynb) section, we looked at two options which are used by most of the pf scripts, **type** (**`-t`**) and **id** (**`-i`**).  

In this section of the tutorial we will be looking at several other functions which `pf data` can perform that may be useful when finding, sharing or using your sequencing data. 

By default, `pf data` will return a directory. It not only contains the imported sequence data, but also the results of any of the analysis pipelines which have been run on that data.

In this section of the tutorial we will cover:

  * the `pf data` command format
  * using `pf data` to find the top level directory where sequence data and analysis pipeline results are stored
  * using `pf data` to find sequence data files
  * using `pf data` to symlink files and directories
  * using `pf data` to compress files and directories
  * using `pf data` to generate sequencing data statistics

### Filetypes

However, you might not want to know the top level directory location.  You might want to know where the sequence data files are and what they are called so that you can use them in a downstream analysis. To do this, we ask `pf data` to find the sequence files using the **filetype** (**--filetype** or **-f**). 


### Symlinking

Pathogen Informatics asks users not to copy sequence data or results that are already in the pathogen databases. This is because copying data uses up precious disk space. 

Instead we ask users to **symlink** the data. Symlinks contain no data, simply referencing the location of the original file or directory. To most commands, the symlink looks like the original file, but the operations the command performs (e.g reading from the file) are directed to the original file which the symlink is pointed to.

You can symlink a file or directory that's returned by a `pf data` search by using the **`--symlink`** or **`-l`** option.

### Archiving or compressing data


### Getting statistics

***

## Exercise 2

**First, let's tell the system the location of our tutorial configuration file.**

In [None]:
export PF_CONFIG_FILE=$PWD/data/pathfind.conf

You can see the available options for `pf data` using the **`--help`** or **`-h`** option.

**Let's take a look at the usage information for `pf data`**.

In [40]:
pf data -h

[1musage:[0m
      pf data --id <id> --type <ID type> [options]

[1mdescription:[0m
    This pathfind command will output the path(s) on disk to the data
    associated with sequencing run(s). Specify the type of data using --type
    and give the accession, name or identifier for the data using --id.
    
    You can search for data using several types of ID: lane, library, sample,
    study, or species. *Note* that searching using study or species can
    produce a large number of results and can be very slow.
    
    Use "pf man" or "pf man data" to see more information.

[1moptions:[0m
    --id -i                     ID or name of file containing IDs [Required;
                                Env: PF_ID]
    --type -t                   ID type. Use "file" to read IDs from file [
                                Required; Possible values: database, file,
                                lane, library, sample, species, study; Env:
                            

Here we can see that basic `pf data` command uses just the **type** (**`--type`** or **`-t`**) and **id** (**`--id`** or **`-i`**) options.

```
 pf data --id <id> --type <ID type> [options]
```

**Let's search for the location of data associated with lane 5477_6#1.**

In [None]:
pf data -t lane -i 5477_6#1

The disk location `pf data` returned is the **top level** directory where all of data and results associated with lane 5477_6#1 are stored. 

### Filetypes

We may want to find the sequence data files which were imported so that we can use them for a subsequent analysis.

**Let's find the FASTQ files which were imported for lane 5477_6#1.**

In [None]:
pf data -t lane -i 5477_6#1 -f fastq

As this is Illumina paired end data, there are two gzipped (.gz) FASTQ-formatted sequence data files returned which correspond to the left (_1) and right (_2) reads.

### Symlinking

We don't want to copy these files to where we're running the analysis because this uses up disk space unnecessarily. Instead, we'll symlink them.

**First, let's try symlinking our two FASTA files from lane 5477_6#1 in your current working directory.**

In [None]:
pf data -t lane -i 5477_6#1 -f fastq -l .

Now, if we look at the directory with `ls` we should see our two symlinked files "5477_6#1_1.fastq.gz" and "5477_6#1_1.fastq.gz".

In [None]:
ls

But, if we take a closer look using `ls -l` we can see that those files are symlinks to our tutorial data files.

In [None]:
ls -l | grep fastq

**Now, let's try symlinking to a new directory called "my_lanes".**

In [None]:
pf data -t lane -i 5477_6#1 -f fastq -l my_lanes

We can now see that a new directory called "my_lanes" has been created.

In [None]:
ls

And inside the "my lanes" directory are our two symlinked files.

In [None]:
ls -l my_lanes

So, we've been symlinking our FASTQ files.  But, what if we want to symlink all of the data and results associated with our lane.

**Instead of symlinking just our sequence data, let's symlink all of the data and results for lane 5477_6#1 to a new directory called "my_lane_data".**

In [None]:
pf data -t lane -i 5477_6#1 -l my_lane_data

Looking inside "my_lane_data" we see a directory which has the same name as our lane, 5477_6#1. This directory is symlinked to the tutorial data directory for this lane. 

In [None]:
ls -l my_lane_data

**Finally, let's try symlinking the data and results for all lanes associated with a study.**

In [None]:
pf data -t study -i 664 -l my_study_lanes

Here we see 11 symlinked directories which have the names of the 11 lanes associated with study 664.

In [None]:
ls -l my_study_lanes

In [None]:
### Archiving data

In [None]:
### Data statistics

In [47]:
pf data -t lane -i 5477_6#1 -s



In [43]:
ls -lart | tail

-rw-r--r--   1 vo1  1662     0 10 Aug 19:13 166720.pe.raw.sorted.bam.cover
-rw-r--r--   1 vo1  1662     0 10 Aug 19:13 166720.pe.raw.sorted.bam.bc
-rw-r--r--   1 vo1  1662     0 10 Aug 19:13 166720.pe.raw.sorted.bam.bas
-rw-r--r--   1 vo1  1662     0 10 Aug 19:13 166720.pe.raw.sorted.bam.bai
-rw-r--r--   1 vo1  1662     0 10 Aug 19:13 166720.pe.raw.sorted.bam
-rw-r--r--   1 vo1  1662     0 10 Aug 19:13 166720.pe.markdup.snp
-rw-r--r--   1 vo1  1662     0 10 Aug 19:13 166720.pe.markdup.bam_graphs
-rw-r--r--   1 vo1  1662     0 10 Aug 19:13 166720.pe.markdup.bam.bc
-rw-r--r--   1 vo1  1662     0 10 Aug 19:13 166720.pe.markdup.bam.bai
-rw-r--r--   1 vo1  1662     0 10 Aug 19:13 166720.pe.markdup.bam


In [46]:
pf data -t study -i 664 --stats

                                                                                

***

## Questions

***

## What's next?

For a quick recap of what the pf scripts are, head back to the [introduction](introduction.ipynb).

Otherwise, let's move on to [sample information and accessions](information-and-accessions.ipynb).