# ARTIC pipeline example

This is a brief run through of the commands needed to run the ARTIC pipeline. We will also cover some of the files which the pipeline produces.

***

## Using this notebook

We are using a [jupyter notebook](https://jupyter.org/) for this example so that we can host it on [Binder](mybinder.org). If you want to run the commands for yourself on the command line, you will need to remove the leading `!` which is before all the code in this notebook (which is telling jupyter to execute a system command).

To run this notebook, you can click on each cell and press Run. Be sure to wait for the cell to complete before moving on to the next one. It might take a minute or so for each cell to complete.

## Data

To begin, we need some data. This repository already has some data for you to use, which was generated from a SARS-COV-2 positive control sample at the University of Birmingham. If you want to obtain the data for yourself, you can run the following:

`
wget http://artic.s3.climb.ac.uk/BHAM-Run88-PTC.fastq
`

This test data is only the FASTQ reads from the positive control sample. We have already basecalled, demuxed and filtered them from the original FAST5 data for this sample.

> Because we only have FASTQ data, this example will use the **medaka** workflow of the ARTIC pipeline. This is because the **nanopolish** version requires FAST5 data as well as FASTQ.


To run the **medaka** workflow ARTIC pipeline on this data, we need to know:

* what version of the ARTIC primer scheme was used
  * version 3
* what [medaka model](https://github.com/nanoporetech/medaka#models) to use
  * r941_min_high_g351 (`{pore}_{device}_{caller variant}_{caller version}`)

As well as the FASTQ reads, we will also need:

* primer scheme (BED format)
* reference sequence (FASTA format)

### Primer scheme and reference sequence

Although the ARTIC pipeline will download these for us, we can also get them for ourselves in order to familiarise ourselves with them:

In [None]:
!artic-tools get_scheme --schemeVersion 3 scov2

This will have downloaded the primer scheme (`scov2.v3.primer.bed`) and the reference sequence (`scov2.v3.reference.fasta`). You can get some quick stats on the primer scheme using artic-tools:

In [None]:
!artic-tools validate_scheme scov2.v3.primer.bed

The primer scheme file is in a BED format, where the columns equate to the following:


| column | name       | type         | description                                               |
| :----- | :--------- | :----------- | :-------------------------------------------------------- |
| 1      | chrom      | string       | primer reference sequence                                 |
| 2      | chromStart | int          | starting position of the primer in the reference sequence |
| 3      | chomEnd    | int          | ending position of the primer in the reference sequence   |
| 4      | name       | string       | primer name                                               |
| 5      | primerPool | int          | primer pool<sup>\*</sup>                                  |
| 6      | strand     | string (+/-) | primer direction                                          |

<sup>\*</sup> column 5 in the BED spec is an int for score, whereas here we are using it to denote primerPool.

If you want to look at the primer scheme file, we can do that here with some Python:

In [None]:
with open("scov2.v3.primer.bed", 'r') as f:
    print(f.read())

## Running the pipeline

Now we have the primer scheme, reference sequence and our FASTQ data. We can run the pipeline!

In [None]:
!artic minion \
    --normalise 200 \
    --threads 2 \
    --medaka \
    --medaka-model r941_min_high_g351 \
    --strict \
    --read-file ../data/BHAM-Run88-PTC.fastq.gz \
    scov2/V3 \
    my_example

That's it! Let's have a quick run through of the parameters we used so that we can understand what was happening.

|parameter|explanation|
|:--------|:----------|
|`--normalise`| This caps amplicon coverage to 200 reads, used mainly to speed up the pipeline run. |
|`--threads`| This sets the number of CPU threads to use during the pipeline. We set this to 2 here as that is the limit on Binder, but if you are playing along at home you can increase this to make things run a bit more quickly. |
|`--medaka`| This tells the ARTIC pipeline to use the **medaka** workflow|
|`--medaka-model`| This specifies which model to use for the **medaka** program calls.|
|`--strict`| This runs an additional filtering of reported variants, checking them in overlap regions of the primer scheme to see if they are artifacts reported in only one primer pool.|
|`--read-file`| This tells the pipeline where to find the reads.|
|`scov/V3`| This specifies the name of the primer scheme and the version to use. If it isn't found locally, the pipeline will try finding it in the ARTIC primer scheme repository.|
|`my_exmple`| The name to give this pipeline run, all output will have this prepended to the filenames.|


In [None]:
!multiqc .