<a href="https://colab.research.google.com/github/sanjaynagi/AmpSeeker/blob/main/docs/AmpSeeker_workshop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center>
<img src="https://drive.google.com/uc?id=10AHbOZGReBnqkTylrW1bQqm7l_ROF0qg" alt="AmpSeeker Logo" width="380" height="260">
</center>

# Targeted Genomic Surveillance of Insecticide Resistance in *An. gambiae s.l* with AmpSeeker and Ag-vampIR

Welcome to this hands-on workshop on amplicon sequencing in African malaria vectors. In this workshop, we will recap on genomic surveillance for malaria vectors and using workflow managers, and analyse some amplicon sequencing data with AmpSeeker.

*   [AmpSeeker GitHub repo](https://github.com/sanjaynagi/AmpSeeker)
*   [AmpSeeker Documentation](https://sanjaynagi.github.io/AmpSeeker/)
*   [Ag-vampIR-AmpSeeker results book](https://sanjaynagi.github.io/agvampir002-results/intro.html)
*   [Workshop Slides](https://docs.google.com/presentation/d/102D4w7hT2PRECYRvXQmf2K-_VFhVnzKunSD8LMF9OqY/edit?usp=sharing)


#### The Ag-vampIR amplicon panel

Ag-vampIR (the *Anopheles gambiae* Vector Amplicon Marker Panel for Insecticide Resistance) is an Illumina amplicon sequencing panel designed for surveillance of insecticide resistance in malaria vectors. The panel:

- Contains 80 amplicons targeting 90 SNP markers in the *An. gambiae* genome
- Targets known insecticide resistance loci and ancestry informative markers

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
<body>
  <center>
  <img
    src="https://raw.githubusercontent.com/sanjaynagi/AmpSeeker/refs/heads/main/docs/agvampir-ampseeker.png"
    alt="AmpSeeker Logo"
    width="600"
    height="600">
  </center>
</body>
</html>

#### AmpSeeker

AmpSeeker is a computational workflow designed to analyze Illumina amplicon sequencing data. It works on any amplicon sequencing data from any diploid organism, but has specific additional modules for the Ag-vampIR panel.

- Provides end-to-end analysis from raw sequencing data to variant calling and downstream analyses
- Is built on Snakemake, ensuring reproducible and automated analysis
- Generates interactive visualizations and a local webpage for easy data exploration

AmpSeeker streamlines the entire analytical pipeline, making genomic surveillance more accessible to researchers without extensive bioinformatics expertise.

<br></br>

## Workshop Aims

In this workshop, we aim to:

1. Introduce you to the principles of targeted genomic surveillance for insecticide resistance monitoring
2. Use AmpSeeker to analyse data from the Ag-vampIR amplicon panel. We will analyse the data from the recent manuscript.

## Requisites
- Linux, Mac OSX
- Conda (preferably with Mamba also installed)
- Snakemake and Pandas `mamba install -c bioconda snakemake pandas`

## Setup the workflow

**1. Clone the AmpSeeker repository to your system and set up conda**

`git clone https://github.com/sanjaynagi/AmpSeeker.git AgvampIR-workshop`

or if on the h3bionet server:

 `. /srv/UVRI-LSTM/AmpSeeker/setup.sh`  

**2. Prepare the Illumina input data**


You can download from the following [dropbox link](
https://www.dropbox.com/scl/fi/o3r3nkiv8nfebgaryg5cp/14_02_2024_MiSeq_output.zip?rlkey=0t34t8sx1ondox1bdz1k16nlf&st=qips0cfy&dl=0) (~4Gb total), or this command from the command line:

```
#### bcl folder (4Gb)
curl -L "https://www.dropbox.com/scl/fi/o3r3nkiv8nfebgaryg5cp/14_02_2024_MiSeq_output.zip?rlkey=0t34t8sx1ondox1bdz1k16nlf&st=qips0cfy&dl=1" -o agvampir-ms-bcl-output.zip && unzip agvampir-ms-bcl-output.zip -d ag-vampir-ms-output

#### subset of the reads (~120Mb)
curl -L "https://www.dropbox.com/scl/fi/fvjfb9hbckrp9ltjnloby/agvampir002.zip?rlkey=24c0c4qh6ypr1l5p2h1agm0l7&st=salhhbv3&dl=1" -o agvampir-ms-bcl-output.zip && unzip agvampir-ms-bcl-output.zip -d ag-vampir-ms-output
```

After the download is complete, extract the .zip or move the folder to the `AgvampIR-workshop/resources/` directory.

### Explore the metadata and config input files
We provide metadata information to AmpSeeker with either a `SampleSheet.csv` or `metadata.tsv` file, and configuration of the workflow is with a config.yaml file. Let's take a look at...

1.  The SampleSheet within the Illumina `14_02_2024_MiSeq_output` directory.

```bash
# View the SampleSheet from the Illumina run directory
cat /DATASETS/ampseeker/14_02_2024_MiSeq_output/SampleSheet.csv
```

2.  The example config.yaml in the `config` directory.

```bash
# View the example config file
cat config/config.yaml
```
<br></br>

## Setting up reference genomes

AmpSeeker requires reference genomes for alignment and variant calling. For the Ag-vampIR panel, we need the *Anopheles gambiae* reference genome (AgamP4).

### Download the reference genomes

If you're working on your own machine without access to these files, you can download them from VectorBase using the script within `resources/reference`:  
```bash
bash ./resources/reference/download-reference-genome.sh

```
Run it from the `AgvampIR-workshop/` directory.

<!--
On the workshop cluster, the reference genomes are already available. We can't directly link to the /DATASETS/ directory, as the workflow will then try to create genome indexes there, and we dont have write permissions. We can create a symbolic link to use them:

```bash
# Create symbolic links to the reference genomes
ln -s /DATASETS/ampseeker/resources/reference/AgamP4.fa ampseeker-workshop/resources/reference/
ln -s /DATASETS/ampseeker/resources/reference/AgamP4.gff3 ampseeker-workshop/resources/reference/
``` -->

<br></br>

## Create the configuration file (config.yaml)
Now let's create a configuration file for our analysis. We'll start by making a copy of the example config file and modify it for our dataset:

```bash
cp config/config.yaml config/config_workshop.yaml
```

Now edit the config_workshop.yaml file to match our dataset. The file should look something like this:

```yaml
# Dataset and panel information
dataset: agvampir-workshop
panel: ag-vampir
cohort-columns:
  - location
  - taxon
targets: config/ag-vampir.bed

# Illumina directory (using the MiSeq output)
illumina-dir: resources/14_02_2024_MiSeq_output

# Reference genome information
reference-fasta: resources/reference/Anopheles-gambiae-PEST_CHROMOSOMES_AgamP4.fa
reference-gff3: resources/reference/Anopheles-gambiae-PEST_BASEFEATURES_AgamP4.12.gff3
reference-snpeffdb: Anopheles_gambiae
custom-snpeffdb: False

# Input file type options
from-bcl: True
fastq:
  auto: True

# Quality control options
quality-control:
  sample-total-reads-threshold: 250
  amplicon-total-reads-threshold: 1000
  fastp: True
  coverage: True
  stats: True
  multiqc: True

# Analysis options
analysis:
  sample-map: False
  population-structure: True
  genetic-diversity: True
  allele-frequencies: True

# Build Jupyter book
build-jupyter-book: True
```

## activate conda and snakemake



## Run AmpSeeker

Now that we have our configuration file and data set up, we can run AmpSeeker on our dataset. First, let's do a dry run to see what steps the workflow will execute:

```bash
snakemake --cores 4 --use-conda --configfile config/config_workshop.yaml -n
```

This will show you all the steps that will be executed without actually running them. If everything looks good, you can run the full analysis:

```bash
snakemake --cores 4 --use-conda --configfile config/config_workshop.yaml
```

<br></br>

#### Open the results web book

After AmpSeeker has completed running, it will generate a Jupyter Book with all the results in an easy-to-browse format. There are two ways to view this book:

**Option 1: Using Python's built-in HTTP server**

```bash
cd results/ampseeker-results/_build/html/
python -m http.server
```

Then, open your web browser and go to `http://localhost:8000` to view the book.

**Option 2: Opening the HTML file directly (if running on a local machine)**

If you're running AmpSeeker on your local machine, you can simply open the index.html file directly in your web browser:

```bash
open results/ampseeker-results/_build/html/index.html  # For macOS
# or
xdg-open results/ampseeker-results/_build/html/index.html  # For Linux
```

<br></br>

## Exploring the results

The Jupyter Book contains several sections that provide comprehensive analysis of your amplicon sequencing data:

1.  **Run Information**: Overview of the sequencing run and sample metadata
2.  **Read Quality**: Quality metrics for raw and processed reads
3.  **Coverage**: Depth and breadth of coverage across amplicons
4.  **Variant Calling**: SNPs and other variants identified
5.  **Population Structure**: Principal component analysis of genetic variation
6.  **Genetic Diversity**: Measures of diversity within and between populations
7.  **Ag-vampIR Specific Analyses**: If using the Ag-vampIR panel:
    *   **Species Identification**: Results of the species classification
    *   **Kdr Analysis**: Analysis of knockdown resistance mutations

Take some time to explore the different sections and visualizations in the web page.

---

Workshop complete! :)