## Configuring the workflow

Before running AmpSeeker, we need to select which analyses we want to run (configuration). This is done by editing the file `config.yaml` in the `config` directory. The config file contains a number of options and parameters.

# AmpSeeker Configuration Guide

## Overview
AmpSeeker's configuration is primarily managed through the `config.yaml` file located in the `config` directory. This document is guide to configuring the pipeline for your specific needs.

## Prerequisites
Before running AmpSeeker from BCL files, users must prepare a sample sheet. An example template containing all possible i5 and i7 combinations can be found at:
```
AmpSeeker/resources/exampleSampleSheet.csv
```
The sample sheet should be placed in the directory of the Illumina run folder. If running AmpSeeker directly from fastq files, no sample sheet is required.

## Configuration File Structure
The `config.yaml` file is organized into several main sections:

### Dataset and Panel Configuration
```yaml
dataset: ag-vampir-002
panel: ag-vampir
metadata: config/metadata_ms.tsv
cohort-columns:
  - location
  - taxon
targets: config/ag-vampir.bed
```

- `dataset`: Unique identifier for your dataset
- `panel`: Specifies the sequencing panel being used (e.g., ag-vampir for Anopheles gambiae vampIR panel)
- `metadata`: Path to the metadata file containing sample information
- `cohort-columns`: List of columns from the metadata file used to define cohorts for analysis
- `targets`: BED file defining the target regions

### Input Data Configuration
```yaml
illumina-dir: resources/14_02_2024_MiSeq_output/
bcl-convert: True
fastq:
  auto: True
```

- `illumina-dir`: Directory containing the Illumina MiSeq output. If starting the workflow from FASTQ, this is not required.
- `bcl-convert`: Boolean flag indicating whether BCL files need to be converted to FASTQ
- `fastq.auto`: When set to True, enables automatic FASTQ file naming based on metadata. When False, fq1 and fq2 columns are expected in the metadata, which allow users to add the path to the FASTQ files.

### Reference Genome Configuration
```yaml
reference-name: AgamP4
reference-fasta: resources/reference/Anopheles-gambiae-PEST_CHROMOSOMES_AgamP4.fa
reference-gff3: resources/reference/Anopheles-gambiae-PEST_BASEFEATURES_AgamP4.12.gff3
reference-snpeffdb: Anopheles_gambiae
```

- `reference-name`: Name identifier for the reference genome
- `reference-fasta`: Path to the reference genome FASTA file
- `reference-gff3`: Path to the GFF3 annotation file
- `reference-snpeffdb`: Name of the SNPEff database for variant annotation

### Quality Control Settings
```yaml
quality-control:
  sample-total-reads-threshold: 250
  amplicon-total-reads-threshold: 1000
  fastp: True
  coverage: True
  stats: True
  multiqc: True
```

- `sample-total-reads-threshold`: Minimum number of reads required per sample
- `amplicon-total-reads-threshold`: Minimum number of reads required per amplicon
- `fastp`: Enable FASTQ quality control and trimming
- `coverage`: Enable coverage analysis
- `stats`: Enable generation of basic BAM and VCF statistics
- `multiqc`: Enable MultiQC report generation

### Analysis Configuration
```yaml
analysis:
  igv: True
  sample-map: False
  pca: True
  genetic-diversity: True
  allele-frequencies: True
```

- `igv`: Generate IGV session files for visualization of alignments
- `sample-map`: Generate geographic sample distribution maps (requires latitude and longitude data)
- `pca`: Perform Principal Component Analysis
- `genetic-diversity`: Calculate genetic diversity metrics
- `allele-frequencies`: Calculate allele frequencies

### Output Configuration
```yaml
build-jupyter-book: True
```

- `build-jupyter-book`: Generate an interactive HTML report using Jupyter Book

## Usage Notes

1. **File Paths**: All paths in the configuration file should be relative to the AmpSeeker root directory.

2. **Sample Sheet Requirements**: When running from BCL files (`bcl-convert: True`), ensure your sample sheet follows the format specified in `AmpSeeker/resources/exampleSampleSheet.csv`.

3. **Cohort Analysis**: The `cohort-columns` specified must exist in your metadata file for proper cohort-based analyses.

4. **Quality Control Thresholds**: Adjust the read thresholds based on your experimental design and expected coverage.

5. **Reference Genome**: Ensure all reference files are properly formatted and match each other before running the pipeline.

**If the example config is out of date, up to data versions are stored in the .test/config folder**  
**If you have any issues configuring the pipeline, please watch the video walkthrough first, and raise an issue on github or [email me](mailto:sanjay.c.nagi@gmail.com).**