## Configuring the workflow

RNA-Seq-Pop should be configured with a configuration file (config.yaml). In this document we describe the purpose of each config parameter. 

The configuration file looks like the following:

```
metadata: config/samples.tsv                    

dataset: 'Ag_Bouake'                            

fastq:
      auto: True                                          

```

**metadata**: path to the sample metadata file    
**dataset**: give this dataset a name    
**fastq**:auto: If true, fastq files are in resources/reads/ and following `{sampleID_1.fastq.gz}` pattern. If False,             there are fq1 and fq2 columns pointing to the reads files for each sample.

```
contigs: ['2L', '2R', '3L', '3R', 'X']

reference:
  genome:
        "resources/reference/Anopheles-gambiae-PEST_CHROMOSOMES_AgamP4.fa"       
  transcriptome:
        "resources/reference/Anopheles-gambiae-PEST_TRANSCRIPTS_AgamP4.12.fa"      
  gff:
        "resources/reference/Anopheles-gambiae-PEST_BASEFEATURES_AgamP4.12.gff3"  
  snpeffdb:
        "Anopheles_gambiae"                                                       
  genes2transcripts:                                                                
        "resources/exampleGene2TranscriptMap.tsv"                                 
```

**contigs**   
A list of contigs you wish to analyse, must match reference files.

**Reference files**   
*genome*: Path to the genome (.fa)    
*transcriptome*: Path to the transcriptome (.fa)   
*gff*: Path to the genome feature file (.gff3)  
*genes2transcripts*: Path to the gene to transcript map (.tsv)  

```
contrasts:
      - 'Kisumu_gambiaeCont'
      - 'gambiaeCont_gambiaePM'

cutadapt:
      activate: False
      adaptors:

QualityControl:                               
      activate: True
```

**contrasts**  
A list of the pairwise contrasts you wish to run, from values in the treatment column of the sample metadata file. The format is `control_case`.

**cutadapt**  
*activate*: if True trim reads  
*adaptors*: adaptors to be trimmed  

**QualityControl**      
*activate*: If True run quality control analyses   

```
DifferentialExpression:
      activate: True                          # Activate differential expression analyses

      venn:
            activate: True
            padj_threshold: 0.05
            groups: ['gambiaeCont_gambiaePM', 'Kisumu_gambiaeCont']

      progressiveGenes:                       
            activate: True                    
            padj_threshold: 0.05
            fc_threshold: 1
            groups: "Kisumu_gambiaeCont_gambiaePM"

      GSEA:                                 # Activate fgsea Gene set enrichment analysis 
            activate: True
            gaf: "resources/reference/VectorBase-50_AgambiaePEST_GO.gaf"
            KeggSpeciesID: "aga"            # 3 letter KEGG species identifier. "aga" = Anopheles gambiae/coluzzii. 
            replaceString: "AgaP_"          # An.gambiae KEGG pathways have this string appended which needs removing

**Differential Expression**  
*activate*: If True run differential expression analysis  

**Venn**  
*activate*: If True make venn diagrams of de expressed genes
*padj_threshold*: threshold to call significance
*groups*: string or list of strings to indicate the de comparisons we wish to make a venn for. 

**ProgressiveGenes**   
*activate*: If True, run progressiveGenes analysis. Will find genes that are consistently up or downregulated across two comparisons.  
*padj_threshold*: threshold to call significance  
*fc_threshold*: threshold to call overexpression  
*groups*: String or list of strings, which indicate the 3 populations and 2 DE comparisons to be compared. In this example - which genes upregulated in Kisumu v gambiaeCont, are also upregulated in gambiaeCont v gambiaePM.  

**gsea**
*activate*: run fgsea Gene set enrichment analysis.  
*gaf*: gaf file with go annotations
*KeggSpeciesID*: 3 letter KEGG species identifier. "aga" = Anopheles gambiae/coluzzii.   
*replaceString*: An.gambiae KEGG pathways have a string appended which needs removing.  

```
VariantAnalysis:
      activate: True
      caller: 'freebayes'
      ploidy: 10                          # Ploidy level for freebayes to call at (Generally we are using pooled samples).For diploid organisms, this should be 2 * number of individuals in each pool
      chunks: 9                           # Number of chunks to split each chromosome into when parallelising freebayes. 9 or less is recommended. 

      pca:                                # Run PCA on the genotype data?
            activate: True
            missingness: 1 
      
      summaryStatistics:                  # Estimate Population Genetic Summary Statistics such as Dxy, Pi
            activate: True
            missingness: 1
      
      selection:                          # Calculate Fst and PBS per gene and in windows
            activate: True
            missingness: 1
            
            # Population Branch Statistic analysis (Needs three conditions, two closely related and an outgroup)
            pbs:
                  activate: True          
                  contrasts: 
                        - 'gambiaePM_gambiaeCont_Kisumu'
                  windownames: ['small', 'large']
                  windowsizes: [1000, 5000]
                  windowsteps: [500, 1000]
```

**VariantAnalysis**  
*activate* : If True, genome alignment and variant calling are performed.  
*caller* : Currently only 'freebayes' is supported.  
*ploidy* : the ploidy of our samples to call variants at. If you have pooled data, the ploidy should be n_samples per pool * organism ploidy.    
*chunks* : The number of chunks to split each chromosome into to parallelise variant calling. 

**pca**  
*activate*: If True, run PCA on the SNP data  
*missingness*: a filter - the proportion of samples that have data at a given allele. 1 means a genomic position can have no missing calls across all samples.

**summaryStatistics**  
*activate*: If True, run genetic diversity analyses.   
*missingness*: a filter - the proportion of samples that have data at a given allele. 1 means a genomic position can have no missing calls across all samples.

**selection**  
*activate*: If True, run Fst analysis  
*missingness*: a filter - the proportion of samples that have data at a given allele. 1 means a genomic position can have no missing calls across all samples.

**pbs**   
*activate*: If True, run population branch statistic analysis. Requires 3 suitable treatment groups.  
*contrasts*: List of the 3 groups as strings separated by an underscore.  
*windownames*: selection analysis can be done with multiple window sizes, names  
*windowsizes*: window sizes  
*windowsteps*: window steps  

```
    # For An.gambiae s.l only - run Ancestry Informative Marker analysis. AIM datasets included in git repo.
      ancestry:                                
            activate: True
            missingness: 0.5                                # proportion between 0 and 1 
            gambcolu: "resources/gamb_vs_colu.zarr"         # path to gambcolu AIMs
            arab: "resources/gambcolu_vs_arab.zarr"         # path to arab AIMs

      # For Anopheles gambiae s.l only - Run analysis with compkaryo to determine proportion % karyotype
      karyotype:                            
            activate: True
            inversions:
                  - "2La"
                  - "2Rb"
```

**ancestry** (*An. gambiae* only)   
*activate*: If True, run AIM analysis.    
*missingness*: a filter - the proportion of samples that have data at a given allele. 1 means a genomic position can have no missing calls across all samples.   
*gambcolu*: path to gamb_colu aims   
*arab*: path to gambcolu_vs_arabiensis aims   

**karyotype** (*An. gambiae* only)    
*activate*: If True, run karyotyping module.  
*inversions*: A list of inversions in compKaryo to analyse.

```
miscellaneous:
      # Creates allele count .xlsx reports on specific mutations of choice
      VariantsOfInterest:             
            activate: True                             
            path: "resources/exampleMutations.tsv"         
      
      GeneFamiliesHeatmap:
            activate: True
            eggnog: resources/Anogam_long.pep_eggnog_diamond.emapper.annotations.GO
            pfam: resources/Anogam_long.pep_Pfamscan.seqs

      # find DE genes that lie under Ag1000g selective sweeps 
      # Only applicable for Anopheles gambiae s.l
      sweeps:
            activate: True
            padj_threshold: 0.05
            fc_threshold: 1.5
```

**VariantsOfInterest**
*activate*: If True, run variants of interest analysis   
*path*: Path to variants of interest data  

**GeneFamiliesHeatmap**   
*activate*: If True, run heatmaps on gene families using go terms and pfam domains.  
*eggnog*: path to eggnog file with go annotations for your organism (produced by eggnog-mapper).  
*pfam*: path to pfam domains for your organism.  

**sweeps** (*An. gambiae* only)  
*activate*: If True, run analysis to determine if DE genes lie under known selective sweep loci.   
*padj_threshold*: threshold to call significance  
*fc_threshold*: threshold to call overexpression  