In [1]:
import pandas as pd
from IPython.display import SVG

# Metagenomics

## Steps of the analysis

* QC of raw reads with [fastqc](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
* Read trimming with [fastp](https://github.com/OpenGene/fastp)
    * Removes adapters, drops reads with average PHRED quality  below 20
    * Does QC of trimmed reads
* Mapping reads to reference with [minimap2](https://github.com/lh3/minimap2)
* Filter alignment files with samtools view
    * Removes reads that are not mapped in propper paris
    * Removes reads with a mapping PHRED score below 60 (in minimap2 this is equilivant to 30)
* Call variants with [freebayes](https://github.com/freebayes/freebayes)
    * Allele needs to be present 3 times
    * Minimum allele fraction is 0.05
    * Minimum 30 PHRED mapping quality (redundant because of filtering)
    * Minimum 20 PHRED base quality
* Annotate alleles with [SnpEFF](https://pcingola.github.io/SnpEff/snpeff/build_db/#step-2-option-2-building-a-database-from-genbank-files)
* Filter alleles with custom python script [vcf_exporter.py](https://github.com/sulheim/leakage/blob/eric/code/meta_sequencing/results/vcf_exporter.py)
    * All frequencies
        * Minimum allele frequency: 0.05
        * Minimum PHRED qual calculated by freebayes 30
        * Minimum allele count: 5
        * Minimum depth at allele site: 30
    * Fixed alleles:
        * Minimum allele frequency: 0.9
        * Minimum PHRED qual calculated by freebayes 30
        * Minimum depth at allele site: 30

## File- and directory strucutre

There was a disk quota issue so the processing was done on `/work/FAC/FBM/DMF/smitri/evomicrocomm/seq_snorre/data/meta_sequencing`.
`references_sequencing/` contains FASTA and annotation files, `meta_sequencing/` contains all files from the samples.  
The code is available in this repo under `../../code/meta_sequencing/` where mainly the Snakefile is important.  
`vcf_exporter.py` grabs the files from freebayes and filters the allele. The output is a csv that I dump under the `../code/meta_sequencing/results` in different folders:
* `all_variants`:
    * All variants outputted by freebayes (note there is already some default filtering)
* `all_filtered_variants`:
    * All filtered variants according to `vcf_exporter.py` filter described above
* `fixed_filtered_variants`:
    * Fixed alleles filtered according to `vcf_exporter.py` described above
      
The `csv` contains following columns:  
`chrom,pos,qual,depth,freq,alt,alt_count,ref,type,len,eff,gene,product,linegroup,sample`
* `freq`: Allele frequency
* `alt`: Nucleotite,deletions,or insertion in allele
* `alt_count`: Number of reads with the observed allele
* `ref`: Nucleotide or sequence in reference
* `type`: Type of mutation
* `eff`: Effect if CDS on translated feature
* `linegroup`: Unique identifiers for each allele (also unique across samples)

## Some quality checks

We can look at the area of the knockout genes AceE and SucB to check if no strains were confused during the processing. Everything looks good!

### AceE (Position 119504-122167)

![meta_acee](../screenshots/meta_alignments_AceE.png)

### SucB (Position 756978-758195)

![meta_sucb](../screenshots/meta_alignments_SucB.png)

## Coverage
![coverage](../screenshots/fig1_coverage.png)

## Frequencies of alleles across samples

![allel_freq](../screenshots/fig2_frequencies.png)
Fixed alleles (frequency >= 0.9):

In [6]:
df = pd.read_csv('../../code/meta_sequencing/results/fixed_filtered_variants/all_samples.filtered.csv')[['gene','product','sample']]
tmp = df[df['sample'] == 'AceE_M2_D44']
print(tmp)

     gene                                            product       sample
136  ldhA  fermentative D-lactate dehydrogenase, NAD-depe...  AceE_M2_D44
137   NaN                                                NaN  AceE_M2_D44
138  ynfM              putative arabinose efflux transporter  AceE_M2_D44
139  galS  galactose- and fucose-inducible galactose regu...  AceE_M2_D44
140   NaN                                                NaN  AceE_M2_D44
