# ASE calling pipeline

Mostly based on Yanyu's work in 2017. The pipeline aligns `fastq` sequences to `bam` using `STAR`, then adjust the mapping via `WASP` to account for allele specificity, and finally call genotype and ASE via `QuASAR`.

## Overview of procedure

1. Gather require resources
2. Align reads to genome
3. Remove biased reads
4. Genotype and ASE calling

The pipeline is implemented in SoS as displayed in the rest of this SoS notebook. The pipeline can be executed from this notebook directly on local or remote computer. For more information see [SoS website](https://github.com/vatlab/SOS).

To run the pipeline:

```
sos run nb.ipynb hg19_reference
sos run nb.ipynb obtain_samples
sos run nb.ipynb align
sos run nb.ipynb call
sos run nb.ipynb quasar
```

### Environment configurations

In [5]:
[global]
cwd = path('~/Documents/m6A/Data/ASE')
parameter: ncpu = 20
# Resource files
resource_dir = f"{cwd:a}/hg19"
ref_fa = "hg19.fa"
ref_gtf = 'Homo_sapiens.GRCh38.91.gtf.gz'
wasp_dir = f"{cwd:a}/WASP-master/mapping"
# Sample files
sample_dir = f"{cwd:a}/samples"
from collections import OrderedDict
## A list of sample names (keys) and their corresponding FASTQ files (values)
samples = OrderedDict({'ENCLB279NMT': ['ENCFF824TZM', 'ENCFF176JNE']})
fastq = paths([[f"{sample_dir}/{s}/{q}.fastq.gz" for q in samples[s]] for s in samples])

## Resource preparation

### hg19 human reference data

Obtain `hg19.fa` and `Homo_sapiens.GRCh38.91.gtf.gz`.

In [3]:
[hg19_reference_1 (download)]
# Download `hg19.2bit` and `twoBitToFa` from {ucsc_url}
ucsc_url = "http://hgdownload.cse.ucsc.edu"
output: f"{resource_dir}/hg19.2bit", f"{resource_dir}/twoBitToFa"
download: dest_dir = resource_dir, expand = True
    {ucsc_url}/goldenPath/hg19/bigZips/hg19.2bit
    {ucsc_url}/admin/exe/linux.x86_64/twoBitToFa

[hg19_reference_2 (decompress hg19.fa)]
# Use `twoBitToFa` to extract `hg19.fa` from `hg19.2bit`
output: f"{resource_dir}/{ref_fa}"
bash: expand = True
    chmod +x {_input[1]}
    {_input[1]} {_input[0]} {_output}

[hg19_reference_3 (gene annotations)]
# Download `Homo_sapiens.GRCh38.91.gtf.gz` from Ensembl
# https://useast.ensembl.org/info/data/ftp/index.html
ensembl_ftp = 'ftp://ftp.ensembl.org/pub/release-91/gtf/homo_sapiens/'
output: f"{resource_dir}/{ref_gtf}"
download: dest_dir = resource_dir, expand = True
    {ensembl_ftp}/{ref_gtf}

Additionally `WASP` uses pre-defined list of SNPs and removes bias caused by them. The list was generated by Yanyu using 1k genome SNP in VCF format with MAF filter as input to `WASP/mapping/extract_vcf_snps.sh` command. Here I just take this pre-compiled list.

### Software tools

In [None]:
[wasp]
output: f"{wasp_dir}/README.md"
download: decompress = True, dest_file = f'{cwd:a}/WASP.zip'
    https://github.com/bmvdgeijn/WASP/archive/master.zip
bash: expand = True
    rm -f {cwd}/WASP.zip

[star]
# Quite time & resource consuming (3hrs, 32GB memory)
depends: Py_Module('docker')
output: f"{resource_dir}/genomeParameters.txt"
bash: workdir = f'{cwd:a}', docker_image = 'bschiffthaler/ngs', expand=True
     STAR --runMode genomeGenerate \
        --genomeDir {resource_dir} \
        --genomeFastaFiles {resource_dir}/{ref_fa} \
        --sjdbGTFtagExonParentTranscript {resource_dir}/{ref_gtf} \
        --runThreadN {ncpu}

Also apparently `WASP` only works with Python 2 ... need to create a `conda` environment for it:

```
conda create -n py27 python=2.7
source activate py27
conda install pytables=2.4.0
pip install pysam
```

and use 

```
source activate py27
```
in `WASP` steps.

## Get samples

FIXME: add description -- what are these samples?

**Caution: these samples are paired end reads.** This is different from Yanyu's workflow and will impact `STAR` and `WASP` commands.

In [1]:
[obtain_samples]
# Download samples from ENCODE
# https://www.encodeproject.org/experiments/ENCSR384KAN/
encode_url = 'https://www.encodeproject.org/files'
input: for_each = 'fastq', concurrent = True
output: fastq, group_by = 1
download: dest_dir = sample_dir, expand = True
    {encode_url}/{_fastq:bnn}/@@download/{_fastq:b}

## Alignment

### Prefiltering alignment

Align with `STAR`, followd by `samtools` to remove reads with quality less than given cutoff.

In [None]:
[align_1 (STAR prefiltering alignment)]
parameter: qual_cutoff = 10
depends: sos_step('star')
input: fastq, group_by = 2, concurrent = True
output: [f"{sample_dir}/{x}.qual{qual_cutoff}.bam" for x in samples], group_by = 1
bash: workdir = f'{cwd:a}', docker_image = 'bschiffthaler/ngs', expand=True
    STAR --genomeDir {resource_dir} \
        --readFilesIn {_input[0]} {_input[1]} \
        --readFilesCommand zcat \
        --runThreadN {ncpu} --outStd BAM_SortedByCoordinate \
        --outSAMtype BAM SortedByCoordinate \
        --sjdbGTFtagExonParentTranscript {resource_dir}/{ref_gtf} |
    samtools view -bq {qual_cutoff} > {_output}

### WASP-informed remap

Remap reads after [`WASP`](https://github.com/bmvdgeijn/WASP/tree/master/mapping) adjustment.

In [None]:
[align_2 (WASP intersecting SNP): shared = {'wasp_split': 'step_output'}]
# WASP finding unbiased reads intersecting with SNP
depends: sos_step('wasp')
input: group_by = 1, concurrent = True
output: [[f"{x:n}.remap.fq1.gz", f"{x:n}.remap.fq2.gz", f"{x:n}.to.remap.bam", f"{x:n}.keep.bam"] for x in _input], group_by = 4
bash: workdir = f'{cwd:a}', expand = True
    source activate py27
    python {wasp_dir}/find_intersecting_snps.py {_input} \
        --snp_dir {resource_dir}/wasp_snp_list \
        --is_sorted --is_paired_end

[align_3 (STAR post alignment)]
# Align WASP remap with STAR
# Followd by samtools remove reads with quality less than {qual_cutoff}
parameter: qual_cutoff = 10
input: group_by = 4, pattern = '{name}.{qual}.remap.{ext}', concurrent = True
output: expand_pattern(f'{_name[0]}.remapped.qual{qual_cutoff}.bam')
bash: workdir = f'{cwd:a}', docker_image = 'bschiffthaler/ngs', expand=True
    STAR --genomeDir {resource_dir} \
        --readFilesIn {_input[0]} {_input[1]} \
        --readFilesCommand zcat \
        --runThreadN {ncpu} --outStd BAM_SortedByCoordinate \
        --outSAMtype BAM SortedByCoordinate \
        --sjdbGTFtagExonParentTranscript {resource_dir}/{ref_gtf} |
    samtools view -bq {qual_cutoff} > {_output}

[align_4 (WASP remove ambiguously mapped reads)]
to_remap = paths([wasp_split[i:i+4][2] for i in range(0, len(wasp_split), 4)])
input: group_by = 1, paired_with = 'to_remap', pattern = '{name}.remapped.{ext}', concurrent = True
output: expand_pattern(f'{_name[0]}.keep_remapped.{_ext[0]}')
bash: workdir = f'{cwd:a}', expand=True
    source activate py27
    python {wasp_dir}/filter_remapped_reads.py {_input} {_to_remap} {_output}

[align_5 (Merge WASP adjusted and originally kept BAM)]
kept = paths([wasp_split[i:i+4][3] for i in range(0, len(wasp_split), 4)])
input: group_by = 1, paired_with = 'kept', pattern = '{name}.keep_remapped.{ext}', concurrent = True
output: expand_pattern(f'{_name[0]}.wasp_remapped.{_ext[0]}')
bash: workdir = f'{cwd:a}', docker_image = 'bschiffthaler/ngs', expand=True
    samtools merge - {_input} {_kept} | sort -o {_output}
    samtools index {_output}

[align_6 (WASP remove duplicate reads )]
input: group_by = 1, pattern = '{name}.bam', concurrent = True
output: expand_pattern(f'{_name[0]}.dedup.bam')
bash: workdir = f'{cwd:a}', expand=True
    source activate py27
    python {wasp_dir}/rmdup.py {_input} {_output}