# Analyzing cappable-seq pilot experiment

## Pipeline

Similar to the `dusp11_clip_seq` pipeline:
1. Trim any adapter sequences using `cutadapt`
2. Deplete any ribosomal RNA using `SortMeRNA`
3. Align to hybrid WSN/human genome index using `STAR`
4. Quantify transcripts using `Salmon`
5. Measure differential expression using `DeSeq2`

## Ribodeplete
We shouldn't have many rRNA due to ribodepletion of inputs and the nature of cappable-seq samples, but will still use `sortmeRNA` to align to human rRNA sequences and see what comes out.

In [2]:
# %load ../sortRNA.sh
#! /bin/bash

source env.sh

sudo mkdir ribodepleted_reads
sudo chown $USER_ID:$GROUP_ID ribodepleted_reads
sudo chmod -R 775 ribodepleted_reads

for file in $(find trimmed_reads/ -name "*.fastq.gz"); do
    base=$(basename $file ".fastq.gz")
    echo "Sorting $base"
    sortmerna -ref human_rRNAs.fasta -reads $file --threads 4 -fastx -workdir ribodepleted_reads/$base -aligned ribodepleted_reads/$base"_rRNA" -other ribodepleted_reads/$base"_nonrRNA"
    sudo rm -R ribodepleted_reads/$base/ # Cleanup working directory
done
echo "Sortmerna complete"

## Alignment

### Make `STAR` index

Use concatenated WSN/FFLUC and hg38 genome to form a hybrid index for alignment. Use homemade WSN_annotated.gtf for flu and FFLUC spike in control

In [None]:
cat genome/GCF_000001405.40_GRCh38.p14_genomic.fasta genome/WSN_Mehle.fasta > hybrid_genome.fasta
cat genome/genomic.gtf genome/WSN_annotated.gtf > genome/hybrid_annotated.gtf
STAR --runThreadN 4 --runMode genomeGenerate --genomeDir genome/star_index --genomeFastaFiles genome/hybrid_genome.fasta --sjdbGTFfile genome/hybrid_annotated.gtf --sjdbOverhang 100 --limitGenomeGenerateRAM 30000000000

### Align to hybrid index

In [2]:
# %load ../run_star_ribodepleted.sh
#!/bin/bash

source env.sh

set -e
trap cleanup EXIT

cleanup () {
    echo "STAR alignment failed, removing genome from memory"
    STAR --genomeLoad Remove \
        --genomeDir genome/star_indices \
        --outFileNamePrefix star_alignments/exit/exit # remove the genome from memory
    sudo rm -R star_alignments/exit || true # remove the exit folder
    echo "STAR genome removal complete"

}

sudo mkdir -p /home/ubuntu/blockvolume/dusp11_clip-seq/ribodepleted_star_alignments
sudo chown -R $USER_ID:$GROUP_ID /home/ubuntu/blockvolume/dusp11_clip-seq/ribodepleted_star_alignments
sudo chmod -R 775 /home/ubuntu/blockvolume/dusp11_clip-seq/ribodepleted_star_alignments

# Align using TranscriptomeSAM
for file in ribodepleted_reads/*_trimmed_nonrRNA.fq.gz; do
    g=$(basename "$file" _trimmed_nonrRNA.fq.gz)
    p=${g#ultraplex_demux_}
    # d="${p}_transcriptome"
    echo "Aligning $p"
    STAR --runThreadN 4 \
        --genomeDir genome/star_indices \
        --readFilesIn $file \
        --outFileNamePrefix ribodepleted_star_alignments/${p}_transcriptome/${p}_ribodepleted_ \
        --quantMode TranscriptomeSAM \
        --genomeLoad LoadAndKeep \
        --outReadsUnmapped Fastx \
        --readFilesCommand zcat
    echo "Converting $p to bam"
    samtools view -o ribodepleted_star_alignments/${p}/${p}_ribodepleted_aligned.bam \
        ribodepleted_star_alignments/${p}/${p}_ribodepleted_Aligned.out.sam # convert to bam
    sudo rm ribodepleted_star_alignments/${p}/${p}_ribodepleted_Aligned.out.sam # remove the sam file
    echo "Sorting $p"
    samtools sort ribodepleted_star_alignments/${p}/${p}_ribodepleted_Aligned.toTranscriptome.out.bam \
        -o ribodepleted_star_alignments/${p}/${p}_ribodepleted_transcriptome_sorted.bam \
        -@ 4 # sort the transcriptome bam file
    samtools sort ribodepleted_star_alignments/${p}/${p}_ribodepleted_aligned.bam \
        -o ribodepleted_star_alignments/${p}/${p}_ribodepleted_sorted.bam \
        -@ 4 # sort the bam file
    echo "Indexing $p"
    samtools index ribodepleted_star_alignments/${p}/${p}_ribodepleted_sorted.bam \
        -o ribodepleted_star_alignments/${p}/${p}_ribodepleted_sorted.bai \
        -@ 4 # index the sorted bam file
    samtools index ribodepleted_star_alignments/${p}/${p}_ribodepleted_transcriptome_sorted.bam \
        -o ribodepleted_star_alignments/${p}/${p}_ribodepleted_transcriptome_sorted.bai \
        -@ 4 # index the sorted transcriptome bam file
done

# Cleanup RAM
echo "STAR alignment complete, removing genome from memory"
STAR --genomeLoad Remove \
    --genomeDir genome/star_indices \
    --outFileNamePrefix star_alignments/exit/exit # remove the genome from memory
sudo rm -R star_alignments/exit # remove the exit folder
echo "STAR genome removal complete"

## Quantify Reads- Alignment independent (bypass `star`)

Now we need to quantify reads before counting them. `DeSeq2` can take counts from `salmon` (in fact, this is the official recommendation in the docs). In `dusp_11_clip-seq` I discussed quantification methods including simple gene counts and more sophisticated transcript counts. We'll use the genome alignments to inform transcript quantification (and maybe test alignment-independent quantification once this approach is done) in preparation to pass those results to `deseq2`.

`Salmon` needs files aligned to the transcriptome, luckily `star` can output this. It first aligns to genome and them maps these alignments to the transcriptome, but still preserves any novel reads that don't map the provided transcriptome. (Note, this is different than the clip-seq approach because we simply wanted to find peaks in those files, not quantify reads.) However, this is an alignment-dependent approach.

`Salmon` can quantify in an alignment-independent manner, and this method seems to be more accurate than alignment-dependent methods, counterintuitively. This is because using alignment-dependent methods (cufflinks, HTseq, FeatureCounts, maybe salmon in alignment-dependent mode) can vastly underestimate abundance from reads with >90% sequence similarity.

Kind of confusingly, while alignment-independent mapping is superior for quantification, these counts are more accurate when mapped to the genome instead of the transcriptome. This is because transcripts can have conserved UTRs and sequences as well as different spliceoforms. So the best approach here I think is to quantify in an alignment-independent manner and collapse these counts down to the gene level. This is recommended here.

To accomplish this, I'll use salmon to map reads to transcripts without aligning, and then collapse these into counts per gene before passing to `deseq2`. This makes `star` alignment obsolete. We need a transcript index for our hybrid genome for human/WSN transcripts. This is tricky. `gffread` can take a gff file and a genome file and extract transcripts from the genome sequences using the info in the gff file, outputting a transcripts.fasta file. Simply combining the gff files and the genome files results in issues with `gffread` only extracting either human or WSN sequences for some reason. To get around this I ran `gffread` on the WSN genome using the WSN gff and the human genome using the human gff and then combined those. This should work with the combined human/WSN genome that I made for `star` to make indexes for `salmon`.

### Make `salmon` index

Now we need to index the transcriptome (WSN and human) so that salmon can map to it. First use `gffread` to use the gff files I downloaded/created for both genomes to extract only the transcripts.

In [None]:
gffread -w genome/WSN_transcripts.fasta -g genome/WSN_Mehle.fasta genome/WSN_annotated.gtf
gffread -w genome/human_transcripts.fasta -g genome/GCF_000001405.40_GRCh38.p14_genomic.fasta genome/genomic.gtf
cat genome/WSN_transcripts.fasta genome/human_transcripts.fasta > genome/hybrid_transcripts.fasta

Note: concatenating the gff files and genome files and then running gffread results in either WSN transcripts only or human transcripts only. Probably a formatting issue with homemade gff file. Instead, extract transcripts individually and concatenate them together.

Also, WSN_transcripts.fasta contains HA and NA gene segments for some reason. Manually remove them so they don't break downstream processing and figure out why gffread is including them later.

Now we can make the `salmon` index using the transcripts and masking the genome for more accurate mapping and automatic filtering of DNA contamination. 

In [None]:
# First make a list of genomic decoys
grep "^>" >(cat genome/hybrid_genome.fasta) | cut -d " " -f 1 > genome/decoys.txt
sed -i.bak -e 's/>//g' genome/decoys.txt

# Make a gentrome file of transcriptome FIRST followed by genome to form the index
cat genome/hybrid_transcripts.fasta genome/hybrid_genome.fasta > genome/hybrid_gentrome.fasta

# Make salmon index
salmon index -t genome/hybrid_gentrome.fasta -d genome/decoys.txt -p 4 -i genome/salmon_index