# Analysis of *A. thaliana* RNA-Seq data with pyrpipe 
Use A thaliana public RNA-Seq data to assemble transcripts.

In [1]:
from pyrpipe import sra,mapping,assembly,qc,tools
from pyrpipe import pyrpipe_utils as pu
from pyrpipe import pyrpipe_engine as pe
#First get the srr accessions of the runs. For this one can use the python package pysradb or R package sradb
#i will consider following randomly selected accessions
#athalRuns=['SRR976159','SRR978411','SRR978410','SRR971778','SRR1058116','SRR1058118','SRR1058121','SRR1058110','SRR1058120','SRR1058117','SRR1104134','SRR1104133','SRR1104135','SRR1104136','SRR1105825']
athalRunsSmol=['SRR976159','SRR978411','SRR971778']
#set your working directory if you don't want to use the current working directory
workingDir="athal_out"
#create working directory
if not pu.check_paths_exist(workingDir):
    pu.mkdir(workingDir)

[93mLogs will be saved to 2020-03-16-14_33_21_pyrpipe.log[0m


## Download genome and gtf

In [2]:
GENOME=workingDir+"/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa"
GTF=workingDir+"/Arabidopsis_thaliana.TAIR10.45.gtf"

if not pu.check_files_exist(GENOME):
    print("Downloading genome fasta file")
    wget="wget ftp://ftp.ensemblgenomes.org/pub/release-46/plants/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz -q -O "+GENOME+".gz"
    pe.execute_command(wget.split(),verbose=True,logs=False)
    pe.execute_command(['gunzip',GENOME+".gz"],verbose=True,logs=False)
    
if not pu.check_files_exist(GTF):
    print("Downloading GTF file")
    wget="wget ftp://ftp.ensemblgenomes.org/pub/release-46/plants/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.46.gtf.gz -O "+GTF+".gz"
    pe.execute_command(wget.split(),verbose=True,logs=False)
    pe.execute_command(['gunzip',GTF+".gz"],verbose=True,logs=False)
    
        

## Download data and create SRA objects
First we can download all data (fastq files) to disk and create pyrpipe.SRA objects

In [3]:

##download all data in athalRuns
sraObjects=[]

for x in athalRunsSmol:
    thisSraOb=sra.SRA(x,workingDir)
    if thisSraOb.download_fastq():
        sraObjects.append(thisSraOb)
    else:
        print("Download failed:"+x)

print("Following runs downloaded:")
for ob in sraObjects:
    print(ob.srr_accession)

[94m$ fasterq-dump -O athal_out/SRR976159 -o SRR976159.fastq -e 8 -f SRR976159[0m
[92mTime taken:0:00:39[0m
[94m$ fasterq-dump -O athal_out/SRR978411 -o SRR978411.fastq -e 8 -f SRR978411[0m
[92mTime taken:0:00:41[0m
[94m$ fasterq-dump -O athal_out/SRR971778 -o SRR971778.fastq -e 8 -f SRR971778[0m
[92mTime taken:0:00:53[0m
Following runs downloaded:
SRR976159
SRR978411
SRR971778


## Saving current session
A reason why I have first downloaded the SRA files is that **in a typical HPC setting, one might have access to special data-transfer nodes**. These nodes could be used for downloading data efficiently but does not allow expensive computations. On the other hand data could also be downloaded from compute nodes **but you will burn most of your computing time/allocations for only downloading the data**. Thus it might be a good idea to download data separately and then start the processing.

We can save the objects created with pyrpipe and restore our session later on a compute node.

In [22]:
# save current session
from pyrpipe import pyrpipe_session
pyrpipe_session.save_session(filename="mySession",add_timestamp=True,out_dir=workingDir)

Session saved to: athal_out/mySession_20200316140640.pyrpipe


True

## Restoring saved session
We can restore the pyrpipe session using the saved session file (saved with .pyrpipe extension).

**Note** After restoring session a new log file will generated to store the logs.

In [23]:
#first clear current session used by notebook
%reset
print(sraObjects)

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


NameError: name 'sraObjects' is not defined

In [24]:
#restore session
from pyrpipe import pyrpipe_session
#update the pyrpipe session file below
pyrpipe_session.restore_session("athal_out/mySession_20200316140640.pyrpipe")
print(sraObjects)

Session restored.
[<pyrpipe.sra.SRA object at 0x7fe0f0dcea10>, <pyrpipe.sra.SRA object at 0x7fe0f0dcee90>, <pyrpipe.sra.SRA object at 0x7fe0f0dce690>]


## Performing fastq quality control
After running fasterq-dump, the fastq files will be updated in each SRA object. To perform fastq quality control we can use ```trimgalore``` or ```bbduk.sh```.

In [4]:
#using bbduk
pathToAdapters="adapters2.fa"
#arguments to pass to bbduk
bbdOpts={"ktrim":"r","k":"23","mink":"11","qtrim":"'rl'","trimq":"10","--":("-Xmx2g",),"ref":pathToAdapters}
#an object for running bbduk.sh with specified parameters
bbdOb=qc.BBmap()
#start QC
for ob in sraObjects:
    ob.perform_qc(bbdOb,**bbdOpts)
    
#after finishing view the current fastq files in the sra objects

for ob in sraObjects:
    print("SRR Accession: {}, fastq files: {}. {}".format(ob.srr_accession,ob.localfastq1Path,ob.localfastq2Path))
    
    if ob.fastqFilesExistsLocally():
          print("Both files exist!!")
    else:
          print("Error")
          raise Exception("Fastq files not found")


Performing QC using bbduk.sh
[94m$ bbduk.sh in=athal_out/SRR976159/SRR976159_1.fastq in2=athal_out/SRR976159/SRR976159_2.fastq out=athal_out/SRR976159/SRR976159_1_bbduk.fastq out2=athal_out/SRR976159/SRR976159_2_bbduk.fastq threads=8 ktrim=r k=23 mink=11 qtrim='rl' trimq=10 ref=adapters2.fa -Xmx2g[0m
[92mTime taken:0:01:01[0m
Performing QC using bbduk.sh
[94m$ bbduk.sh in=athal_out/SRR978411/SRR978411_1.fastq in2=athal_out/SRR978411/SRR978411_2.fastq out=athal_out/SRR978411/SRR978411_1_bbduk.fastq out2=athal_out/SRR978411/SRR978411_2_bbduk.fastq threads=8 ktrim=r k=23 mink=11 qtrim='rl' trimq=10 ref=adapters2.fa -Xmx2g[0m
[92mTime taken:0:00:49[0m
Performing QC using bbduk.sh
[94m$ bbduk.sh in=athal_out/SRR971778/SRR971778_1.fastq in2=athal_out/SRR971778/SRR971778_2.fastq out=athal_out/SRR971778/SRR971778_1_bbduk.fastq out2=athal_out/SRR971778/SRR971778_2_bbduk.fastq threads=8 ktrim=r k=23 mink=11 qtrim='rl' trimq=10 ref=adapters2.fa -Xmx2g[0m
[92mTime taken:0:01:02[0m
SRR 

## Aligning clean reads to the reference genome
After finishing fastq quality control we will map reads to the reference genome.

In [5]:
#using hisat2
hs=mapping.Hisat2(index="",threads=6)

No Hisat2 index provided. Please build index now to generate an index using build_Index()....


In [6]:
#We can build hisat2 index if one doesnt already exist. This index will be bound to the Hisat2 object, hs.
hisat2_buildArgs={"-p":"8","-a":"","-q":""}
#start building
#parameters are out directory, index name, reference genome
if hs.build_index(workingDir+"/athalIndex","athalInd",GENOME,**hisat2_buildArgs) :
    print("Indexing done.")
    
#check the index present in hisat2 object
if hs.check_index():
    print("Index {} exists".format(hs.hisat2_index))
    

Building hisat index...
Hisat2 index with same name already exists. Exiting...
Indexing done.
Index athal_out/athalIndex/athalInd exists


In [7]:
#start alignment
hsOpts={"--dta-cufflinks":"","-p":"10"}
samList=[]
for ob in sraObjects:
    print("Processing {}...".format(ob.srr_accession))
    thisSam=hs.perform_alignment(ob,**hsOpts) #note the parameter p supplied here will replace the parameter "threads" passed during object construction
    if thisSam:
        samList.append(thisSam)
print("Alignment done!! Sam files:"+ ",".join(samList))    

Processing SRR976159...
[94m$ hisat2 -1 athal_out/SRR976159/SRR976159_1_bbduk.fastq -2 athal_out/SRR976159/SRR976159_2_bbduk.fastq -S athal_out/SRR976159/SRR976159_hisat2.sam -p 10 -x athal_out/athalIndex/athalInd --dta-cufflinks[0m
[92mTime taken:0:01:20[0m
Processing SRR978411...
[94m$ hisat2 -1 athal_out/SRR978411/SRR978411_1_bbduk.fastq -2 athal_out/SRR978411/SRR978411_2_bbduk.fastq -S athal_out/SRR978411/SRR978411_hisat2.sam -p 10 -x athal_out/athalIndex/athalInd --dta-cufflinks[0m
[92mTime taken:0:01:10[0m
Processing SRR971778...
[94m$ hisat2 -1 athal_out/SRR971778/SRR971778_1_bbduk.fastq -2 athal_out/SRR971778/SRR971778_2_bbduk.fastq -S athal_out/SRR971778/SRR971778_hisat2.sam -p 10 -x athal_out/athalIndex/athalInd --dta-cufflinks[0m
[92mTime taken:0:01:23[0m
Alignment done!! Sam files:athal_out/SRR976159/SRR976159_hisat2.sam,athal_out/SRR978411/SRR978411_hisat2.sam,athal_out/SRR971778/SRR971778_hisat2.sam


## Using samtools
```pyrpipe``` implemnts a basic high-level samtools API through which samtools functionality could be accessed. Note that users can also use the library ```pysam``` to get advance SAM/BAM/VCF/BCF functionality.

In [8]:
samOb=tools.Samtools()
#sam to sorted bam
bamList=[]
i=0
for sam in samList:
    print("Processing:"+sam)
    thisBam=samOb.sam_sorted_bam(sam,delete_sam=True,delete_bam=True,objectid=sraObjects[i].srr_accession) #add the object id to keep track of process and object. helpful in debugging and reports later
    i+=1
    if thisBam:
        bamList.append(thisBam)
print("Sorted bam files:"+",".join(bamList))

###Some Examples using pysam###
#for details see: https://pysam.readthedocs.io/en/latest/
#import pysam
#pysam.sort("-@","8","-o","sortedBam.bam","in.bam)
#pysam.merge("-@","8","myMerge",*bamList,"-f")

Processing:athal_out/SRR976159/SRR976159_hisat2.sam
[94m$ samtools view -o athal_out/SRR976159/SRR976159_hisat2.bam -@ 6 -b athal_out/SRR976159/SRR976159_hisat2.sam[0m
[92mTime taken:0:00:37[0m
[94m$ samtools sort -o athal_out/SRR976159/SRR976159_hisat2_sorted.bam -@ 6 athal_out/SRR976159/SRR976159_hisat2.bam[0m
[92mTime taken:0:00:18[0m
Processing:athal_out/SRR978411/SRR978411_hisat2.sam
[94m$ samtools view -o athal_out/SRR978411/SRR978411_hisat2.bam -@ 6 -b athal_out/SRR978411/SRR978411_hisat2.sam[0m
[92mTime taken:0:00:29[0m
[94m$ samtools sort -o athal_out/SRR978411/SRR978411_hisat2_sorted.bam -@ 6 athal_out/SRR978411/SRR978411_hisat2.bam[0m
[92mTime taken:0:00:17[0m
Processing:athal_out/SRR971778/SRR971778_hisat2.sam
[94m$ samtools view -o athal_out/SRR971778/SRR971778_hisat2.bam -@ 6 -b athal_out/SRR971778/SRR971778_hisat2.sam[0m
[92mTime taken:0:00:37[0m
[94m$ samtools sort -o athal_out/SRR971778/SRR971778_hisat2_sorted.bam -@ 6 athal_out/SRR971778/SRR971778

## Transcript assembly
We can use stringtie to perform transcript assembly.

In [9]:
st=assembly.Stringtie()
gtfList=[]
i=0
for bam in bamList:
    print("Processing:"+bam)
    gtfList.append(st.perform_assembly(bam,reference_gtf=GTF,objectid=sraObjects[i].srr_accession))
    i+=1

print("Final GTFs:"+",".join(gtfList))

Processing:athal_out/SRR976159/SRR976159_hisat2_sorted.bam
[94m$ stringtie -o athal_out/SRR976159/SRR976159_hisat2_sorted_stringtie.gtf -p 8 -G athal_out/Arabidopsis_thaliana.TAIR10.45.gtf athal_out/SRR976159/SRR976159_hisat2_sorted.bam[0m
[92mTime taken:0:00:40[0m
Processing:athal_out/SRR978411/SRR978411_hisat2_sorted.bam
[94m$ stringtie -o athal_out/SRR978411/SRR978411_hisat2_sorted_stringtie.gtf -p 8 -G athal_out/Arabidopsis_thaliana.TAIR10.45.gtf athal_out/SRR978411/SRR978411_hisat2_sorted.bam[0m
[92mTime taken:0:00:33[0m
Processing:athal_out/SRR971778/SRR971778_hisat2_sorted.bam
[94m$ stringtie -o athal_out/SRR971778/SRR971778_hisat2_sorted_stringtie.gtf -p 8 -G athal_out/Arabidopsis_thaliana.TAIR10.45.gtf athal_out/SRR971778/SRR971778_hisat2_sorted.bam[0m
[92mTime taken:0:00:39[0m
Final GTFs:athal_out/SRR976159/SRR976159_hisat2_sorted_stringtie.gtf,athal_out/SRR978411/SRR978411_hisat2_sorted_stringtie.gtf,athal_out/SRR971778/SRR971778_hisat2_sorted_stringtie.gtf


## Generating analysis reports
pyrpipe_diagnostic.py lets user generate different types of reports and summaries. Following commands can be run from shell.



**Generate a pdf report**
[Output](https://github.com/urmi-21/pyrpipe/blob/master/examples%28case-studies%29/Athaliana_transcript_assembly/2020-03-16-14_33_21_pyrpipe.pdf)

In [11]:
!pyrpipe_diagnostic.py report pyrpipe_logs/2020-03-16-14_33_21_pyrpipe.log

Report written to 2020-03-16-14_33_21_pyrpipe.pdf


***Dump all commands to a shell file***
[Output](https://github.com/urmi-21/pyrpipe/blob/master/examples%28case-studies%29/Athaliana_transcript_assembly/2020-03-16-14_33_21_pyrpipe.sh)

In [12]:
!pyrpipe_diagnostic.py shell pyrpipe_logs/2020-03-16-14_33_21_pyrpipe.log

Generating bash script
shell commands written to 2020-03-16-14_33_21_pyrpipe.sh


**Generate multiqc report**
[Output](https://github.com/urmi-21/pyrpipe/blob/master/examples%28case-studies%29/Athaliana_transcript_assembly/multiqc_report.html)

In [13]:
!pyrpipe_diagnostic.py multiqc -r pyrpipe_logs/2020-03-16-14_33_21_pyrpipe.log

Generating html report with multiqc
[1;30m[INFO   ][0m         multiqc : This is MultiQC v1.8
[1;30m[INFO   ][0m         multiqc : Template    : default
[1;30m[INFO   ][0m         multiqc : Searching   : /home/usingh/work/urmi/hoap/pyrpipe/case_studies/Athaliana_transcript_assembly/tmp
[1;30m[INFO   ][0m         bowtie2 : Found 3 reports
[1;30m[INFO   ][0m         multiqc : Compressing plot data
[1;30m[INFO   ][0m         multiqc : Report      : multiqc_report.html
[1;30m[INFO   ][0m         multiqc : Data        : multiqc_data
[1;30m[INFO   ][0m         multiqc : MultiQC complete
[94mRemoving /home/usingh/work/urmi/hoap/pyrpipe/case_studies/Athaliana_transcript_assembly/tmp/SRR976159_fasterq-dump.txt[0m
[94mRemoving /home/usingh/work/urmi/hoap/pyrpipe/case_studies/Athaliana_transcript_assembly/tmp/SRR978411_fasterq-dump.txt[0m
[94mRemoving /home/usingh/work/urmi/hoap/pyrpipe/case_studies/Athaliana_transcript_assembly/tmp/SRR971778_fasterq-dump.txt[0m
[94mRemovin

**Generate runtime benchmarks**
[Output](https://github.com/urmi-21/pyrpipe/tree/master/examples%28case-studies%29/Athaliana_transcript_assembly/benchmark_reports)

In [14]:
!pyrpipe_diagnostic.py benchmark pyrpipe_logs/2020-03-16-14_33_21_pyrpipe.log

Generating benchmarks
[94mparsing log...[0m
[94mdone.[0m
[92mBenchmark report saved to:/home/usingh/work/urmi/hoap/pyrpipe/case_studies/Athaliana_transcript_assembly/tmp/benchmark_reports[0m
