<h1>Open source RNA-Seq pipeline for identification of novel-mirs and their gene regulatory networks<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#MiRPipe-Flowchart" data-toc-modified-id="MiRPipe-Flowchart-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>MiRPipe Flowchart</a></span></li><li><span><a href="#FastQ-Files" data-toc-modified-id="FastQ-Files-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>FastQ Files</a></span></li><li><span><a href="#Pre-processing" data-toc-modified-id="Pre-processing-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Pre-processing</a></span><ul class="toc-item"><li><span><a href="#Quality-Check-of-Fastq-Files" data-toc-modified-id="Quality-Check-of-Fastq-Files-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Quality Check of Fastq Files</a></span></li><li><span><a href="#Adaptor-Timming-and-Fastq-Splitting" data-toc-modified-id="Adaptor-Timming-and-Fastq-Splitting-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Adaptor Timming and Fastq Splitting</a></span></li></ul></li><li><span><a href="#Sequence-Alignment" data-toc-modified-id="Sequence-Alignment-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Sequence Alignment</a></span><ul class="toc-item"><li><span><a href="#Downloading-the-Mirdeep*-aligner-and-mirbase" data-toc-modified-id="Downloading-the-Mirdeep*-aligner-and-mirbase-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span><strong>Downloading the Mirdeep* aligner and mirbase</strong></a></span></li><li><span><a href="#piRNA-Pipeline" data-toc-modified-id="piRNA-Pipeline-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>piRNA Pipeline</a></span></li><li><span><a href="#miRNA-sequence-Alignment" data-toc-modified-id="miRNA-sequence-Alignment-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>miRNA sequence Alignment</a></span></li></ul></li><li><span><a href="#Post-Alignment-Analysis" data-toc-modified-id="Post-Alignment-Analysis-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Post-Alignment Analysis</a></span><ul class="toc-item"><li><span><a href="#Processing-of-raw-counts-from-Mirdeep*-results" data-toc-modified-id="Processing-of-raw-counts-from-Mirdeep*-results-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span><strong>Processing of raw counts from Mirdeep* results</strong></a></span></li></ul></li><li><span><a href="#Blast-Search" data-toc-modified-id="Blast-Search-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Blast Search</a></span><ul class="toc-item"><li><span><a href="#Search-for-Novel-miRNA-sequence-in-DashR-Database" data-toc-modified-id="Search-for-Novel-miRNA-sequence-in-DashR-Database-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Search for Novel miRNA sequence in DashR Database</a></span></li><li><span><a href="#DashR-Results-post-processing" data-toc-modified-id="DashR-Results-post-processing-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>DashR Results post-processing</a></span></li></ul></li><li><span><a href="#Seed-based-Sequence-Clustering" data-toc-modified-id="Seed-based-Sequence-Clustering-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Seed based Sequence Clustering</a></span><ul class="toc-item"><li><span><a href="#Dictionary-Preparation" data-toc-modified-id="Dictionary-Preparation-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span><strong>Dictionary Preparation</strong></a></span></li><li><span><a href="#CD-HIT" data-toc-modified-id="CD-HIT-7.2"><span class="toc-item-num">7.2&nbsp;&nbsp;</span>CD-HIT</a></span></li><li><span><a href="#Seed-Clustering-Algorithm" data-toc-modified-id="Seed-Clustering-Algorithm-7.3"><span class="toc-item-num">7.3&nbsp;&nbsp;</span>Seed Clustering Algorithm</a></span></li></ul></li><li><span><a href="#miRNA-Differential-expression-Analysis-using-DESeq2" data-toc-modified-id="miRNA-Differential-expression-Analysis-using-DESeq2-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>miRNA Differential expression Analysis using DESeq2</a></span></li><li><span><a href="#piRNA-Counts-Analysis" data-toc-modified-id="piRNA-Counts-Analysis-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>piRNA Counts Analysis</a></span></li><li><span><a href="#piRNA-Differential-expression-Analysis-using-DESeq2" data-toc-modified-id="piRNA-Differential-expression-Analysis-using-DESeq2-10"><span class="toc-item-num">10&nbsp;&nbsp;</span>piRNA Differential expression Analysis using DESeq2</a></span></li><li><span><a href="#Finding-significantly-altered-differentially-expressed-miRNAs" data-toc-modified-id="Finding-significantly-altered-differentially-expressed-miRNAs-11"><span class="toc-item-num">11&nbsp;&nbsp;</span>Finding significantly altered differentially expressed miRNAs</a></span></li><li><span><a href="#Output" data-toc-modified-id="Output-12"><span class="toc-item-num">12&nbsp;&nbsp;</span>Output</a></span></li></ul></div>

# MiRPipe Flowchart

![title](miRPipe_Flowchart.png)

# FastQ Files
**Loading libraries**

In [2]:
import os
import csv
import sys
import numpy as np
import pickle
import pandas as pd
import subprocess
import random
from splinter import Browser
import time
from operator import add
from os import path
import threading
from threading import Semaphore
screenlock = Semaphore(value=1)

In [3]:
%%capture
from tqdm import tqdm_notebook as tqdm
tqdm().pandas()

**Declaring ENV Variables**

In [4]:
# setting env variables:
os.environ['HOME_DIR'] = os.getcwd()
os.environ['SEQ_DIR'] = os.path.join(os.environ['HOME_DIR'],'data')
os.environ['Tools_DIR'] = os.path.join(os.environ['HOME_DIR'],'Tools')
if not os.path.isdir(os.path.join(os.environ['SEQ_DIR'],'LOG_DIR')):
    os.mkdir(os.path.join(os.environ['SEQ_DIR'],'LOG_DIR'))
    os.environ['LOG_DIR'] = os.path.join(os.environ['SEQ_DIR'],'LOG_DIR')
else:
    os.environ['LOG_DIR'] = os.path.join(os.environ['SEQ_DIR'],'LOG_DIR')
    
os.environ['REF_DIR'] = os.path.join(os.environ['HOME_DIR'], 'Tools',
                                     'MDS_command_line_v38/MDS_command_line')
print('Please be sure that all the fastq files are present in %s. All the results will be saved in the same folder also.' %os.environ['SEQ_DIR'])

Please be sure that all the fastq files are present in /mnt/disk2/MiRPipe_Docker/data. All the results will be saved in the same folder also.


Ensure that path is correct and fastq files are present in this path
For Smooth Functioning of Pipeline, please make sure that only input fastq files are present in the data folder.
Delete all the clutter before re-running the pipeline for trouble-free execution.

In [5]:
!echo $HOME_DIR
!echo $SEQ_DIR
!echo $LOG_DIR
!echo $REF_DIR

/mnt/disk2/MiRPipe_Docker
/mnt/disk2/MiRPipe_Docker/data
/mnt/disk2/MiRPipe_Docker/data/LOG_DIR
/mnt/disk2/MiRPipe_Docker/Tools/MDS_command_line_v38/MDS_command_line


# Pre-processing
## Quality Check of Fastq Files

In [None]:
'''
    Following script perform the following task:
    1. Perform quality checking of rach samples using FastQC.
    2. Prepare the better info-graphic and detailed report of all quality checking
       reports from all samples using multiqc
'''
!Rscript $HOME_DIR/scripts/FastQC.R

## Adaptor Timming and Fastq Splitting

In [None]:
print('The default adaptor sequence used for adaptor trimming is \
TCGTATGCCGTCTTCTGCTTG. If your adaptor sequence is different then please \
enter your data specific adaptor sequence. Please selec appropriate option.')
print('1. Default adaptor sequence: TCGTATGCCGTCTTCTGCTTG')
print('2. User specific adaptor sequence')
user_choice =int(input('Please select your option:'))

if user_choice == 1:
    adaptor = 'TGGAATTCTCGGGTGCCAAGG'
    os.environ['adaptor'] = adaptor
elif user_choice == 2:
    adaptor = input('Please enter the adaptor sequence')
    os.environ['adaptor'] = adaptor    


In [None]:
# Add script for adaptor trimming here using trim_galore
!bash scripts/adaptor_trimming.sh 1> data/LOG_DIR/adaptor_trimming.stdout 2>data/LOG_DIR/adaptor_trimming.stderr

# Add script for read length based spliting (into 3 parts) here using bbduk
!bash scripts/fastq_split.sh 1> data/LOG_DIR/fastq_split.stdout 2> data/LOG_DIR/fastq_split.stderr

# Sequence Alignment 

## **Downloading the Mirdeep* aligner and mirbase**

In [None]:
print('Please enter your choice: ')
print("Option 1: hg19 based alignment using Mirdeep* with miRBase v19 ")
print("Option 2: hg19 based alignment using Mirdeep* with miRBase v20")
print("Option 3: hg38 based alignment using Mirdeep* with miRBase 21")
print("Option 4: hg38 based alignment using Mirdeep* with miRBase 22 (Default Condition)")
print("Please enter either 1, 2, 3 or 4")
choice = int(input("Please enter your choice:  "))

if choice == 1:
    print("You chose hg19 and miRBase v19 based alignment using Mirdeep*")
    conf_file = open(os.path.join(os.environ['Tools_DIR'],'mirdeep.conf'),"w")    
    line = "Human Genome = hg19" + "\n"
    line += "miRBase = 19"
    conf_file.write(line)    
    print("Downloading Mirdeep*....")
    command = "wget --content-disposition  http://sourceforge.net/projects/mirdeepstar/files/MDS_command_line_v37.zip -P $Tools_DIR && "
    command += "unzip -o $Tools_DIR/MDS_command_line_v37.zip -d $Tools_DIR"
    subprocess.call(command, shell=True)    
    print("Downloading Mirdeep* is complete. Now downloading hg19 human reference genome....")
    ref_var = 'hg19'    
    os.environ['REF_DIR'] = os.path.join(os.environ['HOME_DIR'], 'Tools',
                                     'MDS_command_line_v37/MDS_command_line')
    print("")
    command = ""
    command += "mkdir $REF_DIR/hg19 && "
    command += "wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz -P $REF_DIR/hg19 && "
    command += "pigz -p 5 -d $REF_DIR/hg19/hg19.fa.gz && echo Download complete. Now building bowtie indexes && "
    command += "$HOME_DIR/Tools/bowtie-1.2.3-linux-x86_64/bowtie-build --threads 8 $HOME_DIR/refs/hg19/hg19.fa $HOME_DIR/refs/hg19/hg19"
    subprocess.call(command, shell=True)  
    print("Human refenrence genome hg19 has been downloaded and index files generation complete.")

elif choice == 2:
    print("You chose hg19 and miRBase v20 based alignment using Mirdeep*")
    conf_file = open(os.path.join(os.environ['Tools_DIR'],'mirdeep.conf'),"w")    
    line = "Human Genome = hg19" + "\n"
    line += "miRBase = 20"
    conf_file.write(line)    
    command = ""
    command += "wget --content-disposition  http://sourceforge.net/projects/mirdeepstar/files/MDS_command_line_v37.zip -P $Tools_DIR && "
    command += "unzip -o $Tools_DIR/MDS_command_line_v37.zip -d $Tools_DIR && "
    command += "rm $HOME_DIR/Tools/MDS_command_line_v37/MDS_command_line/genome/hg19/miRBase/* && "
    command += "wget --content-disposition ftp://mirbase.org/pub/mirbase/20/hairpin.fa.gz -P $HOME_DIR/Tools/MDS_command_line_v37/MDS_command_line/genome/hg19/miRBase && "
    command += "pigz -p 5 -d $HOME_DIR/Tools/MDS_command_line_v37/MDS_command_line/genome/hg19/miRBase/hairpin.fa.gz && "
    command += "wget --content-disposition  ftp://mirbase.org/pub/mirbase/20/mature.fa.gz -P $HOME_DIR/Tools/MDS_command_line_v37/MDS_command_line/genome/hg19/miRBase && "
    command += "pigz -p 5 -d $HOME_DIR/Tools/MDS_command_line_v37/MDS_command_line/genome/hg19/miRBase/mature.fa.gz &&  "
    command += "wget --content-disposition  ftp://mirbase.org/pub/mirbase/20/genomes/hsa.gff3 -P $HOME_DIR/Tools/MDS_command_line_v37/MDS_command_line/genome/hg19/miRBase && "
    command += "mv $HOME_DIR/Tools/MDS_command_line_v37/MDS_command_line/genome/hg19/miRBase/hsa.gff3 $HOME_DIR/Tools/MDS_command_line_v37/MDS_command_line/genome/hg19/miRBase/knownMiR.gff3"                
    subprocess.call(command, shell=True)   
    print("Downloading Mirdeep* is complete. Now downloading hg19 human reference genome....")
    ref_var = 'hg19'    
    os.environ['REF_DIR'] = os.path.join(os.environ['HOME_DIR'], 'Tools',
                                     'MDS_command_line_v37/MDS_command_line')
    print("")
    command = ""
    command += "mkdir $REF_DIR/hg19 && "
    command += "wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz -P $REF_DIR/hg19 && "
    command += "pigz -p 5 -d $REF_DIR/hg19/hg19.fa.gz && echo Download complete. Now building bowtie indexes && "
    command += "$HOME_DIR/Tools/bowtie-1.2.3-linux-x86_64/bowtie-build --threads 8 $HOME_DIR/refs/hg19/hg19.fa $HOME_DIR/refs/hg19/hg19"
    subprocess.call(command, shell=True)    
    print("Human refenrence genome hg19 has been downloaded and index files generation complete.")

elif choice == 3:
    print("You chose hg38 and miRBase v21 based genome alignment using Mirdeep*")
    conf_file = open(os.path.join(os.environ['Tools_DIR'],'mirdeep.conf'),"w")    
    line = "Human Genome = hg38" + "\n"
    line += "miRBase = 21"
    conf_file.write(line)
    print("Downloading Mirdeep*....")
    command = ""
    command += "rm -rf $Tools_DIR/MDS_command_line_v38/MDS_command_line/genome/hg38 && " 
    command += "wget --content-disposition  http://sourceforge.net/projects/mirdeepstar/files/Index_files/hg38.zip/download -P $Tools_DIR/ && "
    command += "unzip -o $Tools_DIR/hg38.zip -d $Tools_DIR/MDS_command_line_v38/MDS_command_line/genome/ && "
    command += "rm -r $Tools_DIR/MDS_command_line_v38/MDS_command_line/genome/hg19"
    subprocess.call(command, shell=True)   
    print("Downloading Mirdeep* is complete. Now downloading hg38 for piRNA pipeline...")
    ref_var = 'hg38'   
    command = ""    
    command += "mkdir -p $HOME_DIR/refs/hg38 && "
    command += "wget --content-disposition http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz -P $HOME_DIR/refs/hg38 && "
    command += "pigz -p 5 -d -f $HOME_DIR/refs/hg38/hg38.fa.gz && echo Download complete. Now building bowtie indexes && "
    command += "$HOME_DIR/Tools/bowtie-1.2.3-linux-x86_64/bowtie-build --threads 8 $HOME_DIR/refs/hg38/hg38.fa $HOME_DIR/refs/hg38/hg38"
    subprocess.call(command, shell=True)   
    print("Human refenrence genome hg38 has been downloaded and index files generation complete.")

elif choice == 4:
    print("You chose hg38 and miRBase v22 based genome alignment using Mirdeep*(Default Condition)")
    conf_file = open(os.path.join(os.environ['Tools_DIR'],'mirdeep.conf'),"w")    
    line = "Human Genome = hg38" + "\n"
    line += "miRBase = 22"
    conf_file.write(line)
    ref_var = 'hg38' 
    print("Downloading hg38 human reference genome")
    command = ""
    command += "mkdir -p $HOME_DIR/refs/hg38 && "
    command += "wget --content-disposition http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz -P $HOME_DIR/refs/hg38 && "
    command += "pigz -p 5 -d -f $HOME_DIR/refs/hg38/hg38.fa.gz && echo Download complete. Now building bowtie indexes && "
    command += "$HOME_DIR/Tools/bowtie-1.2.3-linux-x86_64/bowtie-build --threads 8 $HOME_DIR/refs/hg38/hg38.fa $HOME_DIR/refs/hg38/hg38"
    subprocess.call(command, shell=True)   
    print("Human refenrence genome hg38 has been downloaded and index files generation complete.")

else:
    print("Please enter valid option")

conf_file.close()

## piRNA Pipeline

In [None]:
def piRNA_pipeline():
    cmd = 'bash scripts/piRNA_pipeline.sh 1> data/LOG_DIR/piRNA_pipeline.stdout 2> data/LOG_DIR/piRNA_pipeline.stderr'
    os.system(cmd)

threading.Thread(target=piRNA_pipeline).start()

## miRNA sequence Alignment

Uncomment the following command if you want to perform sequence alignment sequentially.

In [None]:
# # Sequential sample processing
# !bash $HOME_DIR/scripts/Mirdeep_star.sh 1>$LOG_DIR/Mirdeep_star.stdout 2>$LOG_DIR/Mirdeep_star.stderr

**Preparing batches for miRNA sequence alignment for multi-thread processing**

In [None]:
def mirdeep_star(batch,batch_id):    
    os.chdir(os.environ['REF_DIR'])
    files = os.listdir(os.getcwd())    
    ref_id = os.path.join(os.getcwd(),'genome')
    if 'hg19' in os.listdir(ref_id):
        ref_idx = 'hg19'
    elif 'hg38' in os.listdir(ref_id):
        ref_idx = 'hg38'
    else:
        print('Reference index are missing in ', ref_id)
        sys.exit(2)
    
    batch = pd.read_csv(os.path.join(os.environ['SEQ_DIR'],batch),index_col=0)
    for idx in range(batch.shape[0]):
        basename = batch.iloc[idx,0].split('.')[0] +'_trimmed.fastq'
        path_new = os.path.join(os.environ['SEQ_DIR'],'fastq_17_24',basename)
        
        # Constraining miRDeep* to take only 4 gb in each thread
        try:
            command = 'java -Xmx4096m -jar MD.jar -g '+ ref_idx + ' -a ' + os.environ['adaptor'] + ' -t 17 -l 24 -s -20 -r 5 -p 20 -m 101 ' + path_new + ' && rm ' + path_new
            os.system(command)
        except:
            screenlock.acquire()
            print('Sample ',basename,' is not present in the directory.')
            screenlock.release()
    
    screenlock.acquire()
    print('---------------------------')
    print(batch_id, ' is complete.')
    print('---------------------------')
    screenlock.release()

In [None]:
# Collecting the sample_list provided by user
sample_list = pd.read_csv('data/sample_list.csv',index_col=0)

# Generating batches
if sample_list.shape[0] < 10:
    no_of_batch = 2
elif sample_list.shape[0] > 10 and sample_list.shape[0] <= 30:
    no_of_batch = 4
elif sample_list.shape[0] > 30 and sample_list.shape[0] <= 50:
    no_of_batch = 6
elif sample_list.shape[0] > 50 and sample_list.shape[0] <= 100:
    no_of_batch = 8
elif sample_list.shape[0] > 100:
    no_of_batch = 10
    
file_list_splitted = np.array_split(sample_list, no_of_batch)
for batch_idx in range(len(file_list_splitted)):
    file_list_splitted[batch_idx].to_csv('data/batch_'+str(batch_idx+1)+'.csv',encoding='utf-8',index=True)

In [None]:
# Calling mirdeep* for each batch (except first) in individual threads.
for i in range(2,no_of_batch+1):
    print('Running batch_'+ str(i)+ ' on thread-'+ str(i))
    threading.Thread(target=mirdeep_star,args=('batch_'+str(i)+'.csv',
                                               'batch_'+str(i),)).start()    
    
# Running batch_1
print('Running batch_1 on thread-1')
mirdeep_star('batch_1.csv','batch_1')

# Post-Alignment Analysis
## **Processing of raw counts from Mirdeep* results**

In [None]:
os.chdir(os.environ['HOME_DIR'])
files = [] #['synthetic_data.result.updated']
[files.append(i) for i in os.listdir("data/fastq_17_24") if (".result" in i and not "known" in i)]
result_data = {}
print('Preparing master look up table for result data')
for file in (files):    
    result_data[file] = {}
    readfile = open(os.path.join("data/fastq_17_24", file), 'r').readlines()
    header = readfile[0].split("\t")
    for line in readfile[1:]:
        result_data[file][line.split("\t")[0]] = {}
        result_data[file][line.split("\t")[0]][header[1]] = line.split("\t")[1]
        result_data[file][line.split("\t")[0]][header[2]] = line.split("\t")[2]
        result_data[file][line.split("\t")[0]][header[3]] = line.split("\t")[3]
        result_data[file][line.split("\t")[0]][header[4]] = line.split("\t")[4]
        result_data[file][line.split("\t")[0]][header[5]] = line.split("\t")[5]
        result_data[file][line.split("\t")[0]][header[6]] = line.split("\t")[6]
        result_data[file][line.split("\t")[0]][header[7]] = line.split("\t")[7]
        result_data[file][line.split("\t")[0]][header[8]] = line.split("\t")[8]
        result_data[file][line.split("\t")[0]][header[9]] = line.split("\t")[9]
        result_data[file][line.split("\t")[0]][header[10]] = line.split("\t")[10]

        
print('Master lookup table is generated. Now collecting all unique miRs from master lookup table')
mir_dict = {}
for file in (result_data.keys()):
    print('Working on ',file)
    for k,v in result_data[file].items():
        i = 1
        if not k in mir_dict.keys():
            mir_dict[k] = [v['chr'],v['mature_loci'].split('-')[0],
                           v['mature_loci'].split('-')[1],v['mature miR']]
        else:
            mir_dict1 = mir_dict.copy()
            for k1,v1 in mir_dict1.items():
                new_loc_start = int(v['mature_loci'].split('-')[0])
                new_loc_end = int(v['mature_loci'].split('-')[0])
                new_mir_seq = v['mature miR']
                if k1 == k:
                    if v1[0] == v['chr']:                          
                        if new_loc_start == int(v1[1]) and new_loc_end == int(v1[2]):
                            if new_mir_seq == v1[3]:
                                pass
                    else:
                        mir_dict[k+'_'+str(i)] = [v['chr'],v['mature_loci'].split('-')[0],
                                       v['mature_loci'].split('-')[1],v['mature miR']]
                        i += 1


print('Prepating dataframe for all unique miRs and samples....')
counts = pd.DataFrame(columns=list(result_data.keys()))
mir_name = list(mir_dict.keys())
index = 0

for k,v in (mir_dict.items()):
    for file in result_data.keys():
        for k1,v1 in result_data[file].items():
            if v1['mature miR'] == v[3]:
                loc_start = int(v1['mature_loci'].split('-')[0])
                loc_end = int(v1['mature_loci'].split('-')[1])
                loc_chr = v1['chr']
                if loc_chr == v[0]:
                    if loc_start == int(v[1]) and loc_end == int(v[2]):   
                        counts.loc[k,file] = int(v1['expression(number of mature reads)'])
                   
                        
counts['mir_ID'] = mir_name
counts = counts.set_index('mir_ID')
counts['chr'] = [v[0] for v in mir_dict.values()]
counts['chr_start'] = [v[1] for v in mir_dict.values()]
counts['chr_end'] = [v[2] for v in mir_dict.values()]
counts['miRNA_sequence'] = [v[3] for v in mir_dict.values()]

# rearranging columns
cols = counts.columns.tolist()
cols = cols[-4:]+ cols[:-4]
counts = counts[cols]

counts = counts.fillna(0)
print(counts.shape)
counts.to_csv("data/counts.csv",encoding='utf-8',index=True)
counts.head()

In [None]:
counts_known = counts[counts.index.str.contains('hsa')]
counts_unknown = counts[~counts.index.str.contains('hsa')]
print('All %d known and %d novel miRNA are successfully separated' %(counts_known.shape[0],counts_unknown.shape[0]))

# Blast Search
## Search for Novel miRNA sequence in DashR Database

In [None]:
'''
   Run this cell only when running this notebook first time. After successfully running this, 
   pickle file will be saved which can be used further without generating the 
   results again and will save lot of time.
   This code has been tested with Firefox version 66.0.4 (64-bit).
'''

log_writer = open("data/LOG_DIR/DashR_database_searching.txt", 'w')
log_writer.write('\n')
log_writer.write('\n')
log_writer.write('###################################################################################\n')
log_writer.write('************************Starting DashR analysis*********************************\n')
log_writer.write('###################################################################################\n')
log_writer.write('\n')
log_writer.write('\n')

novel_mirs = list(counts_unknown.index.values)
gecko = os.path.abspath('geckodriver')
dashR_Results = []
browser = Browser('firefox',executable_path=gecko,headless=True)
browser.visit('http://dashr2.lisanwanglab.org/search.php#')

if ref_var == 'hg19':
    #  Select DASHR2 GEO HG19 as reference
    xpath = '/html/body/div/div[1]/div/div/select/option[3]' 
else:
    #  Select DASHR2 GEO HG38 as reference
    xpath = '/html/body/div/div[1]/div/div/select/option[4]'
browser.find_by_xpath(xpath).click()
time.sleep(3)

browser.find_by_text('Search by sequence ').click()
time.sleep(3)
n_idx = len(list(counts_unknown.loc[:,'miRNA_sequence']))
n = 0
for idx in tqdm(range(n_idx)):
    name = list(counts_unknown.index.values)[idx]
    seq = list(counts_unknown.loc[:,'miRNA_sequence'])[idx]
    dashR_Results.append(name)     
    log_writer.write('\n\nSearch for ' + str(name) + ' having sequence : ' + str(seq) + '\n')
    browser.fill('querySeq', seq)
    time.sleep(2)
    xpath = '//*[@id="search"]/div[4]/div/button'    
    browser.find_by_xpath(xpath).click()
    if not browser.is_text_present('No matches found'):
        n += 1
        xpath = '//*[@id="sequence-results"]/pre/table' 
        results = browser.find_by_xpath(xpath)
        i=0
        for search_result in results:
            title = search_result.text.encode('utf8') 
            link = search_result["href"]             
            dashR_Results.append((title, link)) 
            i += 1   
        log_writer.write('Total ' + str(i) + 'resutls are found in DashR Database for ' + str(name) + '\n')
        log_writer.write('Searching for ' + str(name) + ' novel-mirna is complete.\n')
    else:
        dashR_Results.append('miRNA NOT FOUND')
        log_writer.write('No results are found for '+ str(name) + '\n')  
    


browser.quit()

log_writer.write('\n\n***************************************************************\n')
log_writer.write('\n\nTotal number of matched novel miRNAs are '+ str(n) + '\n')
with open("data/dashR_Results.pickle", 'wb') as handle:
    pickle.dump(dashR_Results, handle, protocol=pickle.HIGHEST_PROTOCOL)

log_writer.write('\nSearched results are saved in' + str('data/LOG_DIR' + '\n'))
log_writer.write('\n\n***************************************************************\n')
log_writer.close()

Uncomment the following cell if you want to load the pickle file obtained from dashR search module.

In [None]:
# '''
#    Get DashR database searching results from saved pickle file.   
# '''
# import pickle
# print('Loading saved results from DashR Database....')
# with open("data/dashR_Results.pickle", 'rb') as handle:
#     dashR_Results = pickle.load(handle)
# len(dashR_Results)

## DashR Results post-processing

In [None]:
def loc_check(rna_name, df1,df2,idx):
    rna_name1 = rna_name.split(' ')[1]
    rna_chr1 = rna_name.split(' ')[3].split(':')[0]
    rna_chr_start1 = int(rna_type.split(' ')[3].split(':')[1].split('[')[0].split('-')[0])
    rna_chr_end1 = int(rna_type.split(' ')[3].split(':')[1].split('[')[0].split('-')[1])
    try:
        rna_chr2 = df1.loc[rna_name1,'chr']
        rna_chr_start2 = int(df1.loc[rna_name1,'chr_start'])
        rna_chr_end2 = int(df1.loc[rna_name1,'chr_end'])
    except:
        rna_name1 = dashR_Results[idx]
        rna_chr2 = df2.loc[rna_name1,'chr']
        rna_chr_start2 = int(df2.loc[rna_name1,'chr_start'])
        rna_chr_end2 = int(df2.loc[rna_name1,'chr_end'])
    
    if (rna_chr1 == rna_chr2) and (rna_chr_start1 >= rna_chr_start2 -3) and (rna_chr_end1 <= rna_chr_end2 + 3):
        flag = True
        return flag

In [None]:
log_writer = open("data/LOG_DIR/DashR_results_processing.txt", 'w')
log_writer.write('\n')
log_writer.write('\n')
log_writer.write('###################################################################################\n')
log_writer.write('************************Starting DashR Results Analysis****************************\n')
log_writer.write('###################################################################################\n')
log_writer.write('\n')
log_writer.write('\n')


miRNA_count = 0
novel_contains_miRNA = []
novel_miRNA1 = counts_unknown.copy()
counts_unknown2 = counts_unknown.copy()
counts_unknown2 = counts_unknown2.drop(['miRNA_sequence'],axis=1)
counts_unknown2 = counts_unknown2.drop(['chr'],axis=1)
counts_unknown2 = counts_unknown2.drop(['chr_start'],axis=1)
counts_unknown2 = counts_unknown2.drop(['chr_end'],axis=1)
counts_known2 = counts_known.copy()
counts_known2 = counts_known2.drop(['miRNA_sequence'],axis=1)
counts_known2 = counts_known2.drop(['chr'],axis=1)
counts_known2 = counts_known2.drop(['chr_start'],axis=1)
counts_known2 = counts_known2.drop(['chr_end'],axis=1)
old_seq = list(counts_known.loc[:,'miRNA_sequence'].values)
old_chr = list(counts_known.loc[:,'chr'].values)
old_chr_start = list(counts_known.loc[:,'chr_start'].values)
old_chr_end = list(counts_known.loc[:,'chr_end'].values)
add_index_entry = list(counts_known.index.values)
novel_name1 = []
similar_known_mirna = []
mirna_updated = 0
for idx in (range(0,len(dashR_Results),2)):
    results = dashR_Results[idx+1]    
    if isinstance((dashR_Results[idx+1]),tuple):
        results = list(dashR_Results[idx+1])[0].decode("utf-8")          
        miRNA_count += 1
        rna_type_results = results.split('\n')[1:]
        for rna_type in rna_type_results:
            rna_type1 = rna_type.split(' ')[1]
            if 'hsa' in rna_type1:   
                similar_known_mirna.append(rna_type1)            
                idx2 = list(counts_unknown2.index.values).index(dashR_Results[idx])
                count_to_add = list(map(int,list(counts_unknown2.iloc[idx2,:].values)))                
                if rna_type1 in list(counts_known.index.values):    
                    counts_add_flag = loc_check(rna_type,counts_known,counts_unknown,idx)
                    if counts_add_flag:
                        mirna_updated += 1
                        count_increament = list(counts_known2.loc[rna_type1,:].values)                        
                        count_increament = list(map(add,count_increament,count_to_add))               
                        counts_known2.iloc[list(counts_known2.index.values).index(rna_type1),:] = count_increament                        
                        novel_miRNA1 = novel_miRNA1.drop([str(dashR_Results[idx])],axis = 0)
                        print(dashR_Results[idx], ' has been added to the known miR count matrix. ')
                        log_entry = dashR_Results[idx] + ' has been added to the known miR count matrix. '
                        log_writer.write(log_entry + '\n')
                    else:
                        log_entry ='NOT UPDATED: ' + dashR_Results[idx] + ' belongs to the same family of ' + rna_type1 + ' but their genomic locations are different.'
                        log_writer.write(log_entry + '\n')                        
                else:
                    counts_add_flag = loc_check(rna_type,counts_known,counts_unknown,idx)
                    if counts_add_flag:
                        count_unknown_idx = list(counts_unknown2.index.values).index(str(dashR_Results[idx]))
                        counts_known2 = counts_known2.append(counts_unknown2.iloc[count_unknown_idx,:]) 
                        add_index_entry.append(rna_type1)
                        old_seq.append(counts_unknown.loc[str(dashR_Results[idx]),'miRNA_sequence'])
                        old_chr.append(counts_unknown.loc[str(dashR_Results[idx]),'chr'])
                        old_chr_start.append(counts_unknown.loc[str(dashR_Results[idx]),'chr_start'])
                        old_chr_end.append(counts_unknown.loc[str(dashR_Results[idx]),'chr_end'])
                        novel_miRNA1 = novel_miRNA1.drop([str(dashR_Results[idx])],axis = 0)
                        print('ADDED: New mir added')
                        log_entry ='ADDED: New mir added  = ' + dashR_Results[idx]
                        log_writer.write(log_entry + '\n')                        
                    else:                        
                        log_entry ='NOT ADDEDD: ' + dashR_Results[idx] + ' belongs to the same family of ' + rna_type1 + ' but their genomic locations are different.'
                        log_writer.write(log_entry + '\n')                        
                break
            else:
                log_entry = 'Novelmir '+str(dashR_Results[idx])+' is found as : '+ rna_type1 + '. Hence it will be not considered'
                log_writer.write(log_entry + '\n')
    else:
        log_entry = 'Novelmir '+str(dashR_Results[idx])+' is not mached with any known miRNA in DashR database.'
        log_writer.write(log_entry + '\n')
            

counts_known2['chr'] = old_chr
counts_known2['chr_start'] = old_chr_start
counts_known2['chr_end'] = old_chr_end
counts_known2['miRNA_sequence'] = old_seq

# rearranging columns
cols = counts_known2.columns.tolist()
cols = cols[-4:]+ cols[:-4]
counts_known2 = counts_known2[cols]

counts_known2.index = add_index_entry
log_writer.close()

In [None]:
print(counts_unknown2.shape)
print(novel_miRNA1.shape)
print(counts_known2.shape)

# Seed based Sequence Clustering
## **Dictionary Preparation**

In [None]:
seed_dict = {}
Xseed_dict = {}
known_mir_dict = {}
for index in counts_known2.index:
    seed = counts_known2.loc[index,'miRNA_sequence'][1:7]
    Xseed = counts_known2.loc[index,'miRNA_sequence'][7:]
    known_mir_dict[index] = [seed,counts_known2.loc[index,'chr'],counts_known2.loc[index,'chr_start'],
                              counts_known2.loc[index,'chr_end'],counts_known2.loc[index,'miRNA_sequence']]
    seed_dict[index] = seed
    Xseed_dict[index] = Xseed
    

unknown_mir_dict = {}
for index in novel_miRNA1.index:
    seed = novel_miRNA1.loc[index,'miRNA_sequence'][1:7]
    Xseed = novel_miRNA1.loc[index,'miRNA_sequence'][7:]
    unknown_mir_dict[index] = [seed,novel_miRNA1.loc[index,'chr'],novel_miRNA1.loc[index,'chr_start'],
                              novel_miRNA1.loc[index,'chr_end'],novel_miRNA1.loc[index,'miRNA_sequence']]
    seed_dict[index] = seed
    Xseed_dict[index] = Xseed
    

**Fasta File**

In [None]:
count = 1
writer = open("data/seed_seqs.fasta", 'w')
for key in known_mir_dict.keys():
    line = ">" + str(count) + "\t" + key + "\n"
    line += str(known_mir_dict[key][0]) + "\n"
    writer.write(line)
    count+=1

for key in unknown_mir_dict.keys():
    line = ">" + str(count) + "\t" + key + "\n"
    line += str(unknown_mir_dict[key][0]) + "\n"
    writer.write(line)
    count+=1
writer.close()

In [None]:
count = 1
writer = open("data/seed_outer_seqs.fasta", 'w')
for key in known_mir_dict.keys():
    line = ">" + str(count) + "\t" + key + "\n"
    line += str(known_mir_dict[key][-1][7:]) + "\n"
    writer.write(line)
    count+=1

for key in unknown_mir_dict.keys():
    line = ">" + str(count) + "\t" + key + "\n"
    line += str(unknown_mir_dict[key][-1][7:]) + "\n"
    writer.write(line)
    count+=1
writer.close()

## CD-HIT

In [None]:
# Execute CD-HIT command

!cd Tools/cd-hit-v4.6.8-2017-1208/; echo "***** Clustering miRs Seed Sequences Only**********"; \
                                                           ./cd-hit -l 5 -n 2 -c 1 \
                                                                    -i ../../data/seed_seqs.fasta \
                                                                    -o ../../data/cdhit_seed;     \
                                                           echo " "; \
                                                           echo "***** Clustering miRs Rest Sequences Only**********"; \
                                                           ./cd-hit -l 8 -i ../../data/seed_outer_seqs.fasta \
                                                                    -o ../../data/cdhit_seed_outer;

## Seed Clustering Algorithm

Preparing lookup table for Xseed Clusters

In [None]:
"""
CLuster dictionary preparation
Cluster : cluster_no:[mir1,mir2]
"""
def fetch_id(xseed_id_fasta, xid):
    for line in xseed_id_fasta:
        if line:
            if '>' in line[0]:   
                if xid == line.split('\t')[0].split('>')[1]:
                    return line.split('\t')[1]


"""

"""

def cluster_dict_preparation(cluster_file, fasta_file):
    cluster_file = open(cluster_file,'r').read().split('\n')
    fasta_id = open(fasta_file,'r').read().split('\n')
    cluster_dict = {}    
    for line in cluster_file:
        if line:
            if '>' in line[0]:
                line_no = cluster_file.index(line)
                flag = True    
                k = line[1:].replace(' ','_')
                xid = cluster_file[line_no+1].split('>')[1].split('.')[0]
                cluster_dict[k] = [fetch_id(fasta_id,xid)]
                i = 2
                while flag:
                    if cluster_file[line_no+i]:
                        if '>' in cluster_file[line_no+i][0]:
                            flag = False
                            pass
                        else:
                            xid = cluster_file[line_no+i].split('>')[1].split('.')[0]
                            cluster_dict[k].append(fetch_id(fasta_id,xid))
                            i += 1
                    else:
                        flag = False
                        
    return cluster_dict


seed_cluster = cluster_dict_preparation('data/cdhit_seed.clstr','data/seed_seqs.fasta')
Xseed_cluster = cluster_dict_preparation('data/cdhit_seed_outer.clstr','data/seed_outer_seqs.fasta')

In [None]:
"""
Only for outer seed region clusters
rev_dict : mir_name:[cluster1,cluster2]
"""


def rev_dictionary(cluster):
    rev_dict = defaultdict(list)
    mir_list = list(known_mir_dict.keys())
    for k in list(unknown_mir_dict.keys()):
        mir_list.append(k)
        
    for mir in mir_list:
        for k,v in cluster.items():
            for mir1 in v:
                if mir == mir1:
                    rev_dict[mir].append(k)
                    
    return dict(rev_dict)


exact_seed_rev_dict = rev_dictionary(exact_seed_cluster)
seed_rev_dict = rev_dictionary(seed_cluster)
Xseed_rev_dict = rev_dictionary(Xseed_cluster)

In [None]:
"""
Function to update and merge the counts if all conditions (seed,seq and loc) are satisfied
Input : df1,df2,mir1 (from df1), mir2 (from df2)
Output : df1_modified (with updated counts of mir1) and df2_modified (with mir2 removed)
"""

def update_counts(df1,df2,mir1,mir2):
    try:
        mir1_counts = list(df1.loc[mir1,:].values)[4:]
        mir2_counts = list(df2.loc[mir2,:].values)[4:]
        mir1_counts = list(map(add,mir1_counts,mir2_counts))
        df1.loc[mir1,:] = list(df1.loc[mir1,:].values)[:4] + mir1_counts
        df2 = df2.drop(mir2,axis = 0)    
        return df1, df2
    except:
        print('Either mir1 or mir2 has been removed.')
        return df1, df2

In [None]:
log_writer = open("data/LOG_DIR/seed_based_clustering.txt", 'w')
log_writer.write('\n')
log_writer.write('\n')
log_writer.write('###################################################################################\n')
log_writer.write('************************Starting Seed-based Clustering****************************\n')
log_writer.write('###################################################################################\n')
log_writer.write('\n')
log_writer.write('\n')

counts_known3 = counts_known2.copy()
novel_miRNA2 = novel_miRNA1.copy()
mir_reannotation = pd.DataFrame()
mir_old_name = []
mir_new_name = []
novel_mir_seq = []
family_name = []
attribute = []

for k,v in tqdm(seed_cluster.items()):    
    novel_flag = True
    novel_mir_name = []
    known_mir_name = []
    for mir in seed_cluster[k]:
        if not 'hsa' in mir:
            novel_mir_name.append(mir)
            novel_flag = False
        else:
            known_mir_name.append(mir)
            
    if not novel_flag:
        char_no = 0
        for n_mir in novel_mir_name:
            n_chr_loc = unknown_mir_dict[n_mir][1]
            n_chr_loc_start = int(unknown_mir_dict[n_mir][2])
            n_chr_loc_end = int(unknown_mir_dict[n_mir][3])
            if known_mir_name:
                for k_mir in known_mir_name:
                    novel_mir_name = [i for i in novel_mir_name if i is not None]
                    if novel_mir_name and n_mir in novel_mir_name:                        
                        k_chr_loc = known_mir_dict[k_mir][1]
                        k_chr_loc_start = int(known_mir_dict[k_mir][2])
                        k_chr_loc_end = int(known_mir_dict[k_mir][3])
                        if k_mir in Xseed_cluster[Xseed_rev_dict[n_mir][0]]:
                            if (n_chr_loc == k_chr_loc) and (n_chr_loc_start == k_chr_loc_start) and (n_chr_loc_end == k_chr_loc_end):
                                counts_known3,novel_miRNA2 = update_counts(counts_known3,novel_miRNA2,k_mir,n_mir)
                                log_entry = n_mir + '\t' + unknown_mir_dict[n_mir][-1] + '\t' + n_chr_loc + 't' + str(n_chr_loc_start) + '\t' + str(n_chr_loc_end)
                                log_writer.write(log_entry + '\n')                                            
                                log_entry = k_mir + '\t' + known_mir_dict[k_mir][-1] + '\t' + k_chr_loc + 't' + str(k_chr_loc_start) + '\t' + str(k_chr_loc_end)
                                log_writer.write(log_entry + '\n')                                            
                                log_entry = n_mir + ' will be merged to ' + k_mir
                                log_writer.write(log_entry + '\n')                                            
                                novel_mir_name[novel_mir_name.index(n_mir)] = None 
                                mir_old_name.append(n_mir)
                                mir_new_name.append(k_mir)
                                family_name.append(k_mir)
                                novel_mir_seq.append(unknown_mir_dict[n_mir][-1])
                                attribute.append('Same Seed, Xseed and loc')
                                break
                            else:
                                if char_no < 26:
                                    new_name = k_mir+'-'+ chr(97 + char_no)
                                else:
                                    a,b,c = random.randint(0,25), random.randint(0,25), random.randint(0,25)
                                    new_name = k_mir+'-'+ chr(97 + a) + chr(97 + b) + chr(97 + c)
                                log_entry = n_mir + '\t' + unknown_mir_dict[n_mir][-1] + '\t' + n_chr_loc + '\t' + str(n_chr_loc_start) + '\t' + str(n_chr_loc_end)
                                log_writer.write(log_entry + '\n')                                            
                                log_entry = k_mir + '\t' + known_mir_dict[k_mir][-1] + '\t' + k_chr_loc + '\t' + str(k_chr_loc_start) + '\t' + str(k_chr_loc_end)
                                log_writer.write(log_entry + '\n')
                                log_entry = n_mir + ' will be renamed as ' + new_name
                                log_writer.write(log_entry + '\n')                                    
                                novel_miRNA2 = novel_miRNA2.rename(index={n_mir : new_name})
                                novel_mir_name[novel_mir_name.index(n_mir)] = None
                                char_no += 1
                                mir_old_name.append(n_mir)
                                mir_new_name.append(new_name)
                                family_name.append(k_mir)
                                novel_mir_seq.append(unknown_mir_dict[n_mir][-1])
                                attribute.append('Functionally Same')
                                break
                        else:
                            log_entry = n_mir + '\t' + unknown_mir_dict[n_mir][-1] + '\t' + n_chr_loc + '\t' + str(n_chr_loc_start) + '\t' + str(n_chr_loc_end)
                            log_writer.write(log_entry + '\n')                                            
                            log_entry = k_mir + '\t' + known_mir_dict[k_mir][-1] + '\t' + k_chr_loc + '\t' + str(k_chr_loc_start) + '\t' + str(k_chr_loc_end)
                            log_writer.write(log_entry + '\n')
                            log_entry = n_mir + ' will be treated as pure novel miR with same phylogenetic tree member of ' + k_mir + ' family'
                            log_writer.write(log_entry + '\n')
                            novel_mir_name[known_mir_name.index(k_mir)] = None
                            mir_old_name.append(n_mir)
                            mir_new_name.append(n_mir)
                            family_name.append(k_mir)
                            novel_mir_seq.append(unknown_mir_dict[n_mir][-1])
                            attribute.append('Same Phylogenetic Tree')
                            break
                                
                                
                        
            
        novel_mir_name = [i for i in novel_mir_name if i is not None]                        
        if len(novel_mir_name)>1:  
            for n_mir in novel_mir_name:
                if not n_mir == None:
                    char_no = 1
                    novel_mir_name1 = novel_mir_name.copy()
                    novel_mir_name1.remove(n_mir)
                    n_chr_loc = unknown_mir_dict[n_mir][1]
                    n_chr_loc_start = int(unknown_mir_dict[n_mir][2])
                    n_chr_loc_end = int(unknown_mir_dict[n_mir][3])
                    seed_common_mir = list(set(novel_mir_name1) & set(exact_seed_cluster[exact_seed_rev_dict[n_mir][0]]))
                    if seed_common_mir:                    
                        Xseed_common_mir = list(set(seed_common_mir) & set(Xseed_cluster[Xseed_rev_dict[n_mir][0]]))
                        if Xseed_common_mir:
                            for c_mir1 in Xseed_common_mir:
                                c_chr_loc = unknown_mir_dict[c_mir1][1]
                                c_chr_loc_start = int(unknown_mir_dict[c_mir1][2])
                                c_chr_loc_end = int(unknown_mir_dict[c_mir1][3])
                                if (n_chr_loc == c_chr_loc) and (n_chr_loc_start == c_chr_loc_start) and (n_chr_loc_end == c_chr_loc_end):
                                    log_entry = n_mir + '\t' + unknown_mir_dict[n_mir][-1] + '\t' + n_chr_loc + '\t' + str(n_chr_loc_start) + '\t' + str(n_chr_loc_end)
                                    log_writer.write(log_entry + '\n')                                            
                                    log_entry = c_mir1 + '\t' + unknown_mir_dict[c_mir1][-1] + '\t' + c_chr_loc + '\t' + str(c_chr_loc_start) + '\t' + str(c_chr_loc_end)
                                    log_writer.write(log_entry + '\n')
                                    log_entry = n_mir + ' will be merged to ' + c_mir1
                                    log_writer.write(log_entry + '\n')                                    
                                    _,novel_miRNA2 = update_counts(novel_miRNA2,novel_miRNA2,n_mir,c_mir1)
                                    novel_mir_name[novel_mir_name.index(c_mir1)] = None
                                    mir_old_name.append(n_mir)
                                    mir_new_name.append(c_mir1)
                                    family_name.append(c_mir1)
                                    novel_mir_seq.append(unknown_mir_dict[n_mir][-1])
                                    attribute.append('Same Seed, Xseed and loc')
                                    break
                                else:
                                    if char_no < 26:
                                        new_name = n_mir+'-'+ chr(97 + char_no)
                                    else:
                                        a,b,c = random.randint(0,25), random.randint(0,25), random.randint(0,25)
                                        new_name = n_mir+'-'+ chr(97 + a) + chr(97 + b) + chr(97 + c)
                                    log_entry = n_mir + '\t' + unknown_mir_dict[n_mir][-1] + '\t' + n_chr_loc + '\t' + str(n_chr_loc_start) + '\t' + str(n_chr_loc_end)
                                    log_writer.write(log_entry + '\n')                                            
                                    log_entry = c_mir1 + '\t' + unknown_mir_dict[c_mir1][-1] + '\t' + c_chr_loc + '\t' + str(c_chr_loc_start) + '\t' + str(c_chr_loc_end)
                                    log_writer.write(log_entry + '\n')
                                    log_entry = n_mir + ' and ' + c_mir1 + ' are functionally same. So ' + c_mir1 + ' will be renamed to ' + new_name
                                    log_writer.write(log_entry + '\n')                                    
                                    novel_miRNA2 = novel_miRNA2.rename(index={c_mir1 : new_name})
                                    novel_mir_name[novel_mir_name.index(c_mir1)] = None
                                    char_no += 1
                                    mir_old_name.append(c_mir1)
                                    mir_new_name.append(new_name)
                                    family_name.append(n_mir)
                                    novel_mir_seq.append(unknown_mir_dict[n_mir][-1])
                                    attribute.append('Functionally Same')
                                    break
                        else:
                            for c_mir1 in seed_common_mir:                                
                                log_entry = n_mir + '\t' + unknown_mir_dict[n_mir][-1] + '\t' + n_chr_loc + '\t' + str(n_chr_loc_start) + '\t' + str(n_chr_loc_end)
                                log_writer.write(log_entry + '\n')                                            
                                log_entry = c_mir1 + '\t' + unknown_mir_dict[c_mir1][-1] + '\t' + c_chr_loc + '\t' + str(c_chr_loc_start) + '\t' + str(c_chr_loc_end)
                                log_writer.write(log_entry + '\n')
                                log_entry = n_mir + ' and ' + c_mir1 + ' belongs to same family of phylogenetic tree.'
                                log_writer.write(log_entry + '\n')                                     
                                char_no += 1
                                novel_mir_name[novel_mir_name.index(c_mir1)] = None
                                mir_old_name.append(c_mir1)
                                mir_new_name.append(c_mir1)
                                family_name.append(n_mir)
                                novel_mir_seq.append(unknown_mir_dict[n_mir][-1])
                                attribute.append('Same Phylogenetic Tree')
                                                            
            
mir_reannotation['Old_miR_ID'] = mir_old_name
mir_reannotation['New_miR_ID'] = mir_new_name
mir_reannotation['Family'] = family_name
mir_reannotation['Relation'] = attribute
mir_reannotation['sequence'] = novel_mir_seq
mir_reannotation.to_csv("data/mir_reannotation.csv",encoding='utf-8',index=False)
log_writer.close()

Results from functional annotation module are saved in **data/mir_reannotation.csv** and processing details are saved in **data/LOG_DIR/seed_based_clustering.txt**.

In [None]:
frame = [counts_known2,novel_miRNA2]
counts_final = pd.concat(frame)
counts_final.to_csv("data/counts_final.csv",encoding='utf-8',index=True)
print(counts_final.shape)

In [None]:
# Preparing the final count matrix for differential expression analysis
counts_final = counts_final.reindex(sorted(counts_final.columns), axis=1)
counts_final = counts_final.drop(['chr','chr_start','chr_end','miRNA_sequence'],axis=1)
counts_final.to_csv("data/counts_final_1.csv",encoding='utf-8',index=True)

# miRNA Differential expression Analysis using DESeq2

In [None]:
"""
    Before reunning this module, please confirm that the order of sample 
    written in sample_list.csv and order of columns in counts_final_1.csv
    is same.e.g.
    In sample_list.csv the samples are:
    Sample        File        Condition
    SampleA   sampleA.fastq    treated
    SampleB   sampleB.fastq    treated
    SampleC   sampleC.fastq    treated
    SampleD   sampleD.fastq    treated
       .           .              .   
       .           .              .   
       .           .              .   
       .           .              .   
       
    The column order in counts_final_1.csv should be:
    
       .    sampleA    sampleB    sampleC    sampleD    
       .      25         102         10         5
       .       .          .           .         .
       .       .          .           .         .
       .       .          .           .         .
"""
# !Rscript $HOME_DIR/scripts/install_packages.R

!Rscript $HOME_DIR/scripts/norm_diff_exp.R miRNA

# piRNA Counts Analysis

In [None]:
os.environ['COUNT_DIR'] = os.path.join(os.environ['HOME_DIR'],'data','piRNA','pirna_counts')   
# os.chdir(os.environ['COUNT_DIR'])

raw_counts = pd.DataFrame()
for file in tqdm(os.listdir(os.environ['COUNT_DIR'])):
    if file.endswith('_counts.txt'):
        file_open = open(os.path.join(os.environ['COUNT_DIR'],file),"r").read().split("\n")        
        pi_name = []
        pi_count = []
        for line in file_open:
            if len(line) > 1:
                pi_name.append(line.split("\t")[-2])
                pi_count.append(line.split("\t")[-1])
        raw_counts[file.split("_")[0]] = pi_count
        

raw_counts["piRNA_ID"] = pi_name
raw_counts = raw_counts.set_index(["piRNA_ID"])

raw_counts = raw_counts.apply(pd.to_numeric)
raw_counts = raw_counts.groupby(['piRNA_ID']).sum()


# Count filtering : Removing those piRNA's which are not expressed in any sample
raw_counts1 = pd.DataFrame(columns = raw_counts.columns)
for index in list(raw_counts.index.values):
    if sum([int(val) for val in raw_counts.loc[index,:].values]) == 0:
        next
    else:
        raw_counts1.loc[index,:] = [int(val) for val in raw_counts.loc[index,:].values]
        
# Sorting columns
raw_counts1 = raw_counts1.reindex(sorted(raw_counts1.columns), axis=1)
# Saving Results
raw_counts1.to_csv(os.path.join(os.environ['HOME_DIR'],'data','piRNA','piRNA_raw_counts.csv'),encoding='utf-8',index=True)



# piRNA Differential expression Analysis using DESeq2

In [None]:
!Rscript $HOME_DIR/scripts/norm_diff_exp.R piRNA

# Finding significantly altered differentially expressed miRNAs

In [None]:
'''
    This module will save counts log2fold change and p_adj value of segnificantly expressed miRNAs found in  
    previous step.
'''

df1 = pd.read_csv(os.path.join(os.environ['SEQ_DIR'],'miRNA_significantly_DE_mir.csv'), sep=',',header=None)
df2 = pd.read_csv(os.path.join(os.environ['SEQ_DIR'],'counts_final.csv'), sep=',',header=0)
try:    
    df2 = df2.set_index(['mirID'])
except:
    a = df2[df2.columns[df2.columns.str.startswith('Unnamed:')]]
    df2['mirID'] = list(a.loc[:,list(a.columns.values)[0]].values)

    # Reomove Unnamed column
    df2 = df2[df2.columns[~df2.columns.str.startswith('Unnamed:')]]
    df2 = df2.set_index(['mirID'])
df3 = pd.DataFrame()
for idx in range(df1.shape[0]):
    idx2 = list(df2.index.values).index(df1.iloc[idx,0])   
    df3 = df3.append(df2.iloc[idx2,:])

# rearranging columns
cols = df3.columns.tolist()
cols = cols[-4:]+ cols[:-4]
df3 = df3[cols]
# Saving counts of significantly differentially exoressed miRNA

df3.to_csv(os.path.join(os.environ['SEQ_DIR'],'miRNA_DEmir_Counts.csv'),encoding='utf-8',index=True)

# Add expression details in above results

df4 = pd.read_csv(os.path.join(os.environ['SEQ_DIR'],"miRNA_DESeq_expression_data_controlIsUnreated.csv"),sep=',',header=0,index_col=0)
df4 = df4.set_index(['cts...1.'])

df5 = pd.DataFrame()
p_adj = []
up_down_expr = []
fold_change = []
SE_mir = list(df3.index.values)
for mir in SE_mir:    
    idx = list(df4.index.values).index(mir)
    p_adj_val = "{0:.5f}".format(df4.loc[str(mir),'padj'])
    p_adj.append(p_adj_val)
    if df4.loc[str(mir),'log2FoldChange']>0:
        up_down_expr_val = 'up'
    else:
        up_down_expr_val = 'down'
        
    up_down_expr.append(up_down_expr_val)
    fold_change.append(df4.loc[str(mir),'log2FoldChange'])
        
df5['miRNA ID'] = SE_mir
df5['p_adj'] = p_adj
df5['Regulation'] = up_down_expr
df5['Fold Change'] = fold_change
df5.to_csv(os.path.join(os.environ['SEQ_DIR'],'miRNA_DEmir_Res.tsv'),sep='\t',encoding='utf-8',index=True)
print(df5.shape)
df5

# Output

You can check your results in the following files:

    1. miRNA_significantly_DE_mir.csv : This file contains the list of significantly differentially expressed miRNAs (Location: In the same directory where all fastq files are stored).
    
    2. miRNA_DEmir_Res.tsv  : This file contains significantly differentially expressed miRNA along with p_adj values (Location: In the same directory where all fastq files are stored).
    
    3. miRNA_DEmir_Counts.tsv : This file contains counts of significantly differentially expressed miRNAs in all samples (Location: In the same directory where all fastq files are stored).
    
    4. mir_reannotation.csv : This file contains results from functional annotation modules (Location: In the same directory where all fastq files are stored).
    
    5. counts_final_1 : This file contains all miRNAs raw counts obtained form all samples before differential expression analysis.
    
    6. piRNA_raw_counts.csv :  This file contains raw counts of all piRNAs ontained from all samples before differential expression analysis (Location: piRNA/piRNA_raw_counts.csv).
    
    7. piRNA_significantly_DE_mir.csv : This file contains the list  of all significantly differentially expressed piRNAs (Location: piRNA/piRNA_significantly_DE_mir.csv).