Skip to content

DeepCPP, a program making protein coding potential prediction.

Notifications You must be signed in to change notification settings

yuuuuzhang/DeepCPP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeepCPP: A deep neural network based on nucleotide bias information and minimum distribution similarity feature selection for RNA coding potential prediction

The codes and data here are used to predict RNA protein coding potential. It will give researchers usegul guidelines to explore the functions of newly found RNAs.

PUBLICATION

Please cite this paper if using the codes here: Zhang Y, Jia C, Fullwood MJ, Kwoh CK. DeepCPP: a deep neural network based on nucleotide bias information and minimum distribution similarity feature selection for RNA coding potential prediction. Briefings in Bioinformatics. 2020 Mar 30.

EXPLANATION

This repository contains four folders, code, rawdata, input_files and outputfiles.

Code folder:

contains the python codes.

utils_lncRNA.py -- functions will be used.  
DeepCPP.ipynb -- user interface.  

rawdata folder:

This folder contains the new sORF human test datasets in [1].

input_files folder:

18 models:
human normal model: human1.h5, human2.h5, human3.h5
human sORF model: hsorf1.h5, hsorf2.h5, hsorf3.h5
vertebrate normal model: vert1.h5, vert2.h5, vert3.h5
vertebrate sORF model: vsorf1.h5, vsorf2.h5, vsorf3.h5
insect normal model: insect1.h5, insect2.h5, insect3.h5
insect sORF model: isorf1.h5, isorf2.h5, isorf3.h5
6 hexamer score files:
human: humanlnc_6mermean.csv, humanmrna_6mermean.csv
vertebrate: vertlnc_6mermean.csv, vertmrna_6mermean.csv
insect: insectlnc_6mermean.csv, insectmrna_6mermean.csv
2 feature set files:
used for normal data: normal566.csv
used for sORF data: sorf450.csv
1 example input file:
human sORF data: humansorf.fa, the first 232 samples are mRNA, the last 232 samples are lncRNA.

output_files folder:

This folder contains the predicted results of input file.

WORKING MECHANISM OF DeepCPP

The overview of the mechanism of DeepCPP is illustrated in Figure 1. Firstly, 8 kinds of features are extracted from the raw sequence, then we use the first 566 features for normal data and 450 features for sORF data sorted by our newly proposed feature selection method, MDS, as the input feature subsets to the deep network. Finally, the average scores will be calculated from three prebuilt models and the coding potential prediction label will be given. alt text figure 1.An overview of the mechanism of DeepCPP.

USAGE:

Based on python3.
Python modules:

numpy  
pandas  
csv  
keras
Bio
math

will be used. keras is backened on tensorflow.
Download all the files firstly, open DeepCPP.ipynb, change code:

test_model('.../input_files/','.../output_files/',
           'humansorf.fa',species,type)

'humansorf.fa' is the input file name, it must be in .fa or .fasta format.
species: three options are avalibale: 'human','vert' and 'insect'.
type: 'normal' or 'sorf'. 'normal' means normaldata, 'sorf'means RNAs with short ORF.
The predicted results is located in output_files.

More details can be found from [1]

REFERANCE

[1] Zhang Y, Jia C, Fullwood MJ, Kwoh CK. DeepCPP: a deep neural network based on nucleotide bias information and minimum distribution similarity feature selection for RNA coding potential prediction. Briefings in bioinformatics. 2021 Mar;22(2):2073-84.

CONTACT

If you have any inqueries, please contact YU007@e.ntu.edu.sg

About

DeepCPP, a program making protein coding potential prediction.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published