gene-prediction

Introduction

This project is based on the 2008 paper 'Gene prediction in metagenomic fragments: A large scale machine learning approach' by Hoff et at. The goal of the gene_prediction_pipeline.ipynb notebook is to train the gene prediction algorithm using a 2-stage approach - linear discriminant and neural network - and provide predictions for sequences in an easy to use way. The linear discriminant model reduces high dimensional features and the neural network predicts if a given sequence is a gene or not. My intentions for starting this project were to gain a deeper understanding of machine learning approaches behind the Orphelia gene prediction algorithm and sharpen my ML/programming skills.

Preprocess Genome
- extract coding sequences/noncoding sequences
- shuffle/split data
Extract Features
- monocodon (tricodon) frequency,
- dicodon (hexcodon) frequency,
- tis,
- gc content
Train Linear Discriminant for Dimensional Reduction
Train Binary Neural Network for Gene Prediction
Predict Sequences (FASTA)
- Prediction from Genome Sequence
- Prediction from FASTA Input File

TODO

Implement batch training (to overcome large amount of training data)

Setup

With Google Colab

First open a new Google Colab file. Mount your Google Drive and move to your working directory.

from google.colab import drive
drive.mount('/content/gdrive')

# Change working directory
%cd gdrive/MyDrive/

Next, clone this repository and move into the directory.

# Clone git repository (copy all files to Google Colab)
!git clone https://github.com/viggy-ravi/gene-prediction.git

# Go to gene-prediction folder
%cd gene-prediction/

Open the gene_prediction_pipeline.ipynb file and install the necessary dependendies. You will then be able to replicate the results from this notebook.

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
examples		examples
input		input
model		model
modules		modules
results		results
README.md		README.md
Sampling True NCS Dataset.ipynb		Sampling True NCS Dataset.ipynb
gene_prediction_pipeline.ipynb		gene_prediction_pipeline.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples

examples

input

input

model

model

modules

modules

results

results

README.md

README.md

Sampling True NCS Dataset.ipynb

Sampling True NCS Dataset.ipynb

gene_prediction_pipeline.ipynb

gene_prediction_pipeline.ipynb

requirements.txt

requirements.txt

Repository files navigation

gene-prediction

Introduction

Contents

TODO

Setup

With Google Colab

About

Releases

Packages

Languages

viggy-ravi/gene-prediction

Folders and files

Latest commit

History

Repository files navigation

gene-prediction

Introduction

Contents

TODO

Setup

With Google Colab

About

Resources

Stars

Watchers

Forks

Languages