Skip to content

tjx711/noncoding-recurrence

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Noncoding Mutation Recurrence Prediction

This project is about exploration of regression models for noncoding mutation recurrence in cancer, aiming at finding a most accurate regression model for noncoding mutation recurrence prediction so that based on the predicted recurrences, we can rank the noncoding mutations by their recurrences from high to low, and then choose the top N mutations as cancer driver mutation candidates and then identify through biological experiment validations.

Author: Tanjin Xu, tjx711@gmail.com
Respondent: Dr. Stephen A. Ramsey, stephen.ramsey@oregonstate.edu
Org: Dr. Ramsey Lab, Oregon State University, http://lab.saramsey.org/

Mainly it includes three steps in the overall procedure:

  1. Noncoding mutation annotation, feature extraction, and recurrence calculation
  2. Regression analysis with three main categories of nonlinear models:
    • Generalized linear model including Poisson and Negative Binomial (written in R)
    • Ensemble of decision trees including Random Forest and Boosting (written in Python)
    • Deep neural network (written in Python)
  3. Result analysis and plotting

Project directory structure:

  • backup: backed-up project directory from time to time.
  • data: downloaded raw features data, intermittent data, and the input data matrix for regression analysis
  • docs: reference documents
  • nils: the supplementary data of the paper by Nils J Fredricksson.
  • SAE: ANN and SAE code (run on GPU).
  • source: source code of the project including data processing, feature extraction, and regression analysis.

Code examples explanation:

Remaining work:

  • Add more features. Some considerations: 1) introduce more specific Transcription Factors; 2) the distance to the oncogenes/tumor-suppressor genes; 3) the DNA shape/histone info; 4) chromosome length; 5) etc..
  • Left work for regression analysis. 1) fit zero-truncated negative binomial regression into our current noncoding mutation data; 2) need special handling with the super outliers with rare extreme high frequency; 3) check the predicted recurrence of the Poisson/NB model, i.e., check the top 100 mutations with highest predicted recurrence in comparison to the true frequency value; 4) maybe a ranking regressoin problem; 5) need carefully re-lookinto the features data.
  • Binary classfication problem. The challenge is to construct balanced training set which is consist of the positive cases (true noncoding mutation drivers) and the negative cases (validated passenger mutations).
  • Focus on specific categories of cancer types. Considering overall cancer types together may be more difficult due to high heterogeneity of different tumor samples in different cancer types.

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published