This project is about exploration of regression models for noncoding mutation recurrence in cancer, aiming at finding a most accurate regression model for noncoding mutation recurrence prediction so that based on the predicted recurrences, we can rank the noncoding mutations by their recurrences from high to low, and then choose the top N mutations as cancer driver mutation candidates and then identify through biological experiment validations.
Author: Tanjin Xu, tjx711@gmail.com
Respondent: Dr. Stephen A. Ramsey, stephen.ramsey@oregonstate.edu
Org: Dr. Ramsey Lab, Oregon State University, http://lab.saramsey.org/
Mainly it includes three steps in the overall procedure:
- Noncoding mutation annotation, feature extraction, and recurrence calculation
- Regression analysis with three main categories of nonlinear models:
- Generalized linear model including Poisson and Negative Binomial (written in R)
- Ensemble of decision trees including Random Forest and Boosting (written in Python)
- Deep neural network (written in Python)
- Result analysis and plotting
Project directory structure:
- backup: backed-up project directory from time to time.
- data: downloaded raw features data, intermittent data, and the input data matrix for regression analysis
- docs: reference documents
- nils: the supplementary data of the paper by Nils J Fredricksson.
- SAE: ANN and SAE code (run on GPU).
- source: source code of the project including data processing, feature extraction, and regression analysis.
Code examples explanation:
- noncoding_extract_features.py : the main entry to extract features.
- noncoding_mut_stats.py : a program to calculate the mutation recurrences within 101 bps window.
- run_features.sh : the bash script to call feature extracting with different inputs of raw features in BED format.
- run_stats.sh : a bash scripts to call the calculation procedure of noncoding mutation recurrence.
- join_features.sh : the bash script to join all the extracted features to a big matrix.
- noncoding_mut_anno.py : a program to annotate the noncoding mutations out of all the input mutations downloaded from the COSMIC database.
- noncoding_mut_regression.py : the main file for exploration of conventional machine learning regression models including randomforest, adaboost, gradientBoosting.
- noncoding_glm_regr.v4.R : the main file for exploration of generalized linear model including Poisson and Negative Binomial.
- noncoding_sae_baseline.py : a simple artifical neural network with one hidden layer regression.
- noncoding_sae.py : stacked auto encoder regression.
Remaining work:
- Add more features. Some considerations: 1) introduce more specific Transcription Factors; 2) the distance to the oncogenes/tumor-suppressor genes; 3) the DNA shape/histone info; 4) chromosome length; 5) etc..
- Left work for regression analysis. 1) fit zero-truncated negative binomial regression into our current noncoding mutation data; 2) need special handling with the super outliers with rare extreme high frequency; 3) check the predicted recurrence of the Poisson/NB model, i.e., check the top 100 mutations with highest predicted recurrence in comparison to the true frequency value; 4) maybe a ranking regressoin problem; 5) need carefully re-lookinto the features data.
- Binary classfication problem. The challenge is to construct balanced training set which is consist of the positive cases (true noncoding mutation drivers) and the negative cases (validated passenger mutations).
- Focus on specific categories of cancer types. Considering overall cancer types together may be more difficult due to high heterogeneity of different tumor samples in different cancer types.
References