see wiki for information regarding master table, figures, and files.
Full commands for generating files can be found in Makefile.
Command line tools
Generate train/test data files with non-binding intervals:
$ make train_test_split
Data File Format
Files generated when calling
Options (defined in Makefile):
- minimum number of samples (
- repeat threshold type (
- repeat threshold cutoff (
- number of non-binding intervals to generate on each side per binding interval (
- Searches within [1kb, 10kb] on both sides of the binding interval, then +10kb increments if needed.
- binding (0 or 1)
- P53match_count (per motif)
- P53match_score_max (per motif)
- P53match_score_sum (per motif)
- 2-mer count proportions (10)
- 3-mer count proportions (32)
- 6-mer count proportions (2080)
- Full unprocessed data file:
- Training set:
- Testing set:
The full file (1) is unprocessed.
TEST_SIZE variable in
Makefile determines train/test split ratio for files (2) and (3). Default 1:1. Split is based on binding intervals only – non-binding intervals follow the same split as the respective binding interval it was derived from. These files are used in the machine learning process described below.
Machine Learning Analysis
sklearn.svm.SVC – Support Vector Classifier, based on Support Vector Machine.
sklearn.feature_selection.RFE – Recursive Feature Elimination.
Generate master table with either max FE or max MACS score under sample columns:
$ make results/datafiles/ChIP_peaks_master_table_fe.txt $ make results/datafiles/ChIP_peaks_master_table_macs.txt
- Add FE and peak length columns to sample
.bedfiles, concatenate all samples into one file
- Merge using
- Use merged regions from
- Combine information into one table