Skip to content
master
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 

p53-chip-seq-data

see wiki for information regarding master table, figures, and files.

Full commands for generating files can be found in Makefile.

Requirements

R

Python

Command line tools

Binding Dataset

Quickstart

Generate train/test data files with non-binding intervals:

$ make train_test_split

Data File Format

Files generated when calling make train_test_split.

Options (defined in Makefile):

  • minimum number of samples (MINSAMPLES)
  • repeat threshold type (REP_THRESHOLD_TYPE)
  • repeat threshold cutoff (REP_CUTOFF)
  • number of non-binding intervals to generate on each side per binding interval (N_NONBINDING_INTERVALS)
    • Searches within [1kb, 10kb] on both sides of the binding interval, then +10kb increments if needed.

Columns:

  • binding (0 or 1)
  • length
  • repeat_proportion
  • GC_content
  • average_phastCon
  • P53match_count (per motif)
  • P53match_score_max (per motif)
  • P53match_score_sum (per motif)
  • 2-mer count proportions (10)
  • 3-mer count proportions (32)
  • 6-mer count proportions (2080)

Generated Files

  1. Full unprocessed data file: etc/peaks_merged_features__minsamples_{int}__rep_{max/min}{0-1}__nonbinding_{int}.txt
  2. Training set: results/datafiles/peaks_merged_features__minsamples_{int}__rep_{max/min}{0-1}__nonbinding_{int}-preprocessed_train.txt
  3. Testing set: results/datafiles/peaks_merged_features__minsamples_{int}__rep_{max/min}{0-1}__nonbinding_{int}-preprocessed_test.txt

The full file (1) is unprocessed.

The TEST_SIZE variable in Makefile determines train/test split ratio for files (2) and (3). Default 1:1. Split is based on binding intervals only – non-binding intervals follow the same split as the respective binding interval it was derived from. These files are used in the machine learning process described below.

Machine Learning Analysis

Model used: sklearn.svm.SVC – Support Vector Classifier, based on Support Vector Machine.

Ranking mthod: sklearn.feature_selection.RFE – Recursive Feature Elimination.

Master Table

Quickstart

Generate master table with either max FE or max MACS score under sample columns:

$ make results/datafiles/ChIP_peaks_master_table_fe.txt
$ make results/datafiles/ChIP_peaks_master_table_macs.txt

Procedure

  1. Add FE and peak length columns to sample .bed files, concatenate all samples into one file
  2. Merge using bedtools
    1. MACSscore_summary_valid_merged.bed
  3. Use merged regions from MACSscore_summary_valid_merged.bed
  4. Combine information into one table

About

Basic machine learning on genomic data

Resources

Releases

No releases published

Packages

No packages published

Languages