Skip to content

Commit

Permalink
update data-raw readme files
Browse files Browse the repository at this point in the history
  • Loading branch information
slowkow committed Mar 5, 2015
1 parent 048c126 commit 9f5bdbf
Show file tree
Hide file tree
Showing 4 changed files with 209 additions and 8 deletions.
33 changes: 33 additions & 0 deletions data-raw/ITFP/README.md
@@ -1,3 +1,36 @@
# ITFP: an integrated platform of mammalian transcription factors

Guangyong Zheng,
Kang Tu,
Qing Yang,
Yun Xiong,
Chaochun Wei,
Lu Xie,
Yangyong Zhu,
and Yixue

<http://dx.doi.org/10.1093/bioinformatics/btn439>

## Summary

Investigation of transcription factors (TFs) and their downstream regulated
genes (targets) is a significant issue in post-genome era, which can provide
a brand new vision for some vital biological process. However, information of
TFs and their targets in mammalian is far from sufficient. Here, we developed
an integrated TF platform (ITFP), which included abundant TFs and their
targets of mammalian. In current release, ITFP includes 4105 putative TFs and
69 496 potential TF-target pairs for human, 3134 putative TFs and 37 040
potential TF-target pairs for mouse, and 1114 putative TFs and 18 055
potential TF-target pairs for rat. In short, ITFP will serve as an important
resource for the research community of transcription and provide strong
support for regulatory network study.

- - -

# ITFP Website

<http://itfp.biosino.org/itfp>

# Introduction

Investigation of transcription factors (TFs) and their downstream regulated
Expand Down
14 changes: 8 additions & 6 deletions data-raw/Neph2012/README.md
@@ -1,11 +1,13 @@
# Circuitry and Dynamics of Human Transcription Factor Regulatory Networks

Shane Neph
Andrew B. Stergachis
Alex Reynolds
Richard Sandstrom
Elhanan Borenstein
John A. Stamatoyannopoulos
Shane Neph,
Andrew B. Stergachis,
Alex Reynolds,
Richard Sandstrom,
Elhanan Borenstein,
John A. Stamatoyannopoulos,

<http://dx.doi.org/10.1016/j.cell.2012.04.040>

## Highlights

Expand Down
31 changes: 29 additions & 2 deletions data-raw/TRED/README.md
@@ -1,6 +1,33 @@
# Transcriptional Regulatory Element Database
# TRED: a Transcriptional Regulatory Element Database and a platform for in silico gene regulation studies

<https://cb.utdallas.edu/cgi-bin/TRED/>
Fang Zhao,
Zhenyu Xuan,
Lihua Liu,
Michael Q. Zhang

<http://dx.doi.org/10.1093/nar/gki004>

In order to understand gene regulation, accurate and comprehensive knowledge
of transcriptional regulatory elements is essential. Here, we report our
efforts in building a mammalian Transcriptional Regulatory Element Database
(TRED) with associated data analysis functions. It collects cis- and
trans-regulatory elements and is dedicated to easy data access and analysis
for both single-gene-based and genome-scale studies. Distinguishing features
of TRED include: (i) relatively complete genome-wide promoter annotation for
human, mouse and rat; (ii) availability of gene transcriptional regulation
information including transcription factor binding sites and experimental
evidence; (iii) data accuracy is ensured by hand curation; (iv) efficient user
interface for easy and flexible data retrieval; and (v) implementation of
on-the-fly sequence analysis tools. TRED can provide good training datasets
for further genome-wide cis-regulatory element prediction and annotation,
assist detailed functional studies and facilitate the decipher of gene
regulatory networks (http://rulai.cshl.edu/TRED).

- - -

# TRED Website

<https://cb.utdallas.edu/cgi-bin/TRED/tred.cgi?process=home>

# Introduction

Expand Down
139 changes: 139 additions & 0 deletions data-raw/UCSC/README.md
@@ -0,0 +1,139 @@
# Transcription Factor ChIP-seq (161 factors) from ENCODE with Factorbook Motifs

<http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=wgEncodeRegTfbsClusteredV3>

# Description

This track shows regions of transcription factor binding derived from a large
collection of ChIP-seq experiments performed by the ENCODE project, together
with DNA binding motifs identified within these regions by the ENCODE
Factorbook repository.

Transcription factors (TFs) are proteins that bind to DNA and interact with RNA
polymerases to regulate gene expression. Some TFs contain a DNA binding domain
and can bind directly to specific short DNA sequences ('motifs'); others bind
to DNA indirectly through interactions with TFs containing a DNA binding
domain. High-throughput antibody capture and sequencing methods (e.g. chromatin
immunoprecipitation followed by sequencing, or 'ChIP-seq') can be used to
identify regions of TF binding genome-wide. These regions are commonly called
ChIP-seq peaks.

ENCODE TFBS ChIP-seq data were processed using the computational pipeline
developed by the ENCODE Analysis Working Group to generate uniform peaks of TF
binding. Peaks for 161 transcription factors in 91 cell types are combined here
into clusters to produce a summary display showing occupancy regions for each
factor and motif sites within the regions when identified. Additional views of
the underlying ChIP-seq data and documentation on the methods used to generate
it are available from the ENCODE Uniform TFBS track.

# Display Conventions

A gray box encloses each peak cluster of transcription factor occupancy, with
the darkness of the box being proportional to the maximum signal strength
observed in any cell line contributing to the cluster. The HGNC gene name for
the transcription factor is shown to the left of each cluster. Within a
cluster, a green highlight indicates the highest scoring site of a
Factorbook-identified canonical motif for the corresponding factor. (NOTE:
motif highlights are shown only in browser windows of size 50,000 bp or less,
and their display can be suppressed by unchecking the highlight motifs box on
the track configuration page). Arrows on the highlight designate the matching
strand of the motif.

The cell lines where signal was detected for the factor are identified by
single-letter abbreviations shown to the right of the cluster. The darkness of
each letter is proportional to the signal strength observed in the cell line.
Abbreviations starting with capital letters designate ENCODE cell types
identified for intensive study - Tier 1 and Tier 2 - while those starting with
lowercase letters designate Tier 3 cell lines.

Click on a peak cluster to see more information about the TF/cell assays
contributing to the cluster, the cell line abbreviation table, and details
about the highest scoring canonical motif in the cluster.

# Methods

Peaks of transcription factor occupancy from uniform processing of ENCODE
ChIP-seq data by the ENCODE Analysis Working Group were filtered to exclude
datasets that did not pass the integrated quality metric (see "Quality Control"
section of Uniform TFBS) and then were clustered using the UCSC hgBedsToBedExps
tool. Scores were assigned to peaks by multiplying the input signal values by a
normalization factor calculated as the ratio of the maximum score value (1000)
to the signal value at one standard deviation from the mean, with values
exceeding 1000 capped at 1000. This has the effect of distributing scores up to
mean plus one 1 standard deviation across the score range, but assigning all
above to the maximum score. The cluster score is the highest score for any peak
contributing to the cluster.

The Factorbook motif discovery and annotation pipeline uses the MEME-ChIP and
FIMO tools from the MEME software suite in conjunction with machine learning
methods and manual curation to merge discovered motifs with known motifs
reported in Jaspar and TransFac. Motif identifications reported in Wang et al.
2012 (below) were supplemented in this track with more recent data (derived
from newer ENCODE datasets - Jan 2011 through Mar 2012 freezes), provided by
the Factorbook team. Motif identifications from all datasets were merged, with
the most significant value (qvalue) reported being picked when motifs were
duplicated in multiple cell lines. The scores for the selected best-scoring
motif sites were then transformed to -log10.

# Release Notes

Release 4 (February 2014) of this track adds display of the Factorbook motifs.
Release 3 (August 2013) added 124 datasets (690 total, vs. 486 in Release 2),
representing all ENCODE TF ChIP-seq passing quality assessment through the
ENCODE March 2012 data freeze. The peaks used to generate these clusters were
called with less stringent thresholds than used during the January 2011 uniform
processing shown in Release 2 of this track. The contributing datasets are
displayed as individual tracks in the ENCODE Uniform TFBS track, which is
available along with the primary data tracks in the ENC TF Binding Supertrack
page. The clustering for V3/V4 is based on the transcription factor target, and
so differs from V2 where clustering was based on antibody.

For the V3/V4 releases, a new track table format, 'factorSource' was used to
represent the primary clusters table and downloads file,
wgEncodeRegTfbsClusteredV3. This format consists of standard BED5 fields (see
File Formats) followed by an experiment count field (expCount) and finally two
fields containing comma-separated lists. The first list field (expNums)
contains numeric identifiers for experiments, keyed to the
wgEncodeRegTfbsClusteredInputsV3 table, which includes such information as the
experiment's underlying Uniform TFBS table name, factor targeted, antibody
used, cell type, treatment (if any), and laboratory source. The second list
field (expScores) contains the scores for the corresponding experiments. For
convenience, the file downloads directory for this track also contains a BED
file, wgEncodeRegTfbsClusteredWithCellsV3, that lists each cluster with the
cluster score followed by a comma-separated list of cell types.

# Credits

This track shows ChIP-seq data from the Myers Lab at the HudsonAlpha Institute
for Biotechnology and by the labs of Michael Snyder, Mark Gerstein, Sherman
Weissman at Yale University, Peggy Farnham at the University of Southern
California, Kevin Struhl at Harvard, Kevin White at the University of Chicago,
and Vishy Iyer at the University of Texas, Austin. These data were processed
into uniform peak calls by the ENCODE Analysis Working Group pipeline developed
by Anshul Kundaje The clustering of the uniform peaks was performed by UCSC.
The Factorbook motif identifications and localizations (and valuable assistance
with interpretation) were provided by Jie Wang, Bong Hyun Kim and Jiali Zhuang
of the Zlab (Weng Lab) at UMass Medical School.

# References

Gerstein MB, Kundaje A, Hariharan M, Landt SG, Yan KK, Cheng C, Mu XJ, Khurana
E, Rozowsky J, Alexander R et al. Architecture of the human regulatory network
derived from ENCODE data. Nature. 2012 Sep 6;489(7414):91-100. PMID: 22955619

Wang J, Zhuang J, Iyer S, Lin X, Whitfield TW, Greven MC, Pierce BG, Dong X,
Kundaje A, Cheng Y et al. Sequence features and chromatin structure around the
genomic regions bound by 119 human transcription factors. Genome Res. 2012
Sep;22(9):1798-812. PMID: 22955990; PMC: PMC3431495

Wang J, Zhuang J, Iyer S, Lin XY, Greven MC, Kim BH, Moore J, Pierce BG, Dong
X, Virgil D et al. Factorbook.org: a Wiki-based database for transcription
factor-binding data generated by the ENCODE consortium. Nucleic Acids Res. 2013
Jan;41(Database issue):D171-6. PMID: 23203885; PMC: PMC3531197

# Data Release Policy

While primary ENCODE data was subject to a restriction period as described in
the ENCODE data release policy, this restriction does not apply to the
integrative analysis results, and all primary data underlying this track have
passed the restriction date. The data in this track are freely available.

0 comments on commit 9f5bdbf

Please sign in to comment.