SeEN-seq Analysis Pipeline

This repositoray contains R scripts for running the analysis of Selected Engagement on Nucleosome sequencing (SeEN-seq) datasets.

Quick Start:

Recommended environment definition for SeEN-seq analysis can be found in env/SeEN_seq_environment.yaml
Example input files can be found in the example folder.
Quick environment and test run set up can be found at STAR_Protocol/0_setup.sh.
The SeEN-seq analysis script used as the template for code block in the STAR Protocol paper can be found at STAR_Protocol/STAR_protocol_SeEN_seq_analysis.R

Step I: Input preparation

0) Fastq files

In SeEN-seq analysis, each fastq file is assumed to derive from the band (i.e., bound and unbound fractions) of the Electrophoretic Mobility Shift Assay (EMSA) gel in which the SeEN-seq experiment is conducted. For the schematic used in this repository, the sample name is referred to as S<ID>, e.g., S1, S2, ... However, you can use other name that suit your need in the real analysis of you data.

For demonstration purpose, the example data has been deposited in sequence read archive (SRA) under accession PRJNA1305216. A full list of SRA Run ID can be found at example/PRJNA1305216_SraAccList.txt

1) Sample Information

The example input table compatible with our workflow can be found at example/sample_metadata_PE.tsv of this repository. The sample metadata records the information of each sample (gel band) that could be important for the downstream analysis. For example, condition, replicate, protein concentration, etc.

As we will use read alignment and count by the R package (QuasR), the column(s) indicating where the fastq file can be found for each sample are expected.

1.1 Required and recommended sample table columns

Mandatory columns

sample: The sample ID correspinding to each set of fastq input. This should not contain duplicated values between row as it should describe each band of your EMSA gel.
- Note that this column will be changed to SampleName in the table used for read alignment by QuasR.
lane: The ID for each EMSA lane. This will later be used as the column name of the output SummarizedExperiment object.
fraction: A fraction type from the EMSA gel in which your band located in. A typical fraction for SeEN-seq could be a simple "Bound" and "Unbound". However, it is possible to have more than two types of fractions in SeEN-seq experiment. For example, some transcription factor may show multiple bound fraction in the gel.
- Raw reads counts will be split between each fraction type in the final SummarizedExperiment output.
condition: General annotation of the condition used in the experiment. This is mainly used for final plotting.

Columns for QuasR package

Depending on the layout of your datasets, the following columns are required:

Single-end dataset:
- FileName: A path to fastq file of each sample.
Paired-end dataset:
- FileName1, FileName2: Paths to read 1 and 2 of paired-end dataset

NOTE: With the exception of the sample column, other columns in sample table can contain duplicated values. However, the SummarizedExperiment of SeEN-seq data is assembling based on the lane information. The lane metadata is made from unique rows after the sample and fraction columns are removed. Please make sure that the lane metadata doesn't contain duplicated value in the lane column.

2) Reference Sequence Information

The reference data should correspond to the construct sequences used in your SeEN-seq experiment. From the figure below, two types in inputs are needed

FASTA file for reference sequence.
Table for sequence metadata.

2.1 Fasta file

Normal fasta file of the sequence construct used in your SeEN-seq experiment. See example at example/ref_sequence.fasta

2.2 Reference metadata table

A table recording information for sequence constructs. See example in example/ref_metadata.tsv. These columns are required for the input specific to scrips used in this pipeline:

ref_name: Name of the reference sequence. This must correspond to the sequence names used in fasta input file.
position: Noting motif position. This will be used when plotting enrichment score.

Step II: Run The Analysis

A step-by-step analysis of SeEN-seq datasets as specified in the STAR Protocol paper can be found at STAR_Protocol/STAR_protocol_SeEN_seq_analysis.R. This shouldn't be depending on any other scripts. A wrapper functions might be added to this repository in the future.

The outputs

`SummarizedExperiment` Object

The lane metadata (originated from sample table), reference table, and count tables (separated by fraction types) are gathered together in the SummarizedExperiment container object (more information here). This helps ensure that the rows and columns of count and enrichment table (assays) are properly tied to reference metadata (rowData) and lane metadata (colData), respectively.

Basic QC

One basic quality control for SeEN-seq analysis is to check whether constructs originated from motif insertion at certain positions are over- or under-represented or not. This could be explained by multiple reason. For example, (i) some constructs have intended or unintended higher concentration than the other, or (ii), insertion of motif at certain position produce restriction site for restriction enzyme used in the experiment, etc. The basic QC plot should help us spotting these issues.

Enrichment Plots

Nucleosome binding preference of transcription factor can be shown as enrichment score - that is a log2 fold change of the signal between the two fractions (e.g., log2(bound/unbound)).

Key Version History

Release v0.1.1: The version used for STAR Protocol.
Release v0.1.0: The initial version used for STAR Protocol submission.

Reference

The SeEN-seq assay was originally described in:

Michael, Alicia K., Ralph S. Grand, Luke Isbel, Simone Cavadini, Zuzanna Kozicka, Georg Kempf, Richard D. Bunker et al. "Mechanisms of OCT4-SOX2 motif readout on nucleosomes." Science 368, no. 6498 (2020): 1460-1465.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
STAR_Protocol		STAR_Protocol
env		env
etc		etc
example		example
img		img
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SeEN-seq Analysis Pipeline

Quick Start:

Step I: Input preparation

0) Fastq files

1) Sample Information

1.1 Required and recommended sample table columns

Mandatory columns

Columns for QuasR package

2) Reference Sequence Information

2.1 Fasta file

2.2 Reference metadata table

Step II: Run The Analysis

The outputs

`SummarizedExperiment` Object

Basic QC

Enrichment Plots

Key Version History

Reference

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SeEN-seq Analysis Pipeline

Quick Start:

Step I: Input preparation

0) Fastq files

1) Sample Information

1.1 Required and recommended sample table columns

Mandatory columns

Columns for QuasR package

2) Reference Sequence Information

2.1 Fasta file

2.2 Reference metadata table

Step II: Run The Analysis

The outputs

SummarizedExperiment Object

Basic QC

Enrichment Plots

Key Version History

Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`SummarizedExperiment` Object

Packages