ModCRElib: ModCRE to be deployed as a downloadable Package

Installation instructions:

download this repository
download the pbm and pdb database folders a) wget http://aleph.upf.edu/modcrefiles/pbm.tgz (60 Gb) b) wget http://aleph.upf.edu/modcrefiles/pdb.tgz (57 Gb)
place these folders into the downloaded reopository
decompress them a) tar -xvzf pbm.tgz (110 Gb) b) tar -xvzf pdb.tgz (88 Gb)
Replace the empty pwm folder in pbm with the provided pmw folder in PWMdatabase
Ensure all dependencies are installed
Change paths in ModCRElib/configure/config.ini to point to the correct location for your installations.

Dependencies:

Python

Bio → from Biopython (pip install biopython)
SBILib → comes with package but can be installed seperately (pip install SBILib see https://github.com/structuralbioinformatics/SBILib for dependencies)
bs4 → BeautifulSoup (pip install beautifulsoup4)
bottle → lightweight web framework (pip install bottle)
ihm → Integrative Modeling library (pip install ihm)
matplotlib → plotting library (pip install matplotlib)
numpy → numerical computing (pip install numpy)
pandas → data analysis (pip install pandas)
plotly → interactive plotting (pip install plotly)
scipy → scientific computing (pip install scipy)
seaborn → statistical plotting (pip install seaborn)
sklearn → scikit-learn (pip install scikit-learn)

Enviornmental

BLAST+ (run on 2.12.0)
CD-HIT (run on 4.8.1)
Clustal-Omega (run on 1.2.4)
ClustalW2 (run on 2.1)
EMBOSS (run on 6.6.0)
Ghostscript (run on 9.53.3)
HMMER (run on 3.3.2)
MEME (run on 5.1.1)
Modeller (run on 10.3)
Python (run on 3.8.6)
TMalign (run on Version 30012025)
dssp (run on 3.0.0)
x3dna (run on 2.5.0)

Using ModCRElib

Example 1

Modelling a Transcription Factor (TF) from an amino acid sequence

------

We need a single or multi fasta file containing a.a. sequence of protein to be modelled (example of AHR_Example.fa is given)
modelling.sh contains the command to model the protein with the following edits necessary:
pdb="/home/pgohl/ModCRE_Package/pdb" Needs to be replaced with the location of the downloaded pdb folder on your machine
-i uniput protein in fasta (could be multi fasta)
-o output directory (models)

------

We now have a folder containing the modelled Transcription factor in pdb format.
You can view the file in chimera or online at https://www.rcsb.org/3d-view by uploading the file.

------

Example 2

Predict TF binding specificity

------

bin/renumberModels.py is a script to ensure that atom and amino acid numbers of models from modelling.sh is continuous (not allways needed)
arg 1 = folder containing the pdbs to be renumbered
arg 2 = output folder
(python bin/renumberModels.py models remodels)
pwm.sh predicts the binding specificity
pdb="/home/pgohl/ModCRE_Package/pdb" Needs to be replaced with the location of the downloaded pdb folder on your machine
pbm="/home/pgohl/ModCRE_Package/pbm" Needs to be replaced with the location of the downloaded pbm folder on your machine

------

We now have a predicted binding specificity of the transcription factor in pwm and meme format in the output folder specified in the previous command.
Remember that ModCRE is designed to use PWMs in aggregate to predict binding sites and individual PWM predictions may be more or less accurate.

------

Example 3

Scan dna sequence for binding sites

------

pwm/make_scan_ready.py is file that generates a database file of the predicted pwms from the previous step that are in the correct format for scanning.
-line 4 references "database.txt" (the generated database file from pwm.sh) and the name of the new file to be used for scanning
scan_sequence.sh is the script to run a scan with. -i The dna sequence file in fasta format that is to be scanned
-o the name of the output folder
-s the species identifier (if restricting scan to TFs of a given specie)
-ft the fimo threshold value to be used in designating hits
--db the location of the database file to be used (the defaul option to be used is the larger pwm folder provided)
-c Cluster complexes into connected binding sites

------

There are several ways to observe the results. The TFs binding the sequence and their binding sites can be retrieved from the file orthologs.json.
Here we can view the name of the TF experienceing the hit, the start and end index for binding along the DNA sequence and the various orthologs binding.
Alternatively, scan_sequence.sh (with parameters used above) will generate thread files of the TF and its orthologs binding the DNA sequence in the folder aux_files.

------

Example 4

Generate a model of a TF attached to a predicted binding site along a full length of DNA

------

We can view the TFs binding to the full scanned DNA sequence as predicted in the previous step. For this we will need to process the files a little through.

------

modelling.sh contains the option to use thread files as an input instead of the previously used multi fasta files
-i a file containng a list of the location of the thread files to be used (Threads_list.txt is provided)
-t indicates that threads are used instead of aa sequences
-o the output folder location
We copy the desired models (from the output of the previous step) into a folder containg all the binary interactions that we would like to use in the modelled complex
rename_complex_input.py is a python script that prepares scanning output file names for complex builder (eg. python rename_complex_input.py BinaryInteractions/A9YTQ3.5nj8_1A.18-29.pdb ---> BinaryInteractions/A9YTQ3.5nj8_1A.18-29:1:243_TF.pdb)
-the name of the file must follow the following format:
{UniprotAccession}.{PDBID}{Chain}.{index of binding start}-{index of binding end}:{model start index}:{model end index}{a label}.pdb
BuildComplex.sh is the script that will build a complex based on binary interaction files contained in a given folder
a) /soft/system/software/x3dna/2.3/bin/fiber will generate a pdb file for a given DNA sequence
-seq the dna sequence to be used (in the case of modelling the scanning results use the same sequence)
-b the dna conformation to be applied followed by output location
b) exe/complexbuilder.py will build the complex
-d the folder containing the binary interaction files
-o the output folder location

------

Now the modelled complex can be view in the output folder (Complex/fragment_1-100/dna__1-100_aa.pdb).

------

Example 5

Generate thread files from a modelled TF

------

get_best_bindings_threads.sh produces thread files for use in modelling and retrieving scores.
pdb="/home/pgohl/ModCRE_Package/pdb" Needs to be replaced with the location of the downloaded pdb folder on your machine
pbm="/home/pgohl/ModCRE_Package/pbm" Needs to be replaced with the location of the downloaded pbm folder on your machine
-i (the pdb of the transcription factor to be used)
-o (output directory)
--seq fasta sequence containing DNA sequence to be bound
--dna nucleatide sequence of the binding site
--pwm meme file for the transcription factor

------

The thread file can be used to score the binding of a TF along any DNA sequence that matches the binding site length.
previous steps need not be repeated for other DNA sequences, simply create a thread file for the relevant substituted sequence by
replaceing the DNA sequences at the bottom of the threads folder:
>dna
CAGCTGGCTGTG;0
//
>dna_fixed
CAGCTGGCTGTG;0
//

------

Example 6

Generate a scoring profile of a TF-DNA interaction

------

get_best_score.sh produces a scoring profile for the transcription factor on target dna sequence.
-i (the input is a thread file as produced by get_best_bindings_threads)
-o (output directory)

------

Finally we have a score profile (statistical potentials) for the TF binding along the tested DNA binding site.

------

Example 7

Generate a scoring profile plot for a TF along a DNA sequence

------

get_score_profiles.sh will generate the raw scores that will be used to plot the scoring
-i a file containing the folders containing the models to be used
-d a fasta format dna sequence to profile the TF binding against
-o Output name for tables and plots (default is the name of the input FOLDER)

------

The output will be stored in each folder provided by the input file. A folder will have been generated and named
profilerinput.txt_profiling.34_272 by default (profilerinput.txt being the input file, profiling the folder from that file).
Within can be found individual model scores (pickle files) as well as the mean tables (csv files)

------

bin/plotprofile.py will generate a plot from the mean score file
arg 1 = path to the mean table to be plotted
arg 2 = location the plot is to be stored at
arg 3 = column (score type) to be plotted (options will be printed out if provided isn't in file)
(python bin/plotprofile.py profiling/profilerinput.txt_profiling.34_272/profilerinput.txt_profiling.Profiletest_1.mean.csv profiling/energy_plot.out_normal_s3dc_dd.png normal_s3dc_dd)

There is support for running jobs in parallel. In order to do this the relevant information in the config.ini Cluster field must be filled in.
After that simply run jobs with the parallel parameter

Example 8

Aggregate PWM clusters for a TF

------

get_json.sh will generate input (json files) for subsequent steps
exe/get_json.py takes positional arguments:
home path
Fasta format file of the TFs to be run
file containing TF codes (uniprot) corresponding to TFs in Fasta file
Table of family labels for TF codes (provided in files)
Table of nearest neighbors (provided in files is one with cases of 30-100% sequence id)
folder of the pwms generated with modcre (output of pwm.sh)
output folder name
uniprot label indicator (uniprot)
get_aggregates.sh
input={name of the output from get_json.sh}
modcre={folder of the modcre predicted pwms (pwm.sh output)}
pvalue={P-value threshold of TOMTOM similarity between two PWMs (default 0.005)}
threshold={Distance threshold of the agglomerative clustering (default is 0.01)}
length={Length of the binding site of the output PWMs}
ModCRElib/msa/aggregate_pwms.py takes some arguments that the user may wish to change
--jaspar location of jaspar pwm database
--cisbp location of cisbp pwm database
--hocomoco location of hocomoco pwm database
-o output folder
--info logfile to write to
--dummy dummy folder to use

------

The output will be saved as a folder for each uniprot id provided (in the example given P35869).
Each of these folders will contain a single folder for each of the clusters detected.
Within the cluster folder will be found the memes for all of the cluster member PWMs as well as a mean PWM for the cluster.

------

Other available functionalities that may be of interest

The exe folder contains other executable programs that, while not the main focus of ModCRElib, may still be usefull.

TFinderSelect.py
Retrieves info from a pdb or uniprot entry to decide if a protein is a TF
build_dna.py
Turns a DNA string into a pdb file
clean.py
Cleans a PDB file
contacts.py
Gets the contacts calculated from a pdb file.
dimers.py
Check if input is a component of a dimer and retrieve monomer ids and contacts
homologs.py
given an input of a blast or hmm file will create a file of homologs
interface.py
Will search for the interface of interaction within a pdb
merge_pwms.py
Combine predicted pwms into a single averaged pwm
mmcif_to_pdb.py
convert mmcif format to pdb
mmcifs_to_pdbs.py convert a folder of mmcif format files to pdb
model_IMP.py
Use IMP to model macro-complexes
nearest_neighbour.py
calculates the closest similar sequences (nearest neighbour) of each TF and compares their PWMs using TOMTOM. It also compares the PWMs modelled for each TF with the dataset of PWMs. Boxplots are built to compare both success using different conditions and scores derived from TOMTOM results
pbm.py and pdb.py
create the pbm and pdb directories that was provided at http://aleph.upf.edu/modcrefiles/
pdb2thread.py
converts pdb to a thread file

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Aggregates		Aggregates
BinaryInteractions		BinaryInteractions
Complex		Complex
ExternalPWMs		ExternalPWMs
JSON		JSON
ModCRElib		ModCRElib
PWMdatabase/pwms		PWMdatabase/pwms
SBILib		SBILib
bin		bin
exe		exe
files		files
models		models
modpy		modpy
profiling		profiling
pwm		pwm
remodels		remodels
scanned_models		scanned_models
scanning		scanning
scores		scores
threads		threads
BuildComplex.sh		BuildComplex.sh
NonPBMTFs.fa		NonPBMTFs.fa
README.md		README.md
Ref1.fa		Ref1.fa
TF_codes_40-50.txt		TF_codes_40-50.txt
Threads_list.txt		Threads_list.txt
execute_TF_search.sh		execute_TF_search.sh
get_aggregates.sh		get_aggregates.sh
get_best_bindings_threads.sh		get_best_bindings_threads.sh
get_best_score.sh		get_best_score.sh
get_json.sh		get_json.sh
get_score_profiles.sh		get_score_profiles.sh
modelling.sh		modelling.sh
pwm.sh		pwm.sh
scan_sequence.sh		scan_sequence.sh
tf_sequences.fa		tf_sequences.fa

structuralbioinformatics/ModCRElib

Folders and files

Latest commit

History

Repository files navigation

ModCRElib: ModCRE to be deployed as a downloadable Package

Installation instructions:

Dependencies:

Python

Enviornmental

Using ModCRElib

Example 1

------

------

------

Example 2

------

------

------

Example 3

------

------

------

Example 4

------

------

------

------

Example 5

------

------

------

Example 6

------

------

------

Example 7

------

------

------

Example 8

------

------

------

Other available functionalities that may be of interest

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages