Sigma70Pred: A Highly Accurate Method for Predicting Sigma70 Promoters in Escherichia coli K-12 Strains
Sigma70Pred is a computational tool developed for predicting sigma70 promoters in Escherichia coli K-12 strains.
Sigma70 factor plays a crucial role in prokaryotic transcription and regulates most housekeeping genes. Sigma70Pred uses nucleotide sequence-based features and machine learning models to classify DNA sequences as sigma70 promoters or non-promoters with high accuracy.
Web Server: https://webs.iiitd.edu.in/raghava/sigma70pred/
Patiyal, S., Singh, N., Ali, M. Z., Pundir, D. S., and Raghava, G. P. S. Sigma70Pred: A highly accurate method for predicting sigma70 promoter in Escherichia coli K-12 strains. Frontiers in Microbiology, 13, 1042127, 2022.
https://doi.org/10.3389/fmicb.2022.1042127
This tool and dataset is also available on Zenodo at https://doi.org/10.5281/zenodo.20163591
Promoters are regulatory DNA regions located upstream of transcription start sites and are responsible for controlling gene expression. In prokaryotes, promoters are recognized by RNA polymerase together with sigma factors.
Sigma70 is one of the most important sigma factors because it regulates the transcription of most housekeeping genes in Escherichia coli. Sigma70 promoters usually contain conserved sequence regions near the -10 and -35 positions upstream of the transcription start site.
Accurate prediction of sigma70 promoters is important for understanding bacterial gene regulation, transcriptional control, genome annotation, and regulatory network analysis.
Data Compilation: The benchmark dataset was obtained from RegulonDB 9.0 and contained 741 sigma70 promoters and 1400 non-promoters from Escherichia coli K-12. An independent dataset was created using RegulonDB 10.8 and contained 1134 sigma70 promoters and 638 non-promoters.
Methodology: Sigma70Pred uses machine learning models trained on nucleotide sequence-based features. Around 8465 features were generated, including dinucleotide auto-correlation, dinucleotide cross-correlation, dinucleotide auto cross-correlation, Moran auto-correlation, normalized Moreau-Broto auto-correlation, pseudo tri-nucleotide composition, motif counts, GC skew, AT skew, and other nucleotide-based descriptors.
Predictive Modeling: Allows users to submit DNA sequences and predict whether they are sigma70 promoters or non-promoters.
High Accuracy: The best SVM-based model achieved 97.38% accuracy, AUROC 0.996, and MCC 0.94 on the benchmark dataset.
Independent Validation: On the independent dataset from RegulonDB 10.8, Sigma70Pred achieved 90.41% accuracy, AUROC 0.953, and MCC 0.794.
Constitutive Promoter Prediction: The model successfully predicted constitutive promoters with 81.46% accuracy at the default threshold.
Feature Generation: The study generated more than 8000 nucleotide sequence-based features using Nfeature.
Feature Selection: Recursive Feature Elimination was used to select the top 200 most relevant features.
Machine Learning Models: Several classifiers were tested, including Decision Tree, Random Forest, K-Nearest Neighbor, eXtreme Gradient Boosting, Gaussian Naive Bayes, and Support Vector Machine.
Best Classifier: The Support Vector Machine-based model performed best among all tested classifiers.
Predict Module: Allows users to classify submitted DNA sequences as sigma70 promoters or non-promoters.
Scan Module: Allows users to scan long DNA or genome sequences to identify possible sigma70 promoter regions using overlapping windows of 81 bp.
Design Module: Generates possible mutants of a submitted sequence and predicts whether mutations can convert a sigma70 promoter into a non-promoter or vice versa.
Standalone Package: Sigma70Pred is also available as Python and Perl-based standalone software for local use.
Docker Availability: The package is also distributed through GPSRdocker.
Bacterial Promoter Prediction: Sigma70Pred can identify sigma70 promoter regions in Escherichia coli K-12 DNA sequences.
Genome Annotation: The tool can help annotate regulatory regions in prokaryotic genomes.
Gene Regulation Studies: Sigma70Pred can support studies of transcriptional regulation and promoter architecture.
Synthetic Biology: The design module can help modify promoter sequences for synthetic biology applications.
Comparative Genomics: The scan module can be used to investigate promoter-like regions in bacterial genomic sequences.
Regulatory Network Analysis: Accurate promoter prediction can help build and refine bacterial gene regulatory networks.
Prof. Gajendra P. S. Raghava Corresponding Author
Email: raghava@iiitd.ac.in
Department of Computational Biology Indraprastha Institute of Information Technology Delhi New Delhi, India
Sigma70Pred was developed with financial support from the Department of Biotechnology, Government of India.
The authors also acknowledge Megha Mathur and Anjali Dhall for Python scripts used to generate features and for help in figure preparation.