Skip to content

Github repository for the Liu and Samee 2021 publication, predicting single nucleotide mutation rate variations with a combination of DNA shape features and sequence context features.

License

Notifications You must be signed in to change notification settings

sameelab/mutprediction-with-shape

Repository files navigation

Predicting mutation rate variations with DNA shape

Zian Liu

Last updated: 7/5/2022

Introduction

This is the GitHub repository for the Liu Z and Samee MAH 2021 (pending) publication, Mutation rate variations in the human genome are encoded in DNA shape. The manuscript is currently under review and is submitted to bioRxiv (https://www.biorxiv.org/content/10.1101/2021.01.15.426837v2); see the bottom of the page for citation information.

This repo contains the main notebook for our publication, codes and snippets used for generating our final results, as well as scripts used for our TFBS analysis section.

Workflow

Our workflow are documented in the various .ipynb notebooks located in the Notebook/ directory. Make sure to download the python library script, and either a) the individual numbered notebooks for individual steps or 2) Publication_note.ipynb for all the steps.

Installation

The program runs on python version 3, the following packages are required:

  • numpy
  • pandas
  • joblib
  • sklearn

along with their dependencies.

What are the input data?

For the mutation rates data, please request Dr. Benjamin F. Voight; their data is also available from Dr. Voight's GitHub. We would strongly encourage you to first communicate with and request access from Dr. Voight and cite their study in the case you want to use their data.

As you might have noticed, we included an input mutation rate data file in our example script directory. We would strongly discourage you to directly use this data for other purposes. This input data is generated by one of our in-production pipelines, and then re-formatted to match the format of the Aggarwala and Voight data. It is intended to be a toy dataset and we do not currently have documentation for how to generate it. If you are interested, please stay tuned as we do have plans to release our pipeline to the Samee Lab GitHub, or contact us and we are more than happy to pass the data (as well as the steps to generate it) to you.

For the DNAshape reference table, we have included a 7-mer reference table in the "data_input" directory. We also encourage checking out Zian Liu's GitHub repo which contains scripts for extracting the reference table from the DNAshapeR package. Please make sure to cite the four DNAshapeR papers when using this excel spreadsheet.

For the DNAshapeR package, please visit Tsu-Pei Chiu's GitHub page for more information.

How can I run this for myself?

We have included our main Jupyter notebook ("Publication_note.ipynb") and the components (different "Notebook" ipynbs and the functions script) as reference documents. We have separately prepared an example pipeline in the "pipeline_example" directory.

To run our model, call:

python main.py input_mutation_file reference_dnashape_file.xlsx

from the example directory, make sure that the python refer to python version 3. The included README file will share more regarding what to do, and the script file is well annotated for you to follow.

Where are your TFBS analyses?

We have included our TFBS analysis scripts in the TFBS_analysis/ directory. Please read the directory-specific README for more information, and please don't hesitate to reach out to Zian if anything seems especially confusing.

Citations

If you are using the input data from Aggarwala and Voight, please make sure to cite:

  • Aggarwala, V. & Voight, B. F. An expanded sequence context model broadly explains variability in polymorphism levels across the human genome. Nature Genetics 48, 349–355 (2016).

If you are using any data pertinent to the DNAshape method, the DNAshapeR package, or our curated DNA shape tables, please make sure to cite all four of the following:

  • Chiu, T.-P. et al. DNAshapeR: an R/Bioconductor package for DNA shape prediction and feature encoding. Bioinformatics 32, 1211–1213 (2016).
  • Chiu, T.-P., Rao, S., Mann, R. S., Honig, B. & Rohs, R. Genome-wide prediction of minor-groove electrostatic potential enables biophysical modeling of protein–DNA binding. Nucleic Acids Res 45, 12565–12576 (2017).
  • Li, J. et al. Expanding the repertoire of DNA shape features for genome-scale studies of transcription factor binding. Nucleic Acids Res 45, 12877–12887 (2017).
  • Rao, S. et al. Systematic prediction of DNA shape changes due to CpG methylation explains epigenetic effects on protein–DNA binding. Epigenetics & Chromatin 11, 6 (2018).

For all other usages pertinent to our work, our manuscript is currently under review. In the meantime, please cite the following submission in bioRxiv:

Contact

Please contact Md. Abul Hassan Samee, Ph.D. (samee@bcm.edu) for questions related to our publication or other logistics-related questions.

Please contact Zian Liu (zian.liu@bcm.edu) for questions specifically related to our research. Note that if you are accessing this page on or after Spring 2023 and you don't hear back from Zian for 2 days, please email Dr. Samee directly.

Thank you!

About

Github repository for the Liu and Samee 2021 publication, predicting single nucleotide mutation rate variations with a combination of DNA shape features and sequence context features.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages