Software plays a vital role in modern scientific research, making it imperative to uphold both the accessibility and high quality of scientific software. Recognizing the significance of sustainable and reproducible science, Software Heritage (https://www.softwareheritage.org/) serves as a global archive for software preservation. This project focuses on examining the current trends in the development of bioinformatic software by gathering information from the abstracts of articles published on PubMed (https://pubmed.ncbi.nlm.nih.gov/). By utilizing the APIs of PubMed, GitHub, and Software Heritage, we collect diverse information regarding approximately 10,000 scientific software packages. Subsequently, our analysis aims to determine the proportion of archived software, assess the developmental dynamics, and evaluate the accessibility of software through the provided publication links. Furthermore, the workflow is implemented using Snakemake, facilitating the seamless initiation of the analysis from scratch.
Clone the repository:
git clone https://github.com/zhukovanadezhda/bioinformatics-software.git
cd bioinformatics-software
Install miniconda and mamba. Create the bioinfosoft
conda environment:
mamba env create -f binder/environment.yml
conda activate bioinfosoft
Remark: to deactivate an active environment, use:
conda deactivate
The workflow analysis requires API keys for PubMed, GitHub and Software Heritage.
To get API keys:
- For PubMed, go at the bottom of the NCBI Account Settings page.
- For GitHub, go on the Personnal access tokens page of your account. There is not need to select specific scopes.
Create the file .env
to store API keys in the following format:
GITHUB_TOKEN=...
PUBMED_TOKEN=...
SWH_TOKEN=...
Run the analysis with the Snakemake workflow:
snakemake --cores 1 --use-conda
All the results will be stored in the data
folder.