Dear Community,
After careful consideration, we have decided to shift our focus to new and innovative initiatives that will better serve our community and align with our long-term goals.
Effective Date: May 12th, 2025
Impact on Users:
The project repository will be archived and set to read-only mode, ensuring that it remains accessible for reference. While no further updates, bug fixes, or support will be provided, we encourage you to explore the wealth of knowledge and resources available in the repository.
Licensing: The project will remain under its current open-source license, allowing others to fork and continue development if they choose.
We understand that this change may come as a surprise, but we are incredibly grateful for your support and contributions over the years. Your dedication has been instrumental in the success of this project, and we look forward to your continued involvement in our future endeavors.
Thank you for your understanding and support.
Jupyter notebook is a great tool for data scientists who are working on genomics data analysis. In this repo, we demonstrate the use of Azure Notebooks for genomics data analysis via GATK, Picard, Bioconductor and Python libraries.
Here is the list of sample notebooks on this repo:
genomics.ipynb
: Analysis from 'uBAM' to 'structured data table' analysis.genomicsML.ipynb
: Train Machine Learning models with Genomics + Clinical Datagenomics-platinum-genomes.ipynb
: Accessing Illumina Platinum Genomes data from Azure Open Datasets* and to make initial data analysis.genomics-reference-genomes.ipynb
: Accessing reference genomes from Azure Open Datasets*genomics-clinvar.ipynb
: Accessing ClinVar data from Azure Open Datasets*genomics-giab.ipynb
: Accessing Genome in a Bottle data from Azure Open Datasets*SnpEff.ipynb
: Accessing SnpEff databases from Azure Open Datasets*1000 Genomes.ipynb
: Accessing 1000 Genomes dataset from Azure Open Datasets*GATKResourceBundle.ipynb
: Accessing GATK resource bundle from Azure Open Datasets*ENCODE.ipynb
: Accessing ENCODE dataset from Azure Open Datasets*genomics-OpenCRAVAT.ipynb
: Accessing OpenCRAVAT dataset from Azure Open Datasets and deploy built-in Azure Data Science VM for OpenCRAVAT*Bioconductor.ipynb
: Pulling Bioconductor Docker image from Microsoft Container Registrysimtotable.ipynb
: Simulate NGS data, use Cromwell on Azure OR Microsoft Genomics service for secondary analysis and convert the gVCF data to a structured data table.igv_jupyter_extension_sample.ipynb
: Download sample VCF file from Azure Open Datasets and use igv-jupyter extension on Jupyter Lab environment.radiogenomics.ipynb
: Combine DICOM, VCF and gene expression data for patient segmentation analysis.fhir+PacBio.ipynb
: Convert Synthetic FHIR and PacBio VCF Data to parquet and Explore with Azure Synapse Analyticsfhir-vcf-clustering.ipynb
: Convert Synthetic FHIR and PacBio VCF Data to parquet and Explore with Azure Synapse Analyticsgraphragforgenomics.ipynb
: Use GraphRAG for genomics annotation.
*Technical note: Explore Azure Genomics Data Lake with Azure Storage Explorer
For further details on creation of Azure ML workspace please visit this page.
This chapter uses the cloud notebook server in your workspace for an install-free and pre-configured experience. Use your own environment if you prefer to have control over your environment, packages and dependencies.
Follow along with this video or use the detailed steps below to clone and run the tutorial from your workspace.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.