# dbGaP Reporter
This notebook queries the database of Genotypes and Phenotypes [dbGaP](https://www.ncbi.nlm.nih.gov/gap/) for studies and reports access requests for datasets. The number of access requests per study is a measure of data reuse. In addition, this notebook queries [Europe PMC](https://europepmc.org/) for publications and preprints that cite or mention dbGaP accession numbers.

Created: 2023-03-09

Author : Peter W. Rose (pwrose@ucsd.edu)

In [9]:
#@title Enter a query term and then select **Run All** from the **Runtime** menu. For an exact match enclose query term in quotes. { run: "auto", vertical-output: true, form-width: "50%", display-mode: "form" }
#@markdown ### Enter query term or dbGap accession number
query = "\"COVID-19\"" #@param {type:"string"}
print(f"Query: {query}")

Query: "COVID-19"


In [10]:
%%capture
#@title Installing software on Google Colab
![ ! -f "installed" ] && pip -q install selenium
![ ! -f "installed" ] && apt-get update
![ ! -f "installed" ] && apt-get install firefox && touch installed

In [11]:
#@title Importing packages
import os
import shutil
import glob
import time
from tqdm import tqdm
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options
from google.colab import data_table

In [12]:
#@title Running query
TMP_DIR = "/tmp"
filepath = os.path.join(TMP_DIR, "studies.csv")

def driversetup(download_dir):
    options = Options()
    #run Selenium in headless mode
    options.add_argument("--headless")
    options.add_argument("--no-sandbox")
    # https://stackoverflow.com/questions/60170311/how-to-switch-download-directory-using-selenium-firefox-python
    # 0: download to the desktop, 1 download to the default "Downloads" directory, 2 use specified directory
    options.set_preference("browser.download.folderList", 2)
    options.set_preference("browser.download.manager.showWhenStarting", False)
    options.set_preference("browser.download.dir", download_dir)
    options.set_preference("browser.helperApps.neverAsk.saveToDisk", "text/csv")
    
    # https://stackoverflow.com/questions/42204897/how-to-set-up-a-selenium-python-environment-for-firefox
    driver = webdriver.Firefox(options=options)
    driver.implicitly_wait(5)

    return driver

def download_dbgap_studies(query, filepath):
    # clean up any previously downloaded csv files
    files = glob.glob(os.path.join(TMP_DIR, "*.csv"))
    for file in files:
        os.remove(file)
    
    # download csv file
    driver = driversetup(TMP_DIR)
    driver.get(f"https://www.ncbi.nlm.nih.gov/gap/advanced_search/?TERM={query}")
    time.sleep(3)
    #print("Running: ", driver.title)
    button = driver.find_element(By.CLASS_NAME, "svr_container")
    time.sleep(3)
    button.click()
    # wait until download is completed
    for step in tqdm(range(15)):
        time.sleep(1)
    #time.sleep(15)
    driver.close()
                  
    # move downloaded csv file to a standard location
    move_studies_file(filepath)
    
def move_studies_file(filepath):
    """ Move downloaded file to a specified standard location"""
    # the file name of the downloaded csv file is unknown in advance,
    # but there should be only one csv file.
    files = glob.glob(os.path.join(TMP_DIR, "*.csv"))
    if len(files) == 1:
        shutil.move(files[0], filepath)
    else:
        print("query error")
        
filepath = "studies.csv"
download_dbgap_studies(query, filepath)

studies = pd.read_csv(filepath, usecols=["accession", "name", "description", "Study Design", "Study Consent",])

100%|██████████| 15/15 [00:15<00:00,  1.00s/it]


In [13]:
#@title Table of studies
print(f"Number of studies for {query}:", studies.shape[0])
data_table.DataTable(studies, include_index=False, num_rows_per_page=10)

Number of studies for "COVID-19": 175


Unnamed: 0,accession,name,description,Study Design,Study Consent
0,phs001672.v10.p1,Veterans Administration (VA) Million Veteran P...,MVP is an ongoing prospective cohort study and...,Prospective Longitudinal Cohort,HMB-MDS --- Health/medical/biomedical (mds)
1,phs002245.v1.p1,Genetic Determinants of Susceptibility to Seve...,The goal of the project is to identify genetic...,Interventional,GRU --- General research use
2,phs002258.v1.p1,Shotgun Transcriptome and Isothermal Profiling...,"In less than nine months, the Severe Acute Res...",Metagenomics,"GRU-IRB-PUB-COL --- General research use (irb,..."
3,phs002299.v1.p1,PETAL Network: Outcomes Related to COVID-19 Tr...,"ORCHID was a multicenter, blinded, placebo-con...",Clinical Trial,HMB --- Health/medical/biomedical
4,phs002300.v1.p1,Successful Clinical Response in Pneumonia Ther...,The goal of this study is to iteratively ident...,Prospective Longitudinal Cohort,HMB-IRB --- Health/medical/biomedical (irb)
...,...,...,...,...,...
170,phs003128.v1.p1,Rapid Acceleration of Diagnostics - Underserve...,"Prospective, mixed-methods cohort study of chi...",Case Set,GRU --- General research use
171,phs003246.v1.p1,Multimodal Immune Profiling to Determine Mecha...,We conducted a prospective cohort study of adu...,Prospective Longitudinal Cohort,"GRU-IRB-COL --- General research use (irb, col)"
172,phs002315.v1.p1,Integrated Analysis of Multimodal Single-Cell ...,The simultaneous measurement of multiple modal...,Case Set,GRU --- General research use
173,phs003086.v1.p1,Single Cell Transcriptomic Data from CD4+ T Ce...,Multisystem inflammatory syndrome in children ...,Case-Control,DS-AIID --- Disease-specific (autoimmune/infla...


In [14]:
#@title Data access requests grouped by requestor
def get_download_url(accession):
    return "https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/GetAuthorizedRequestDownload.cgi?study_id=" + accession

def get_authorized_requests(studies):
    authorized_requests = pd.DataFrame()

    for _, row in tqdm(studies.iterrows(), total=studies.shape[0]):
        try:
            df = pd.read_csv(get_download_url(row["accession"]), 
                             usecols=["Requestor", "Affiliation", "Project", "Date of approval", "Request status", 
                                      "Public Research Use Statement", "Technical Research Use Statement"],
                            sep="\t")
            df["accession"] = row["accession"]
            df["name"] = row["name"]
            authorized_requests = pd.concat([authorized_requests, df], ignore_index=True)
        except:
            print(f"Skipping: {row['accession']} - no data access through dbGaP.")
                                        
    return authorized_requests

requests = get_authorized_requests(studies)

# group requests to create a summary view
if requests.shape[0] > 0: 
    # summary = requests.groupby(["Requestor", "Affiliation", "Project", "Date of approval", "Request status",
    #                             "Public Research Use Statement", "Technical Research Use Statement"], 
    #                             as_index=False)["accession"].agg(', '.join)
    summary = requests.groupby(["Requestor", "Affiliation", "Project", "Request status",
                                "Public Research Use Statement", "Technical Research Use Statement"], 
                                as_index=False)[["Date of approval","accession"]].agg(", ".join)

    # keep only unique dates
    summary["Date of approval"] = summary["Date of approval"].str.split(", ").apply(set).str.join(", ")

    # show most frequent requests first
    summary["Number of requests"] = summary["accession"].str.count(",") + 1
    summary.sort_values(by="Number of requests", ascending=False, inplace=True)

# show results
print()
print()
print("Number of data access requests :", requests.shape[0])

if requests.shape[0] > 0:
    print("Number of unique requestors    :", len(requests["Requestor"].unique()))
    print("Number of unique studies       :", len(requests["accession"].unique()))
    display(data_table.DataTable(summary, include_index=False, num_rows_per_page=10))

 25%|██▍       | 43/175 [00:30<01:25,  1.54it/s]

Skipping: phs002577.v1.p1 - no data access through dbGaP.


100%|██████████| 175/175 [01:56<00:00,  1.50it/s]



Number of data access requests : 380
Number of unique requestors    : 273
Number of unique studies       : 70





Unnamed: 0,Requestor,Affiliation,Project,Request status,Public Research Use Statement,Technical Research Use Statement,Date of approval,accession,Number of requests
234,"Rose, Peter","UNIVERSITY OF CALIFORNIA, SAN DIEGO",Analysis and Evaluation of RADx-rad Datasets,approved,The National Institutes of Health launched the...,This request supports the RADx-rad program out...,"2023, Mar07","phs002522.v1.p1, phs002523.v1.p1, phs002524.v1...",48
14,"Anwar, Mohd Mozharul",NIH,Exploration of Wearable Device Data for COVID-19,approved,Wearable devices collect various physiological...,The objective of the proposed research is to e...,"Mar27, Dec19, 2023, Feb07, 2022","phs002516.v1.p1, phs002519.v1.p1, phs002523.v1...",11
46,"Ciofani, Danielle","BROAD INSTITUTE, INC.",Confirmation of RAS approval workflow for RADx...,approved,I will be conducting testing to confirm that I...,"I am a co-I of the RADx Data Hub program, and ...","Nov14, 2022","phs002516.v1.p1, phs002533.v1.p1, phs002534.v1...",8
186,"Miguez, Maria-Jose",FLORIDA INTERNATIONAL UNIVERSITY,System analysis for COVID humoral response,approved,Multisystem inflammatory syndrome in children ...,Multisystem inflammatory syndrome in children ...,"Feb10, 2023","phs002781.v1.p1, phs002945.v1.p1",2
36,"Chan, Kei Hang",BROWN UNIVERSITY,NHLBI TOPMed Whole-genome Sequencing Program (...,approved,"We use epidemiological, statistical, and bioin...",Many complex traits such as diabetes and cardi...,"2022, Dec15","phs002299.v1.p1, phs002363.v1.p1",2
...,...,...,...,...,...,...,...,...,...
105,"HAN, SHIZHONG",JOHNS HOPKINS UNIVERSITY,Integrative analysis of multi-omics datasets f...,closed,"Alcoholism has a genetic basis, but identifica...",Our goal is to identify risk genes underlying ...,"2019, Dec30",phs001672.v10.p1,1
104,"Guo, Yan",XI'AN JIAOTONG UNIVERSITY,Genomic analysis for human complex diseases,rejected,Many human complex diseases/traits have strong...,Many human complex diseases/traits are highly ...,"2020, Jan13",phs001672.v10.p1,1
103,"Guo, Shicheng",JOHNSON/JOHNSON/PHARM/RES/ DEVELOPMENT,"GWAS to identify disease genes, drug targets a...",approved,This study plans to identify genetic differenc...,Large genetic cohorts provide an opportunity t...,"2020, Nov13",phs001672.v10.p1,1
102,"Gui, Hongsheng",HENRY FORD HEALTH SYSTEM,Investigation of substance use on risk of suic...,approved,Is there any connection between substance addi...,The project aims to study the correlation and ...,"2021, Jul28",phs001672.v10.p1,1


In [15]:
#@title Detailed table of data access requests
print("Number of data access requests :", requests.shape[0])

if requests.shape[0] > 0:
    print("Number of unique requestors    :", len(requests["Requestor"].unique()))
    print("Number of unique studies       :", len(requests["accession"].unique()))
    display(data_table.DataTable(requests, include_index=False, num_rows_per_page=10))

Number of data access requests : 380
Number of unique requestors    : 273
Number of unique studies       : 70


Unnamed: 0,Requestor,Affiliation,Project,Date of approval,Request status,Public Research Use Statement,Technical Research Use Statement,accession,name
0,"Aday, Aaron",VANDERBILT UNIVERSITY MEDICAL CENTER,Peripheral Artery Disease and Major Depressive...,"Mar03, 2023",approved,People with depression are at higher risk of d...,Overview Major depressive disorder (MDD) is ph...,phs001672.v10.p1,Veterans Administration (VA) Million Veteran P...
1,"Adebamowo, Sally",UNIVERSITY OF MARYLAND BALTIMORE,NIH PRIMED Consortium Coordinated Application,"Dec14, 2022",approved,"Polygenic risk scores (PRS), are a genetic est...",The Polygenic Risk Methods in Diverse Populati...,phs001672.v10.p1,Veterans Administration (VA) Million Veteran P...
2,"AGRAWAL, ARPANA",WASHINGTON UNIVERSITY,Neurobiological bases of psychiatric traits,"Apr08, 2020",approved,Genetic variation contributes to health outcom...,Genetic variation contributes to variability i...,phs001672.v10.p1,Veterans Administration (VA) Million Veteran P...
3,"Almasy, Laura",CHILDREN'S HOSP OF PHILADELPHIA,Polygenic effects on complex traits across dev...,"Jun22, 2022",approved,Genetic markers are an important source of var...,Summary statistics from genome-wide associatio...,phs001672.v10.p1,Veterans Administration (VA) Million Veteran P...
4,"Andreassen, Ole",UNIVERSITY OF OSLO,TOP study,"Apr21, 2020",approved,"Severe mental disorders such as schizophrenia,...",The overall research objective is to gain more...,phs001672.v10.p1,Veterans Administration (VA) Million Veteran P...
...,...,...,...,...,...,...,...,...,...
375,"Vlachos, Ioannis",BETH ISRAEL DEACONESS MEDICAL CENTER,"Effect of genetic, genomic, and microbiomic va...","Mar20, 2023",approved,Information is often lost in extensive cohort ...,In this project we aim to identify the effects...,phs001886.v4.p1,Single Cell Analysis of Human Parturition: The...
376,"Wang, Kevin",STANFORD UNIVERSITY,long noncoding RNA (lncRNA) regulation of huma...,"Feb01, 2022",approved,The mechanisms that trigger preterm labor have...,We are interested in analyzing preterm labor (...,phs001886.v4.p1,Single Cell Analysis of Human Parturition: The...
377,"Worthey, Elizabeth",UNIVERSITY OF ALABAMA AT BIRMINGHAM,Understanding the molecular underpinnings of c...,"Mar20, 2023",approved,Chorangiomas are benign placental capillary le...,The main objective of the research proposed is...,phs001886.v4.p1,Single Cell Analysis of Human Parturition: The...
378,"Zhang, Liye",SHANGHAITECH UNIVERSITY,Explore human placenta in term and preterm par...,"Apr16, 2021",closed,Parturition is essential for the reproductive ...,Research Objectives: Explore the mechanism of ...,phs001886.v4.p1,Single Cell Analysis of Human Parturition: The...


In [16]:
#@title Publications that cite or mention dbGaP accession numbers
studies["dbgap"] = studies["accession"].apply(lambda s: s.split(".")[0])
# get list of publications from Europe PMC
dbgap_pub = pd.read_csv("ftp://ftp.ebi.ac.uk/pub/databases/pmc/TextMinedTerms/dbgap.csv")
pubs = studies.merge(dbgap_pub, on="dbgap")
print("Number of publications:", pubs.shape[0])
display(data_table.DataTable(pubs, include_index=False, num_rows_per_page=10))

Number of publications: 59


Unnamed: 0,accession,name,description,Study Design,Study Consent,dbgap,PMCID,EXTID,SOURCE
0,phs001672.v10.p1,Veterans Administration (VA) Million Veteran P...,MVP is an ongoing prospective cohort study and...,Prospective Longitudinal Cohort,HMB-MDS --- Health/medical/biomedical (mds),phs001672,PMC7614108,36601961,MED
1,phs001672.v10.p1,Veterans Administration (VA) Million Veteran P...,MVP is an ongoing prospective cohort study and...,Prospective Longitudinal Cohort,HMB-MDS --- Health/medical/biomedical (mds),phs001672,PMC9515437,36174398,MED
2,phs001672.v10.p1,Veterans Administration (VA) Million Veteran P...,MVP is an ongoing prospective cohort study and...,Prospective Longitudinal Cohort,HMB-MDS --- Health/medical/biomedical (mds),phs001672,PMC9256707,35790736,MED
3,phs001672.v10.p1,Veterans Administration (VA) Million Veteran P...,MVP is an ongoing prospective cohort study and...,Prospective Longitudinal Cohort,HMB-MDS --- Health/medical/biomedical (mds),phs001672,PMC9233006,35449297,MED
4,phs001672.v10.p1,Veterans Administration (VA) Million Veteran P...,MVP is an ongoing prospective cohort study and...,Prospective Longitudinal Cohort,HMB-MDS --- Health/medical/biomedical (mds),phs001672,PMC9853312,35422469,MED
5,phs001672.v10.p1,Veterans Administration (VA) Million Veteran P...,MVP is an ongoing prospective cohort study and...,Prospective Longitudinal Cohort,HMB-MDS --- Health/medical/biomedical (mds),phs001672,PMC8917986,34865855,MED
6,phs001672.v10.p1,Veterans Administration (VA) Million Veteran P...,MVP is an ongoing prospective cohort study and...,Prospective Longitudinal Cohort,HMB-MDS --- Health/medical/biomedical (mds),phs001672,PMC8323712,34139859,MED
7,phs001672.v10.p1,Veterans Administration (VA) Million Veteran P...,MVP is an ongoing prospective cohort study and...,Prospective Longitudinal Cohort,HMB-MDS --- Health/medical/biomedical (mds),phs001672,PMC8159880,33853351,MED
8,phs001672.v10.p1,Veterans Administration (VA) Million Veteran P...,MVP is an ongoing prospective cohort study and...,Prospective Longitudinal Cohort,HMB-MDS --- Health/medical/biomedical (mds),phs001672,PMC8064427,33629108,MED
9,phs001672.v10.p1,Veterans Administration (VA) Million Veteran P...,MVP is an ongoing prospective cohort study and...,Prospective Longitudinal Cohort,HMB-MDS --- Health/medical/biomedical (mds),phs001672,PMC7485556,32451486,MED
