# ReportDbGaPRequests (Version for Google Colab)
This notebook runs a query on the dbGaP website and returns matching studies. For each study, it retrieves the authorized dataset access requests.

The number of access requests per study is a measure of data reuse.

Created: 2023-02-25
Author : Peter W. Rose (pwrose@ucsd.edu)

In [None]:
#@title Enter dbGaP query term  and then select ```Run All``` from ```Runtime``` menu {run: "auto"}
query = '' #@param {type:"string"}
print(f"dbGaP query: {query}")

dbGaP query: radx-rad


In [None]:
%%capture
#@title Installing software on Google Colab
!pip install selenium
!apt-get update
!apt-get install firefox

In [None]:
#@title Importing packages
import os
import shutil
import glob
import time
from tqdm import tqdm
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options
from google.colab import data_table
data_table.enable_dataframe_formatter()
pd.set_option('display.max_colwidth', None)

In [None]:
#@title Running query
TMP_DIR = "/tmp"
filepath = os.path.join(TMP_DIR, "studies.csv")

def driversetup(download_dir):
    options = Options()
    #run Selenium in headless mode
    options.add_argument("--headless")
    options.add_argument("--no-sandbox")
    # https://stackoverflow.com/questions/60170311/how-to-switch-download-directory-using-selenium-firefox-python
    # 0: download to the desktop, 1 download to the default "Downloads" directory, 2 use specified directory
    options.set_preference("browser.download.folderList", 2)
    options.set_preference("browser.download.manager.showWhenStarting", False)
    options.set_preference("browser.download.dir", download_dir)
    options.set_preference("browser.helperApps.neverAsk.saveToDisk", "text/csv")
    
    # https://stackoverflow.com/questions/42204897/how-to-set-up-a-selenium-python-environment-for-firefox
    driver = webdriver.Firefox(options=options)
    driver.implicitly_wait(5)

    return driver

def download_dbgap_studies(query, filepath):
    # clean up any previously downloaded csv files
    files = glob.glob(os.path.join(TMP_DIR, "*.csv"))
    for file in files:
        os.remove(file)
    
    # download csv file
    driver = driversetup(TMP_DIR)
    driver.get(f"https://www.ncbi.nlm.nih.gov/gap/advanced_search/?TERM={query}")
    time.sleep(3)
    print("Running: ", driver.title)
    button = driver.find_element(By.CLASS_NAME, "svr_container")
    time.sleep(3)
    print("Downloading file: studies.csv")
    button.click()
    # wait until download is completed
    time.sleep(15)
    driver.close()
                  
    # move downloaded csv file to a standard location
    move_studies_file(filepath)
    
def move_studies_file(filepath):
    """ Move downloaded file to a specified standard location"""
    # the file name of the downloaded csv file is unknown in advance,
    # but there should be only one csv file.
    files = glob.glob(os.path.join(TMP_DIR, "*.csv"))
    if len(files) == 1:
        shutil.move(files[0], filepath)
    else:
        print("query error")
        
filepath = "studies.csv"
download_dbgap_studies(query, filepath)

studies = pd.read_csv(filepath, usecols=["accession", "name", "description", "Study Design", "Study Consent",])

Running:  dbGaP Advanced Search
Downloading file: studies.csv


In [None]:
#@title Table of studies
print(f"Number of studies for {query}:", studies.shape[0])
data_table.DataTable(studies, include_index=False, num_rows_per_page=10)

Number of studies for radx-rad: 48


Unnamed: 0,accession,name,description,Study Design,Study Consent
0,phs002585.v1.p1,Rapid Acceleration of Diagnostics - Radical (RADx-rad): AICORE-kids,"This work is directed at characterizing pediatric COVID-19 and stratifying incoming patients by projected (future) disease severity. Such stratification has several implications: immediately improving treatment planning, and as disease mechanistic regulatory milestones intended to conform with the Emergency Use Authorization (EUA) programs in effect for SARS-CoV-2 diagnostics. Note for data in RADx",Case Set,GRU --- General research use
1,phs002525.v1.p1,Rapid Acceleration of Diagnostics - Radical (RADx-rad): SF-RAD: Development and Proof-of-Concept Implementation of the South Florida Miami RADx-rad SARS-CoV-2 Wastewater-Based Surveillance Infrastructure,"The University of Miami (UM), with three primary campuses in Miami, Florida, is geographically spread within one of the worst current COVID-19 hotbeds. UM has deployed an elaborate human surveillance strategies. Working closely with the RADx-rad Data Coordination Center (DCC), this application (SF-RAD) will develop and implement data standards and",Case Set,GRU --- General research use
2,phs002782.v1.p1,Rapid Acceleration of Diagnostics - Radical (RADx-rad): NIEHS Diagnostic-Prognostic RNAseq,"Infectious disease outbreaks like Coronavirus Disease 2019 (COVID-19) can overwhelm healthcare systems when screening tools are scarce or lacking. In the face of an ongoing COVID-19 pandemic and with single-plex , long queue times, backlogs in COVID-19 diagnoses, and delayed access to specialized treatment for COVID-19 patients. The goal of this RADx funded project",Case Set,GRU --- General research use
3,phs002679.v1.p1,Rapid Acceleration of Diagnostics - Radical (RADx-rad): Wastewater Detection of COVID-19,"When faced with a pandemic such as SARS-Coronavirus-2 (SAR-CoV-2), the virus responsible for COVID-19, timely risk assessment and action are required to prevent public health impacts to entire communities. Because existing and emerging variants from wastewater, and 3) design platforms for communicating wastewater variant results to the public. Note for data in RADx",Collection,GRU --- General research use
4,phs002600.v1.p1,Rapid Acceleration of Diagnostics - Radical (RADx-rad): Portable GC Detector for COVID Diagnostics,"The data herein combines GC-MS and GC-DMS analysis of exhaled breath vapor compounds. The intent of this study is to develop a portable GC-DMS system to diagnose SARS-CoV-2 infections from , weight, symptoms at time of sampling, etc., was also collected. Note for data in RADx: Instructions for requesting individual-level data are available on",Collection,GRU --- General research use
5,phs002603.v1.p1,Rapid Acceleration of Diagnostics - Radical (RADx-rad): Diagnosis of MIS-C in Febrile Children,"The recent emergence of SARS-CoV2 and resultant pandemic of COVID-19 disease has overwhelmed global health systems and led to over 200,000 American deaths to date. While initial reports suggested that diagnostic strategy to distinguish children with MIS-C from children with other causes of fever. Note for data in RADx: Instructions for requesting individual",Prospective Longitudinal Cohort,GRU --- General research use
6,phs002685.v1.p1,Rapid Acceleration of Diagnostics - Radical (RADx-rad): DNA Star SAS-CoV-2 Rapid Test,"Automated, rapid diagnostics with little sample collection and preparation are needed to identify and trace affected persons in times when hyper-infectious pathogens cause pandemics. Frequent, low cost and highly scalable to results in minutes) and cost effective ( $3 per test). Note for data in RADx: Instructions for requesting individual-level data are available on",Case Set,GRU --- General research use
7,phs002583.v1.p1,Rapid Acceleration of Diagnostics - Radical (RADx-rad): A Rapid Breathalyzer Diagnostics Platform for COVID-19,"We propose to develop a novel testing platform that detects SARS-CoV-2 virions in a patient's breath. When a person exhales into the COVID breathalyzer, droplets and other emitted particles are , and has accurate reporting with high sensitivity and specificity. Note for data in RADx: Instructions for requesting individual-level data are available",Collection,GRU --- General research use
8,phs002604.v1.p1,Rapid Acceleration of Diagnostics - Radical (RADx-rad): Tracking the COVID-19 Epidemic in Sewage (TRACES),"Wastewater based testing (WBT) holds great promise for cost-effective population surveillance and transmission tracking of SARS-CoV-2, but optimal sampling modalities and protocols are unknown. Taking advantage of a diverse inner develop point-of-use microfluidics systems for timely WBT. Note for data in RADx: Instructions for requesting individual-level data are available on RADx",Prospective Longitudinal Cohort,GRU --- General research use
9,phs002524.v1.p1,Rapid Acceleration of Diagnostics - Radical (RADx-rad): Validation of Smart Masks for Surveillance of COVID-19,Vulnerable populations do not just need testing - they need surveillance. The ideal surveillance tool would operate in the background with minimal involvement of the population to be tested; it . Note for data in RADx: Instructions for requesting individual-level data are available on RADx Data Hub at https://radx-hub.nih.gov/home. Apply for data,Collection,GRU --- General research use


In [None]:
#@title Table of approved requests for datasets
def get_download_url(accession):
    return "https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/GetAuthorizedRequestDownload.cgi?study_id=" + accession

def get_authorized_requests(studies):
    authorized_requests = pd.DataFrame()

    for _, row in tqdm(studies.iterrows(), total=studies.shape[0]):
        try:
            df = pd.read_csv(get_download_url(row["accession"]), 
                             usecols=["Requestor", "Affiliation", "Project", "Date of approval", "Request status", 
                                      "Public Research Use Statement", "Technical Research Use Statement"],
                            sep="\t")
            df["accession"] = row["accession"]
            df["name"] = row["name"]
            authorized_requests = pd.concat([authorized_requests, df], ignore_index=True)
        except:
            print(f"Skipping: {row['accession']} - no data access through dbGaP.")
                                        
    return authorized_requests

requests = get_authorized_requests(studies)
print()
print()
print("Number of authorized requests :", requests.shape[0])
print("Number of unique requestors   :", len(requests["Requestor"].unique()))
print("Number of unique studies      :", len(requests["accession"].unique()))
data_table.DataTable(requests, include_index=False, num_rows_per_page=10)

100%|██████████| 48/48 [00:27<00:00,  1.76it/s]



Number of authorized requests : 51
Number of unique requestors   : 3
Number of unique studies      : 48





Unnamed: 0,Requestor,Affiliation,Project,Date of approval,Request status,Public Research Use Statement,Technical Research Use Statement,accession,name
0,"Rose, Peter","UNIVERSITY OF CALIFORNIA, SAN DIEGO",Analysis and Evaluation of RADx-rad Datasets,"Mar07, 2023",approved,"The National Institutes of Health launched the RADx? Radical program to support innovative, non-traditional diagnostic approaches to address gaps in COVID-19 testing and surveillance. This request is to support the evaluation of RADx-rad program outcomes by the RADx-rad Discoveries and Research Center (DCC).","This request supports the RADx-rad program outcomes evaluation by the RADx-rad Discoveries and Research Center (DCC). DCC facilitates âdata standardization, harmonization, integration, and analysis across RADx-rad projects and coordinates quality control, data curation, and analyses, and provides tools to monitor progress, performance, and use of the curated dataâ (RFA-OD-20-019). To fulfill these requirements, we will perform the following evaluations: 1. Measure the extent of shared data elements across (i) projects within the same or related FOA areas, (ii), projects across all RADx-rad. 2. Evaluate the quality and consistency of data submissions. 3. Compare the performance of diagnostic methods for analytical and clinical performance. 4. Correlate analytical with clinical performance. 5. Develop data analysis templates and use cases to demonstrate data integration across RADx-rad projects. 6. Evaluate the ease of use and capabilities of the NIH Data Hub to (i) access datasets, (ii) data dictionaries, (iii) develop and run data analyses (for # 1. â 5.) in the Data Hub workbench.",phs002585.v1.p1,Rapid Acceleration of Diagnostics - Radical (RADx-rad): AICORE-kids
1,"Ciofani, Danielle","BROAD INSTITUTE, INC.",Testing Data Access for RADx developer (2),"Dec12, 2022",approved,I will be confirming data access using the Hub.,I am testing data access for select RADx program data. I will be using the Data Hub to access the data.,phs002525.v1.p1,Rapid Acceleration of Diagnostics - Radical (RADx-rad): SF-RAD: Development and Proof-of-Concept Implementation of the South Florida Miami RADx-rad SARS-CoV-2 Wastewater-Based Surveillance Infrastructure
2,"Rose, Peter","UNIVERSITY OF CALIFORNIA, SAN DIEGO",Analysis and Evaluation of RADx-rad Datasets,"Mar07, 2023",approved,"The National Institutes of Health launched the RADx? Radical program to support innovative, non-traditional diagnostic approaches to address gaps in COVID-19 testing and surveillance. This request is to support the evaluation of RADx-rad program outcomes by the RADx-rad Discoveries and Research Center (DCC).","This request supports the RADx-rad program outcomes evaluation by the RADx-rad Discoveries and Research Center (DCC). DCC facilitates âdata standardization, harmonization, integration, and analysis across RADx-rad projects and coordinates quality control, data curation, and analyses, and provides tools to monitor progress, performance, and use of the curated dataâ (RFA-OD-20-019). To fulfill these requirements, we will perform the following evaluations: 1. Measure the extent of shared data elements across (i) projects within the same or related FOA areas, (ii), projects across all RADx-rad. 2. Evaluate the quality and consistency of data submissions. 3. Compare the performance of diagnostic methods for analytical and clinical performance. 4. Correlate analytical with clinical performance. 5. Develop data analysis templates and use cases to demonstrate data integration across RADx-rad projects. 6. Evaluate the ease of use and capabilities of the NIH Data Hub to (i) access datasets, (ii) data dictionaries, (iii) develop and run data analyses (for # 1. â 5.) in the Data Hub workbench.",phs002525.v1.p1,Rapid Acceleration of Diagnostics - Radical (RADx-rad): SF-RAD: Development and Proof-of-Concept Implementation of the South Florida Miami RADx-rad SARS-CoV-2 Wastewater-Based Surveillance Infrastructure
3,"Rose, Peter","UNIVERSITY OF CALIFORNIA, SAN DIEGO",Analysis and Evaluation of RADx-rad Datasets,"Mar07, 2023",approved,"The National Institutes of Health launched the RADx? Radical program to support innovative, non-traditional diagnostic approaches to address gaps in COVID-19 testing and surveillance. This request is to support the evaluation of RADx-rad program outcomes by the RADx-rad Discoveries and Research Center (DCC).","This request supports the RADx-rad program outcomes evaluation by the RADx-rad Discoveries and Research Center (DCC). DCC facilitates âdata standardization, harmonization, integration, and analysis across RADx-rad projects and coordinates quality control, data curation, and analyses, and provides tools to monitor progress, performance, and use of the curated dataâ (RFA-OD-20-019). To fulfill these requirements, we will perform the following evaluations: 1. Measure the extent of shared data elements across (i) projects within the same or related FOA areas, (ii), projects across all RADx-rad. 2. Evaluate the quality and consistency of data submissions. 3. Compare the performance of diagnostic methods for analytical and clinical performance. 4. Correlate analytical with clinical performance. 5. Develop data analysis templates and use cases to demonstrate data integration across RADx-rad projects. 6. Evaluate the ease of use and capabilities of the NIH Data Hub to (i) access datasets, (ii) data dictionaries, (iii) develop and run data analyses (for # 1. â 5.) in the Data Hub workbench.",phs002782.v1.p1,Rapid Acceleration of Diagnostics - Radical (RADx-rad): NIEHS Diagnostic-Prognostic RNAseq
4,"Rose, Peter","UNIVERSITY OF CALIFORNIA, SAN DIEGO",Analysis and Evaluation of RADx-rad Datasets,"Mar07, 2023",approved,"The National Institutes of Health launched the RADx? Radical program to support innovative, non-traditional diagnostic approaches to address gaps in COVID-19 testing and surveillance. This request is to support the evaluation of RADx-rad program outcomes by the RADx-rad Discoveries and Research Center (DCC).","This request supports the RADx-rad program outcomes evaluation by the RADx-rad Discoveries and Research Center (DCC). DCC facilitates âdata standardization, harmonization, integration, and analysis across RADx-rad projects and coordinates quality control, data curation, and analyses, and provides tools to monitor progress, performance, and use of the curated dataâ (RFA-OD-20-019). To fulfill these requirements, we will perform the following evaluations: 1. Measure the extent of shared data elements across (i) projects within the same or related FOA areas, (ii), projects across all RADx-rad. 2. Evaluate the quality and consistency of data submissions. 3. Compare the performance of diagnostic methods for analytical and clinical performance. 4. Correlate analytical with clinical performance. 5. Develop data analysis templates and use cases to demonstrate data integration across RADx-rad projects. 6. Evaluate the ease of use and capabilities of the NIH Data Hub to (i) access datasets, (ii) data dictionaries, (iii) develop and run data analyses (for # 1. â 5.) in the Data Hub workbench.",phs002679.v1.p1,Rapid Acceleration of Diagnostics - Radical (RADx-rad): Wastewater Detection of COVID-19
5,"Rose, Peter","UNIVERSITY OF CALIFORNIA, SAN DIEGO",Analysis and Evaluation of RADx-rad Datasets,"Mar07, 2023",approved,"The National Institutes of Health launched the RADx? Radical program to support innovative, non-traditional diagnostic approaches to address gaps in COVID-19 testing and surveillance. This request is to support the evaluation of RADx-rad program outcomes by the RADx-rad Discoveries and Research Center (DCC).","This request supports the RADx-rad program outcomes evaluation by the RADx-rad Discoveries and Research Center (DCC). DCC facilitates âdata standardization, harmonization, integration, and analysis across RADx-rad projects and coordinates quality control, data curation, and analyses, and provides tools to monitor progress, performance, and use of the curated dataâ (RFA-OD-20-019). To fulfill these requirements, we will perform the following evaluations: 1. Measure the extent of shared data elements across (i) projects within the same or related FOA areas, (ii), projects across all RADx-rad. 2. Evaluate the quality and consistency of data submissions. 3. Compare the performance of diagnostic methods for analytical and clinical performance. 4. Correlate analytical with clinical performance. 5. Develop data analysis templates and use cases to demonstrate data integration across RADx-rad projects. 6. Evaluate the ease of use and capabilities of the NIH Data Hub to (i) access datasets, (ii) data dictionaries, (iii) develop and run data analyses (for # 1. â 5.) in the Data Hub workbench.",phs002600.v1.p1,Rapid Acceleration of Diagnostics - Radical (RADx-rad): Portable GC Detector for COVID Diagnostics
6,"Rose, Peter","UNIVERSITY OF CALIFORNIA, SAN DIEGO",Analysis and Evaluation of RADx-rad Datasets,"Mar07, 2023",approved,"The National Institutes of Health launched the RADx? Radical program to support innovative, non-traditional diagnostic approaches to address gaps in COVID-19 testing and surveillance. This request is to support the evaluation of RADx-rad program outcomes by the RADx-rad Discoveries and Research Center (DCC).","This request supports the RADx-rad program outcomes evaluation by the RADx-rad Discoveries and Research Center (DCC). DCC facilitates âdata standardization, harmonization, integration, and analysis across RADx-rad projects and coordinates quality control, data curation, and analyses, and provides tools to monitor progress, performance, and use of the curated dataâ (RFA-OD-20-019). To fulfill these requirements, we will perform the following evaluations: 1. Measure the extent of shared data elements across (i) projects within the same or related FOA areas, (ii), projects across all RADx-rad. 2. Evaluate the quality and consistency of data submissions. 3. Compare the performance of diagnostic methods for analytical and clinical performance. 4. Correlate analytical with clinical performance. 5. Develop data analysis templates and use cases to demonstrate data integration across RADx-rad projects. 6. Evaluate the ease of use and capabilities of the NIH Data Hub to (i) access datasets, (ii) data dictionaries, (iii) develop and run data analyses (for # 1. â 5.) in the Data Hub workbench.",phs002603.v1.p1,Rapid Acceleration of Diagnostics - Radical (RADx-rad): Diagnosis of MIS-C in Febrile Children
7,"Rose, Peter","UNIVERSITY OF CALIFORNIA, SAN DIEGO",Analysis and Evaluation of RADx-rad Datasets,"Mar07, 2023",approved,"The National Institutes of Health launched the RADx? Radical program to support innovative, non-traditional diagnostic approaches to address gaps in COVID-19 testing and surveillance. This request is to support the evaluation of RADx-rad program outcomes by the RADx-rad Discoveries and Research Center (DCC).","This request supports the RADx-rad program outcomes evaluation by the RADx-rad Discoveries and Research Center (DCC). DCC facilitates âdata standardization, harmonization, integration, and analysis across RADx-rad projects and coordinates quality control, data curation, and analyses, and provides tools to monitor progress, performance, and use of the curated dataâ (RFA-OD-20-019). To fulfill these requirements, we will perform the following evaluations: 1. Measure the extent of shared data elements across (i) projects within the same or related FOA areas, (ii), projects across all RADx-rad. 2. Evaluate the quality and consistency of data submissions. 3. Compare the performance of diagnostic methods for analytical and clinical performance. 4. Correlate analytical with clinical performance. 5. Develop data analysis templates and use cases to demonstrate data integration across RADx-rad projects. 6. Evaluate the ease of use and capabilities of the NIH Data Hub to (i) access datasets, (ii) data dictionaries, (iii) develop and run data analyses (for # 1. â 5.) in the Data Hub workbench.",phs002685.v1.p1,Rapid Acceleration of Diagnostics - Radical (RADx-rad): DNA Star SAS-CoV-2 Rapid Test
8,"Rose, Peter","UNIVERSITY OF CALIFORNIA, SAN DIEGO",Analysis and Evaluation of RADx-rad Datasets,"Mar07, 2023",approved,"The National Institutes of Health launched the RADx? Radical program to support innovative, non-traditional diagnostic approaches to address gaps in COVID-19 testing and surveillance. This request is to support the evaluation of RADx-rad program outcomes by the RADx-rad Discoveries and Research Center (DCC).","This request supports the RADx-rad program outcomes evaluation by the RADx-rad Discoveries and Research Center (DCC). DCC facilitates âdata standardization, harmonization, integration, and analysis across RADx-rad projects and coordinates quality control, data curation, and analyses, and provides tools to monitor progress, performance, and use of the curated dataâ (RFA-OD-20-019). To fulfill these requirements, we will perform the following evaluations: 1. Measure the extent of shared data elements across (i) projects within the same or related FOA areas, (ii), projects across all RADx-rad. 2. Evaluate the quality and consistency of data submissions. 3. Compare the performance of diagnostic methods for analytical and clinical performance. 4. Correlate analytical with clinical performance. 5. Develop data analysis templates and use cases to demonstrate data integration across RADx-rad projects. 6. Evaluate the ease of use and capabilities of the NIH Data Hub to (i) access datasets, (ii) data dictionaries, (iii) develop and run data analyses (for # 1. â 5.) in the Data Hub workbench.",phs002583.v1.p1,Rapid Acceleration of Diagnostics - Radical (RADx-rad): A Rapid Breathalyzer Diagnostics Platform for COVID-19
9,"Rose, Peter","UNIVERSITY OF CALIFORNIA, SAN DIEGO",Analysis and Evaluation of RADx-rad Datasets,"Mar07, 2023",approved,"The National Institutes of Health launched the RADx? Radical program to support innovative, non-traditional diagnostic approaches to address gaps in COVID-19 testing and surveillance. This request is to support the evaluation of RADx-rad program outcomes by the RADx-rad Discoveries and Research Center (DCC).","This request supports the RADx-rad program outcomes evaluation by the RADx-rad Discoveries and Research Center (DCC). DCC facilitates âdata standardization, harmonization, integration, and analysis across RADx-rad projects and coordinates quality control, data curation, and analyses, and provides tools to monitor progress, performance, and use of the curated dataâ (RFA-OD-20-019). To fulfill these requirements, we will perform the following evaluations: 1. Measure the extent of shared data elements across (i) projects within the same or related FOA areas, (ii), projects across all RADx-rad. 2. Evaluate the quality and consistency of data submissions. 3. Compare the performance of diagnostic methods for analytical and clinical performance. 4. Correlate analytical with clinical performance. 5. Develop data analysis templates and use cases to demonstrate data integration across RADx-rad projects. 6. Evaluate the ease of use and capabilities of the NIH Data Hub to (i) access datasets, (ii) data dictionaries, (iii) develop and run data analyses (for # 1. â 5.) in the Data Hub workbench.",phs002604.v1.p1,Rapid Acceleration of Diagnostics - Radical (RADx-rad): Tracking the COVID-19 Epidemic in Sewage (TRACES)
