# RADx Reporter (Version for Google Colab)
This notebook runs a query on the dbGaP website for RADx projects and returns matching studies. For each study, it retrieves the authorized dataset access requests. Access requests for testing the RADx Data Hub can be excluded.

The number of access requests per study is a measure of data reuse.

Created: 2023-03-09

Author : Peter W. Rose (pwrose@ucsd.edu)

In [42]:
#@title Enter dbGaP query term  and then select ```Run All``` from ```Runtime``` menu {run: "auto"}
#@markdown ### Enter a query term for dbGaP
query = 'radx-dht' #@param {type:"string"}
print(f"dbGaP query: {query}")
developers = ["Rose, Peter ", "Ciofani, Danielle ", "Krishnamurthy, Ashok ", "Claypool, Kajal "]
#@markdown ### Exclude test requests
exclude_tests = True #@param {type:"boolean"}
print(f"exclude test requests: {exclude_tests}")

dbGaP query: radx-dht
exclude tests: True


In [43]:
%%capture
#@title Installing software on Google Colab
!pip install selenium
!apt-get update
!apt-get install firefox

In [44]:
#@title Importing packages
import os
import shutil
import glob
import time
from tqdm import tqdm
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options
from google.colab import data_table
data_table.enable_dataframe_formatter()
pd.set_option('display.max_colwidth', None)

In [45]:
#@title Running query
TMP_DIR = "/tmp"
filepath = os.path.join(TMP_DIR, "studies.csv")

def driversetup(download_dir):
    options = Options()
    #run Selenium in headless mode
    options.add_argument("--headless")
    options.add_argument("--no-sandbox")
    # https://stackoverflow.com/questions/60170311/how-to-switch-download-directory-using-selenium-firefox-python
    # 0: download to the desktop, 1 download to the default "Downloads" directory, 2 use specified directory
    options.set_preference("browser.download.folderList", 2)
    options.set_preference("browser.download.manager.showWhenStarting", False)
    options.set_preference("browser.download.dir", download_dir)
    options.set_preference("browser.helperApps.neverAsk.saveToDisk", "text/csv")
    
    # https://stackoverflow.com/questions/42204897/how-to-set-up-a-selenium-python-environment-for-firefox
    driver = webdriver.Firefox(options=options)
    driver.implicitly_wait(5)

    return driver

def download_dbgap_studies(query, filepath):
    # clean up any previously downloaded csv files
    files = glob.glob(os.path.join(TMP_DIR, "*.csv"))
    for file in files:
        os.remove(file)
    
    # download csv file
    driver = driversetup(TMP_DIR)
    driver.get(f"https://www.ncbi.nlm.nih.gov/gap/advanced_search/?TERM={query}")
    time.sleep(3)
    print("Running: ", driver.title)
    button = driver.find_element(By.CLASS_NAME, "svr_container")
    time.sleep(3)
    print("Downloading file: studies.csv")
    button.click()
    # wait until download is completed
    time.sleep(15)
    driver.close()
                  
    # move downloaded csv file to a standard location
    move_studies_file(filepath)
    
def move_studies_file(filepath):
    """ Move downloaded file to a specified standard location"""
    # the file name of the downloaded csv file is unknown in advance,
    # but there should be only one csv file.
    files = glob.glob(os.path.join(TMP_DIR, "*.csv"))
    if len(files) == 1:
        shutil.move(files[0], filepath)
    else:
        print("query error")
        
filepath = "studies.csv"
download_dbgap_studies(query, filepath)

studies = pd.read_csv(filepath, usecols=["accession", "name", "description", "Study Design", "Study Consent",])

Running:  dbGaP Advanced Search
Downloading file: studies.csv


In [46]:
#@title Table of studies
print(f"Number of studies for {query}:", studies.shape[0])
data_table.DataTable(studies, include_index=False, num_rows_per_page=10)

Number of studies for radx-dht: 10


Unnamed: 0,accession,name,description,Study Design,Study Consent
0,phs002537.v1.p1,Rapid Acceleration of Diagnostics - Digital Health Technologies (RADx-DHT): COVID-19 Experience Study (C19EX) Survey,This was conducted virtually through the Achievement studies platform during the current COVID-19 pandemic. Participants were asked to complete a survey every day to capture information about whether they had : https://rapids.ll.mit.edu/10.57895/6m5z-je42 Note for data in RADx: Instructions for requesting individual-level data are available on RADx Data Hub at,Case Set,GRU --- General research use
1,phs002539.v1.p1,Rapid Acceleration of Diagnostics - Digital Health Technologies (RADx-DHT): Large Scale Flu Surveillance Study (LSFS),"The purpose of this study was to better understand behavioral and physiological functioning in relation to recent self-reported influenza and influenza-like-illness (ILI), including coronavirus disease (COVID-19). Over 65,000 Achievement members in RADx: Instructions for requesting individual-level data are available on RADx Data Hub at https://radx-hub.nih.gov/home. Apply for data access in",Collection,GRU --- General research use
2,phs002534.v1.p1,Rapid Acceleration of Diagnostics - Digital Health Technologies (RADx-DHT): NIH Digital Health Solutions for COVID-19: Team SAE,The goal of this project is to develop a smartphone-based platform to monitor and support individuals with COVID-19 symptoms (who may need testing) and those who have already tested positive. : https://rapids.ll.mit.edu/10.57895/wv88-by98 Note for data in RADx: Instructions for requesting individual-level data are available on RADx Data Hub at,Case Set,GRU --- General research use
3,phs002538.v1.p1,Rapid Acceleration of Diagnostics - Digital Health Technologies (RADx-DHT): ILI Labels and Longitudinal Novel Engagement with Symptom Surveillance (ILLNESS) Study,"This study is a prospective observational study, approximately seven months in duration. Participants were asked to complete a weekly survey online asking about their ILI (influenza-like illness) experience over the for data in RADx: Instructions for requesting individual-level data are available on RADx Data Hub at https://radx-hub.nih.gov/home. Apply for data",Prospective Longitudinal Cohort,GRU --- General research use
4,phs002533.v1.p1,Rapid Acceleration of Diagnostics - Digital Health Technologies (RADx-DHT): Digital Health Solutions for COVID-19: COVID Community Action and Research Engagement (COVID-CARE),"Vibrent Health will expand the Vibrent Digital Health Solutions Platform (DHSP) implementation to additional populations among diverse user groups for additional validation of the technology's performance, usability, and reliability in solution and the NCI data hub. DOI: https://rapids.ll.mit.edu/10.57895/ravs-1b57 Note for data in RADx: Instructions for requesting individual-level data are",Case Set,GRU --- General research use
5,phs002535.v1.p1,Rapid Acceleration of Diagnostics - Digital Health Technologies (RADx-DHT): Personalized Analytics and Wearable Biosensor Platform for Early Detection of Covid-19 Decompensation (DECODE),"The goal of this project is to develop an artificial intelligence-based data analytics and cloud computing platform, paired with U.S. Food and Drug Administration (FDA)-cleared wearable devices, to create a (ROC) area under the curve (AUC) as the metric of performance. DOI: https://rapids.ll.mit.edu/10.57895/6d2f-c112 Note for data in RADx: Instructions for",Case Set,GRU --- General research use
6,phs002516.v1.p1,Rapid Acceleration of Diagnostics - Digital Health Technologies (RADx-DHT): NIH Digital Health Solutions for COVID-19: IBM Covid19 Contact Tracing and Data Exchange Tools,The goal of this project is to develop both contact tracing and secure data exchange tools. The contact tracing solution securely combines data from a variety of sources (including manual -19. DOI: https://rapids.ll.mit.edu/10.57895/h0an-m559 Note for data in RADx: Instructions for requesting individual-level data are available on RADx,Prospective Longitudinal Cohort,GRU --- General research use
7,phs002540.v1.p1,Rapid Acceleration of Diagnostics - Digital Health Technologies (RADx-DHT): NIH Digital Health Solutions for COVID-19: SAFER-COVID - Integration of Testing and Digital Health,"SAFER-COVID provides a set of self-management tools to consumers to track symptoms, test results, vaccine record, and environmental factors, such as exposure to others. Consumers may choose to integrate data , activity risk assessment and self-management within SAFER-COVID. DOI: https://rapids.ll.mit.edu/10.57895/cmt5-gh78 Note for data in RADx: Instructions for",Prospective Longitudinal Cohort,GRU --- General research use
8,phs002628.v1.p1,Rapid Acceleration of Diagnostics - Digital Health Technologies (RADx-DHT): NIH Digital Health Solutions for COVID-19: Clear2Go - A Digital Identity Wallet for Health Status,"Clear2Go is a solution/app that provides digital, non-refutable cryptographic proof of testing or vaccination that can be used to evaluate risk of allowing individuals to return to normal work, travel, : https://rapids.ll.mit.edu/10.57895/b2d6-8060 Note for data in RADx: Instructions for requesting individual-level data are available on RADx Data Hub at",Clinical Trial,GRU --- General research use
9,phs002519.v1.p1,Rapid Acceleration of Diagnostics - Digital Health Technologies (RADx-DHT): Covidseeker and COVID-19 Citizen Science: Leveraging Citizen Science and Real-Time Geospatial Temporal Mobile Data for Digital Contact Tracing and SARS-CoV-2 Hotspotting,The Covidseeker and COVID-19 Citizen Science Study integrates a retrospectively-determined geolocation digital program into an established digital infrastructure housed within the NIH-funded Eureka platform to enroll SARS-CoV-2 positive and negative ://rapids.ll.mit.edu/10.57895/me7r-vp06 Note for data in RADx: Instructions for requesting individual-level data are available on RADx Data Hub at https://radx,Case Set,GRU --- General research use


In [47]:
#@title Table of approved requests for datasets
def get_download_url(accession):
    return "https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/GetAuthorizedRequestDownload.cgi?study_id=" + accession

def get_authorized_requests(studies):
    authorized_requests = pd.DataFrame()

    for _, row in tqdm(studies.iterrows(), total=studies.shape[0]):
        try:
            df = pd.read_csv(get_download_url(row["accession"]), 
                             usecols=["Requestor", "Affiliation", "Project", "Date of approval", "Request status", 
                                      "Public Research Use Statement", "Technical Research Use Statement"],
                            sep="\t")
            df["accession"] = row["accession"]
            df["name"] = row["name"]
            authorized_requests = pd.concat([authorized_requests, df], ignore_index=True)
        except:
            print(f"Skipping: {row['accession']} - no data access through dbGaP.")
                                        
    return authorized_requests

requests = get_authorized_requests(studies)

# exclude test requests
if exclude_tests:
  requests = requests[~requests["Requestor"].isin(developers)]

print(requests["Requestor"].unique())
print()
print()
print("Number of authorized requests :", requests.shape[0])
print("Number of unique requestors   :", len(requests["Requestor"].unique()))
print("Number of unique studies      :", len(requests["accession"].unique()))
data_table.DataTable(requests, include_index=False, num_rows_per_page=10)

100%|██████████| 10/10 [00:07<00:00,  1.39it/s]

['Anwar, Mohd Mozharul' 'Davis-Dusenbery, Brandi ']


Number of authorized requests : 11
Number of unique requestors   : 2
Number of unique studies      : 10





Unnamed: 0,Requestor,Affiliation,Project,Date of approval,Request status,Public Research Use Statement,Technical Research Use Statement,accession,name
0,"Anwar, Mohd Mozharul",NIH,Exploration of Wearable Device Data for COVID-19,"Dec19, 2022",approved,"Wearable devices collect various physiological signals and measurements such as heart-beat rate, respiration rate, sleep, and body movement. This study explores wearable device data for detection, risk measurement, or prediction of disease like COVID-19.","The objective of the proposed research is to explore how wearable device data from different sources can be aggregated and analyzed to draw meaningful scientific insights, such as detection, risk stratification, or prediction of disease. The datasets will be utilized to build various machine learning/deep learning models. In the process, the datasets will be standardized, normalized, and harmonized. Furthermore, it will be studied how to leverage survey data in interpreting wearable device data. The different models will be compared on the basis of various performance metrics.",phs002537.v1.p1,Rapid Acceleration of Diagnostics - Digital Health Technologies (RADx-DHT): COVID-19 Experience Study (C19EX) Survey
2,"Anwar, Mohd Mozharul",NIH,Exploration of Wearable Device Data for COVID-19,"Dec19, 2022",approved,"Wearable devices collect various physiological signals and measurements such as heart-beat rate, respiration rate, sleep, and body movement. This study explores wearable device data for detection, risk measurement, or prediction of disease like COVID-19.","The objective of the proposed research is to explore how wearable device data from different sources can be aggregated and analyzed to draw meaningful scientific insights, such as detection, risk stratification, or prediction of disease. The datasets will be utilized to build various machine learning/deep learning models. In the process, the datasets will be standardized, normalized, and harmonized. Furthermore, it will be studied how to leverage survey data in interpreting wearable device data. The different models will be compared on the basis of various performance metrics.",phs002539.v1.p1,Rapid Acceleration of Diagnostics - Digital Health Technologies (RADx-DHT): Large Scale Flu Surveillance Study (LSFS)
4,"Anwar, Mohd Mozharul",NIH,Exploration of Wearable Device Data for COVID-19,"Dec19, 2022",approved,"Wearable devices collect various physiological signals and measurements such as heart-beat rate, respiration rate, sleep, and body movement. This study explores wearable device data for detection, risk measurement, or prediction of disease like COVID-19.","The objective of the proposed research is to explore how wearable device data from different sources can be aggregated and analyzed to draw meaningful scientific insights, such as detection, risk stratification, or prediction of disease. The datasets will be utilized to build various machine learning/deep learning models. In the process, the datasets will be standardized, normalized, and harmonized. Furthermore, it will be studied how to leverage survey data in interpreting wearable device data. The different models will be compared on the basis of various performance metrics.",phs002534.v1.p1,Rapid Acceleration of Diagnostics - Digital Health Technologies (RADx-DHT): NIH Digital Health Solutions for COVID-19: Team SAE
6,"Anwar, Mohd Mozharul",NIH,Exploration of Wearable Device Data for COVID-19,"Dec19, 2022",approved,"Wearable devices collect various physiological signals and measurements such as heart-beat rate, respiration rate, sleep, and body movement. This study explores wearable device data for detection, risk measurement, or prediction of disease like COVID-19.","The objective of the proposed research is to explore how wearable device data from different sources can be aggregated and analyzed to draw meaningful scientific insights, such as detection, risk stratification, or prediction of disease. The datasets will be utilized to build various machine learning/deep learning models. In the process, the datasets will be standardized, normalized, and harmonized. Furthermore, it will be studied how to leverage survey data in interpreting wearable device data. The different models will be compared on the basis of various performance metrics.",phs002538.v1.p1,Rapid Acceleration of Diagnostics - Digital Health Technologies (RADx-DHT): ILI Labels and Longitudinal Novel Engagement with Symptom Surveillance (ILLNESS) Study
9,"Anwar, Mohd Mozharul",NIH,Exploration of Wearable Device Data for COVID-19,"Dec19, 2022",approved,"Wearable devices collect various physiological signals and measurements such as heart-beat rate, respiration rate, sleep, and body movement. This study explores wearable device data for detection, risk measurement, or prediction of disease like COVID-19.","The objective of the proposed research is to explore how wearable device data from different sources can be aggregated and analyzed to draw meaningful scientific insights, such as detection, risk stratification, or prediction of disease. The datasets will be utilized to build various machine learning/deep learning models. In the process, the datasets will be standardized, normalized, and harmonized. Furthermore, it will be studied how to leverage survey data in interpreting wearable device data. The different models will be compared on the basis of various performance metrics.",phs002533.v1.p1,Rapid Acceleration of Diagnostics - Digital Health Technologies (RADx-DHT): Digital Health Solutions for COVID-19: COVID Community Action and Research Engagement (COVID-CARE)
11,"Anwar, Mohd Mozharul",NIH,Exploration of Wearable Device Data for COVID-19,"Feb07, 2023",approved,"Wearable devices collect various physiological signals and measurements such as heart-beat rate, respiration rate, sleep, and body movement. This study explores wearable device data for detection, risk measurement, or prediction of disease like COVID-19.","The objective of the proposed research is to explore how wearable device data from different sources can be aggregated and analyzed to draw meaningful scientific insights, such as detection, risk stratification, or prediction of disease. The datasets will be utilized to build various machine learning/deep learning models. In the process, the datasets will be standardized, normalized, and harmonized. Furthermore, it will be studied how to leverage survey data in interpreting wearable device data. The different models will be compared on the basis of various performance metrics.",phs002535.v1.p1,Rapid Acceleration of Diagnostics - Digital Health Technologies (RADx-DHT): Personalized Analytics and Wearable Biosensor Platform for Early Detection of Covid-19 Decompensation (DECODE)
12,"Anwar, Mohd Mozharul",NIH,Exploration of Wearable Device Data for COVID-19,"Dec19, 2022",approved,"Wearable devices collect various physiological signals and measurements such as heart-beat rate, respiration rate, sleep, and body movement. This study explores wearable device data for detection, risk measurement, or prediction of disease like COVID-19.","The objective of the proposed research is to explore how wearable device data from different sources can be aggregated and analyzed to draw meaningful scientific insights, such as detection, risk stratification, or prediction of disease. The datasets will be utilized to build various machine learning/deep learning models. In the process, the datasets will be standardized, normalized, and harmonized. Furthermore, it will be studied how to leverage survey data in interpreting wearable device data. The different models will be compared on the basis of various performance metrics.",phs002516.v1.p1,Rapid Acceleration of Diagnostics - Digital Health Technologies (RADx-DHT): NIH Digital Health Solutions for COVID-19: IBM Covid19 Contact Tracing and Data Exchange Tools
15,"Anwar, Mohd Mozharul",NIH,Exploration of Wearable Device Data for COVID-19,"Dec19, 2022",approved,"Wearable devices collect various physiological signals and measurements such as heart-beat rate, respiration rate, sleep, and body movement. This study explores wearable device data for detection, risk measurement, or prediction of disease like COVID-19.","The objective of the proposed research is to explore how wearable device data from different sources can be aggregated and analyzed to draw meaningful scientific insights, such as detection, risk stratification, or prediction of disease. The datasets will be utilized to build various machine learning/deep learning models. In the process, the datasets will be standardized, normalized, and harmonized. Furthermore, it will be studied how to leverage survey data in interpreting wearable device data. The different models will be compared on the basis of various performance metrics.",phs002540.v1.p1,Rapid Acceleration of Diagnostics - Digital Health Technologies (RADx-DHT): NIH Digital Health Solutions for COVID-19: SAFER-COVID - Integration of Testing and Digital Health
17,"Anwar, Mohd Mozharul",NIH,Exploration of Wearable Device Data for COVID-19,"Dec19, 2022",approved,"Wearable devices collect various physiological signals and measurements such as heart-beat rate, respiration rate, sleep, and body movement. This study explores wearable device data for detection, risk measurement, or prediction of disease like COVID-19.","The objective of the proposed research is to explore how wearable device data from different sources can be aggregated and analyzed to draw meaningful scientific insights, such as detection, risk stratification, or prediction of disease. The datasets will be utilized to build various machine learning/deep learning models. In the process, the datasets will be standardized, normalized, and harmonized. Furthermore, it will be studied how to leverage survey data in interpreting wearable device data. The different models will be compared on the basis of various performance metrics.",phs002628.v1.p1,Rapid Acceleration of Diagnostics - Digital Health Technologies (RADx-DHT): NIH Digital Health Solutions for COVID-19: Clear2Go - A Digital Identity Wallet for Health Status
19,"Anwar, Mohd Mozharul",NIH,Exploration of Wearable Device Data for COVID-19,"Dec19, 2022",approved,"Wearable devices collect various physiological signals and measurements such as heart-beat rate, respiration rate, sleep, and body movement. This study explores wearable device data for detection, risk measurement, or prediction of disease like COVID-19.","The objective of the proposed research is to explore how wearable device data from different sources can be aggregated and analyzed to draw meaningful scientific insights, such as detection, risk stratification, or prediction of disease. The datasets will be utilized to build various machine learning/deep learning models. In the process, the datasets will be standardized, normalized, and harmonized. Furthermore, it will be studied how to leverage survey data in interpreting wearable device data. The different models will be compared on the basis of various performance metrics.",phs002519.v1.p1,Rapid Acceleration of Diagnostics - Digital Health Technologies (RADx-DHT): Covidseeker and COVID-19 Citizen Science: Leveraging Citizen Science and Real-Time Geospatial Temporal Mobile Data for Digital Contact Tracing and SARS-CoV-2 Hotspotting
