## NIMH Data Archive Data Access

In this notebook a HEAL user can access a NIMH Data Archive (NDA) data package using a miNDAR data package ID, a NDA username and password for the data package, and the NDA API.

Before using this notebook, the user must have been granted access to the dataset they wish to download, and have created a miNDAR for the data package. (A miNDAR is a cloud-based Oracle database that contains a copy of a data package.)  

* You can request access to a NDA dataset from the [Data Permissions tab on the NDA site](https://nda.nih.gov/user/dashboard/data_permissions.html). If you do not have an NDA account, you can create one by first logging in using your NIH login credentials.
* Once you have access, you can create a miNDAR from the data package by [following the instructions here](https://docs.google.com/document/d/1-DuyUke3I7CK_riRvVDIesgtnjhJBQVlDu9tQpe2JYo/edit?usp=sharing). 

Once you have a miNDAR for your dataset, the only inputs required from the user of this notebook is to create a nimh_credentials.json file in the second cell of this notebok using your NDA NIMH username and password and to define the miNDAR data package ID in the third cell of this notebook.  

* The **miNDAR ID** can be found in the `ID` column on the [Data Packages tab of the NDA portal](https://nda.nih.gov/user/dashboard/packages.html)  
* Data packages and their corresponding **miNDAR IDs** expire after 60 days and must be generated again in order to download the data.

Please note: Users are responsible for complying with all aspects of their Data Use Agreement, including deleting any accessed data in accordance with the parameters specified in their DUA.

### Import Necessary Packages

In [1]:
import base64
import requests
import json
import urllib.request
import shutil
from pathlib import Path
import os
import getpass

### Create Your NDA Credentials JSON File

Input your NDA username and password in the 'nda_username' and 'nda_password' fields in the nda_credentials dictionary object. Note these are NOT your RAS credentials; this username and password are set when you create an account at NDA. You can view your NDA username and reset your NDA password if needed by visiting your Profile at the NIMH Data Archive. 

If you have already created your nda_credentials.json file, then there is no need to run the cell below.

After you create the nda_credentials.json file we suggest that you delete the input of your NDA username and password in the cell below to minimize the risk of potentially sharing these credentials. Your newly created nda_credentials.json file will persist between workspace sessions under the /pd directory.

In [None]:
nda_username = getpass.getpass()

In [15]:
nda_password = getpass.getpass()

 ········


### Define NIMH Data Package ID

Enter your miNDAR ID (replace '123')

In [18]:
packageId = 1228116 # miNDAR package ID as integer 

### Encode A Credentials Object Using The Credentials JSON File

In [19]:
credentials = nda_username + ':' + nda_password
credentials = base64.b64encode(credentials.encode('ascii')).decode('utf-8')

### Create Header And Test Request
Create the headers to be used for all requests and send initial HTTP request to test connection

In [20]:
headers = {
    'Authorization': 'Basic ' + credentials,
    'User-Agent': 'Example Client',
    'Accept': 'application/json'
}

response = requests.get('https://nda.nih.gov/api/package/auth', headers=headers)

if response.status_code != requests.codes.ok:
    print('failed to authenticate')
    response.raise_for_status()

### Request File IDs & Names
Return all file names and associated file IDs as a nested json/dictionary object

In [21]:
response = requests.get('https://nda.nih.gov/api/package/' + str(packageId) + '/files', headers=headers)
results = response.json()['results']

api_files = {}
for f in results:
    api_files[f['package_file_id']] = {'name': f['download_alias']}
api_files

{10600799118: {'name': 'README.pdf'},
 10600799117: {'name': 'dataset_collection.txt'},
 10600799116: {'name': 'package_info.txt'},
 10600799115: {'name': 'package_file_metadata_1228116.txt.gz'},
 10600799114: {'name': 'stai01.txt'},
 10600799113: {'name': 'tsi01.txt'},
 10600799112: {'name': 'sfhs01.txt'},
 10600799111: {'name': 'traitfear01.txt'},
 10600799110: {'name': 'scid01.txt'},
 10600799109: {'name': 'pid501.txt'},
 10600799108: {'name': 'ptsd01.txt'},
 10600799107: {'name': 'sds01.txt'},
 10600799106: {'name': 'masq01.txt'},
 10600799105: {'name': 'ndar_subject01.txt'},
 10600799104: {'name': 'meaq01.txt'},
 10600799103: {'name': 'pdss01.txt'},
 10600799102: {'name': 'lec01.txt'},
 10600799101: {'name': 'ius01.txt'},
 10600799100: {'name': 'lsps01.txt'},
 10600799099: {'name': 'gadss01.txt'},
 10600799098: {'name': 'bisbas01.txt'},
 10600799097: {'name': 'fpq01.txt'},
 10600799096: {'name': 'combat01.txt'},
 10600799095: {'name': 'drs01.txt'},
 10600799094: {'name': 'bdi01.tx

### Request Files' Presigned URLs
Use the data package file IDs to return the presigned URLs for each file in the data package. Manually selecting the presigned download URL can be used to download the associated file to your local machine.

Define the file ID (file_id) by either copying and pasting one of the file IDs listed above or index a file ID from the list of file IDs as can be seen below.

In [22]:
response = requests.post('https://nda.nih.gov/api/package/' + str(packageId) + '/files/batchGeneratePresignedUrls', json=list(api_files.keys()), headers=headers)
results = response.json()['presignedUrls']

for url in results:
    api_files[url['package_file_id']]['download'] = url['downloadURL']

file_ids = list(api_files.keys())
api_files[file_ids[0]]['download']                       

'https://gpop.s3.amazonaws.com/ndar_data/QueryPackages/PRODDB/README.pdf?X-Amz-Security-Token=FwoGZXIvYXdzEJL%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaDEk0MdEj3gSSaoqGZiKBAvaPicToXlTo4ilbJ0a84iHPidJbdPXdfBK5OlvHgR5LzZ91v4f8IDupGMwtjeoy60qZdKIJZ6L4Me1QkuY3CG2fMqWNP%2BElVjgf5oA32e47jfLp2W%2FQDRxhtM659wZ8IDfEsliUlt8JXe4qyHWlK5EXW93el52ReiFllmYVKAp0FcIIRiYQizz7Lr%2FzKa9Q4ysMr7ovC81SDnkfyHMJMsfa2pzBCRnUEBdeu9%2FDHG6j9j6kXqWTjDpDP0nmMHZ2bV9x3mPCVVGqWj%2FhyhcbzqdliSvyCr8tnKO93ldTVeRZUqnFUom2qCGhYHatau2ZF48wZtsF%2FvnouOpncksxqdHaKJKx6bEGMiljppcF96Il%2BAMVPlrzyW09omxw5U3s%2FYxBL9ksIW8Bp%2B6OjBWBSPCiSQ%3D%3D&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20240507T163634Z&X-Amz-SignedHeaders=host&X-Amz-Expires=129599&X-Amz-Credential=ASIAZAAXFM2FIVQ4GVU3%2F20240507%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=57ddddefe3a013a64c69a66fc22ecf783889791b92ced5bda83a5a73ed4bd81d'

### Download All Files In The Data Package
Use the complete list of data package file IDs to download all of the files in the data package.

In [23]:
for key in list(api_files.keys()):

    name = api_files[key]['name']
    downloadUrl = api_files[key]['download']
    
    # Create a downloads directory - you may change the download directory path here
    file = 'downloads/' + name
    # Strip out the file's name for creating non-existent directories
    directory = file[:file.rfind('/')]

    # Create non-existent directories, package files have their
    # own directory structure, and this will ensure that it is
    # kept in tact when downloading.
    Path(directory).mkdir(parents=True, exist_ok=True)

    # Initiate the download.
    with urllib.request.urlopen(downloadUrl) as dl, open(file, 'wb') as out_file:
        try:
            shutil.copytree(dl, out_file)
        except:
            pass
        

### Download Select Files By File ID
Download files associated with select data package file IDs.

In [18]:
select_file_ids = [file_ids[0], file_ids[1], file_ids[2]]
for key in select_file_ids:

    name = api_files[key]['name']
    downloadUrl = api_files[key]['download']
    
    # Create a downloads directory - you may change the download directory path here
    file = 'downloads/' + name
    # Strip out the file's name for creating non-existent directories
    directory = file[:file.rfind('/')]
    
    # Create non-existent directories; package files have their
    # own directory structure, and this will ensure that the structure is
    # kept intact when downloading.
    Path(directory).mkdir(parents=True, exist_ok=True)
    
    # Initiate the download.
    with urllib.request.urlopen(downloadUrl) as dl, open(file, 'wb') as out_file:
        try:
            shutil.copytree(dl, out_file)
        except:
            pass
        