# Data Steward's reference -
# HEAL Companion Tool platform reference

The goal of this notebook is to empower the Data Stewards to see how to find the data they need for the HEAL companion tool. You can pull out the relevant study metadata info if you have a study's **HDP ID, an appl_id, or a project number**.  

**Here is the data you can get from running this notebook:**
* appl_id (pulled from NIH reporter into MDS)
* clinical trials ID number - NCTID (if submitted to the platform)
* study name and project number (pulled from NIH reporter into MDS)
* archive status and (if applicable) archive date
* whether a study is registered, who registered it, and when
* data repository selection indicated in MDS
* repository study ID to indicate data submitted to repo
* when the CEDAR form was last updated
* number of completed fields and percent completion for each of the 9 sections of the CEDAR form
* a list of CEDAR fields that do not have data
* overall number of fields and percent completion for the whole CEDAR form


Note: this notebook does NOT require installing the Gen3SDK

## Background info - how to use the notebook

This notebook includes some cells with code (these have a slightly gray background), and some cells with text (like this cell you are reading, and the cells above and below this one). We will need to "run" the cells with code, in order, waiting for the previous cell to finish running before moving on, in order to pull out the information we need. The cells with text do not need you to act on them at all; however, if you also "run" these cells for convenience, it will not cause any problems. *Note: if you accidentally double-click in a text cell, it will change to show the markdown format for the cell. Simply run the text cell to change it back to the nicer text formatting.*  

### First time opening a notebook:

When you open this notebook for the first time, you will need to **set the kernel to be Python 3**. You can do this by clicking in the upper right corner of the notebook, where it says "No Kernel." Set the kernel to Python 3 in the dropdown.  

### How to execute a cell.

Running a cell is when it runs the code, produces any output, and moves to the next cell. Here are the steps to run a cell.  

1) *(If you didn't just run a previous cell)* **Click in the cell** you want to execute. This is to make sure you are running the cell you want to run. You can always tell which cell is currently the one in active focus because it will have a thick blue vertical line to the left of the cell.
2) In the line of menu icons at the top of the notebook, find the **icon that looks like a triangle** pointing right (like a Play button icon). **Click this button**. This is the icon to run a cell and automatically move to the next one. Note it only runs one cell at a time -- it will not run the next cell until you click the button again.
3) Just to the left of the upper left corner of the cell, you will see **square brackets ([])**. If they have a `*` in them, the cell is still running - wait until it is done to run the next cell. If you see a number in them, it means it is complete and you can run the next cell. The numbers will increase sequentially for each cell run (so, the first cell you run will get 1, and the second cell you run will be 2, etc - regardless of whether you are re-running a cell or what order you have done them in).

> If the cell has any output, it will print out below the cell. Please be sure to **read the text cells** to understand if you need to enter any information before the cell runs, or if you need to do anything with the information produced.

4) If your cell has finished running, and you don't need to enter any information on the next cell, you can **click the Run button again** to run the next cell. Sometimes you need to scroll to be sure you are keeping up with the cell in focus (look for the blue line on the left.)

Questions? Reach out to Sara on slack on at smvgarcia@uchicago.edu.

## Start here

This will import all the python libraries needed to run the commands in the notebook. If you don't already have the following Python libraries installed, you will need to install them first (follow the links for information about installing them). If you aren't sure what's installed -- you can always just try running the next cell and see if you get an error. Reach out to Sara (smvgarcia@uchicago.edu) if you have any questions about this.  
* [Pandas](https://pandas.pydata.org/docs/getting_started/install.html)
* [NumPy](https://numpy.org/install/)
* [Requests](https://pypi.org/project/requests/)
* [JSON](https://pypi.org/project/jsonlib/)
* OS, Sys, DateTime, and Webbrowser are all part of the standard Python package, so these don't need to be installed separately.

In [None]:
import os,sys
import webbrowser
import pandas as pd
import numpy as np
import requests
import json
from datetime import date

## Gather the metadata into a useable format.
In the next series of cells, we will gather the metadata and get it into a dataframe so later, we can pull out pieces as needed.

In [None]:
cedar_fields = ["data",
                "study_type",
                "minimal_info",
                "data_availability",
                "metadata_location",
                "study_translational_focus",
                "human_subject_applicability",
                "human_condition_applicability",
                "human_treatment_applicability",
                "time_of_registration",
                "time_of_last_cedar_updated"]

# Create a function to clean the metadata so that all unfilled dictionaries or lists are seen as nan (technically np.nan)
# leave empty strings as `''`
def clean_data(df):
    for col in df.columns:
        for i in range(len(df)):
            if type(df[col].iloc[i]) == bool or type(df[col].iloc[i]) == np.bool_:
                continue
            elif type(df[col].iloc[i]) == dict:
                if not any(list(df[col].iloc[i].values())):
                    df.at[i, col]= np.nan
            elif type(df[col].iloc[i]) == int or type(df[col].iloc[i]) == np.float64 or type(df[col].iloc[i]) == float:
                continue
            elif type(df[col].iloc[i]) == list:
                if len(df[col].iloc[i]) == 0:
                    df.at[i, col]= np.nan
    return df


# Create a function to transform the metadata to a dataframe format
# Use the clean_data function during transformation
def transform_data(meta_dict):
    df = pd.DataFrame.from_dict(meta_dict)
    df = df.T
    df.index.name = 'guids'
    df.reset_index(inplace=True)
    df = clean_data(df)
    return df

### Find MDS record for the study searching by project number, appl_id, or hdpid

In the next cell, choose only 1 of the 3 numbered options based on whether you would like to use your project number, HDP ID, or appl_id. For the option you choose - remove the "#" at the front of the line to make it active code. Make sure you only have 1 option with "#" removed from its lines. Replace the # for any options you are NOT using. The output will be a query. You can compare the structure of the output query to the examples shown below to ensure the input (the HDP ID, the appl_id, or the project number) was handled correctly.

The cell after that will open a new browser tab with the public MDS record for the study. If you want, this will let you examine the MDS for other information manually. This is just for convenience and reference - you do not need to do anything with the record. Go back to your Jupyter notebook to keep moving forward.

Here are several example queries to pull up the MDS using project number, HDP ID, or appl_id.
```
Example to get MDS from project number (project_number)
https://healdata.org/mds/metadata?data=True&limit=1000&gen3_discovery.project_number=1R01HL150523-01

Example to get MDS from HDP ID (\_hdp_uid)
https://healdata.org/mds/metadata?data=True&limit=1000&gen3_discovery._hdp_uid=HDP01068

Example to get MDS from appl_id
https://healdata.org/mds/metadata?data=True&limit=1000&gen3_discovery.appl_id=10590120
```

In [None]:
# Choose only ONE of the 3 numbered options below to input your project number, HDP ID, or appl_id.
# You should only have 1 option with "#" removed from its lines. Replace the # for any options you are NOT using.

# (1) If you have the NIH project number (eg, 1R01HL150523-01), remove the "#" from the front of the
# next 2 lines, and replace the "your project number" with your project number (leave the quote marks in)
# projnumber = 'your project number'
# query = 'https://healdata.org/mds/metadata?data=True&limit=1000&gen3_discovery.project_number={}'.format(projnumber)

# (2) If you have the HDP ID (eg, HDP01068), remove the "#" from the front of the
# next 2 lines, and replace the "your HDP ID" with your HDP ID (leave the quote marks in)
hpid = 'HDP00010'
query = 'https://healdata.org/mds/metadata?data=True&limit=1000&gen3_discovery._hdp_uid={}'.format(hpid)

# (3) If you have the appl_id (eg, 10590120), remove the "#" from the front of the
# next 2 lines, and replace the "your appl_id" with your appl_id (leave the quote marks in)
apID = '10439270'
# query = 'https://healdata.org/mds/metadata?data=True&limit=1000&gen3_discovery.appl_id=9893173'

print(query)

https://healdata.org/mds/metadata?data=True&limit=1000&gen3_discovery._hdp_uid=HDP00010


In [None]:
# run this cell to trigger opening the MDS page for your study of interest in a new browser tab
# this will let you examine the MDS for other information manually if you want
# note: one tester found this did not auto-open the URL. If that happens you can click the output of
# the previous cell to see the MDS page.

webbrowser.open(query)

False

In [None]:
# pull the MDS info from the query, then (optionally) show the MDS to make sure it worked
# unless you uncomment the line at the bottom, there will be no output

response = requests.get(f"{query}")

# if you want to double-check that you pulled something down, you can uncomment the next line.
#response.text

In [None]:
# since the MDS is in json format, decode the json
# unless you uncomment the line at the bottom, there will be no output

response_json = response.json()
print(response_json)
# if you want to check that your JSON decoded, uncomment the next line
# display(response_json)

{'HDP00010': {'_guid_type': 'unregistered_discovery_metadata', 'nih_reporter': {'foa': 'DA19-008', 'terms': '<Academic Medical Centers><University Medical Centers><Aging><Opioid Analgesics><opioid painkiller><opioid pain reliever><opioid pain medication><opioid anesthetic><opioid analgesia><opiate pain reliever><opiate pain medication><opiate analgesic><opiate analgesia><Artificial Intelligence><Machine Intelligence><Computer Reasoning><Buprenorphine><Chicago><Cities><Clinical Trials><Communities><Death><Cessation of life><Diamorphine><Diacetylmorphine><Heroin><abuses drugs><abuse of drugs><Drug abuse><Professional Education><Epidemic><National Government><Federal Government><Gender Identity><Geography><Health><Hospitals><Illinois><Indiana><Learning><life course><Life Cycle><Life Cycle Stages><Los Angeles><Mentorship><Methods><Midwest US><Midwest U.S.><Midwest><Midwestern United States><Morbidity><Morbidity - disease rate><mortality><National Institutes of Health><NIH><United States Na

In [None]:
# Return the HDP ID corresponding to your query

guids = response_json.keys()
print(guids)

dict_keys(['HDP00010'])


In [None]:
# Map the API json response to a workable format so we can find our fields easily
# There won't be any output from this cell.

guids = response_json.keys()
metadata = {'nih_metadata': {}, 'ctgov_metadata': {}, 'gen3_metadata': {}}

no_gen3_metadata = []
missed_keys = []

for guid in guids:

    if 'gen3_discovery' in response_json[guid].keys():
        metadata['gen3_metadata'][guid] = response_json[guid]['gen3_discovery'] # get majority metadata

        if '_guid_type' in response_json[guid].keys():
            metadata['gen3_metadata'][guid]['registration_status'] = response_json[guid]['_guid_type'] # get registration status

        if 'study_metadata' in response_json[guid]['gen3_discovery'].keys():
            for key1 in response_json[guid]['gen3_discovery']['study_metadata'].keys():
                for key2 in response_json[guid]['gen3_discovery']['study_metadata'][key1].keys():
                    if key1 in cedar_fields:
                        metadata['gen3_metadata'][guid][f'cedar_study_metadata.{key1}.{key2}'] = response_json[guid]['gen3_discovery']['study_metadata'][key1][key2]
                    else:
                        metadata['gen3_metadata'][guid][f'study_metadata.{key1}.{key2}'] = response_json[guid]['gen3_discovery']['study_metadata'][key1][key2]
            del metadata['gen3_metadata'][guid]['study_metadata']


    if 'nih_reporter' in response_json[guid].keys():
        metadata['nih_metadata'][guid] = response_json[guid]['nih_reporter']

    if 'clinicaltrials_gov' in response_json[guid].keys():
        metadata['ctgov_metadata'][guid] = response_json[guid]['clinicaltrials_gov']

In [None]:
# Use the transform_data function we created earlier to convert the metadata from json format to a dataframe.
# Unless you uncomment the line at the bottom, there will be no output

df1 = transform_data(metadata['gen3_metadata'])
df2 = transform_data(metadata['ctgov_metadata'])
df3 = transform_data(metadata['nih_metadata'])

# note that df2 (the Clinical Trials metadata dataframe) will be expected to be empty for any study that does NOT have an NCTId.
# you can verify that any dataframe has data by uncommenting the following line and replacing "df1" with the df2 or df3 - whatever you want to check
# display(df1)
# df3.head()

In [None]:
# this cell copies the appl_id column from df3 and adds it to df1
# make a dataframe with the appl_id col from df3 - it's always there
df_apid = df3['appl_id']

# drop the appl_id col from df1 -- it's only there sometimes and we don't want duplicate cols
df1.drop(['appl_id'], axis=1, errors='ignore')

# add df_apid to df1
df1 = pd.concat([df1, df_apid], axis=1)

## Pull out the relevant metadata
Now that all the metadata is in a dataframe, we can easily pull out the data we need.  

### Basic details for verification
First - let's verify that the name, project number, and other basic study details are what we expect.  
* Study Name
* Project Number
* Study PIs

### Is the study registered?
Several fields below show you whether the study was registered, when, and by whom.

**Who registered the study?**
This will not produce the user's name. Instead, it will provide the HEAL login name for the person who registered the study. There are 2 possible response types:
* **Email address:** the Gmail or InCommon login used by the registering user
* **ORCID:** a series of 4 groups of 4 digits each, separated with a hyphen (####-####-####-####)

You can usually reverse-lookup the ORCID using the search bar at [orcid.org](https://orcid.org/)

If the study is not registered, the response will be "na".

### Is the study archived? ("Active status")

Archived studies are shown as "archived" instead of "live" in the "Active Status" column below. When a study on the platform is no longer wanted on the platform (either because it was different funding year for another study already on the platform, or for some other reason we don't want it to show up on the discovery page), we archive the study.  

**If a study is archived, the PI is not expected to do anything further with it.** A study that is archived is not expected to be registered (although it may have been before archiving), nor would it need to have any study-level metadata.  

### Other relevant ID numbers and links

* NIH Application ID (appl_id)  
* NIH RePORTER link  
* Clinical Trials ID (NCT ID)  

In [None]:
# here, rowdf is the row from the df that relates to the guid it's currently looping thru
def mydf1function(rowdf):
    projname = rowdf.iloc[0]['project_title']
    projnumber = rowdf.iloc[0]['project_number']
    projPI = rowdf.iloc[0]['investigators_name']
    ap_id = rowdf.iloc[0]['appl_id']
    url = rowdf.iloc[0]['cedar_study_metadata.metadata_location.nih_reporter_link']
    ctid = rowdf.iloc[0]['cedar_study_metadata.metadata_location.clinical_trials_study_ID']

    if rowdf.iloc[0]['registration_status'] == 'discovery_metadata_archive':
        archivestatus = 'archived'
        archivedate = rowdf.iloc[0]['archive_date']
    else:
        archivestatus = 'live'
        archivedate = 'na'
    if rowdf.iloc[0]['is_registered'] == True:
        regstatus = 'is registered'
        regdate = rowdf.iloc[0]['time_of_registration']
        reguser = rowdf.iloc[0]['registrant_username']
    else:
        regstatus = 'not registered'
        regdate = 'na'
        reguser = 'na'

    return {
        'Study name': projname,
        'Project number': projnumber,
        'Study PIs': projPI,
        'Registration status': regstatus,
        'Registration date': regdate,
        'Registering user': reguser,
        'Active status': archivestatus,
        'Archive date': archivedate,
        'NIH application ID': ap_id,
        'NIH RePORTER link': url,
        'Clinical Trials ID': ctid
    }

res_series = df1.groupby('guids').apply(mydf1function)
res_df = pd.DataFrame(res_series.tolist(), index=res_series.index)
res_df.T

guids,HDP00010
Study name,Great Lakes Node of the Drug Abuse Clinical Tr...
Project number,1UG1DA049467-01
Study PIs,[Niranjan Karnik]
Registration status,not registered
Registration date,na
Registering user,na
Active status,live
Archive date,na
NIH application ID,9839124
NIH RePORTER link,https://reporter.nih.gov/project-details/9839124


## Data Repository Metadata

If there is no repository in the MDS, the `cedar_study_metadata.metadata_location.data_repositories` field will report np.nan (or, in the MDS, will just be an empty list). If there has been one or more data repositories selected and reported to the platform, this section will report back the full list of repositories selected.

For studies with at least one repository, these fields are present in the MDS:  

* `repository_name`: the name of the repository  
* `repository_study_id`: this should be the study-level identifier at the repository (this will be blank if they have reported a repository, but have not yet reported depositing their study data)  

PIs should not get a study ID until after they have deposited data, so the `repository_study_id` field can be an indicator for whether study data has been deposited (and reported to the platform). Note that this field will only exist if they have reported a repository to the platform.  

This section of the notebook does the following:  

* Identifies whether any repository selections have been reported to the platform  
* Reports all repository selections for the study on the platform
* Determines whether there has been at least one repo selected
* Reports any repository study IDs reported for a repository
* Determines whether there has been a study ID reported for each repo selected for the study

In [None]:
# all_repo_data is a list of dictionaries. need to loop through the dictionaries in the list
def get_repo_values(lst, key):
    # base case: if the list is empty, return an empty list
    if not lst:
        return []
    # get the first dictionary in the list
    first_dict = lst[0]
    # check if the key is in the first dictionary
    if key in first_dict:
        # if the key is in the dictionary, add the value to the result list
        result = [first_dict[key]]
    else:
        # if the key is not in the dictionary, the result list is empty
        result = []
    # recursively call the function on the rest of the list
    result += get_repo_values(lst[1:], key)
    return result

reported_repo = []

for index, row in df1.iterrows():
    h_id = row['guids']
    all_repo_data = row['cedar_study_metadata.metadata_location.data_repositories']
    if row['cedar_study_metadata.metadata_location.data_repositories'] is np.nan:
        all_repo_names = []
        all_repo_IDs = []
        any_repo_sel = 'No repo has been submitted'
        repo_deposit = 'No data submissions have been reported'
    else:
        all_repo_names = get_repo_values(all_repo_data, 'repository_name')
        if all_repo_names == '[]':
            any_repo_sel = 'No repo has been submitted'
        else:
            any_repo_sel = 'Yes'
        all_repo_IDs = get_repo_values(all_repo_data, 'repository_study_ID')
        if all_repo_IDs == '[]':
            repo_deposit = 'No data submissions have been reported'
        elif '' in all_repo_IDs:
            repo_deposit = 'At least one repo selection does not have a data submission reported'
        else:
            repo_deposit = 'Yes'

reported_repo.append(
    [
        h_id,
        all_repo_names,
        any_repo_sel,
        all_repo_IDs,
        repo_deposit
    ]
)

col_names = [
    'guids',
    'Repo Selections',
    'Selected at least 1 repo?',
    'Repo Study IDs',
    'All repos have submitted data?'
]
repo_info = pd.DataFrame(reported_repo, columns=col_names)

repo_info.T

Unnamed: 0,0
guids,HDP00010
Repo Selections,[]
Selected at least 1 repo?,No repo has been submitted
Repo Study IDs,[]
All repos have submitted data?,No data submissions have been reported


### CEDAR completion

The next series of cells examines when the CEDAR form was last updated and CEDAR completion percentage.

**CEDAR form completion - By Section**

There are 9 sections in the CEDAR form:
* Minimal Info
* Metadata Location (read note about this section)
* Data Availability
* Study Translational Focus
* Study Type
* Human Treatment Applicability
* Human Condition Applicability
* Human Subject Applicability
* Data

This next huge code block goes through all the metadata fields in the CEDAR form. For each section, it counts the total number of fields with data entered (that is, is not some empty or null indicator) and the total number of fields possible in each section. It then calculates percent completion for each section *(note that total CEDAR percent completion is not calculated here - that's coming later)*.  

There will be no output when you run this cell.

A note:
* In the CEDAR form, the Metadata Location section only has 2 fields - application ID (which is autopopulated) and "other websites". In the MDS, there are a number of other fields in the Metadata Location section that are not on the CEDAR form (see below for full listing of fields in Metadata Location in MDS). To avoid having very skewed percentages for the section, I am only looking at whether the "other websites" field is populated, and manually adding the completion for the application ID field since it's auto-populated.
* "data_repositories" (not in CEDAR form)
* "nih_reporter_link" (not in CEDAR form, autopopulated)
* "nih_application_id" (autopopulated in CEDAR form)
* "other_study_websites"
* "clinical_trials_study_ID" (not in CEDAR form)
* "cedar_study_level_metadata_template_instance_ID" (not in CEDAR form, very irrelevant for users)

**What CEDAR fields are missing?**

Part of this code also looks at all the CEDAR fields (excepting the ones in Metadata Location that are not in the CEDAR form) and makes a list of all the fields in each section that do not have data in them.

This can be helpful, for e.g., for evaluating whether a 70% completion seems reasonable for a particular study -- you can check whether the questions that are missing are questions that are often not applicable to a study. Also, if a PI is confused and thinking they completed more fields, you can share this list of fields that are not completed. (Note: the field names are not completely clear to a typical user not often using the MDS, so I do not recommend sharing this list by default with all PIs.)

**How to convert the field name in the list to CEDAR field names**

The field names beginning with `cedar_study_metadata` have a system to their names that allows us to easily see what part of the CEDAR form the field is from. Let's look at this field name as an example:

`cedar_study_metadata.human_treatment_applicability.treatment_investigation_stage_or_type`

There are 3 parts to the field name, separated by periods.

`cedar_study_metadata` - this indicates the field is from the CEDAR form
`human_treatment_applicability` - this indicates which of the 9 sections in the CEDAR form this field is from
`treatment_investigation_stage_or_type` - this tells us what the field in the CEDAR form is. You can use this [study-level metadata schema](https://github.com/HEAL/heal-metadata-schemas/blob/main/for-investigators-how-to/study-level-metadata-fields/study-metadata-schema-for-humans.md) to connect the field names to more human-readable questions for more information about the fields.

In [None]:
# create a list to store all the gathered data
cedar_comp_info = []

# these are fields in the Metadata Location section in the MDS that are not on the CEDAR form.
# We are excluding them to keep from confusing the PIs if we need to share the list
noncedar = [
    'cedar_study_metadata.metadata_location.data_repositories',
    'cedar_study_metadata.metadata_location.nih_reporter_link',
    'cedar_study_metadata.metadata_location.nih_application_id',
    'cedar_study_metadata.metadata_location.clinical_trials_study_ID',
    'cedar_study_metadata.metadata_location.cedar_study_level_metadata_template_instance_ID'
]

#for each row
#if the column name begins with cedar_study_metadata.XXX.
#loop through the columns with that prefix
#count if not empty string/nan/0
for index, row in df1.iterrows():
    if 'time_of_last_cedar_updated' in df1:
        cedar_update = df1.loc[0, 'time_of_last_cedar_updated']
    else:
        cedar_update = "na"

    hid = row['guids']

    #if the column name begins with cedar_study_metadata.minimal_info.
    sel_min_info = row.index.str.startswith("cedar_study_metadata.minimal_info.")
    # how many fields are not completed in Minimal Info section
    not_empty_string = (row.loc[sel_min_info] != "")
    not_0 = (row.loc[sel_min_info] != "0")
    not_nan = ~row.loc[sel_min_info].isna()
    completed_min_info = (not_empty_string & not_0 & not_nan).value_counts().get(True, default=0)
    # how many total fields in Minimal Info section
    total_min_info=len(row.loc[sel_min_info])
    # find the fields that are not completed
    is_empty_string = (row.loc[sel_min_info] == "")
    is_0 = (row.loc[sel_min_info] == "0")
    is_nan = row.loc[sel_min_info].isna()
    is_noncedar = row.loc[sel_min_info].index.isin(noncedar)
    missing_min_info = (is_empty_string | is_0 | is_nan) & ~is_noncedar
    is_missing_min_info = row.loc[sel_min_info].loc[missing_min_info].index.tolist()

    #if the column name is cedar_study_metadata.metadata_location.other_study_websites
    # if is not nan
    completed_websites = 2 if row["cedar_study_metadata.metadata_location.other_study_websites"] is not np.nan else 1
    # only 2 fields, including autopop field
    total_websites = 2
    #if the column name begins with cedar_study_metadata.metadata_location.
    sel_met_loc = row.index.str.startswith("cedar_study_metadata.metadata_location.")
    # Find the fields that are not completed
    is_empty_string = (row.loc[sel_met_loc] == "")
    is_0 = (row.loc[sel_met_loc] == "0")
    is_nan = row.loc[sel_met_loc].isna()
    is_noncedar = row.loc[sel_met_loc].index.isin(noncedar)
    missing_met_loc = (is_empty_string | is_0 | is_nan) & ~is_noncedar
    is_missing_met_loc = row.loc[sel_met_loc].loc[missing_met_loc].index.tolist()

    #if the column name begins with cedar_study_metadata.data_availability.
    sel_data_avail = row.index.str.startswith("cedar_study_metadata.data_availability.")
    # how many total fields in Data Availability section section
    total_data_avail=len(row.loc[sel_data_avail])
    # how many fields are not completed in Data Availability section
    not_empty_string = (row.loc[sel_data_avail] != "")
    not_0 = (row.loc[sel_data_avail] != "0")
    not_nan = ~row.loc[sel_data_avail].isna()
    completed_data_avail = (not_empty_string & not_0 & not_nan).value_counts().get(True, default=0)
    # Find the fields that are not completed
    is_empty_string = (row.loc[sel_data_avail] == "")
    is_0 = (row.loc[sel_data_avail] == "0")
    is_nan = row.loc[sel_data_avail].isna()
    is_noncedar = row.loc[sel_data_avail].index.isin(noncedar)
    missing_data_avail = (is_empty_string | is_0 | is_nan) & ~is_noncedar
    is_missing_data_avail = row.loc[sel_data_avail].loc[missing_data_avail].index.tolist()

    #if the column name begins with cedar_study_metadata.study_translational_focus.
    sel_trans_focus = row.index.str.startswith("cedar_study_metadata.study_translational_focus.")
    # how many total fields in Study Translational Focus section
    total_trans_focus=len(row.loc[sel_trans_focus])
    # how many fields are not completed in Study Translational Focus section
    not_empty_string = (row.loc[sel_trans_focus] != "")
    not_0 = (row.loc[sel_trans_focus] != "0")
    not_nan = ~row.loc[sel_trans_focus].isna()
    completed_trans_focus = (not_empty_string & not_0 & not_nan).value_counts().get(True, default=0)
    # Find the fields that are not completed
    is_empty_string = (row.loc[sel_trans_focus] == "")
    is_0 = (row.loc[sel_trans_focus] == "0")
    is_nan = row.loc[sel_trans_focus].isna()
    is_noncedar = row.loc[sel_trans_focus].index.isin(noncedar)
    missing_trans_focus = (is_empty_string | is_0 | is_nan) & ~is_noncedar
    is_missing_trans_focus = row.loc[sel_trans_focus].loc[missing_trans_focus].index.tolist()

    #if the column name begins with cedar_study_metadata.study_type.
    sel_study_type = row.index.str.startswith("cedar_study_metadata.study_type.")
    # how many total fields in Study Type section
    total_study_type=len(row.loc[sel_study_type])
    # how many fields are not completed in Study Type section
    not_empty_string = (row.loc[sel_study_type] != "")
    not_0 = (row.loc[sel_study_type] != "0")
    not_nan = ~row.loc[sel_study_type].isna()
    completed_study_type = (not_empty_string & not_0 & not_nan).value_counts().get(True, default=0)
    # Find the fields that are not completed
    is_empty_string = (row.loc[sel_study_type] == "")
    is_0 = (row.loc[sel_study_type] == "0")
    is_nan = row.loc[sel_study_type].isna()
    is_noncedar = row.loc[sel_study_type].index.isin(noncedar)
    missing_study_type = (is_empty_string | is_0 | is_nan) & ~is_noncedar
    is_missing_study_type = row.loc[sel_study_type].loc[missing_study_type].index.tolist()

    #if the column name begins with cedar_study_metadata.human_treatment_applicability.
    sel_hum_treat = row.index.str.startswith("cedar_study_metadata.human_treatment_applicability.")
    # how many total fields in Human Treatment Applicability section
    total_hum_treat=len(row.loc[sel_hum_treat])
    # how many fields are not completed in Human Treatment Applicability section
    not_empty_string = (row.loc[sel_hum_treat] != "")
    not_0 = (row.loc[sel_hum_treat] != "0")
    not_nan = ~row.loc[sel_hum_treat].isna()
    completed_hum_treat = (not_empty_string & not_0 & not_nan).value_counts().get(True, default=0)
    # Find the fields that are not completed
    is_empty_string = (row.loc[sel_hum_treat] == "")
    is_0 = (row.loc[sel_hum_treat] == "0")
    is_nan = row.loc[sel_hum_treat].isna()
    is_noncedar = row.loc[sel_hum_treat].index.isin(noncedar)
    missing_hum_treat = (is_empty_string | is_0 | is_nan) & ~is_noncedar
    is_missing_hum_treat = row.loc[sel_hum_treat].loc[missing_hum_treat].index.tolist()

    #if the column name begins with cedar_study_metadata.human_condition_applicability.
    sel_hum_cond = row.index.str.startswith("cedar_study_metadata.human_condition_applicability.")
    # how many total fields in Human Condition Applicability section
    total_hum_cond=len(row.loc[sel_hum_cond])
    # how many fields are not completed in Human Condition Applicability section
    not_empty_string = (row.loc[sel_hum_cond] != "")
    not_0 = (row.loc[sel_hum_cond] != "0")
    not_nan = ~row.loc[sel_hum_cond].isna()
    completed_hum_cond = (not_empty_string & not_0 & not_nan).value_counts().get(True, default=0)
    # Find the fields that are not completed
    is_empty_string = (row.loc[sel_hum_cond] == "")
    is_0 = (row.loc[sel_hum_cond] == "0")
    is_nan = row.loc[sel_hum_cond].isna()
    is_noncedar = row.loc[sel_hum_cond].index.isin(noncedar)
    missing_hum_cond = (is_empty_string | is_0 | is_nan) & ~is_noncedar
    is_missing_hum_cond = row.loc[sel_hum_cond].loc[missing_hum_cond].index.tolist()

    #if the column name begins with cedar_study_metadata.human_subject_applicability.
    sel_hum_subj = row.index.str.startswith("cedar_study_metadata.human_subject_applicability.")
    # how many total fields in Human Subject Applicability section
    total_hum_subj=len(row.loc[sel_hum_subj])
    # how many fields are not completed in Human Subject Applicability section
    not_empty_string = (row.loc[sel_hum_subj] != "")
    not_0 = (row.loc[sel_hum_subj] != "0")
    not_nan = ~row.loc[sel_hum_subj].isna()
    completed_hum_subj = (not_empty_string & not_0 & not_nan).value_counts().get(True, default=0)
    # Find the fields that are not completed
    is_empty_string = (row.loc[sel_hum_subj] == "")
    is_0 = (row.loc[sel_hum_subj] == "0")
    is_nan = row.loc[sel_hum_subj].isna()
    is_noncedar = row.loc[sel_hum_subj].index.isin(noncedar)
    missing_hum_subj = (is_empty_string | is_0 | is_nan) & ~is_noncedar
    is_missing_hum_subj = row.loc[sel_hum_subj].loc[missing_hum_subj].index.tolist()

    #if the column name begins with cedar_study_metadata.data.
    sel_data = row.index.str.startswith("cedar_study_metadata.data.")
    # how many total fields in Data section
    total_data=len(row.loc[sel_data])
    # how many fields are not completed in Data section
    not_empty_string = (row.loc[sel_data] != "")
    not_0 = (row.loc[sel_data] != "0")
    not_nan = ~row.loc[sel_data].isna()
    completed_data = (not_empty_string & not_0 & not_nan).value_counts().get(True, default=0)
    # Find the fields that are not completed
    is_empty_string = (row.loc[sel_data] == "")
    is_0 = (row.loc[sel_data] == "0")
    is_nan = row.loc[sel_data].isna()
    is_noncedar = row.loc[sel_data].index.isin(noncedar)
    missing_data = (is_empty_string | is_0 | is_nan) & ~is_noncedar
    is_missing_data = row.loc[sel_data].loc[missing_data].index.tolist()

    cedar_comp_info.append(
        [
            hid,
            cedar_update,
            round((100 * completed_min_info / total_min_info), 1),
            completed_min_info,
            total_min_info,
            is_missing_min_info,
            round((100 * (completed_websites) / (total_websites)), 1),
            completed_websites,
            total_websites,
            is_missing_met_loc,
            round((100 * completed_data_avail / total_data_avail), 1),
            completed_data_avail,
            total_data_avail,
            is_missing_data_avail,
            round((100 * completed_trans_focus / total_trans_focus), 1),
            completed_trans_focus,
            total_trans_focus,
            is_missing_trans_focus,
            round((100 * completed_study_type / total_study_type), 1),
            completed_study_type,
            total_study_type,
            is_missing_study_type,
            round((100 * completed_hum_treat / total_hum_treat), 1),
            completed_hum_treat,
            total_hum_treat,
            is_missing_hum_treat,
            round((100 * completed_hum_cond / total_hum_cond), 1),
            completed_hum_cond,
            total_hum_cond,
            is_missing_hum_cond,
            round((100 * completed_hum_subj / total_hum_subj), 1),
            completed_hum_subj,
            total_hum_subj,
            is_missing_hum_subj,
            round((100 * completed_data / total_data), 1),
            completed_data,
            total_data,
            is_missing_data
        ])

In [None]:
# create a list to store all the gathered data
cedar_comp_info = []

# these are fields in the Metadata Location section in the MDS that are not on the CEDAR form.
# We are excluding them to keep from confusing the PIs if we need to share the list
noncedar = [
    'cedar_study_metadata.metadata_location.data_repositories',
    'cedar_study_metadata.metadata_location.nih_reporter_link',
    'cedar_study_metadata.metadata_location.nih_application_id',
    'cedar_study_metadata.metadata_location.clinical_trials_study_ID',
    'cedar_study_metadata.metadata_location.cedar_study_level_metadata_template_instance_ID'
]

#for each row
#if the column name begins with cedar_study_metadata.XXX.
#loop through the columns with that prefix
#count if not empty string/NaN/0
for index, row in df1.iterrows():
    if 'time_of_last_cedar_updated' in df1:
        cedar_update = df1.loc[0, 'time_of_last_cedar_updated']
    else:
        cedar_update = ''

    hid = row['guids']
    searchhid = 'HDP00011'
    if hid == searchhid: print(f'####### {searchhid}')
    #if the column name begins with cedar_study_metadata.minimal_info.
    sel_min_info = row.index.str.startswith("cedar_study_metadata.minimal_info.")
    # how many fields are not completed in Minimal Info section
    not_empty_string = (row.loc[sel_min_info] != "")
    not_0 = (row.loc[sel_min_info] != "0")
    not_nan = ~row.loc[sel_min_info].isna()
    completed_min_info = (not_empty_string & not_0 & not_nan).value_counts().get(True, default=0)
    if hid == searchhid: print(f'completed_min_info: {completed_min_info}')
    # how many total fields in Minimal Info section
    total_min_info=len(row.loc[sel_min_info])
    # find the fields that are not completed
    is_empty_string = (row.loc[sel_min_info] == "")
    is_0 = (row.loc[sel_min_info] == "0")
    is_nan = row.loc[sel_min_info].isna()
    is_noncedar = row.loc[sel_min_info].index.isin(noncedar)
    missing_min_info = (is_empty_string | is_0 | is_nan) & ~is_noncedar
    is_missing_min_info = row.loc[sel_min_info].loc[missing_min_info].index.tolist()

    #if the column name is cedar_study_metadata.metadata_location.other_study_websites
    # if is not nan
    completed_websites = 2 if row["cedar_study_metadata.metadata_location.other_study_websites"] != np.nan else 1
    if hid == searchhid: print(f'completed_websites: {completed_websites}')

    # only 2 fields, including autopop field
    total_websites = 2
    #if the column name begins with cedar_study_metadata.metadata_location.
    sel_met_loc = row.index.str.startswith("cedar_study_metadata.metadata_location.")
    # Find the fields that are not completed
    is_empty_string = (row.loc[sel_met_loc] == "")
    is_0 = (row.loc[sel_met_loc] == "0")
    is_nan = row.loc[sel_met_loc].isna()
    is_noncedar = row.loc[sel_met_loc].index.isin(noncedar)
    missing_met_loc = (is_empty_string | is_0 | is_nan) & ~is_noncedar
    is_missing_met_loc = row.loc[sel_met_loc].loc[missing_met_loc].index.tolist()

    #if the column name begins with cedar_study_metadata.data_availability.
    sel_data_avail = row.index.str.startswith("cedar_study_metadata.data_availability.")
    # how many total fields in Data Availability section section
    total_data_avail=len(row.loc[sel_data_avail])
    # how many fields are not completed in Data Availability section
    not_empty_string = (row.loc[sel_data_avail] != "")
    not_0 = (row.loc[sel_data_avail] != "0")
    not_nan = ~row.loc[sel_data_avail].isna()
    completed_data_avail = (not_empty_string & not_0 & not_nan).value_counts().get(True, default=0)
    if hid == searchhid: print(f'completed_data_avail: {completed_data_avail}')
    # Find the fields that are not completed
    is_empty_string = (row.loc[sel_data_avail] == "")
    is_0 = (row.loc[sel_data_avail] == "0")
    is_nan = row.loc[sel_data_avail].isna()
    is_noncedar = row.loc[sel_data_avail].index.isin(noncedar)
    missing_data_avail = (is_empty_string | is_0 | is_nan) & ~is_noncedar
    is_missing_data_avail = row.loc[sel_data_avail].loc[missing_data_avail].index.tolist()

    #if the column name begins with cedar_study_metadata.study_translational_focus.
    sel_trans_focus = row.index.str.startswith("cedar_study_metadata.study_translational_focus.")
    # how many total fields in Study Translational Focus section
    total_trans_focus=len(row.loc[sel_trans_focus])
    # how many fields are not completed in Study Translational Focus section
    not_empty_string = (row.loc[sel_trans_focus] != "")
    not_0 = (row.loc[sel_trans_focus] != "0")
    not_nan = ~row.loc[sel_trans_focus].isna()
    completed_trans_focus = (not_empty_string & not_0 & not_nan).value_counts().get(True, default=0)
    if hid == searchhid: print(f'completed_trans_focus: {completed_trans_focus}')
    # Find the fields that are not completed
    is_empty_string = (row.loc[sel_trans_focus] == "")
    is_0 = (row.loc[sel_trans_focus] == "0")
    is_nan = row.loc[sel_trans_focus].isna()
    is_noncedar = row.loc[sel_trans_focus].index.isin(noncedar)
    missing_trans_focus = (is_empty_string | is_0 | is_nan) & ~is_noncedar
    is_missing_trans_focus = row.loc[sel_trans_focus].loc[missing_trans_focus].index.tolist()

    #if the column name begins with cedar_study_metadata.study_type.
    sel_study_type = row.index.str.startswith("cedar_study_metadata.study_type.")
    # how many total fields in Study Type section
    total_study_type=len(row.loc[sel_study_type])
    # how many fields are not completed in Study Type section
    not_empty_string = (row.loc[sel_study_type] != "")
    not_0 = (row.loc[sel_study_type] != "0")
    not_nan = ~row.loc[sel_study_type].isna()
    completed_study_type = (not_empty_string & not_0 & not_nan).value_counts().get(True, default=0)
    if hid == searchhid: print(f'completed_study_type: {completed_study_type}')
    # Find the fields that are not completed
    is_empty_string = (row.loc[sel_study_type] == "")
    is_0 = (row.loc[sel_study_type] == "0")
    is_nan = row.loc[sel_study_type].isna()
    is_noncedar = row.loc[sel_study_type].index.isin(noncedar)
    missing_study_type = (is_empty_string | is_0 | is_nan) & ~is_noncedar
    is_missing_study_type = row.loc[sel_study_type].loc[missing_study_type].index.tolist()

    #if the column name begins with cedar_study_metadata.human_treatment_applicability.
    sel_hum_treat = row.index.str.startswith("cedar_study_metadata.human_treatment_applicability.")
    # how many total fields in Human Treatment Applicability section
    total_hum_treat=len(row.loc[sel_hum_treat])
    # how many fields are not completed in Human Treatment Applicability section
    not_empty_string = (row.loc[sel_hum_treat] != "")
    not_0 = (row.loc[sel_hum_treat] != "0")
    not_nan = ~row.loc[sel_hum_treat].isna()
    completed_hum_treat = (not_empty_string & not_0 & not_nan).value_counts().get(True, default=0)
    if hid == searchhid: print(f'completed_hum_treat: {completed_hum_treat}')
    # Find the fields that are not completed
    is_empty_string = (row.loc[sel_hum_treat] == "")
    is_0 = (row.loc[sel_hum_treat] == "0")
    is_nan = row.loc[sel_hum_treat].isna()
    is_noncedar = row.loc[sel_hum_treat].index.isin(noncedar)
    missing_hum_treat = (is_empty_string | is_0 | is_nan) & ~is_noncedar
    is_missing_hum_treat = row.loc[sel_hum_treat].loc[missing_hum_treat].index.tolist()

    #if the column name begins with cedar_study_metadata.human_condition_applicability.
    sel_hum_cond = row.index.str.startswith("cedar_study_metadata.human_condition_applicability.")
    # how many total fields in Human Condition Applicability section
    total_hum_cond=len(row.loc[sel_hum_cond])
    # how many fields are not completed in Human Condition Applicability section
    not_empty_string = (row.loc[sel_hum_cond] != "")
    not_0 = (row.loc[sel_hum_cond] != "0")
    not_nan = ~row.loc[sel_hum_cond].isna()
    completed_hum_cond = (not_empty_string & not_0 & not_nan).value_counts().get(True, default=0)
    if hid == searchhid: print(f'completed_hum_cond: {completed_hum_cond}')
    # Find the fields that are not completed
    is_empty_string = (row.loc[sel_hum_cond] == "")
    is_0 = (row.loc[sel_hum_cond] == "0")
    is_nan = row.loc[sel_hum_cond].isna()
    is_noncedar = row.loc[sel_hum_cond].index.isin(noncedar)
    missing_hum_cond = (is_empty_string | is_0 | is_nan) & ~is_noncedar
    is_missing_hum_cond = row.loc[sel_hum_cond].loc[missing_hum_cond].index.tolist()

    #if the column name begins with cedar_study_metadata.human_subject_applicability.
    sel_hum_subj = row.index.str.startswith("cedar_study_metadata.human_subject_applicability.")
    # how many total fields in Human Subject Applicability section
    total_hum_subj=len(row.loc[sel_hum_subj])
    # how many fields are not completed in Human Subject Applicability section
    not_empty_string = (row.loc[sel_hum_subj] != "")
    not_0 = (row.loc[sel_hum_subj] != "0")
    not_nan = ~row.loc[sel_hum_subj].isna()
    completed_hum_subj = (not_empty_string & not_0 & not_nan).value_counts().get(True, default=0)
    if hid == searchhid: print(f'completed_hum_subj: {completed_hum_subj}')
    # Find the fields that are not completed
    is_empty_string = (row.loc[sel_hum_subj] == "")
    is_0 = (row.loc[sel_hum_subj] == "0")
    is_nan = row.loc[sel_hum_subj].isna()
    is_noncedar = row.loc[sel_hum_subj].index.isin(noncedar)
    missing_hum_subj = (is_empty_string | is_0 | is_nan) & ~is_noncedar
    is_missing_hum_subj = row.loc[sel_hum_subj].loc[missing_hum_subj].index.tolist()

    #if the column name begins with cedar_study_metadata.data.
    sel_data = row.index.str.startswith("cedar_study_metadata.data.")
    # how many total fields in Data section
    total_data=len(row.loc[sel_data])
    # how many fields are not completed in Data section
    not_empty_string = (row.loc[sel_data] != "")
    not_0 = (row.loc[sel_data] != "0")
    not_nan = ~row.loc[sel_data].isna()
    completed_data = (not_empty_string & not_0 & not_nan).value_counts().get(True, default=0)
    if hid == searchhid: print(f'completed_data: {completed_data}')
    # Find the fields that are not completed
    is_empty_string = (row.loc[sel_data] == "")
    is_0 = (row.loc[sel_data] == "0")
    is_nan = row.loc[sel_data].isna()
    is_noncedar = row.loc[sel_data].index.isin(noncedar)
    missing_data = (is_empty_string | is_0 | is_nan) & ~is_noncedar
    is_missing_data = row.loc[sel_data].loc[missing_data].index.tolist()
    # Collapse original 2 cells into single
    overall_total = total_min_info + total_websites + total_data_avail + total_trans_focus + total_study_type + total_hum_treat + total_hum_cond + total_hum_subj + total_data
    if hid == searchhid: print(f'total_min_info: {total_min_info}')
    if hid == searchhid: print(f'total_websites: {total_min_info}')
    if hid == searchhid: print(f'total_data_avail: {total_data_avail}')
    if hid == searchhid: print(f'total_trans_focus: {total_trans_focus}')
    if hid == searchhid: print(f'total_study_type: {total_study_type}')
    if hid == searchhid: print(f'total_hum_treat: {total_hum_treat}')
    if hid == searchhid: print(f'total_hum_cond: {total_hum_cond}')
    if hid == searchhid: print(f'total_hum_subj: {total_hum_subj}')
    if hid == searchhid: print(f'total_data: {total_data}')

    overall_complete = completed_min_info + completed_websites + completed_data_avail + completed_trans_focus + completed_study_type + completed_hum_treat + completed_hum_cond + completed_hum_subj + completed_data
    if hid == searchhid: print(f'>>> overall_complete: {overall_complete}')
    if hid == searchhid: print(f'>>> overall_total: {overall_total}')
    overall_pct = round((100 * overall_complete / overall_total), 1)

    cedar_comp_info.append(
        [
            hid,
            cedar_update,
            round((100 * completed_min_info / total_min_info), 1),
            completed_min_info,
            total_min_info,
            is_missing_min_info,
            round((100 * (completed_websites) / (total_websites)), 1),
            completed_websites,
            total_websites,
            is_missing_met_loc,
            round((100 * completed_data_avail / total_data_avail), 1),
            completed_data_avail,
            total_data_avail,
            is_missing_data_avail,
            round((100 * completed_trans_focus / total_trans_focus), 1),
            completed_trans_focus,
            total_trans_focus,
            is_missing_trans_focus,
            round((100 * completed_study_type / total_study_type), 1),
            completed_study_type,
            total_study_type,
            is_missing_study_type,
            round((100 * completed_hum_treat / total_hum_treat), 1),
            completed_hum_treat,
            total_hum_treat,
            is_missing_hum_treat,
            round((100 * completed_hum_cond / total_hum_cond), 1),
            completed_hum_cond,
            total_hum_cond,
            is_missing_hum_cond,
            round((100 * completed_hum_subj / total_hum_subj), 1),
            completed_hum_subj,
            total_hum_subj,
            is_missing_hum_subj,
            round((100 * completed_data / total_data), 1),
            completed_data,
            total_data,
            is_missing_data,
            overall_pct,
            overall_complete
        ])

cedar_comp_info
today = date.today()
pd.DataFrame(cedar_comp_info).to_csv(f'HEAL_CEDAR_{today}_CEDAR_COMP_INFO_V20.csv')

**Let's see it!**
This cell takes all the information we just collected and puts it into a dataframe where we can view it. You should see a data table when this cell is complete.

Important note: the very top row in the data table has NO row header -- that is the index row number (starting at 0).

In [None]:
col_names = [
    "guids",
    "Last CEDAR update",
    "Minimal Info % complete",
    "# complete in Minimal Info",
    "total # in Minimal Info",
    "Missing fields in Minimal Info",
    "Metadata Location % complete",
    "# complete in Metadata Location",
    "total # in Metadata Location",
    "Missing fields in Metadata Location",
    "Data Availability % complete",
    "# complete in Data Availability",
    "total # in Data Availability",
    "Missing fields in Data Availability",
    "Study Translational Focus % complete",
    "# complete in Study Translational Focus",
    "total # in Study Translational Focus",
    "Missing fields in Study Translational Focus",
    "Study Type % complete",
    "# complete in Study Type",
    "total # in Study Type",
    "Missing fields in Study Type",
    "Human Treatment Applicability % complete",
    "# complete in Human Treatment Applicability",
    "total # in Human Treatment Applicability",
    "Missing fields in Human Treatment Applicability",
    "Human Condition Applicability % complete",
    "# complete in Human Condition Applicability",
    "total # in Human Condition Applicability",
    "Missing fields in Human Condition Applicability",
    "Human Subject Applicability % complete",
    "# complete in Human Subject Applicability",
    "total # in Human Subject Applicability",
    "Missing fields in Human Subject Applicability",
    "Data % complete",
    "# complete in Data",
    "total # in Data",
    "Missing fields in Data",
    "1",
    "2"
]
complxn_stats = pd.DataFrame(cedar_comp_info, columns=col_names)

complxn_stats.T

Unnamed: 0,0
guids,HDP00011
Last CEDAR update,2022-08-22T09:00:52-07:00
Minimal Info % complete,50.0
# complete in Minimal Info,2
total # in Minimal Info,4
Missing fields in Minimal Info,[cedar_study_metadata.minimal_info.alternative...
Metadata Location % complete,100.0
# complete in Metadata Location,2
total # in Metadata Location,2
Missing fields in Metadata Location,[cedar_study_metadata.metadata_location.other_...


**Overall CEDAR form completion**

Calculate overall CEDAR form completion by adding the number of completed fields and dividing by the total number of fields.

In [None]:
overall = []

for index, row in complxn_stats.iterrows():
    hid = row['guids']
    overall_total = total_min_info + total_websites + total_data_avail + total_trans_focus + total_study_type + total_hum_treat + total_hum_cond + total_hum_subj + total_data
    overall_complete = completed_min_info + completed_websites + completed_data_avail + completed_trans_focus + completed_study_type + completed_hum_treat + completed_hum_cond + completed_hum_subj + completed_data
    overall_pct = round((100 * overall_complete / overall_total), 1)
    overall.append(
        [hid,
         overall_pct,
         overall_complete
        ])
col_names = [
    'guids',
    'Overall % Complete',
    'Overall # Complete (possible 51)'
]
overall_df = pd.DataFrame(overall, columns=col_names)
overall_df

Unnamed: 0,guids,Overall % Complete,Overall # Complete (possible 51)
0,HDP00011,11.8,6


### Create an additional dataframe with data that can help us make decisions about completion percentage

We know that many studies can have "complete" CEDAR forms that are not 100% complete. This is because some of the CEDAR questions do not apply to some studies, and they should be left blank. Below, I've listed some examples of study characteristics that determine whether some CEDAR questions apply. We will create a data frame to capture some info relevant to these characteristics, so we can include it on a report that helps us determine whether we should expect 100% completion or not.  

* Is the study focusing on Pain or Treatment of a Pain Condition (as opposed to opioid use or treatment of OUD)?  A few questions only apply to studies that focus on pain or treatment of pain. Look at the **Relevant Opioid use and-or Pain condition - Category (Hum Cond Appl)** field below.
* Is the study treatment-focused (focused on learning more about a treatment, intervention, or solution for a human pain or opioid use condition, or condition-focused (focused on learning more about a human pain or opioid use condition)? Some questions only apply to treatment-focused studies. Look at the **Transl Focus - Condition or Treatment** field below.
* Is the study a human-subjects study? Note that MANY of the human-subjects fields DO apply even to studies that are NOT human-subjects studies. But, some are exclusive to human-subjects studies. Look at the **Type of study subjects** field below.
* Are your study results applicable to certain categories of people (even if they are not human subjects studies)? Some or all of the questions in the Human Subject Applicability section may be left blank if your study does not have particular applicability to specific groups of people.) (These fields are in the Human Subject Applicability section.)
* Does your study have a relevant website? (This is in the Metadata Location section.)
* Do you have an alternative study name or description you would want to associate with your study? (This is in the Minimal Info section.)
* * Will your study collect data? Not all HEAL studies collect or produce data. Some develop methods and protocols for future studies. If your study will NOT collect or produce data, but will produce shareable products other than data (e.g. protocols, slide decks, etc.), there are some questions that do not apply to your study. Look at the **Will it produce data (Data Avail)** field below.
* Can you estimate when your study will start and finish collecting data? When it will start and finish releasing data? (These are in the Data Availability section.)


In [None]:
def anotherfunction(row2df):
    trans_focus = row2df.iloc[0]['cedar_study_metadata.study_translational_focus.study_translational_focus'] # condition or treatment
    subj_type = row2df.iloc[0]['cedar_study_metadata.study_type.study_subject_type']
    condition_category = row2df.iloc[0]['cedar_study_metadata.human_condition_applicability.condition_category']
    produce_data = row2df.iloc[0]['cedar_study_metadata.data_availability.produce_data']

    return {
        'Transl Focus - Condition or Treatment': trans_focus,
        'Type of study subjects': subj_type,
        'Relevant Opioid use and-or Pain condition - Category (Hum Cond Appl)': condition_category,
        'Will it produce data (Data Avail)': produce_data,
    }

res2_series = df1.groupby('guids').apply(anotherfunction)
res2_df = pd.DataFrame(res2_series.tolist(), index=res2_series.index)
res2_df.T

## Combining all the dataframes

We have created 4 different dataframes, each with information important to us. Let's merge these dataframes into 1 big dataframe, in an order that is useful to us. This will give us one single dataframe that we can call on to populate a report for each HDP ID.

In [None]:
#overall_df
#res_df
#repo_info
#complxn_stats
#res2_df

merged_df = pd.merge(overall_df, res_df, how='outer', on='guids')
merged_df = pd.merge(merged_df, repo_info, how='outer', on='guids')
merged_df = pd.merge(merged_df, complxn_stats, how='outer', on='guids')
merged_df = pd.merge(merged_df, res2_df, how='outer', on='guids')
merged_df = merged_df.rename(columns={'guids': 'HDP ID'})
merged_df.T

## Export data to a .csv
You can export all the data in this dataframe to a CSV which you can open in Excel.

In [None]:
today = date.today()
merged_df.T.to_csv(f'HEAL_CEDAR_{today}.csv')

## Save a copy of your notebook (optional)

To save a record of your work, choose a (short) identifier for this study evaluation -- enter that value in the quotations for the "identifier" variable in the first line.

When you run the cell, there will be no output, but a copy of this notebook will be saved in the same local folder as this current notebook, with the name "CEDAR_complxn_{identifier}_{today}.ipynb" (for example, CEDAR_complxn_9889726_2023-12-15.ipynb)

In [None]:
identifier = "maybe HDPID or appl_id"
today = date.today()
%notebook CEDAR_complxn_{identifier}_{today}.ipynb

## Appendix

Optional, for reference: Running this cell will create a list of all the data columns in the Gen3 Discovery data frame (df1) produced above (the dataframe which includes the CEDAR form).

This list is only here for reference - you do not need to do anything with this list.

In [None]:
sorted(list(df1.columns))