# MIDRC Pre-Ingestion QC Report using planxton_midrc library. 
---
### Purpose:
Perform some basic quality checks of batches of data received from data contributors in MIDRC prior to beginning work on ingestion, so that we can notify data contributors of any issues with their data submissions sooner rather than later.

Authors: Chris Meyer, PhD & Dan Biber, MS <br>
Oct 2024 <br>

### Procedure:
1. Meta data is downloaded
2. The following items are checked for every dataset marked for ingestion:
    - [Appropriate cloud resource file structure](#meta_data_download)
    - [Metadata files named as expected for our ingestion scripts](#sort_batch_tsvs)
    - [Each metadata file can be loaded and has data in it](#metadata_valid)
    - [Types of properties match their values](#pre_ingest_qc_check)
    - [Main QC Script](#main_script)
        - Linking properties are appropriately populated
        - Properties do not contain special characters and are complete
        - Uniqueness of submitter_ids within a submitted batch
    - [If other submitter_ids are duplicated within MIDRC that they represent the same values](#other_xchecks) (e.x. A case is resubmitted, ensuring that it is the same case)
    - [Submitted series_uids are unique compared to what currently exists in MIDRC](#series_uid_xcheck) 

3. A report is provided to data submitters and this notebook is saved for a specific batch


In [None]:
# Prepare Python environment

import pandas as pd
import json
import pathlib
from pathlib import Path
import sys, os
from gen3.submission import Gen3Submission
from gen3.auth import Gen3Auth
from gen3.index import Gen3Index
from gen3.query import Gen3Query

In [None]:
# Append the directory containing the planxton.py module to sys.path

# If a users github directory is in there base path "Users/userid/" the following should work
plx_path = os.path.expanduser("~/github/midrc-scripts/")

# Append to sys.path
sys.path.append(plx_path)

print("Plan(x)ton path:", plx_path)

# Import the planxton class from the planxton.py module
from planxton_midrc import planxton_midrc

In [None]:
#Setting up connection to both MIDRC staging and MIDRC validate staging
s_api = 'https://staging.midrc.org'
s_cred = os.path.expanduser('~/Downloads/midrc-staging-credentials.json')

vs_api = 'https://validatestaging.midrc.org/'
vs_cred = os.path.expanduser('~/Downloads/midrc-validatestaging-credentials.json')

s_plx = planxton_midrc(s_api, s_cred)
s_exp = s_plx.expansion()

vs_plx = planxton_midrc(vs_api, vs_cred)
vs_exp = vs_plx.expansion()

#Change you cwd if not in correct location of working directory
wd_path = os.path.expanduser('~/Documents/Projects/MIDRC/sheep_dog_ingestion/RSNA/RSNA_20240528')
os.chdir(wd_path)

cd = os.getcwd()
print("Your current working directory is set to: \n", cd, "\n\n")

#Testing that Gen3Submissionm and Gen3Auth is initiated correctly in Plan(x)ton
print(s_plx.fetch_programs())
print(s_plx.fetch_projects())

print(vs_plx.fetch_programs())
print(vs_plx.fetch_projects())

<a class="anchor" id="meta_data_download"></a>
## Download the batch metadata TSVs and clinical/image manifests
---
Run the following in linux/unix shell:

* a. Pull data from AWS bucket to utilityvm.midrc.csoc, e.g.:
```
aws s3 sync s3://external-data-midrc-replication/replicated-data-acr/RSNA_20220812/ RSNA_20220812/ --exclude "*" --include "*.tsv"
```
* b. Sync the data locally for submission, or can run this notebook directly in the utility VM via ipython shell, e.g.:
```
wd="/Users/christopher/Documents/Notes/MIDRC/data/ssot-s3"
batch="RSNA_20230303"
rsync -rP utilityvm.midrc.csoc:/home/ubuntu/download/${batch} ${wd}
```



In [None]:
downloads = "~/Documents/Projects/MIDRC/sheep_dog_ingestion/RSNA/"
# change batch ! ! ! 
batch="RSNA_20240813"
batch_dir = "{}{}".format(downloads,batch)
print(batch_dir)

The notebook is designed to work when the directory is changed into the folder we just downloaded the batch metadata into.

In [None]:
os.chdir(batch_dir)

<a class="anchor" id="sort_batch_tsvs"></a>
## Sort the TSVs into manifests, submission TSVs, and supplemental/other
---
Provide the batch name ("batch") and the directory where the batch TSVs are located ("batch_dir")



In [None]:
#Main pre-ingest QC function
#Common variables for both commons
organization = 'RSNA'
date = '20240813'

#Program and Project identifier for MIDRC Staging 
s_program = 'Open'
s_project = 'R1'

#Program and Project identifier for MIDRC Staging 
vs_program = 'SEQ_Open'
vs_project = 'R3'

#This is the Program Project Batch (ppb) object for staging
s_ppb = s_plx.create_ppb(s_program, s_project, organization, date)

#This is the Program Project Batch (ppb) object for validatestaging
vs_ppb = vs_plx.create_ppb(vs_program, vs_project, organization, date)

#batch_tsvs is an object that many planxton_midrc functions use and is the listing of the batches metadata files in their local location
batch_tsvs = s_plx.sort_batch_tsvs(s_ppb,batch_dir)
batch_tsvs

In [None]:
## Display batch TSV information

if len(batch_tsvs["other_tsvs"]) > 0:
    print("CAUTION!!: Other TSVs are not matched with data model and require special attention:")
    display(batch_tsvs["other_tsvs"])
if len(batch_tsvs["nomatch_tsvs"]) > 0:
    print("CAUTION!!: TSVs that don't match regex for finding TSVs and require special attention:")
    display(batch_tsvs["nomatch_tsvs"])

print("Clinical manifests:")
display(batch_tsvs["clinical_manifests"])
print("Image manifests:")
display(batch_tsvs["image_manifests"])
print("Submission TSVs:")
display(batch_tsvs["node_tsvs"])


<a class="anchor" id="metadata_valid"></a>
## Ensuring that the Batch Metadata is not empty and can be loaded


In [None]:
# Check if each tsv can be loaded and has data in it

file_test = {}

for tsv,path in batch_tsvs['node_tsvs'].items():
    file_test[tsv] = pd.read_csv(path, sep='\t').shape

display(file_test)

In [None]:
#Only for removing empty tsv files from the node_tsvs

# keys_to_remove = ['visit', 'procedure']
# for key in keys_to_remove:
#     if key in batch_tsvs['node_tsvs']:
#         del batch_tsvs['node_tsvs'][key]

# display(batch_tsvs)

<a class="anchor" id="main_script"></a>
## Main QC Section 
### - Linking properties are appropriately populated
### - Properties do not contain special characters and are complete 
### - Uniqueness of submitter_ids within a submitted batch

In [None]:
report_output_dir = "~/Documents/Projects/MIDRC/sheep_dog_ingestion/RSNA/RSNA_20240920"

qc_report = s_plx.pre_ingest_qc_check(s_ppb, batch_tsvs, report_output_dir)

<a class="anchor" id="other_xchecks"></a>
## Checking for duplicate submission IDs that may be problematic

In [None]:
existing_img_study_df = s_plx.get_img_study_node(s_ppb)
print(existing_img_study_df.shape)

In [None]:
sub_img_study_df = pd.read_csv(batch_tsvs['node_tsvs']['imaging_study'], sep='\t')
img_study_overlaps = s_plx.img_study_xcheck(sub_img_study_df,existing_img_study_df)

print(img_study_overlaps)

In [None]:
existing_case_df = s_plx.get_case_node(s_ppb)
print(existing_case_df.shape)
sub_case_df = pd.read_csv(batch_tsvs['node_tsvs']['case'], sep='\t')
case_overlaps = s_plx.case_xcheck(sub_case_df,existing_case_df)

In [None]:
case_overlaps

<a class="anchor" id="series_uid_xcheck"></a>
## Checking for duplicate series uids

In [None]:
stag_series_nodes = s_plx.get_series_nodes(s_ppb)


In [None]:
v_series_nodes = vs_plx.get_series_nodes(vs_ppb)


In [None]:
staging_conficting_uids = s_plx.series_uid_xcheck(batch_tsvs = batch_tsvs, series_df_dict=stag_series_nodes)
vstag_conflicting_uids = vs_plx.series_uid_xcheck(batch_tsvs = batch_tsvs, series_df_dict=v_series_nodes)

In [None]:
staging_conficting_uids

In [None]:
vstag_conflicting_uids