# Quality Check of Data in Staging and ValidateStaging Environment prior to Monthly Release
---
by Johnbright Anyaibe, M.S

Scientific Support Analyst at the Center for Translational Data Science at University of Chicago

April 2023

---

The purpose of the notebook is to do some basic quality checks of batches of data received from data contributors in MIDRC prior to monthly release. This is a post-ingestion check.
Thi is a quality chek of the data in both [MIDRC-Staging](https://staging.midrc.org/dd) and [MIDRC-ValidateStaging](https://validatestaging.midrc.org/dd)

Import necessary libraries such as pandas, numpy, os, and the gen3 SDK for accessing the MIDRC API.

In [None]:
import pandas as pd
import numpy as np
import os
import sys
import gen3
from gen3.submission import Gen3Submission
from gen3.auth import Gen3Auth
from gen3.index import Gen3Index
from gen3.query import Gen3Query
git_dir='/Users/johnbrightanyaaibe/Documents/GitHub'
sdk_dir='/cgmeyer/gen3sdk-python'
sys.path.insert(1, '{}{}'.format(git_dir,sdk_dir))
from expansion.expansion import Gen3Expansion
%run /Users/johnbrightanyaaibe/Documents/GitHub/cgmeyer/gen3sdk-python/expansion/expansion.py


Setting up authentication credentials for the MIDRC API and three different environments (API, validatestaging, and staging).

In [None]:
api = 'https://data.midrc.org'
cred = '/Users/johnbrightanyaaibe/Downloads/midrc-credentials.json'
auth = Gen3Auth(api, refresh_file=cred)
sub = Gen3Submission(api, auth)
query = Gen3Query(auth)
index = Gen3Index(auth)
exp = Gen3Expansion(api,auth,sub)
exp.get_project_ids()
########################
vapi = 'https://validatestaging.midrc.org'
vcred = '/Users/johnbrightanyaaibe/Downloads/midrc-validatestaging-credentials.json'
vauth = Gen3Auth(vapi, refresh_file=vcred)
vsub = Gen3Submission(vapi, vauth)
vquery = Gen3Query(vauth)
vexp = Gen3Expansion(vapi,vauth,vsub)
vexp.get_project_ids()
########################
sapi = 'https://staging.midrc.org'
scred = '/Users/johnbrightanyaaibe/Downloads/midrc-staging-credentials.json'
sauth = Gen3Auth(sapi, refresh_file=scred)
ssub = Gen3Submission(sapi, sauth)
squery = Gen3Query(sauth)
sexp = Gen3Expansion(sapi,sauth,ssub)
sexp.get_project_ids()


Defining variables for the batch, user, and node that will be used in the code.

In [None]:
batch = ""
user = ""
node = ""


Reading in metadata from TSV files that are downloaded to the user's local machine.

In [None]:
rsync -rP utilityvm.midrc.csoc:/home/ubuntu/download/RSNA_20230117/ /Users/johnbrightanyaaibe/Documents/Notes/MIDRC/submission_tsvs/RSNA_20230117


Following code reads TSV files in a specific folder and outputs the length of each file. 

- It first defines the `folder_path` variable with the path to the folder where the TSV files are located, and the `file_suffix` variable with the suffix of the TSV files to be read.
- It then loops through all the files in the `folder_path` directory using `os.listdir()`.
- For each file in the directory, it checks if the file name ends with the specified `file_suffix` using the `.endswith()` method.
- If the file name ends with the specified suffix, it creates a full path to the file using `os.path.join()` and reads the file into a Pandas DataFrame using `pd.read_csv()`. 
- Finally, it prints the length of the DataFrame using `len(df)` along with the name of the file.

Overall, this code is useful for quickly checking the size of multiple TSV files in a directory.

In [None]:
folder_path = "/Users/johnbrightanyaaibe/Documents/Notes/MIDRC/submission_tsvs/RSNA_20230117"
file_suffix = "RSNA_20230117.tsv"
for filename in os.listdir(folder_path):
    if filename.endswith(file_suffix):
        file_path = os.path.join(folder_path, filename)
        df = pd.read_csv(file_path, sep="\t")
        print(f"Length of {filename}: {len(df)}")


Cross-checking the metadata with what has been submitted and validating the metadata.

In [None]:
# read in the given metadata
filename = "/Users/johnbrightanyaaibe/Documents/Notes/MIDRC/submission_tsvs/RSNA_20230117/originals/case_RSNA_20230117.tsv"
batch_df = pd.read_csv(filename,sep='\t',header=0,dtype=str)
len(list(set(batch_df.submitter_id)))
batch_ids = list(set(batch_df.submitter_id))

# crosscheck the given metadata with what's been submitted
sexp.get_node_tsvs(node=node,overwrite=False,remove_empty=True)
node_filename= "path to exported node tsv"
node_df = pd.read_csv(node_filename,sep='\t',header=0,dtype=str)

hdf = node_df.loc[node_df.submitter_ids.isin(batch_ids)]


If there any removals, the following code can be used in tandem with a manifest of the removed records to get a count of the removed records

In [None]:
file1 = "/Users/johnbrightanyaaibe/Documents/Notes/MIDRC/submission_tsvs/RSNA_20230117/case_RSNA_20230117.tsv"
file2 = "/Users/johnbrightanyaaibe/Documents/Notes/MIDRC/submission_tsvs/RSNA_20230117/remove4grandchallenge2_image_manifest_RSNA_20230117.tsv"

df1 = pd.read_csv(file1, delimiter='\t')
df2 = pd.read_csv(file2, delimiter='\t')

# extract the case_ids column from both dataframes
case_ids1 = set(df1['case_ids'])
case_ids2 = set(df2['case_ids'])

# count the number of common values between the two columns
common_values_count = len(case_ids1.intersection(case_ids2))

print(f"There are {common_values_count} common values in the case_ids column of the two files.")


The following code loops through all the files in a repositiory and see how many records are in common with the manifest of the removed records

In [None]:
# initialize an empty dictionary to store the results
results = {}

# loop through all files in the directory
for filename in os.listdir(directory):
    if filename.endswith(".tsv"):
        filepath = os.path.join(directory, filename)
        df1 = pd.read_csv(filepath, delimiter='\t')
        if 'case_ids' in df1.columns:
            case_ids1 = set(df1['case_ids'])
            common_values_count = len(case_ids1.intersection(case_ids2))
            results[filename] = common_values_count

# print the results
for filename, count in results.items():
    print(f"{filename}: {count} common values with manifest of removed records.")
