# Updating the Master Sequestration Locations List (in the VM)
---
by Eric Giger

Data Submission Technician at the Center for Translational Data Science at the University of Chicago

January 2023

---

### Outline
After receiving the sequestration results,
1. Download the COMPLETED_sequestration_data_ORG_DATE.tsv from [ValidateStaging](https://validatestaging.midrc.org/) to the VM.
2. Append new case_ids to current master sequestration list
3. Save new the list and archive the old
4. Notify team channel that a new masterlist available

## Getting Started

### For this notebook, you will have to copy and paste into an `ipython3` shell in the utilityvm. 

In [None]:
# Import necessary packages
import pandas as pd
import sys, os
import glob, copy
import numpy as np
from pathlib import Path
from datetime import date

In [None]:
# Configure your gen3-client profile
profile = "midrc-validatestaging"
api="https://validatestaging.midrc.org"
vm_cred_path = "/home/ubuntu/wd/creds/midrc-validatestaging-credentials.json"

os.system("gen3-client configure --profile={} --apiendpoint={} --cred={}".format(profile,api,vm_cred_path))

## Step 1: Download Completed Results

After being notified the sequestration results are available, we will have to go through the `/submission` endpoint to obtain the `object_id` associated with the file.\
We will then use the `gen3-client` to download the completed results to the VM.

As a reminder: \
In order to do this, we will need to configure a profile and then use the `object_id` obtained from the `/submission` endpoint. 

In [None]:
# Download the completed results
profile = "midrc-validatestaging"
object_id=""
download_path="/home/ubuntu/wd/sequestration/completed"

os.system("gen3-client download-single --profile={} --guid={} --download-path={}".format(profile,object_id,download_path))

## Step 2: Append New Cases

In [None]:
# Import necessary packages
import pandas as pd
import sys, os
import glob, copy
import numpy as np
from pathlib import Path
from datetime import date

In [None]:
# set your working directory and change to it
wd_dir = "/home/ubuntu/wd"
os.chdir(wd_dir)

In [None]:
### grab our local copy of the most up to date master list
master_list = glob.glob('master_sequestration_locations_*.tsv')
master_filename = master_list[0]
display(master_filename)

Notice the naming convention above - we include the number of cases and the date the masterlist was updated.

In [None]:
mf = copy.deepcopy(pd.read_csv(master_filename,sep='\t')) 
display(mf)

In [None]:
# define your LOCAL COMPLETED directory (where we downloaded the COMPLETED results)
comp_dir = "/home/ubuntu/wd/sequestration/completed"

In [None]:
# collect the COMPLETED files into a list (sometimes we have receive multiple batches at the same time)
completed_files=glob.glob('{}/COMPLETED_sequestration_data_*_*.tsv'.format(comp_dir))

display(completed_files)

In [None]:
for file in completed_files:
    cf = copy.deepcopy(pd.read_csv(file,sep='\t')) #
    cf['case_ids']=cf['submitter_id']
    # make a copy of the completed file with only the necessary fields
    i = copy.deepcopy(cf[['case_ids','dataset','project_id']])
    # select only the case_ids not currently in the master list
    hdf = i.loc[~i['case_ids'].isin(mf.case_ids)].reset_index(drop=True)
    display(len(hdf.case_ids))
    print("There are {} new case_ids from {}".format(len(hdf.case_ids),file))
    # add new sequestration results to masterlist
    new_mf = pd.concat([mf,hdf]).drop_duplicates().reset_index(drop=True)
    list(set(new_mf.case_ids))
    mf = copy.deepcopy(new_mf)

new_mf.index = new_mf.case_ids
new_mf.drop(columns={'case_ids'},inplace=True)

display(new_mf)

## Step 3: Save New Masterlist and Archive the Old

In [None]:
# the file name for the master list consists of the number of unique case_ids and the date the master list was updated
length = len(new_mf) # number of unique case_ids
today = date.today().strftime('%Y-%m-%d') # today's date

display(length)
display(today)

In [None]:
# save the file
new_masterlist = "master_sequestration_locations_{}_{}.tsv".format(length,today)
new_mf.to_csv(new_masterlist,sep='\t')

In [None]:
# Move old masterlist to archive
os.chdir(wd_dir)
os.system("mv {} /home/ubuntu/wd/sequestration/archive".format(master_filename))

In [None]:
# move COMPLETED files to archive
for file in completed_files:
    os.system("mv {} /home/ubuntu/wd/sequestration/completed/archive".format(file))

## Step 4: Notify Team Channel That a New Masterlist Available