In [1]:
import pandas as pd
import os

pd.set_option("display.max_columns", 500)
pd.set_option("display.width", 1000)
pd.set_option("display.max_colwidth", 1000)

In [2]:
dataverse_path = "../dataverse_files/Tables"

# print all files in the directory
for file in os.listdir(dataverse_path):
    print(file)

author.table.csv
pointer.table.csv
ref.table.csv
person.table.csv
deferrals.csv
indications.csv
study_results.csv
sigs.csv
canonical.triples.csv
drugs.csv
study.table.csv
hemonc_rels.csv
variant.table.csv
hemonc_classes.csv
exclusions.csv


## Author- not interesting

## Drugs

| Variable                  | Description                                                                                       | Type       |
|---------------------------|---------------------------------------------------------------------------------------------------|------------|
| drug                      | The HemOnc preferred name for a drug component                                                    | Value set  |
| drug_CUI                  | The HemOnc unique code identifying the drug                                                       | integer    |
| drug_INN                  | The International Non-proprietary Name of the drug, which should almost always be the same as the drug field | string     |
| ATC                       | The ATC code(s) for the drug                                                                      | string     |
| investigational           | Binary indicator of whether the drug is investigational (TRUE) or approved by any worldwide regulatory body (FALSE) | logical    |
| multiagent                | Binary indicator of whether the drug has multiple components                                      | logical    |
| main_class                | The principle class to which a drug belongs, preferably mechanistic (a.k.a. primary mechanism of action) | Value set  |
| class_type                | The type of class                                                                                 | Value set  |
| CanMED_major_class        | The major class(es) as defined by NCI CanMED with additional local review                         | Value set  |
| CanMED_major_class_CUI    | The HemOnc unique code identifying the major class(es)                                            | integer    |
| CanMED_minor_class        | The minor class(es) as defined by NCI CanMED with additional local review                         | Value set  |
| CanMED_minor_class_CUI    | The HemOnc unique code identifying the minor class(es)                                            | integer    |
| in_regimen                | Binary indicator of whether the drug is used as a primary component of a regimen                  | logical    |
| date_added                | The date on which the drug was added to HemOnc.                                                   | Calendar date |

In [3]:
drugs = pd.read_csv(dataverse_path + "/drugs.csv")
drugs.head()

Unnamed: 0,drug,drug_CUI,drug_INN,atc,investigational,multiagent,main_class,class_type,CanMED_major_class,CanMED_major_class_CUI,CanMED_minor_class,CanMED_minor_class_CUI,date_added
0,Abarelix,3652,abarelix,L02BX01,False,False,GnRH antagonist,mechanistic,Androgen receptor inhibitor,46196.0,GnRH antagonist,45632,D-2019-08-29
1,Abciximab,3,abciximab,B01AC13,False,False,Anti-GPIIb-IIIa antibody,mechanistic,,,,,D-2019-05-27
2,Abemaciclib,4,abemaciclib,,False,False,CDK4/6 inhibitor,mechanistic,CDK inhibitor,44963.0,CDK4 inhibitor|CDK6 inhibitor,32998|32999,D-2019-05-27
3,Abexinostat,5,abexinostat,,True,False,HDAC inhibitor,mechanistic,Enzyme inhibitor,46122.0,HDAC inhibitor,44966,D-2019-05-27
4,Abiraterone,6,abiraterone,L02BX03,False,False,CYP17 inhibitor,mechanistic,Androgen receptor inhibitor,46196.0,,,D-2019-05-27


1. What is the main class of drug x 
- drug column is the name of the drug
- main_class column is the main class of the drug

2. More specific version= "What is the primary mechanism of action of the drug?" 
   1. same as above but limit to only class_type = mechanistic


1. Does the following drug form a primary component of a regimen for condition y?
- drug column is the name of the drug
- in_regimen column is a binary indicator of whether the drug is used as a primary component of a regimen
- would need to join to the regimen table to get the condition
- lower priority question...

## Indications


| Variable               | Description                                                                                                                                                                                                                               | Type          |
|------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
| component              | The drug or biologic agent which is the subject of 1 or more regulatory approvals                                                                                                                                                         | Value set     |
| regulator              | If approved, the regulatory body issuing the approval                                                                                                                                                                                     | Value set     |
| date                   | Date of approval in ISO 8601 format, if available. Otherwise, “Uncertain”                                                                                                                                                                 | Date          |
| condition              | The cancer or non-malignant condition for which the drug is approved                                                                                                                                                                      | Value set     |
| accelerated            | Indicator of whether the approval was an accelerated approval                                                                                                                                                                             | Logical       |
| withdrawn              | Indicator of whether the indication event was a withdrawal                                                                                                                                                                                | Logical       |
| first_in_class         | Indicator of whether the approval was the first approval for this class of drugs.                                                                                                                                                         | Logical       |
| note                   | Captures one of several error conditions: “No linked condition”; “Investigational drug” (not an error per se but by definition there won’t be an approval event); “No changes in FDA indication section” (on the website); “No record of FDA approval”; “No month/year information” as well as "To be dissected" for new additions | Value set     |
| context                | The context for which the drug is approved.                                                                                                                                                                                               | String        |
| stage_or_status        | The stage or status for which the drug is approved.                                                                                                                                                                                       | String        |
| risk_stratification    | If applicable, the risk strata for which the drug is approved                                                                                                                                                                             | String        |
| phenotype              | If specified, a demographic or other restriction on the approval (e.g., postmenopausal women).                                                                                                                                            | String        |
| prior_therapy          | Prior treatment exposure requirements within the approval. If the requirement is no prior treatment (untreated); this value is set to 0                                                                                                   | String        |
| prior_therapy_negation | If the prior treatment requirement is an absence of a named treatment exposure, this is TRUE                                                                                                                                              | Logical       |
| prior_therapy_setting  | If described, the setting in which prior treatments were given                                                                                                                                                                            | String        |
| response_contingency   | If there is a prior therapy, an event-based contingency if specified (e.g., complete surgical resection; progression)                                                                                                                     | String        |
| time_contingency       | If there is a prior therapy requirement, a time-based contingency if specified (e.g., within 6 months, within 12 months)                                                                                                                  | String        |
| prior_biomarker        | Prior treatment requirements that are contingent upon certain biomarker criteria.                                                                                                                                                         | String        |
| with                   | If specified, a requirement that the approved drug be given with one or more other drugs.                                                                                                                                                 | Value set     |
| biomarker              | If applicable, gene mutation- and/or protein expression-specific requirements.                                                                                                                                                            | String        |
| biomarker_negation     | If there is a biomarker-specific requirement, an indicator of whether this is a presence or absence of said biomarker. If an entire gene is negated (e.g., BRAF + TRUE) this is equivalent to a wild-type state                          | Logical       |
| study_yn               | If TRUE, the regulatory indication cites 1 or more studies as evidence to support the indication. If FALSE, no studies are listed in the package insert.                                                                                  | Logical       |
| study                  | The study cited to support the indication. If more than one study is cited for a particular indication, each row will contain one study                                                                                                   | Value set     |
| string                 | The raw regulatory indication string from HemOnc.org HTML                                                                                                                                                                                 | String        |
| date_added             | The date on which the indication was added to the table.                                                                                                                                                                                  | Calendar date |

In [5]:
indications = pd.read_csv(dataverse_path + "/indications.csv", encoding="latin1")
indications.head()

Unnamed: 0,component,regulator,date,condition,accelerated,withdrawn,first_in_class,note,context,stage_or_status,risk_stratification,demographics,ineligibility,prior_therapy,prior_therapy_negation,prior_therapy_setting,response_contingency,time_contingency,prior_biomarker,with,biomarker,biomarker_finding,study_yn,study,string,date_added
0,Abarelix,FDA,D-2003-11-25,Prostate cancer,False,False,True,,,Symptomatic,,men,,,,,,,,,,,True,Koch et al. 2003,"2003-11-25: Approved for palliative treatment of men with advanced symptomatic <a href=""/wiki/Prostate_cancer"" title=""Prostate cancer"">prostate cancer</a>, in whom LHRH agonist therapy is not appropriate and who refuse surgical castration, and have one or more of the following: risk of neurological compromise due to metastases, ureteral or bladder outlet obstruction due to local encroachment or metastatic disease, or severe bone pain from skeletal metastases persisting on narcotic analgesia. <i>(Based on Koch et al. 2003)</i>",D-2022-09-05
1,Abciximab,FDA,D-1994-12-22,NONE,False,False,True,No linked condition,,,,,,,,,,,,,,,,,1994-12-22: Initial approval (label not available at Drugs @ FDA),D-2022-09-05
2,Abemaciclib,EMA,D-2018-09-26,Breast cancer,False,False,False,,,Locally advanced OR Metastatic,,women,,,,,,,,Aromatase inhibitor OR Fulvestrant,HR and HER2,positive|negative,,,"2018-09-26: Initial authorization as Verzenios for the treatment of women with hormone receptor (HR) positive, human epidermal growth factor receptor 2 (HER2) negative locally advanced or metastatic <a href=""/wiki/Breast_cancer"" title=""Breast cancer"">breast cancer</a> in combination with an aromatase inhibitor or fulvestrant as initial endocrine-based therapy, or in women who have received prior endocrine therapy.",D-2023-09-17
3,Abemaciclib,EMA,D-2018-09-26,Breast cancer,False,False,False,,,Locally advanced OR Metastatic,,women,,,,,,,,Aromatase inhibitor OR Fulvestrant,HR and HER2,positive|negative,,,"2018-09-26: Initial authorization as Verzenios for the treatment of women with hormone receptor (HR) positive, human epidermal growth factor receptor 2 (HER2) negative locally advanced or metastatic <a href=""/wiki/Breast_cancer"" title=""Breast cancer"">breast cancer</a> in combination with an aromatase inhibitor or fulvestrant as initial endocrine-based therapy, or in women who have received prior endocrine therapy. In pre- or perimenopausal women, the endocrine therapy should be combined with a luteinising hormone-releasing hormone (LHRH) agonist.",D-2023-09-17
4,Abemaciclib,EMA,D-2022-04-01,Breast cancer,False,False,False,,Adjuvant,node-positive early,high risk of recurrence,,,,,,,,,Endocrine therapy,HR and HER2,positive|negative,,,"2022-04-01: Extension of indication to include Verzenios in combination with endocrine therapy for adjuvant treatment of patients with hormone receptor (HR) positive, human epidermal growth factor receptor 2 (HER2)-negative, node-positive early <a href=""/wiki/Breast_cancer"" title=""Breast cancer"">breast cancer</a> at high risk of recurrence.",D-2023-09-17


1. Which of the following drugs has an approved indication "condition x"?
   - component column is the name of the drug
   - condition column is the condition for which the drug is approved
   - make sure withdrawn is False
   - study is the associated study
   - Side note: just need to make sure options do not include >1 right answer 
     - i.e. for drug of interest, get all conditions associated, filter out those, then sample incorrect

2. For which stage or status is the drug "drug x" approved to treat condition "condition x"?
   - component column is the name of the drug
   - stage_or_status column is the stage or status for which the drug is approved
   - condition column is the condition for which the drug is approved
   - make sure withdrawn is False
   - study is the associated study
   - Side note: same as above for condition-- make sure that it isn't approved for >1 stage/status

## Person -- not interesting

## Pointer 

### Core Variables

| Variable | Description | Type | Allowed Values |
|-|-|-|-|
| condition | Cancer or non-malignant condition | Value set | From Condition class |
| biomarker | Biomarker-defined condition | String | Any |
| context | Treatment context | Value set | From Context class |
| regimen | The regimen | Value set | From Regimen class |
| regimen_cui | Concept code for regimen | Value set | From Regimen class |

### Documentation Links

| Variable | Description | Type | Allowed Values |
|-|-|-|-|
| h2_html | Level 2 heading pointer | String | URL |
| h3_html | Level 3 heading pointer | String | URL |
| notes | HemOnc.org notes | String | Any |
| version | Copy date | Calendar date | YYYY-MM-DD |

In [6]:
pointer = pd.read_csv(dataverse_path + "/pointer.table.csv", encoding="latin1")
pointer.head()

Unnamed: 0,tracer,condition,biomarker,context,regimen,regimen_cui,h2_html,h3_html,notes,version
0,,Acquired hemophilia A,,All lines of therapy,Cyclophosphamide and Prednisolone,795,https://hemonc.org/wiki/Acquired_hemophilia_A#Cyclophosphamide_&_Prednisolone,N/A - no variants or protocol chunks,,2024-09-06
1,15.2.1.1,Acquired hemophilia A,,All lines of therapy,Cyclophosphamide and Prednisolone,795,https://hemonc.org/wiki/Acquired_hemophilia_A#Cyclophosphamide_&_Prednisolone,N/A - no variants or protocol chunks,,2024-12-01
2,,Acquired hemophilia A,,All lines of therapy,Cyclophosphamide and Prednisone,797,https://hemonc.org/wiki/Acquired_hemophilia_A#Cyclophosphamide_&_Prednisone,https://hemonc.org/wiki/Acquired_hemophilia_A#Regimen_variant_#1,"<p><i>Note: Per Shaffer et al. 1997, ""the lower ends of these ranges became the standard therapy to minimize side effects""</i>",2024-09-06
3,,Acquired hemophilia A,,All lines of therapy,Cyclophosphamide and Prednisone,797,https://hemonc.org/wiki/Acquired_hemophilia_A#Cyclophosphamide_&_Prednisone,https://hemonc.org/wiki/Acquired_hemophilia_A#Regimen_variant_#2,,2024-09-06
4,,Acquired hemophilia A,,All lines of therapy,Cyclophosphamide and Prednisone,797,https://hemonc.org/wiki/Acquired_hemophilia_A#Cyclophosphamide_&_Prednisone,https://hemonc.org/wiki/Acquired_hemophilia_A#Regimen_variant_#3,,2024-09-06


## Pointer

| Variable       | Description                                                                                                                                     | Type           | Allowed Values/Format                               | Display Name       |
|----------------|-------------------------------------------------------------------------------------------------------------------------------------------------|----------------|-----------------------------------------------------|--------------------|
| condition      | The cancer or non-malignant condition in which the regimen is instantiated                                                                      | Value set      | Any `concept_name` from the Condition concept class | Disease Condition  |
| biomarker      | If applicable, the biomarker-defined condition                                                                                                  | String         |                                                     | Biomarker          |
| context        | The treatment context in which the regimen is instantiated (e.g., first-line metastatic)                                                        | Value set      | Any `concept_name` from the Context concept class   | Treatment Context  |
| regimen        | The regimen                                                                                                                                     | Value set      | Any `concept_name` from the Regimen concept class   | Regimen            |
| regimen_cui    | The concept code corresponding to the regimen                                                                                                   | Value set      | Any `concept_code` from the Regimen concept class   |                    |
| h2_html        | The pointer back to the HemOnc.org level 2 heading (regimen-level)                                                                              | String         | URL                                                 |                    |
| h3_html        | If there are variants or protocol chunks, the pointer back to the HemOnc.org level 3 heading (variant-level)                                    | String         | URL                                                 |                    |
| notes          | Notes taken from HemOnc.org                                                                                                                     | String         |                                                     |                    |
| version        | The date on which the local HemOnc.org copy was obtained                                                                                        | Calendar date  | `YYYY-MM-DD` (ISO 8601)                             |                    |

1. which of the following regimens is used to treat condition x?
   - regimen column is the name of the regimen
   - condition column is the condition for which the regimen is used
   - context column is the treatment context in which the regimen is used
   - Is this really different to what we are already doing, but a simpler version?

2. which of the following conditions is a biomarker-defined condition?
   - condition column is the condition
   - biomarker column is the biomarker-defined condition
     - [nan 'MET' 'RET-mutated_first-line_therapy' 'ROS1' 'HPV' 'HRD' 'VHL' ... ...]

- Can we use h2 and h3 html blocks as retrieved from the HemOnc.org website to get the information about the regimen, condition, biomarker, and context.

## Ref

| Variable | Description | Type |
|----------|-------------|------|
| study | The study (clinical trial) | Value set |
| reference | The HemOnc concept name for reference | Value set |
| title | The reference title | String |
| pmid | The PubMed URL for a given publication | Value set |
| doi | If available, the DOI URL for a given reference | Value set |
| url | If DOI not available, the direct URL to the original article. If DOI is available, the URL that the DOI resolves to | Value set |
| pmcid | If available, the PubMed Central PMCID for a given reference | Value set |
| condition | The cancer or non-malignant condition being reported in the reference | Value set |
| journal | Journal name in abbreviated MEDLINE format | Value set |
| biblio | Publication date, journal volume, pages sourced from PubMed | String |
| pub.date | The date of publication derived from the biblio field, defaulting to ePub information when available. If day is not specified, the midpoint of the month or quarter is used instead | Calendar date |
| order | The chronologic order in which the reference is published, relative to the global table | Integer |
| update | The chronologic rank of the reference related to a study. The first (primary) study publication in 0, the next is 1, and so forth | Integer |
| ref_type | The reference type. By default, the first reference (update = 0) is always called "Primary". Efficacy updates are abbreviated as "Update" | Value set |
| date_added | The date on which the reference-pmid-condition triad was added to the table | Calendar date |
| date_last_modified | The last date on which any of the fields in the row were modified | Calendar date |
| modifications | A list of the last modifications made to the data row | String |
| temp | Reserved field for troubleshooting | String |
| xml_parse_error | Binary indicator of whether reutils failed to retrieve details for a given PMID | Binary |

In [None]:
ref = pd.read_csv(dataverse_path + "/ref.table.csv", encoding="latin1")
ref.head()

Unnamed: 0,study,reference,title,pmid,doi,url,pmcid,condition,journal,biblio,pub.date,order,update,ref_type,date_added,date_last_modified,modifications,temp,xml_parse_error
0,006/027/ICI,006/027/ICI::00,"A double-blind, placebo-controlled, randomized...",https://pubmed.ncbi.nlm.nih.gov/20931299/,https://doi.org/10.1007/s12032-010-9700-3,,No PMCID,Cervical cancer,Med Oncol,2011 Dec;28 Suppl 1:S540-6. Epub 2010 Oct 8,2010-10-08,3806,0,Primary,2020-02-16,2023-08-15,; url updated from https://link.springer.com/a...,,False
1,01-002-0601,01-002-0601::00,Thalidomide-dexamethasone compared with melpha...,https://pubmed.ncbi.nlm.nih.gov/18955563/,https://doi.org/10.1182/blood-2008-07-169565,,No PMCID,Multiple myeloma,Blood,2009 Apr 9;113(15):3435-42. Epub 2008 Oct 27,2008-10-27,3192,0,Primary,2020-10-13,2024-01-04,; url updated from to https://doi.org/10.1182...,,False
2,01-002-0601,01-002-0601::01,Thalidomide maintenance treatment increases pr...,https://pubmed.ncbi.nlm.nih.gov/20418244/,https://doi.org/10.3324/haematol.2009.020586,,https://www.ncbi.nlm.nih.gov/pmc/articles/pmc2...,Multiple myeloma,Haematologica,2010 Sep;95(9):1548-54. Epub 2010 Apr 23,2010-04-23,3637,1,Update,2020-10-13,2024-09-06,; doi updated from to https://doi.org/10.3324...,,False
3,03-C-0110,03-C-0110::00,Phase II study of bevacizumab in patients with...,https://pubmed.ncbi.nlm.nih.gov/22430271/,https://doi.org/10.1200/jco.2011.39.6853,,https://www.ncbi.nlm.nih.gov/pmc/articles/pmc3...,Kaposi sarcoma,J Clin Oncol,2012 May 1;30(13):1476-83. Epub 2012 Mar 19,2012-03-19,4241,0,Primary,2021-03-25,2023-08-15,; url updated from https://ascopubs.org/doi/10...,,False
4,03-TTD-01,03-TTD-01::00,Phase III study of capecitabine plus oxaliplat...,https://pubmed.ncbi.nlm.nih.gov/17548839/,https://doi.org/10.1200/jco.2006.09.8467,,No PMCID,Colorectal cancer,J Clin Oncol,2007 Sep 20;25(27):4224-30. Epub 2007 Jun 4,2007-06-04,2804,0,Primary,2021-09-06,2023-08-15,; url updated from https://ascopubs.org/doi/10...,,False


1. what is the main condition being reported in the reference "reference x"?
   - reference column is the name of the reference
   - condition column is the condition being reported in the reference

2. Retrieval benchmark
   - match id to condition and stage as doing atm
   - goal is, given condition + stage retrieve relevant treatment documents 
   - does pmid in our table appear in top k

Side note
- guessing we are using pub.date?
- Update column is useful here to tell you which is most up to date

# SIG Variables

## Core Identifiers

| Variable | Description | Type | Multiple Values | Allowed Values |
|-|-|-|-|-|
| study | Clinical trial for SIG | Value set | Yes (pipe-delimited) | From Study concept class |
| regimen | Studied regimen | Value set | No | From Regimen concept class |
| regimen_cui | Regimen concept code | CUI | No | From Regimen concept class |
| variant | Local variant name | String | No | Variant #01; Variant #02; etc |
| variant_cui | Variant concept code | CUI | No | From Regimen Variant class |

## Component Information

| Variable | Description | Type | Multiple Values | Allowed Values |
|-|-|-|-|-|
| component | SIG intervention | Value set | No | From Component/Procedure class |
| component_cui | Component concept code | CUI | No | From Component/Procedure class |
| component_role | Component type | Value set | No | locoregional; primary systemic; secondary systemic |
| phase | Treatment phase | Value set | No | Induction; Maintenance; etc |
| portion | Protocol portion | String | No | Any |

## Timing and Sequence

| Variable | Description | Type | Multiple Values | Allowed Values |
|-|-|-|-|-|
| timing | Temporal criteria | String | No | Any |
| timing_unit | Timing unit | Value set | No | Course; Cycle; Day; Month; Week |
| timing_sequence | Drug prescription units | Integer | Yes (comma-sep) | 1,2,3,4 etc |
| timing_bounded | Has definite end | Logical | No | TRUE; FALSE |
| cycle_length_lb | Cycle lower bound | Numeric | No | Any |
| cycle_length_ub | Cycle upper bound | Numeric | No | Any |
| cycle_length_unit | Cycle length unit | Value set | No | days; weeks; months; years |
| step_number | SIG step specification | Value set | No | 1 of 1; 1 of 2; etc |

## Dosing Information

| Variable | Description | Type | Multiple Values | Allowed Values |
|-|-|-|-|-|
| class | SIG classification | Value set | No | Non-canonical; IV canonical; Rad Sig; etc |
| doseMinNum | Minimum dose | Numeric | No | Any |
| doseMaxNum | Maximum dose | Numeric | No | Any |
| doseUnit | Dose units | Value set | No | mg; mg/kg; Gy; etc |
| doseUnit_cui | Dose unit code | CUI | No | From Unit concept class |
| doseCapNum | Maximum capped dose | Numeric | No | Any |
| doseCapUnit | Cap dose units | Value set | No | mg; GBq; etc |
| doseCapUnit_cui | Cap unit code | CUI | No | From Unit concept class |
| divided | Divide dose across day | Logical | No | TRUE; FALSE |

## Administration Details

| Variable | Description | Type | Multiple Values | Allowed Values |
|-|-|-|-|-|
| route | Delivery route | Value set | No | IV; PO; SC; etc |
| route_cui | Route concept code | CUI | No | From Route concept class |
| allDays | Administration days | Integer | Yes (comma-sep) | 1,8,15,22 etc |
| durationMinNum | Min admin time | Numeric | No | Any |
| durationMaxNum | Max admin time | Numeric | No | Any |
| durationUnit | Duration unit | Value set | No | second; minute; hour; day |
| durationUnit_cui | Duration unit code | CUI | No | From Unit concept class |
| frequency | Admin interval | Value set | No | once per day; twice per day; etc |
| frequency_cui | Frequency code | CUI | No | From Repeat Unit class |

## Additional Specifications

| Variable | Description | Type | Multiple Values | Allowed Values |
|-|-|-|-|-|
| branch | Conditional criteria | String | No | none or descriptor |
| branch_type | Branch type | Value set | No | N/A or from sig_branch_types |
| cyclesigs | Applicable cycle SIG | String | No | Any |
| cyclesigs_note | Cycle SIG notes | String | No | Any |
| inParens | Parenthetical content | String | No | Any |
| sequence | Absolute sequence | String | No | Any |
| seq.rel | Relative sequence | String | No | Any |
| seq.rel.what | Relative to what | String | No | Any |
| tail | Misc ending strings | String | No | Any |
| temp | Troubleshooting field | String | No | Any |
| date_added | Addition date | Calendar date | No | YYYY-MM-DD |



In [None]:
sig = pd.read_csv(dataverse_path + "/sigs.csv", encoding="latin1")
sig.head()

Unnamed: 0,study,regimen,regimen_cui,phase,portion,component,component_cui,component_role,timing,timing_unit,cycle_length_lb,cycle_length_ub,cycle_length_unit,timing_sequence,timing_bounded,branch,branch_type,cyclesigs,cyclesigs_note,step_number,class,doseMinNum,doseMaxNum,doseUnit,doseUnit_cui,divided,doseCapNum,doseCapUnit,doseCapUnit_cui,route,route_cui,allDays,durationMinNum,durationMaxNum,durationUnit,durationUnit_cui,frequency,frequency_cui,inParens,sequence,seq.rel,seq.rel.what,tail,variant,variant_cui,temp,date_added
0,MDACC ID01-233,(90)YFC,3071,,-,Cyclophosphamide,122,primary systemic,,,1.0,1.0,indeterminate,1,True,none,,One course,,1 of 1,IV intermittent canonical Sig,750.0,750.0,mg/m^2,134448.0,False,,,,IV,44957.0,"-5,-4,-3",,,,,once per day,139562.0,,,,,-,Variant #01,129495.0,,2023-09-19
1,MDACC ID01-233,(90)YFC,3071,,-,Fludarabine,224,primary systemic,,,1.0,1.0,indeterminate,1,True,none,,One course,,1 of 1,IV intermittent canonical Sig,30.0,30.0,mg/m^2,134448.0,False,,,,IV,44957.0,"-5,-4,-3",,,,,once per day,139562.0,,,,,-,Variant #01,129495.0,,2023-09-19
2,MDACC ID01-233,(90)YFC,3071,,-,Ibritumomab tiuxetan,262,primary systemic,,,1.0,1.0,indeterminate,1,True,none,,One course,,1 of 2,Non-canonical Sig,,,,,False,,,,IV,44957.0,,,,,,once,139563.0,,,,,-,Variant #01,129495.0,,2023-09-19
3,MDACC ID01-233,(90)YFC,3071,,-,Ibritumomab tiuxetan,262,primary systemic,,,1.0,1.0,indeterminate,1,True,none,,One course,,2 of 2,Rad Sig,0.4,0.4,mCi/kg,134443.0,False,32.0,GBq,134428.0,IV,44957.0,,,,,,once,139563.0,(maximum dose of 32 mCi/1.2 GBq),,,,-,Variant #01,129495.0,,2023-09-19
4,MDACC ID01-233,(90)YFC,3071,,-,Rituximab,446,primary systemic,,,1.0,1.0,indeterminate,1,True,none,,One course,,1 of 1,IV intermittent canonical Sig,250.0,250.0,mg/m^2,134448.0,False,,,,IV,44957.0,"-14,-7",,,,,once per day,139562.0,,,,,-,Variant #01,129495.0,,2023-09-19


1. When giving the drug "drug x" in the regimen "regimen x", what is the minimum dose?
   - component column is the name of the drug
   - regimen column is the name of the regimen
   - regimen_cui column is the concept code for the regimen
   - doseMinNum column is the minimum dose 

2. When giving the drug "drug x" in the regimen "regimen x", what is the maximum dose?
    - component column is the name of the drug
    - regimen column is the name of the regimen
    - regimen_cui column is the concept code for the regimen
    - doseMaxNum column is the maximum dose

3. When giving the drug "drug x" in the regimen "regimen x", what is the delivery route?
    - component column is the name of the drug
    - regimen column is the name of the regimen
    - regimen_cui column is the concept code for the regimen
    - route column is the delivery route

Note: These are really hard - only achievable with RAG I imagine, and a lot of these won't be reported in the abstract. 


Then we should go to open ended with model grader 
- is the following prescription valid for regimen x?
- do clash eval style increase dose / alter route / etc

# Study Results

## Core Study Variables

| Variable | Description | Type | Multiple Values | Allowed Values |
|-|-|-|-|-|
| study | The study (clinical trial) | Value set | No | Any concept_name from Study class |
| context | Treatment context of study-regimen dyad | Value set | No | Any contextPretty from context.table |
| regimen | The regimen being studied | Value set | No | Any concept_name from Regimen class |
| r_modifier | Regimen specs (cycles, dosing) | String | No | Any |
| condition | Cancer/condition being studied | Value set | No | Any concept_name from Condition class |
| biomarker | Biomarker-defined subtype | String | No | HGVS with exceptions |

## Comparator Information

| Variable | Description | Type | Multiple Values | Allowed Values |
|-|-|-|-|-|
| comparator | RCT comparator arm(s) | Value set | Yes (pipe-delimited) | Any concept_name from Regimen/Procedure classes |
| c_modifier | Comparator specifications | String | Yes (pipe-delimited) | Blank if non-comparative, "none" if comparative |

## Comparator Codes

| Variable | Description | Type | Multiple Values | Allowed Values |
|-|-|-|-|-|
| comparator_code | Comparator incompletion codes | Factor | Yes (pipe-delimited) | See codes below |

Comparator Code Values:
- none: Default
- 333: Approved drugs regimen, potential wiki addition (experimental arm)
- 555: Regimen on stub page
- 666: Approved drugs being tested, results pending
- 777: Contains investigational component, wanted upon approval
- 888: Approved drugs regimen wanted for wiki
- 999: Unwanted regimen for wiki (negative trials)

## Study Endpoints

| Variable | Description | Type | Multiple Values | Allowed Values |
|-|-|-|-|-|
| efficacy | HTML efficacy string | String | No | Any |
| toxicity | HTML toxicity string | String | No | Any |
| error | Parsing error indicator | Logical | No | TRUE; FALSE |
| endpoint | Study endpoints | Value set | No | From Endpoint class; Could not be determined |
| endpoint_type | Endpoint designation | Value set | No | Primary; Secondary; Co-primary; Undesignated |
| arm_type | Trial arm type vs comparator | Value set | No | Control; De-escalation; Escalation; In/Out-class switch |

## Metrics and Statistics

| Variable | Description | Type | Multiple Values | Allowed Values |
|-|-|-|-|-|
| metric | Effect size metric | String | No | Any |
| metric_version | Efficacy update number | Integer | No | 1, 2, 3, ... |
| metric.num.this.arm | Regimen point estimate | String | No | Numeric, NYR, ERROR |
| metric.num.that.arm | Comparator point estimate | String | No | Numeric, NYR, ERROR |
| metric.unit | Time units if applicable | Value set | No | days; weeks; months; years; rates |
| statistic | Type of statistic | Value set | No | HR; sHR; aHR; RR; OR |
| estimate | Central estimate value | Numeric | No | Any |
| est.lb | CI lower bound | Numeric | No | Any |
| est.ub | CI upper bound | Numeric | No | Any |
| est.ci | CI width | String | No | XX.XX% |

## Administrative Fields

| Variable | Description | Type | Multiple Values | Allowed Values |
|-|-|-|-|-|
| temp | Troubleshooting field | String | No | Any |
| date_added | Addition date | Calendar date | No | YYYY-MM-DD |
| date_last_modified | Last modified date | Calendar date | No | YYYY-MM-DD |
| field_last_modified | Modified fields list | String | Yes (pipe-delimited) | Any |

Notes:
- Empty "Multiple Values" cells indicate single values only
- Calendar dates follow ISO 8601 format
- Value sets refer to controlled vocabularies within the system
- Pipe-delimited fields allow multiple values separated by "|

In [None]:
study_results = pd.read_csv(dataverse_path + "/study_results.csv", encoding="latin1")
study_results.head()

Unnamed: 0,study,context,regimen,r_modifier,condition,biomarker,comparator,c_modifier,comparator_code,efficacy,toxicity,error,endpoint,endpoint_type,arm_type,metric,metric_version,metric.num.this.arm,metric.num.that.arm,metric.unit,statistic,estimate,est.lb,est.ub,est.ci,temp,date_added,date_last_modified,field_last_modified
0,5501068,Advanced or metastatic disease first-line,Cisplatin and Docetaxel (DC),,Non-small cell lung cancer squamous,,Docetaxel and Nedaplatin,none,none,Might have inferior PFS,,False,PFS,Undesignated,Control,,,,,,,,,,,,2023-05-15,,
1,20020408,Advanced or metastatic disease subsequent line...,Best supportive care,,Colorectal cancer,,Panitumumab monotherapy,none,none,Inferior PFS,,False,PFS,Primary,Control,,,,,,,,,,,,2023-05-15,,
2,20020408,Advanced or metastatic disease subsequent line...,Panitumumab monotherapy,,Colorectal cancer,RAS,Best supportive care,none,none,Superior PFS (primary endpoint)<br/>Median PFS...,,False,PFS,Primary,Escalation,Median PFS,1.0,8.0,7.3,weeks,HR,0.54,0.44,0.66,95%,,2023-05-15,2023-09-08,
3,20050181,Advanced or metastatic disease second-line,FOLFIRI,,Colorectal cancer,RAS,FOLFIRI and Panitumumab,none,none,Seems to have inferior PFS,,False,PFS,Co-primary,Control,,,,,,,,,,,,2023-05-15,,
4,20050181,Advanced or metastatic disease second-line the...,FOLFIRI,,Colorectal cancer,,FOLFIRI and Panitumumab,none,none,Seems to have inferior PFS,,False,PFS,Co-primary,Control,,,,,,,,,,,,2023-05-15,,


1. Have studies shown that drug x is effective in treating condition y at stage z? Y/N
   - component column is the name of the drug
   - condition column is the condition for which the drug is approved
   - stage_or_status column is the stage or status for which the drug is approved
   - study column is the study cited to support the indication
   - make sure withdrawn is False
   - study_yn is True
   - efficacy column is the efficacy of the drug
   - context column is the treatment context in which the drug is approved-- like stage or status

2. Which of the two regimens is more effective in treating condition x?
   - regimen column is the name of the regimen
   - condition column is the condition for which the regimen is used
   - context column is the treatment context in which the regimen is used
   - study column is the study cited to support the indication
   - make sure withdrawn is False
   - study_yn is True
   - comparator column is the name of the other regimen
   - efficacy column is the efficacy of the regimen


# Study Table

## Core Study Information

| Variable | Description | Type | Multiple Values | Allowed Values |
|-|-|-|-|-|
| study | The study (clinical trial) | Value set | No | From Study concept class |
| registry | Clinical trial registry | Value set | No | ClinicalTrials.gov; ISRCTN; EudraCT; UMIN |
| trial_id | Clinical trial identifier | Value set | No | From Clinical trial ID class |
| condition | Cancer/condition studied | Value set | No | From Condition concept class |
| enrollment | Years of enrollment | String | No | Any |
| phase | Clinical trial phase | String | No | Any |

## Study Design

| Variable | Description | Type | Multiple Values | Allowed Values |
|-|-|-|-|-|
| study_design | Study design type | Value set | No | Non-randomized; De-escalation; Escalation; Mixed; CBD |
| study_design_imputed | Design was imputed | Logical | No | TRUE; FALSE |
| sact | Uses systemic therapy | Logical | No | TRUE; FALSE |
| protocol | Has multi-phase protocols | Logical | No | TRUE; FALSE |

## Regulatory Information

| Variable | Description | Type | Multiple Values | Allowed Values |
|-|-|-|-|-|
| fda_reg_study | FDA regulatory labeling | Logical | No | TRUE; FALSE |
| fda_unreg_study | FDA withdrawal decision | Logical | No | TRUE; FALSE |
| start | Enrollment start date | String | No | YYYY; YYYY-MM-DD; NR |
| end | Enrollment end date | String | No | YYYY; YYYY-MM-DD; ongoing; NR |

## Organization Details

| Variable | Description | Type | Multiple Values | Allowed Values |
|-|-|-|-|-|
| study_group | Study groups involved | Value set | Yes (pipe-delimited) | From Study Group class |
| sponsor | Study sponsors | String | No | Any |
| sponsor_type | Type of sponsor | Value set | No | Academic; Consortium; Community; Cooperative; Government; Industry |

## Administrative

| Variable | Description | Type | Multiple Values | Allowed Values |
|-|-|-|-|-|
| temp | Troubleshooting field | String | No | Any |
| date_added | Addition date | Calendar date | No | YYYY-MM-DD |
| date_last_modified | Last modified date | Calendar date | No | YYYY-MM-DD |

In [7]:
study_table = pd.read_csv(dataverse_path + "/study.table.csv", encoding="latin1")
study_table.head()

Unnamed: 0,study,registry,trial_id,condition,condition_cui,enrollment,phase,study_design,study_design_imputed,sact,protocol,reg_study,unreg_study,start,end,study_group,sponsor,sponsor_type,temp,date_added,date_last_modified
0,KN035-BTC,ClinicalTrials.gov,NCT03478488,Cholangiocarcinoma,580.0,,,CBD,False,,False,False,False,,,,"3D Medicines (Sichuan) Co., Ltd.",Pharmaceutical industry,,D-2022-09-08,
1,KN035-CN-006,ClinicalTrials.gov,NCT03667170,Mismatch repair deficient malignancy,624.0,2018-2019,Phase 2,Non-randomized,False,True,False,False,False,2018.0,2019.0,,"3D Medicines (Sichuan) Co., Ltd.",Pharmaceutical industry,,D-2023-03-16,
2,AMPECT,ClinicalTrials.gov,NCT02494570,PEComa,37829.0,2016-2018,Phase 2,Non-randomized,False,True,False,True,False,2016.0,2018.0,,"Aadi Bioscience, Inc.",Pharmaceutical industry,,D-2022-01-05,
3,AB06006,ClinicalTrials.gov,NCT00814073,Systemic mastocytosis,669.0,2009-2015,Phase 3,CBD,False,True,False,False,False,2009.0,2015.0,,AB Science,Pharmaceutical industry,,D-2019-05-27,
4,AB07001,ClinicalTrials.gov,NCT01506336,Gastrointestinal stromal tumor,602.0,2009-2011,Randomized phase 2,CBD,False,True,False,False,False,2009.0,2011.0,,AB Science,Pharmaceutical industry,,D-2019-05-27,


# Variant

## Core Information

| Variable | Description | Type | Multiple Values | Allowed Values |
|-|-|-|-|-|
| study | Associated clinical trials | Value set | Yes (pipe-delimited) | From Study concept class |
| regimen | Studied regimen | Value set | No | From Regimen concept class |
| variant | Normalized variant name | String | No | Variant #01; Variant #02; etc |
| variant_cui | Variant concept code | Value set | No | From Regimen Variant class |

## Structure Counters

| Variable | Description | Type | Multiple Values | Allowed Values |
|-|-|-|-|-|
| portions | Number of portions | Integer | No | 0, 1, 2, ... |
| components | Number of components | Integer | No | 0, 1, 2, ... |
| timings | SIGs with timing instruction | Integer | No | 0, 1, 2, ... |
| branches | SIGs with branching | Integer | No | 0, 1, 2, ... |
| sigs | Total SIGs present | Integer | No | 0, 1, 2, ... |
| cyclesigs | Total cycleSIGs present | Integer | No | 0, 1, 2, ... |
| routes | Total routes present | Integer | No | 0, 1, 2, ... |

## SIG Specifications

| Variable | Description | Type | Multiple Values | Allowed Values |
|-|-|-|-|-|
| allSIGsHaveDose | All SIGs have dosing | Logical | No | TRUE; FALSE |
| allSIGsHaveDoseUnit | All SIGs have dose unit | Logical | No | TRUE; FALSE |
| allSIGsHaveRoute | All SIGs have route | Logical | No | TRUE; FALSE |
| allSIGsHaveSchedule | All SIGs have schedule | Logical | No | TRUE; FALSE |
| allSIGsHaveDuration | All SIGs have duration | Logical | No | TRUE; FALSE |
| allSIGsHaveDurationUnit | All SIGs have duration unit | Logical | No | TRUE; FALSE |
| allSIGsHaveFrequency | All SIGs have frequency | Logical | No | TRUE; FALSE |
| allSIGsHaveSequence | All SIGs have sequencing | Logical | No | TRUE; FALSE |
| allSIGsHaveCycleSIGs | All SIGs have cycleSIG | Logical | No | TRUE; FALSE |
| fullySpecified | All indicators are TRUE | Logical | No | TRUE; FALSE |

## Administrative

| Variable | Description | Type | Multiple Values | Allowed Values |
|-|-|-|-|-|
| blob | Concatenated variant components | String | No | Any |
| version | Variant version number | Integer | No | 1, 0, -1, etc |
| date_added | Addition date | Calendar date | No | YYYY-MM-DD |
| valid | Row validity indicator | Logical | No | TRUE; FALSE |

Notes:
- Version is decremented (current = 1, previous = 0, -1, etc)
- Duration fields default to TRUE for non-IV medications
- All logical fields are TRUE/FALSE only
- Calendar dates follow ISO 8601 format

In [11]:
variant = pd.read_csv(dataverse_path + "/variant.table.csv", encoding="latin1")
variant.head()

Unnamed: 0,study,regimen,variant,variant_cui,portions,components,timings,branches,sigs,cyclesigs,routes,allSIGsHaveDose,allSIGsHaveDoseUnit,allSIGsHaveRoute,allSIGsHaveSchedule,allSIGsHaveDuration,allSIGsHaveDurationUnit,allSIGsHaveFrequency,allSIGsHaveSequence,allSIGsHaveCycleSIGs,fullySpecified,blob,version,date_added,valid,temp
0,MDACC ID01-233,(90)YFC,Variant #01,129495,0,5,0,0,6,0,1,False,False,False,False,False,False,False,False,False,False,"(90)YFC|-|Cyclophosphamide|primary systemic||none|N/A|One course|1 of 1|750|750|mg/m^2|FALSE|||IV|-5,-4,-3||||once per day|||||-|-|Fludarabine|primary systemic||none|N/A|One course|1 of 1|30|30|mg/m^2|FALSE|||IV|-5,-4,-3||||once per day|||||-|-|Ibritumomab tiuxetan|primary systemic||none|N/A|One course|1 of 2||||FALSE|||IV|||||once|||||-|-|Ibritumomab tiuxetan|primary systemic||none|N/A|One course|2 of 2|0.4|0.4|mCi/kg|FALSE|32|GBq|IV|||||once|(maximum dose of 32 mCi/1.2 GBq)||||-|-|Rituximab|primary systemic||none|N/A|One course|1 of 1|250|250|mg/m^2|FALSE|||IV|-14,-7||||once per day|||||-",1,2024-03-14,True,
1,MDACC ID01-233,(90)YFC,Variant #01,129495,0,5,0,0,6,0,1,False,False,False,False,False,False,False,False,False,False,"(90)YFC|-|Allogeneic stem cell|primary systemic||none|N/A||1 of 1||||FALSE||||0|||||||||-|-|Cyclophosphamide|primary systemic||none|N/A||1 of 1|750|750|mg/m^2|FALSE|||IV|-5,-4,-3||||once per day|||||-|-|Fludarabine|primary systemic||none|N/A||1 of 1|30|30|mg/m^2|FALSE|||IV|-5,-4,-3||||once per day|||||-|-|Ibritumomab tiuxetan|primary systemic||none|N/A||1 of 2||||FALSE|||IV|||||once|||||-|-|Ibritumomab tiuxetan|primary systemic||none|N/A||2 of 2|0.4|0.4|mCi/kg|FALSE|32|GBq|IV|||||once|(maximum dose of 32 mCi/1.2 GBq)||||-|-|Rituximab|primary systemic||none|N/A||1 of 1|250|250|mg/m^2|FALSE|||IV|-14,-7||||once per day|||||-",0,2023-09-19,False,
2,MDACC ID01-233,"(90)YFC, then allo HSCT",Variant #01,129496,0,5,0,0,6,0,1,False,False,False,False,False,False,False,False,False,False,"(90)YFC, then allo HSCT|-|Allogeneic stem cells|primary systemic||none|N/A|One course|1 of 1||||FALSE||||0|||||||||-|-|Cyclophosphamide|primary systemic||none|N/A|One course|1 of 1|750|750|mg/m^2|FALSE|||IV|-5,-4,-3||||once per day|||||-|-|Fludarabine|primary systemic||none|N/A|One course|1 of 1|30|30|mg/m^2|FALSE|||IV|-5,-4,-3||||once per day|||||-|-|Ibritumomab tiuxetan|primary systemic||none|N/A|One course|1 of 2||||FALSE|||IV|||||once|||||-|-|Ibritumomab tiuxetan|primary systemic||none|N/A|One course|2 of 2|0.4|0.4|mCi/kg|FALSE|32|GBq|IV|||||once|(maximum dose of 32 mCi/1.2 GBq)||||-|-|Rituximab|primary systemic||none|N/A|One course|1 of 1|250|250|mg/m^2|FALSE|||IV|-14,-7||||once per day|||||-",0,2024-06-15,False,
3,MDACC ID01-233,"(90)YFC, then allo HSCT",Variant #01,129496,0,5,0,0,6,0,1,False,False,False,False,False,False,False,False,False,False,"(90)YFC, then allo HSCT|-|Allogeneic stem cell|primary systemic||none|N/A|One course|1 of 1||||FALSE||||0|||||||||-|-|Cyclophosphamide|primary systemic||none|N/A|One course|1 of 1|750|750|mg/m^2|FALSE|||IV|-5,-4,-3||||once per day|||||-|-|Fludarabine|primary systemic||none|N/A|One course|1 of 1|30|30|mg/m^2|FALSE|||IV|-5,-4,-3||||once per day|||||-|-|Ibritumomab tiuxetan|primary systemic||none|N/A|One course|1 of 2||||FALSE|||IV|||||once|||||-|-|Ibritumomab tiuxetan|primary systemic||none|N/A|One course|2 of 2|0.4|0.4|mCi/kg|FALSE|32|GBq|IV|||||once|(maximum dose of 32 mCi/1.2 GBq)||||-|-|Rituximab|primary systemic||none|N/A|One course|1 of 1|250|250|mg/m^2|FALSE|||IV|-14,-7||||once per day|||||-",-1,2024-03-14,False,
4,MDACC ID01-233,"(90)YFC, then allo HSCT",Variant #01,129496,0,5,0,0,6,0,1,False,False,False,False,False,False,False,False,False,False,"(90)YFC, then allo HSCT|-|Allogeneic stem cell|primary systemic||none|N/A||1 of 1||||FALSE||||0|||||||||-|-|Cyclophosphamide|primary systemic||none|N/A||1 of 1|750|750|mg/m^2|FALSE|||IV|-5,-4,-3||||once per day|||||-|-|Fludarabine|primary systemic||none|N/A||1 of 1|30|30|mg/m^2|FALSE|||IV|-5,-4,-3||||once per day|||||-|-|Ibritumomab tiuxetan|primary systemic||none|N/A||1 of 2||||FALSE|||IV|||||once|||||-|-|Ibritumomab tiuxetan|primary systemic||none|N/A||2 of 2|0.4|0.4|mCi/kg|FALSE|32|GBq|IV|||||once|(maximum dose of 32 mCi/1.2 GBq)||||-|-|Rituximab|primary systemic||none|N/A||1 of 1|250|250|mg/m^2|FALSE|||IV|-14,-7||||once per day|||||-",-2,2023-09-19,False,


In [None]:
# print first row blob column
print(variant["blob"][0])

(90)YFC|-|Cyclophosphamide|primary systemic||none|N/A|One course|1 of 1|750|750|mg/m^2|FALSE|||IV|-5,-4,-3||||once per day|||||-|-|Fludarabine|primary systemic||none|N/A|One course|1 of 1|30|30|mg/m^2|FALSE|||IV|-5,-4,-3||||once per day|||||-|-|Ibritumomab tiuxetan|primary systemic||none|N/A|One course|1 of 2||||FALSE|||IV|||||once|||||-|-|Ibritumomab tiuxetan|primary systemic||none|N/A|One course|2 of 2|0.4|0.4|mCi/kg|FALSE|32|GBq|IV|||||once|(maximum dose of 32 mCi/1.2 GBq)||||-|-|Rituximab|primary systemic||none|N/A|One course|1 of 1|250|250|mg/m^2|FALSE|||IV|-14,-7||||once per day|||||-


1. Which of the following is a valid variant of regimen x?
   - regimen column is the name of the regimen-> link to sig/pointer table regimen and regimen_cui
   - variant column is the variant of the regimen
   - valid column is True
   
   - Note: Need to think about this one as HemOnc is completely comprehensive of all possible variants so we may need to do a clash eval style question
   - i.e. make the answers 100x/10x incorrect to make sure that definitely incorrect 
   - ideally would manually review some or use umls database to check
   - lower priority question