### Generate synthetic questions to evaluate LLMs more broadly for specific cancer questions ###


In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # note the GPU index as a string 

# Import libraries

In [1]:
import requests
import pandas as pd
import numpy as np
import io

# Construct data from the CIVIC db for eval questions
- At the heart of CIViC is the clinical evidence statement. The clinical evidence statement is a piece of information that has been manually curated from trustable medical literature about a molecular profile (variant) or genomic ‘event’ that has implications for protein function, oncogenicity, cancer predisposition, diagnosis (aka molecular classification), prognosis, or predictive response to therapy. For example, “Patients with BRAF V600 mutations respond well to the drug dabrafenib”. A molecular profile is comprised of one or more variant(s) which may be a single nucleotide substitution, a small insertion or deletion, an RNA gene fusion, a chromosomal rearrangement, an RNA expression pattern (e.g. over-expression), etc. Each clinical evidence statement corresponds to a single citable publication.
- Evidence Items follow a structured knowledge model with required fields: Molecular Profile Name (Gene/Variant), Source, Variant Origin, Disease, Evidence Statement, Evidence Type, Evidence Level, Evidence Direction, Significance, and Trust Rating) with additional optional fields (Associated Phenotypes, etc). For some Evidence Types, additional required or optional fields become available (e.g., Predictive Evidence Types require a Therapy/Drug Name).

In [2]:
curl_cmd = 'https://civicdb.org/downloads/nightly/nightly-ClinicalEvidenceSummaries.tsv'

In [3]:
response = requests.get(curl_cmd)

In [4]:
response.status_code

200

In [5]:
tsv_data = io.StringIO(response.content.decode('utf-8'))

In [8]:
evidence_info = pd.read_csv(tsv_data, sep='\t')

In [9]:
evidence_info.head()

Unnamed: 0,molecular_profile,molecular_profile_id,disease,doid,phenotypes,therapies,therapy_interaction_type,evidence_type,evidence_direction,evidence_level,...,citation,nct_ids,rating,evidence_status,evidence_id,variant_origin,last_review_date,evidence_civic_url,molecular_profile_civic_url,is_flagged
0,JAK2 V617F,64,Lymphoid Leukemia,1037.0,,,,Diagnostic,Supports,B,...,"Levine et al., 2005",,4.0,accepted,1,Somatic,2023-01-09 21:46:26 UTC,https://civicdb.org/links/evidence_items/1,https://civicdb.org/links/molecular_profiles/64,False
1,PDGFRA D842V,99,Gastrointestinal Stromal Tumor,9253.0,,,,Diagnostic,Supports,B,...,"Lasota et al., 2004",,3.0,accepted,2,Somatic,2023-01-09 21:46:27 UTC,https://civicdb.org/links/evidence_items/2,https://civicdb.org/links/molecular_profiles/99,False
2,DNMT3A R882,32,Acute Myeloid Leukemia,9119.0,,,,Diagnostic,Supports,B,...,"LaRochelle et al., 2011",,2.0,accepted,3,Somatic,2023-01-09 21:46:25 UTC,https://civicdb.org/links/evidence_items/3,https://civicdb.org/links/molecular_profiles/32,False
3,DNMT3A R882,32,Acute Myeloid Leukemia,9119.0,,,,Diagnostic,Supports,B,...,"Ribeiro et al., 2012",,3.0,accepted,4,Somatic,2023-01-09 21:46:25 UTC,https://civicdb.org/links/evidence_items/4,https://civicdb.org/links/molecular_profiles/32,False
4,JAK2 V617F,64,Chronic Myeloid Leukemia,8552.0,,,,Diagnostic,Supports,B,...,"Levine et al., 2005",,4.0,accepted,5,Somatic,2023-01-09 21:46:26 UTC,https://civicdb.org/links/evidence_items/5,https://civicdb.org/links/molecular_profiles/64,False


In [10]:
evidence_info.columns

Index(['molecular_profile', 'molecular_profile_id', 'disease', 'doid',
       'phenotypes', 'therapies', 'therapy_interaction_type', 'evidence_type',
       'evidence_direction', 'evidence_level', 'significance',
       'evidence_statement', 'citation_id', 'source_type', 'asco_abstract_id',
       'citation', 'nct_ids', 'rating', 'evidence_status', 'evidence_id',
       'variant_origin', 'last_review_date', 'evidence_civic_url',
       'molecular_profile_civic_url', 'is_flagged'],
      dtype='object')

In [11]:
evidence_info.shape

(4528, 25)

In [12]:
evidence_info['evidence_status'].unique()

array(['accepted'], dtype=object)

In [13]:
evidence_info['evidence_level'].unique()

array(['B', 'D', 'C', 'E', 'A'], dtype=object)

In [14]:
evidence_info['evidence_level'].value_counts()

evidence_level
C    1573
B    1501
D    1228
A     194
E      32
Name: count, dtype: int64

In [15]:
# exclude fusion and frameshift mutations (frameshifts are hard to match with GDC annotations)
ssms_df = evidence_info[
     (~evidence_info['molecular_profile'].str.contains('Fusion')) & \
     (~evidence_info['molecular_profile'].str.contains('FS'))
    ]

In [16]:
ssms_df.shape

(3521, 25)

In [17]:
ssms_df['molecular_profile'].unique()

array(['JAK2 V617F', 'PDGFRA D842V', 'DNMT3A R882', ...,
       'C19MC Amplification', 'C19MC TTYH1::C19MC fusion',
       'C19MC v::C19MC fusion'], dtype=object)

In [18]:
ssms_df.drop_duplicates(['molecular_profile'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ssms_df.drop_duplicates(['molecular_profile'], inplace=True)


In [19]:
ssms_df.shape

(1478, 25)

### combination mutation data
- how many ssm + ssm
- ssm + cnv
- cnv + cnv

In [20]:
ssms_df[ssms_df['molecular_profile'].str.contains('AND')].head()

Unnamed: 0,molecular_profile,molecular_profile_id,disease,doid,phenotypes,therapies,therapy_interaction_type,evidence_type,evidence_direction,evidence_level,...,citation,nct_ids,rating,evidence_status,evidence_id,variant_origin,last_review_date,evidence_civic_url,molecular_profile_civic_url,is_flagged
64,BRAF V600E AND BRAF V600M,4170,Melanoma,1909.0,,Dabrafenib,,Predictive,Supports,C,...,"Ponti et al., 2012",,1.0,accepted,73,Somatic,2023-01-11 17:59:40 UTC,https://civicdb.org/links/evidence_items/73,https://civicdb.org/links/molecular_profiles/4170,False
78,BRAF V600E AND NF1 Loss,5379,Melanoma,1909.0,,Vemurafenib,,Predictive,Supports,D,...,"Nissan et al., 2014",,3.0,accepted,90,Somatic,2025-02-25 23:34:09 UTC,https://civicdb.org/links/evidence_items/90,https://civicdb.org/links/molecular_profiles/5379,False
80,BRAF V600E AND BRAF Amplification,4173,Colorectal Cancer,9256.0,,Selumetinib,,Predictive,Supports,E,...,"Corcoran et al., 2010",,4.0,accepted,92,Somatic,2023-01-12 04:51:38 UTC,https://civicdb.org/links/evidence_items/92,https://civicdb.org/links/molecular_profiles/4173,False
216,FLT3 ITD AND FLT3 D835Y,4616,Acute Myeloid Leukemia,9119.0,,Sorafenib,,Predictive,Supports,C,...,"Man et al., 2012",,4.0,accepted,247,Somatic,2023-10-17 15:51:35 UTC,https://civicdb.org/links/evidence_items/247,https://civicdb.org/links/molecular_profiles/4616,False
518,TYMS 5' TANDEM REPEAT,261,Colorectal Cancer,9256.0,,"Irinotecan,Fluorouracil",Combination,Predictive,Supports,B,...,"Martinez-Balibrea et al., 2010",,3.0,accepted,678,Rare Germline,2023-01-09 21:46:30 UTC,https://civicdb.org/links/evidence_items/678,https://civicdb.org/links/molecular_profiles/261,False


### decompose molecular profile into:
- gene
- mutation
- disease

In [21]:
ssms_df_copy = ssms_df.copy()
ssms_df_copy['gene'] = ssms_df_copy['molecular_profile'].str.split(' ').str[0]
ssms_df_copy['mutation'] = ssms_df_copy['molecular_profile'].str.split(' ').str[1]

In [22]:
ssms_df_copy.shape

(1478, 27)

In [23]:
ssms_df_copy.head()

Unnamed: 0,molecular_profile,molecular_profile_id,disease,doid,phenotypes,therapies,therapy_interaction_type,evidence_type,evidence_direction,evidence_level,...,rating,evidence_status,evidence_id,variant_origin,last_review_date,evidence_civic_url,molecular_profile_civic_url,is_flagged,gene,mutation
0,JAK2 V617F,64,Lymphoid Leukemia,1037.0,,,,Diagnostic,Supports,B,...,4.0,accepted,1,Somatic,2023-01-09 21:46:26 UTC,https://civicdb.org/links/evidence_items/1,https://civicdb.org/links/molecular_profiles/64,False,JAK2,V617F
1,PDGFRA D842V,99,Gastrointestinal Stromal Tumor,9253.0,,,,Diagnostic,Supports,B,...,3.0,accepted,2,Somatic,2023-01-09 21:46:27 UTC,https://civicdb.org/links/evidence_items/2,https://civicdb.org/links/molecular_profiles/99,False,PDGFRA,D842V
2,DNMT3A R882,32,Acute Myeloid Leukemia,9119.0,,,,Diagnostic,Supports,B,...,2.0,accepted,3,Somatic,2023-01-09 21:46:25 UTC,https://civicdb.org/links/evidence_items/3,https://civicdb.org/links/molecular_profiles/32,False,DNMT3A,R882
7,KRAS G12,76,Acute Leukemia,12603.0,,,,Diagnostic,Supports,B,...,3.0,accepted,8,Somatic,2023-01-09 21:46:26 UTC,https://civicdb.org/links/evidence_items/8,https://civicdb.org/links/molecular_profiles/76,False,KRAS,G12
8,NRAS Q61,94,Melanoma,1909.0,,,,Diagnostic,Supports,B,...,3.0,accepted,10,Somatic,2023-01-09 21:46:27 UTC,https://civicdb.org/links/evidence_items/10,https://civicdb.org/links/molecular_profiles/94,False,NRAS,Q61


### check if a mutation is an ssm
- add an `is_ssm` flag

In [24]:
ssms_df_copy['is_ssm'] = ssms_df_copy['mutation'].str.contains('[A-Z]\d+[A-Z]')

In [25]:
ssms_df_copy['is_ssm'].value_counts()

is_ssm
False    771
True     707
Name: count, dtype: int64

### V1 eval dataset test
- test fully specified ssms

In [26]:
ssms_df_copy[ssms_df_copy['is_ssm'] == True].shape

(707, 28)

In [27]:
ssms_df_copy[ssms_df_copy['is_ssm'] == True].to_csv('/opt/gpudata/aartiv/rag_rig/civic_full_ssms.csv')

In [28]:

data_for_eval = pd.read_csv('/opt/gpudata/aartiv/rag_rig/civic_full_ssms.csv', sep=',', index_col=0)


In [29]:
data_for_eval.shape

(707, 28)

In [30]:
data_for_eval.head(n=6)

Unnamed: 0,molecular_profile,molecular_profile_id,disease,doid,phenotypes,therapies,therapy_interaction_type,evidence_type,evidence_direction,evidence_level,...,evidence_status,evidence_id,variant_origin,last_review_date,evidence_civic_url,molecular_profile_civic_url,is_flagged,gene,mutation,is_ssm
0,JAK2 V617F,64,Lymphoid Leukemia,1037.0,,,,Diagnostic,Supports,B,...,accepted,1,Somatic,2023-01-09 21:46:26 UTC,https://civicdb.org/links/evidence_items/1,https://civicdb.org/links/molecular_profiles/64,False,JAK2,V617F,True
1,PDGFRA D842V,99,Gastrointestinal Stromal Tumor,9253.0,,,,Diagnostic,Supports,B,...,accepted,2,Somatic,2023-01-09 21:46:27 UTC,https://civicdb.org/links/evidence_items/2,https://civicdb.org/links/molecular_profiles/99,False,PDGFRA,D842V,True
10,MAP2K1 P124S,82,Melanoma,1909.0,,Selumetinib,,Predictive,Supports,D,...,accepted,12,Somatic,2023-01-09 21:46:27 UTC,https://civicdb.org/links/evidence_items/12,https://civicdb.org/links/molecular_profiles/82,False,MAP2K1,P124S,True
11,MAP2K1 Q56P,83,Melanoma,1909.0,,Selumetinib,,Predictive,Supports,D,...,accepted,13,Somatic,2023-01-09 21:46:27 UTC,https://civicdb.org/links/evidence_items/13,https://civicdb.org/links/molecular_profiles/83,False,MAP2K1,Q56P,True
15,ARAF S214C,10,Lung Non-small Cell Carcinoma,3908.0,,Sorafenib,,Predictive,Supports,C,...,accepted,17,Somatic,2023-01-09 21:46:24 UTC,https://civicdb.org/links/evidence_items/17,https://civicdb.org/links/molecular_profiles/10,False,ARAF,S214C,True
19,NRAS G13D,93,Melanoma,1909.0,,Tanespimycin,,Predictive,Supports,C,...,accepted,21,Somatic,2023-01-09 21:46:27 UTC,https://civicdb.org/links/evidence_items/21,https://civicdb.org/links/molecular_profiles/93,False,NRAS,G13D,True


In [31]:
data_exclude_fusions = data_for_eval[~data_for_eval.molecular_profile.str.contains('Fusion') &
                                     ~data_for_eval.molecular_profile.str.contains(':') &
                                     ~data_for_eval.molecular_profile.str.contains('/')
                                     ]

In [32]:
data_exclude_fusions.shape

(699, 28)

In [33]:
len(data_exclude_fusions['gene'].unique())

109

In [34]:
data_exclude_fusions.columns

Index(['molecular_profile', 'molecular_profile_id', 'disease', 'doid',
       'phenotypes', 'therapies', 'therapy_interaction_type', 'evidence_type',
       'evidence_direction', 'evidence_level', 'significance',
       'evidence_statement', 'citation_id', 'source_type', 'asco_abstract_id',
       'citation', 'nct_ids', 'rating', 'evidence_status', 'evidence_id',
       'variant_origin', 'last_review_date', 'evidence_civic_url',
       'molecular_profile_civic_url', 'is_flagged', 'gene', 'mutation',
       'is_ssm'],
      dtype='object')

In [35]:
new_mapping = {
    'gene': 'gene_x',
    'mutation': 'mutation_x'
}
new_colnames = [ new_mapping[col] if col in new_mapping  else col for col in data_exclude_fusions.columns ]

In [36]:
new_colnames[-5:-1]

['molecular_profile_civic_url', 'is_flagged', 'gene_x', 'mutation_x']

In [37]:
data_exclude_fusions.columns = new_colnames

# Define templates for question variety for llm evals

In [38]:
# generate synthetic data for different types of questions
# note these are for single mutations
# for combination mutations, lets on hand-curated examples from civic and literature

template_questions = [
    'What percentage of cancers have simple somatic mutations and copy number variants in [gene_x] in the genomic data commons for [disease]?',
    'How often is the [gene_x] [mutation_x] found in [disease] in the genomic data commons?',
    'What is the occurrence rate of [gene_x] [mutation_x] in [disease] in the genomic data commons ?',
    'What is the frequency of simple somatic mutations and copy number variants in [gene_x] in [disease] in the genomic data commons?',
    'What is the incidence of simple somatic mutations and copy number variants in [gene_x] in the genomic data commons for [disease] ?',
    'What is the frequency of [gene_x] [mutation_x] in [disease] in the genomic data commons?',
    'How common are simple somatic mutations and copy number variants in [gene_x] in [disease] in the genomic data commons?',
    'What fraction of cases have simple somatic mutations and copy number variants in [gene_x] in [disease] in the genomic data commons?',
    'What is the rate of occurrence of [gene_x] [mutation_x] mutation in [disease] in the genomic data commons?',
    'How frequently are [gene_x] [mutation_x] mutations detected in [disease] in the genomic data commons?',
    'What proportion of cancer patients exhibit simple somatic mutations and copy number variants in [gene_x] in [disease] in the genomic data commons?',
    'What is the frequency of somatic [gene_x] heterozygous deletion in [disease] in the genomic data commons?',
    'What is the incidence of somatic [gene_x] homozygous deletion in [disease] in the genomic data commons?',
    'Can you provide the frequency of [gene_x] gain in [disease] in the genomic data commons?',
    'In [disease] data from the genomic data commons, what is the frequency of [gene_x] amplification?',
    'What is the frequency of microsatellite instability in [disease] in the genomic data commons?',
    'How common is microsatellite instability in [disease] cases in the genomic data commons?',
    'Can you provide the prevalence of microsatellite instability in [disease] in the genomic data commons?',
    'What percentage of [disease] patients have microsatellite instability in the genomic data commons?',
    'How often is microsatellite instability observed in [disease] in the genomic data commons?',
    'In [disease], what is the occurrence rate of microsatellite instability in the genomic data commons?',
    'What is the incidence of microsatellite instability in [disease] in the genomic data commons?',
]
template_questions_without_mutations = [
    query
    for query in template_questions
    if 'mutation_x' not in query
]


In [39]:
len(template_questions)

22

In [40]:
len(template_questions_without_mutations)

17

**Synthetic question answer generation**

**Dataset 1**

Generate a simpler and smaller df for first NER test
 - Remove = or * from mutation nomenclature
 - Remove gene/mutation duplicates


In [41]:
dataset1 = data_exclude_fusions[['gene_x', 'mutation_x', 'disease', 'molecular_profile']]

In [42]:
dataset1.head(n=6)

Unnamed: 0,gene_x,mutation_x,disease,molecular_profile
0,JAK2,V617F,Lymphoid Leukemia,JAK2 V617F
1,PDGFRA,D842V,Gastrointestinal Stromal Tumor,PDGFRA D842V
10,MAP2K1,P124S,Melanoma,MAP2K1 P124S
11,MAP2K1,Q56P,Melanoma,MAP2K1 Q56P
15,ARAF,S214C,Lung Non-small Cell Carcinoma,ARAF S214C
19,NRAS,G13D,Melanoma,NRAS G13D


In [43]:
dataset1.shape

(699, 4)

In [44]:
dataset1.to_csv('../csvs/civic_evidence_data.csv')

In [45]:
def generate_q(row):
  # generate question
  mutation_x = row['mutation_x']
  gene_x = row['gene_x']
  disease = row['disease']
  
  if not isinstance(disease, str):
    # fix a generic term if disease not defined
    disease = 'cancer'

  # randomly choose 5 questions from list
  question_list = np.random.choice(template_questions, size = 5, replace=False)
  mod_question_list = []
  for question in question_list:
    try:
      question = question.replace("[mutation_x]", mutation_x).replace("[gene_x]", gene_x).replace("[disease]", disease)
      mod_question_list.append(question)
    except Exception as e:
      # retry with generic cancer disease
      print('unable to generate question')

  return mod_question_list



In [46]:
dataset1_copy = dataset1.copy()
dataset1_copy['questions'] = dataset1_copy.apply(lambda row: generate_q(row),axis=1)

In [47]:
dataset1_exploded = dataset1_copy.explode('questions', ignore_index=True)

In [48]:
dataset1_exploded.shape

(3495, 5)

In [49]:
# remove duplicates, retain only unique questions
dataset1_exploded['questions'].nunique()


1822

In [50]:
dataset1_exploded.head()

Unnamed: 0,gene_x,mutation_x,disease,molecular_profile,questions
0,JAK2,V617F,Lymphoid Leukemia,JAK2 V617F,What is the incidence of microsatellite instab...
1,JAK2,V617F,Lymphoid Leukemia,JAK2 V617F,What is the frequency of somatic JAK2 heterozy...
2,JAK2,V617F,Lymphoid Leukemia,JAK2 V617F,What proportion of cancer patients exhibit sim...
3,JAK2,V617F,Lymphoid Leukemia,JAK2 V617F,How frequently are JAK2 V617F mutations detect...
4,JAK2,V617F,Lymphoid Leukemia,JAK2 V617F,What is the frequency of microsatellite instab...


In [52]:
dataset1_exploded_unique = dataset1_exploded.drop_duplicates(subset=['questions'])

In [53]:
dataset1_exploded_unique.shape

(1822, 5)

In [54]:
dataset1_exploded_unique.to_csv('../csvs/dataset1.simple_eval.csv')

### for combination mutations:
- Use full TSV from civic evidence
- filter for "AND"
- exclude fusions

In [55]:
combination_mutation_data = ssms_df_copy[(ssms_df_copy.molecular_profile.str.contains('AND')) & \
                                                (ssms_df_copy.molecular_profile.str.contains('Fusion') == False)
                                                ]

In [56]:
combination_mutation_data.shape

(14, 28)

In [57]:
combination_mutation_data

Unnamed: 0,molecular_profile,molecular_profile_id,disease,doid,phenotypes,therapies,therapy_interaction_type,evidence_type,evidence_direction,evidence_level,...,evidence_status,evidence_id,variant_origin,last_review_date,evidence_civic_url,molecular_profile_civic_url,is_flagged,gene,mutation,is_ssm
64,BRAF V600E AND BRAF V600M,4170,Melanoma,1909.0,,Dabrafenib,,Predictive,Supports,C,...,accepted,73,Somatic,2023-01-11 17:59:40 UTC,https://civicdb.org/links/evidence_items/73,https://civicdb.org/links/molecular_profiles/4170,False,BRAF,V600E,True
78,BRAF V600E AND NF1 Loss,5379,Melanoma,1909.0,,Vemurafenib,,Predictive,Supports,D,...,accepted,90,Somatic,2025-02-25 23:34:09 UTC,https://civicdb.org/links/evidence_items/90,https://civicdb.org/links/molecular_profiles/5379,False,BRAF,V600E,True
80,BRAF V600E AND BRAF Amplification,4173,Colorectal Cancer,9256.0,,Selumetinib,,Predictive,Supports,E,...,accepted,92,Somatic,2023-01-12 04:51:38 UTC,https://civicdb.org/links/evidence_items/92,https://civicdb.org/links/molecular_profiles/4173,False,BRAF,V600E,True
216,FLT3 ITD AND FLT3 D835Y,4616,Acute Myeloid Leukemia,9119.0,,Sorafenib,,Predictive,Supports,C,...,accepted,247,Somatic,2023-10-17 15:51:35 UTC,https://civicdb.org/links/evidence_items/247,https://civicdb.org/links/molecular_profiles/4616,False,FLT3,ITD,False
518,TYMS 5' TANDEM REPEAT,261,Colorectal Cancer,9256.0,,"Irinotecan,Fluorouracil",Combination,Predictive,Supports,B,...,accepted,678,Rare Germline,2023-01-09 21:46:30 UTC,https://civicdb.org/links/evidence_items/678,https://civicdb.org/links/molecular_profiles/261,False,TYMS,5',False
610,EGFR Amplification AND EGFR EGFRVIII,4245,Glioblastoma,3068.0,,Afatinib,,Predictive,Supports,C,...,accepted,773,Somatic,2023-08-25 16:47:08 UTC,https://civicdb.org/links/evidence_items/773,https://civicdb.org/links/molecular_profiles/4245,False,EGFR,Amplification,False
921,MET Amplification AND MET Splice Site (c.3028G>A),4366,Lung Non-small Cell Carcinoma,3908.0,,Crizotinib,,Predictive,Supports,C,...,accepted,1095,Somatic,2023-03-27 21:58:17 UTC,https://civicdb.org/links/evidence_items/1095,https://civicdb.org/links/molecular_profiles/4366,False,MET,Amplification,False
1254,MTOR E2014K AND MTOR E2419K,4368,Transitional Cell Carcinoma,2671.0,,"Pazopanib,Everolimus",Combination,Predictive,Supports,C,...,accepted,1438,Somatic,2023-03-27 22:28:49 UTC,https://civicdb.org/links/evidence_items/1438,https://civicdb.org/links/molecular_profiles/4368,False,MTOR,E2014K,True
2734,PIK3CA Exon 21 Mutation AND PIK3CA Exon 10 Mut...,5317,Colorectal Cancer,9256.0,,,,Prognostic,Supports,B,...,accepted,5350,Somatic,2024-12-05 21:39:24 UTC,https://civicdb.org/links/evidence_items/5350,https://civicdb.org/links/molecular_profiles/5317,False,PIK3CA,Exon,False
3160,BRAF Amplification AND ( BRAF V600E OR BRAF V6...,4174,Melanoma,1909.0,,"Vemurafenib,Dabrafenib",Substitutes,Predictive,Supports,B,...,accepted,6262,Somatic,2023-01-12 04:52:45 UTC,https://civicdb.org/links/evidence_items/6262,https://civicdb.org/links/molecular_profiles/4174,False,BRAF,Amplification,False


In [58]:
combination_mutation_data.to_csv('../csvs/combination_mutation_data.csv')

In [59]:
dataset1_exploded_unique.head(n=6)

Unnamed: 0,gene_x,mutation_x,disease,molecular_profile,questions
0,JAK2,V617F,Lymphoid Leukemia,JAK2 V617F,What is the incidence of microsatellite instab...
1,JAK2,V617F,Lymphoid Leukemia,JAK2 V617F,What is the frequency of somatic JAK2 heterozy...
2,JAK2,V617F,Lymphoid Leukemia,JAK2 V617F,What proportion of cancer patients exhibit sim...
3,JAK2,V617F,Lymphoid Leukemia,JAK2 V617F,How frequently are JAK2 V617F mutations detect...
4,JAK2,V617F,Lymphoid Leukemia,JAK2 V617F,What is the frequency of microsatellite instab...
5,PDGFRA,D842V,Gastrointestinal Stromal Tumor,PDGFRA D842V,Can you provide the prevalence of microsatelli...
