# Objetivos

Con este notebook se pretende crear distintos archivos necesarios para ejecutar las distintas pipelines definidas en el repositorio de TrialGPT, en el proceso de *retrieval*, además de obtener de forma interna la historia clínica de los pacientes, que las podemos obtener en base a los formularios de éstos en MatchTrial recogidos en la base de datos en Metabase.

Por tanto, tenemos distintos hitos en este notebook:

- Hito 1. Obtener historias clínicas de pacientes en el formato adecuado en JSON
- Hito 2. Ejecución de la pipeline de ```keywords_generation.py``` y obtención de las keywords
- Hito 3. Creación de ```queries.jsonl```
- Hito 4. Creación de ```id2queries.json```
- Hito 5. Descarga de ```trial_info.json```
- Hito 6. Creación de ```corpus.jsonl```
- Hito 7. Creación de ```test.tsv```



# Setup

In [2]:
import json
import pandas as pd
from collections import defaultdict
import jsonlines

# Hito 1

## Descarga de los diagnóstico y mapeao ID Questions de los diagnósticos a las preguntas reales

In [3]:
with open('/Users/agustinlopez/Repositories/TrialGPT/dataset/metabase.json', 'r') as file:
    json2 = json.load(file)

In [4]:
print("Estructura del JSON original:", type(json2))
if isinstance(json2, list):
    print("Número de objetos en la lista:", len(json2))
    print("Ejemplo del primer objeto:", json2[0])

Estructura del JSON original: <class 'list'>
Número de objetos en la lista: 100
Ejemplo del primer objeto: {'Question 535__responseValue': '0-4 weeks', 'province': 'Asturias', 'Question 546__responseValue': 'No', 'Question 529__responseValue': 'Yes', 'cacheStatus': 2, 'country': 'SPAIN', 'Question 537__responseValue': 'Yes', 'Question 553__responseValue': 'No', 'city': 'Santullano', 'Question 547__responseValue': "Basedow's disease (Grave's disase)", 'birthday': '1972-01-01', 'lng': -5.7821242, 'gender': 'Woman', 'userProfile': 'Patient', 'Question 549__responseValue': 'Severe respiratory disease and/or infection', 'count': 1, 'userId': 906, 'Question 544__responseValue': 'I have not received any', 'Question 533__responseValue': 'Durvalumab', 'dianosticId': 2132, 'Question 527__responseValue': 'Yes', 'Question 531__responseValue': 'Yes', 'diagnosticCreatedDate': '2021-02-04', 'Question 543__responseValue': 'Yes', 'Question 536__responseValue': 'Yes', 'Question 539__responseValue': 'Yes

In [5]:
mapeo = {
    "Question 567__responseValue": "Tumor type",
    "Question 545__responseValue": "ECOG",
    "Question 554__responseValue": "Mutation",
    "Question 544__responseValue": "How many treatments in the metastasic setting have you received?",
    "Question 533__responseValue": "Of the treatments that you have received, please mark all the ones that you remember",
    "Question 528__responseValue": 'Do you remember when did you participate in a clinical trial?',
    "Question 527__responseValue": 'Have you ever participated in a clinical trial?',
    "Question 529__responseValue": 'Was it a clinical trial about cancer?',
    "Question 530__responseValue": 'When was it diagnosed?',
    "Question 531__responseValue": "Have you had any antitumor treatments?",
    "Question 534__responseValue": 'Have you relapsed after being treated?',
    "Question 535__responseValue": 'When did you receive the last treatment?',
    "Question 536__responseValue": 'Has the diagnosed cancer spread to the lymph nodes?',
    "Question 537__responseValue": 'Has developed any metastasis?',
    "Question 543__responseValue": 'Has the metastasis spread to the liver?',
    "Question 541__responseValue": 'Has the metastasis spread to the bones?',
    "Question 539__responseValue": 'Have you been diagnosed with a brain metastasis?',
    "Question 546__responseValue": 'Have you previously suffered another type of cancer?',
    "Question 547__responseValue": 'Currently, do you encounter any of the following diseases or conditions?',
    "Question 548__responseValue": 'Do you remember if you are taking any of the following medications?',
    "Question 549__responseValue": 'Have you previously suffered any of the following diseases?',
    "Question 550__responseValue": 'In the last four weeks, have you had any major surgery or have you received a transplant?',
    "Question 551__responseValue": 'Have you had a blood test done within the last four weeks?',
    "Question 552__responseValue": 'Are the blood test values normal? (Mark those that have some type of mark or annotation)',
    "Question 553__responseValue": 'Have you been performed a biopsy of the diagnosed tumor?'
}

In [6]:
mapeo.values()

dict_values(['Tumor type', 'ECOG', 'Mutation', 'How many treatments in the metastasic setting have you received?', 'Of the treatments that you have received, please mark all the ones that you remember', 'Do you remember when did you participate in a clinical trial?', 'Have you ever participated in a clinical trial?', 'Was it a clinical trial about cancer?', 'When was it diagnosed?', 'Have you had any antitumor treatments?', 'Have you relapsed after being treated?', 'When did you receive the last treatment?', 'Has the diagnosed cancer spread to the lymph nodes?', 'Has developed any metastasis?', 'Has the metastasis spread to the liver?', 'Has the metastasis spread to the bones?', 'Have you been diagnosed with a brain metastasis?', 'Have you previously suffered another type of cancer?', 'Currently, do you encounter any of the following diseases or conditions?', 'Do you remember if you are taking any of the following medications?', 'Have you previously suffered any of the following diseas

In [7]:
nuevo_json = []
for obj in json2:
    nuevo_obj = {}
    for key, value in obj.items():
        question_number = key.split('__')[0]  # Esto separa la clave en base al '__'
        nuevo_key = mapeo.get(question_number + '__responseValue', key)
        nuevo_obj[nuevo_key] = value
    nuevo_json.append(nuevo_obj)

In [8]:
print("Ejemplo:", nuevo_json[0])


Ejemplo: {'When did you receive the last treatment?': '0-4 weeks', 'province': 'Asturias', 'Have you previously suffered another type of cancer?': 'No', 'Was it a clinical trial about cancer?': 'Yes', 'cacheStatus': 2, 'country': 'SPAIN', 'Has developed any metastasis?': 'Yes', 'Have you been performed a biopsy of the diagnosed tumor?': 'No', 'city': 'Santullano', 'Currently, do you encounter any of the following diseases or conditions?': "Basedow's disease (Grave's disase)", 'birthday': '1972-01-01', 'lng': -5.7821242, 'gender': 'Woman', 'userProfile': 'Patient', 'Have you previously suffered any of the following diseases?': 'Severe respiratory disease and/or infection', 'count': 1, 'userId': 906, 'How many treatments in the metastasic setting have you received?': 'I have not received any', 'Of the treatments that you have received, please mark all the ones that you remember': 'Durvalumab', 'dianosticId': 2132, 'Have you ever participated in a clinical trial?': 'Yes', 'Have you had an

In [23]:
with open('/Users/agustinlopez/Repositories/TrialGPT/dataset/metabase_modificado3.json', 'w') as file:
    json.dump(nuevo_json, file, ensure_ascii=False, indent=4)

print("El archivo JSON ha sido modificado y guardado con éxito.")


El archivo JSON ha sido modificado y guardado con éxito.


# Hito 2

## Keyword generation

Actualizamos el código de ```keyword_generation.py``` a ```update_keyword_generation.py```, sin que use el archivo ```queries.jsonl```, ya que lo crearemos después de hacer esta generación de keywords. Ejecutaremos el comando ```python trialgpt_retrieval/update_keyword_generation.py metabase_modificado3.json gpt-4-turbo```.

# Hito 3

## Creación del archivo ```queries.jsonl```

Tras haber hecho el primer paso del retrieval usando ```update_keyword_generation.py``` y obteniendo ```retrieval_keywords_gpt-4-turbo_metabase_modificado3.json```, procedemos a crear ```queries.jsonl```. Lo creamos teniendo como referencia los otros archivos ```queries.jsonl``` del repositorio en ```sigir```, ```trec_2021```y ```trec_2022```.

In [26]:
input_file = "/Users/agustinlopez/Repositories/TrialGPT/results/retrieval_keywords_gpt-4-turbo_metabase_modificado3.json"

output_file = "/Users/agustinlopez/Repositories/TrialGPT/dataset/metabase/queries.jsonl"

In [27]:
with open(input_file, "r") as f:
    results = json.load(f)

In [28]:
queries = []

for qid, result in results.items():
    query_text = result["summary"]
    queries.append({"_id": qid, "text": query_text})

In [29]:
with open(output_file, "w") as f:
    for query in queries:
        json.dump(query, f)
        f.write("\n")

print(f"Archivo {output_file} generado exitosamente.")

Archivo /Users/agustinlopez/Repositories/TrialGPT/dataset/metabase/queries.jsonl generado exitosamente.


# Hito 4

## Objetivo 3. Creación del archivo ```id2queries.json```

In [9]:
def load_queries_jsonl(file_path):
    id2text = {}
    with open(file_path, 'r') as f:
        for line in f:
            entry = json.loads(line.strip())
            _id = entry["_id"]
            text = entry["text"]
            id2text[_id] = text
    return id2text

def load_metabase_modificado(file_path):
    with open(file_path, 'r') as f:
        data = json.load(f)
    return data

In [10]:
queries_file = "/Users/agustinlopez/Repositories/TrialGPT/dataset/metabase/queries.jsonl"
metabase_file = "/Users/agustinlopez/Repositories/TrialGPT/results/retrieval_keywords_gpt-4-turbo_metabase_modificado3.json"

In [12]:
id2text = load_queries_jsonl(queries_file)
metabase_data = load_metabase_modificado(metabase_file)

In [13]:
id2text

{'2132': 'The patient is a woman from Santullano, Asturias, Spain, born in 1972, diagnosed with neuroendocrine cancer more than a year ago. She has participated in a cancer clinical trial and has undergone tumor removal but has not received any metastatic treatments. The cancer has metastasized to the liver, lymph nodes, and brain. She has a history of severe respiratory disease and/or infection, and she is a smoker. Her daily activity is restricted, and she spends some hours in bed but is capable of self-care. She has abnormal liver function tests (AST/SGOT, ALT/SGPT).',
 '2553': 'The patient is a woman from Zaragoza, Spain, born in 1971, diagnosed with breast cancer more than a year ago. She has participated in a clinical trial and has undergone multiple treatments including tumor removal. The cancer has metastasized to the brain and lymph nodes but not to the liver or bones. She has received three or more treatments in the metastatic setting and is currently pregnant. She has a slig

In [14]:
metabase_data.keys()

dict_keys(['2132', '2553', '2624', '3030', '3033', '3133', '3236'])

In [15]:
id2queries = {}

for _id in metabase_data.keys():
    raw_query = id2text.get(_id, "")
    
    gpt_4_turbo_summary = metabase_data[_id]["summary"]
    gpt_4_turbo_conditions = metabase_data[_id]["conditions"]
    
    id2queries[_id] = {
        "raw": raw_query,
        "gpt-4-turbo": {
            "summary": gpt_4_turbo_summary,
            "conditions": gpt_4_turbo_conditions
        }
    }

In [16]:
output_file = "/Users/agustinlopez/Repositories/TrialGPT/dataset/metabase/id2queries.json"
with open(output_file, "w") as f:
    json.dump(id2queries, f, indent=4)

print(f"Archivo {output_file} generado exitosamente.")

Archivo /Users/agustinlopez/Repositories/TrialGPT/dataset/metabase/id2queries.json generado exitosamente.


# Hito 5

## Descarga de trial_info.json

Primeramente, obtenemos los distintos ensayos y sus campos. Esto lo podríamos sacar de igual manera desde la DB de Blissey pero los autores del paper ya parsean los datos de clinicaltrials.gov y lo obtenemos en ```trial_info.json```, tras ejecutar ```wget -O dataset/trial_info.json https://ftp.ncbi.nlm.nih.gov/pub/lu/TrialGPT/trial_info.json```.

In [None]:
input_file = 'dataset/trial_info.json'

# Hito 6

## Creación del archivo corpus.jsonl

Tras descargar ```trial_info.json```, lo limitamos a los ensayos que tenemos en Metabase y así poder hacer la comparativa entre este proceso LLM y nuestra codificación manual, además de reducir el tamaño muestral de ensayos y, así, reducir la carga computacional. También, le damos el formato adecuado tal y como podemos ver en ```corpus.jsonl``` de las muestras utilizadas por los autores del repositorio de TrialGPT.

## Limitación de los ensayos de Blissey a los ensayos de Metabase

In [24]:
allowed_ncts = [
    "NCT03110107", "NCT03157128", "NCT03316638", "NCT03369223", "NCT03595059",
    "NCT03783403", "NCT03821935", "NCT03893955", "NCT03917381", "NCT03922204",
    "NCT04069026", "NCT04083599", "NCT04237649", "NCT04243499", "NCT04254107",
    "NCT04303858", "NCT04336241", "NCT04423029", "NCT04458259", "NCT04644068",
    "NCT04648202", "NCT04740424", "NCT04777994", "NCT04839991", "NCT04868877",
    "NCT04958239", "NCT05029882", "NCT05180474", "NCT05194072", "NCT05216432",
    "NCT05298592", "NCT05396833", "NCT03175224", "NCT03520075", "NCT03767075",
    "NCT03833154", "NCT03974022", "NCT04032847", "NCT04077463", "NCT04185883",
    "NCT04449874", "NCT04512430", "NCT04553692", "NCT04624204", "NCT04657068",
    "NCT04699188", "NCT04772235", "NCT04886804", "NCT04913285", "NCT04919811",
    "NCT05007782", "NCT05048797", "NCT05060432", "NCT05076396", "NCT05101265",
    "NCT05117476", "NCT05132075", "NCT05142189", "NCT05176483", "NCT05209295",
    "NCT05280314", "NCT05298423", "NCT05325866", "NCT05353257", "NCT05356741",
    "NCT05384626", "NCT05498428", "NCT05527782", "NCT05565378", "NCT05614102",
    "NCT05614739", "NCT05620134", "NCT05647122", "NCT05661578", "NCT05668585",
    "NCT05684276", "NCT05694013", "NCT05718297", "NCT05765734", "NCT05765851",
    "NCT05784012", "NCT05789069", "NCT05867121", "NCT05925530", "NCT05967689",
    "NCT05980598", "NCT06331598", "NCT02264678", "NCT02817633", "NCT02912949",
    "NCT03093116", "NCT03448042", "NCT03574779", "NCT03742895", "NCT03748186",
    "NCT03781934", "NCT03872778", "NCT03964233", "NCT04009681", "NCT04053673",
    "NCT04104776", "NCT04180371", "NCT04250155", "NCT04259450", "NCT04260529",
    "NCT04278144", "NCT04300556", "NCT04417465", "NCT04442126", "NCT04455620",
    "NCT04526106", "NCT04589845", "NCT04725474", "NCT04735978", "NCT04762602",
    "NCT04799054", "NCT04855435", "NCT04855929", "NCT04857138", "NCT04901806",
    "NCT04924075", "NCT04925284", "NCT04936178", "NCT04953897", "NCT04953910",
    "NCT04983810", "NCT05002270", "NCT05013554", "NCT05052255", "NCT05063318",
    "NCT05067283", "NCT05118789", "NCT05123482", "NCT05141474", "NCT05155332",
    "NCT05159388", "NCT05242822", "NCT05262400", "NCT05262530", "NCT05307705",
    "NCT05358379", "NCT05372367", "NCT05389462", "NCT05407675", "NCT05430555",
    "NCT05435339", "NCT05443126", "NCT05471856", "NCT05480865", "NCT05489211",
    "NCT05503797", "NCT05543629", "NCT05544552", "NCT05581719", "NCT05584670",
    "NCT05732831", "NCT05797831", "NCT05826600", "NCT05835609", "NCT05839600",
    "NCT05840224", "NCT05841563", "NCT05859464", "NCT05888831", "NCT05904496",
    "NCT05940571", "NCT06039384", "NCT06130553", "NCT02568267", "NCT02576431",
    "NCT03114319", "NCT03400332", "NCT03459222", "NCT03564691", "NCT03568656",
    "NCT03600883", "NCT03645928", "NCT03656718", "NCT03733990", "NCT03761017",
    "NCT03767348", "NCT03792724", "NCT03845166", "NCT03861793", "NCT03864042",
    "NCT03894618", "NCT03918278", "NCT04044859", "NCT04083976", "NCT04095273",
    "NCT04101357", "NCT04140500", "NCT04147234", "NCT04418661", "NCT04521686",
    "NCT04564027", "NCT04639219", "NCT04759846", "NCT04808362", "NCT04830124",
    "NCT04857372", "NCT04959266", "NCT04983238", "NCT05072106", "NCT05129280",
    "NCT05155254", "NCT02628067", "NCT03319628", "NCT03396445", "NCT03530397",
    "NCT03786484", "NCT04144140", "NCT04234113", "NCT04389632", "NCT04561362",
    "NCT04564417", "NCT04642365", "NCT04665921", "NCT04802876", "NCT04914897",
    "NCT04991740", "NCT05116891", "NCT05488314"
]

In [17]:
input_file = 'dataset/trial_info.json'

In [20]:
%cd ..

/Users/agustinlopez/Repositories/TrialGPT


In [21]:
with open(input_file, "r") as f:
    corpus = json.load(f)

In [22]:
corpus

{'NCT00000110': {'brief_title': 'Influence of Diet and Endurance Running on Intramuscular Lipids Measured at 4.1 TESLA',
  'phase': '',
  'drugs': "['magnetic resonance spectroscopy', 'dietary fat']",
  'drugs_list': ['magnetic resonance spectroscopy', 'dietary fat'],
  'diseases': "['Obesity']",
  'diseases_list': ['Obesity'],
  'enrollment': '',
  'inclusion_criteria': 'inclusion criteria: \n\n Healthy volunteers (developmental phase) \n\n Healthy endurance-trained subjects \n\n Maximum age for males is 39 \n\n Maximum age for females is 49',
  'exclusion_criteria': '',
  'brief_summary': 'The purpose of this pilot investigation is to use 1 H Magnetic Resonance Spectroscopy (MRS) to 1) document the change in intra-muscular lipid stores (IML) before and after a prolonged bout of endurance running and, 2) determine the pattern (time course) of IML replenishment following an extremely low-fat diet (10% of energy from fat) and a moderate-fat diet (35% of energy from fat). Specifically, t

In [32]:
from tqdm import tqdm

filtered_data = {}

for nct_id in tqdm(allowed_ncts, desc="Filtrando NCTs"):
    if nct_id in corpus:
        filtered_data[nct_id] = corpus[nct_id]

print(f"Se han filtrado {len(filtered_data)} entradas en filtered_data.")


Filtrando NCTs: 100%|██████████| 223/223 [00:00<00:00, 104798.86it/s]

Se han filtrado 207 entradas en filtered_data.





In [33]:
type(filtered_data)

dict

## Creación del diccionario corpus

In [38]:
def dict_corpus2(row):
    nct_id = row.get('nct_id', '')
    title = row.get('official_title', '')
    brief_title = row.get('brief_title', '')
    phase = row.get('phase', '')
    drugs = row.get('drugs', '')
    drugs_list = row.get('drugs_list', [])
    diseases = row.get('combined_condition', '')
    diseases_list = row.get('combined_condition', [])
    enrollment = str(row.get('enrollment', ''))
    inclusion_criteria = row.get('inclusion_criteria', '')
    exclusion_criteria = row.get('exclusion_criteria', '')
    brief_summary = row.get('brief_summary', '')
    criteria = row.get('criteria', '')
    text = f"{brief_summary} {inclusion_criteria} {exclusion_criteria}"
    
    return {
        "_id": str(nct_id),
        "title": title,
        "text": text,
        "metadata": {
            "brief_title": brief_title,
            "phase": phase,
            "drugs": drugs,
            "drugs_list": drugs_list,
            "diseases": diseases,
            "diseases_list": diseases_list,
            "enrollment": enrollment,
            "inclusion_criteria": inclusion_criteria,
            "exclusion_criteria": exclusion_criteria,
            "brief_summary": brief_summary
        }
    }

In [40]:
formatted_data = [dict_corpus2(data) for nct_id, data in filtered_data.items()]

In [41]:
output_file = 'results/corpus.jsonl'
with open(output_file, 'w') as f:
    for entry in formatted_data:
        f.write(json.dumps(entry) + '\n')

print(f"Datos formateados guardados en {output_file}")

Datos formateados guardados en results/corpus.jsonl


# Hito 7

## Objetivo 4. Crear ```test.tsv```

Creamos ```test.tsv```. La score utilizada es de 0, 1 ó 2. En nuestro caso, utilizaremos únicamente 0 si el NCT no matchea con el diagnosticID en Metabase, y 2 si matchea con el diagnosticID en Metabase. Esto lo hacemos manualmente.