## **Testing Trial2Vec repository**

In [5]:
from trial2vec import download_embedding
t2v_emb = download_embedding()

config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

Download pretrained Trial2Vec model, save to ./trial_search/pretrained_trial2vec.
Load pretrained Trial2Vec model from ./trial_search/pretrained_trial2vec
load predictor config file from ./trial_search/pretrained_trial2vec/model_config.json


In [9]:
from trial2vec import Trial2Vec

model = Trial2Vec(device="cpu")

model.from_pretrained()

Load pretrained Trial2Vec model from ./trial_search/pretrained_trial2vec
load predictor config file from ./trial_search/pretrained_trial2vec/model_config.json


In [20]:
inputs = ['Pharmacologic Treatment Augmentation in Chronic Depression',
          'Pharmacologic Treatment Augmentation in Depression']
outputs = model.sentence_vector(inputs)

In [22]:
from sklearn.metrics.pairwise import cosine_similarity

In [21]:
outputs.shape

torch.Size([2, 128])

In [26]:
cosine_similarity(outputs[0, :].reshape(-1, 1).T, outputs[1,:].reshape(-1, 1).T)

array([[0.9546094]], dtype=float32)

## **Parse inclusion, exclusion criteria with lxml**

In [2]:
import re
from glob import glob
from lxml import etree

def extract_criteria(xml_file):
    # Parse the XML file
    tree = etree.parse(xml_file)
    root = tree.getroot()

    # Find the criteria textblock
    criteria_textblock = root.xpath("//eligibility/criteria/textblock")[0].text

    # Split the criteria into inclusion and exclusion
    criteria_parts = criteria_textblock.split("Exclusion Criteria:")
    inclusion_criteria = criteria_parts[0].split("Inclusion Criteria:")[1].strip()
    exclusion_criteria = criteria_parts[1].strip() if len(criteria_parts) > 1 else "No exclusion criteria specified"

    return inclusion_criteria, exclusion_criteria

def process_criteria(criteria_text):
    lines = criteria_text.split('\n')
    lines = [line.strip() for line in lines if line.strip()]
    
    criteria_list = []
    for line in lines:
        # Remove bullet points and other formatting
        line = re.sub(r'^\s*-\s*', '', line)
        line = re.sub(r'^\s*•\s*', '', line)
        if line and not line[0].isupper() and criteria_list:
            criteria_list[-1] += ' ' + line
        else:
            criteria_list.append(line)
    
    return criteria_list

In [3]:
xml_paths = glob("/Users/titipata/Desktop/Misc_docs/ClinicalTrialGov/ctg-public-xml/*/*.xml")

In [4]:
xml_paths[10]

'/Users/titipata/Desktop/Misc_docs/ClinicalTrialGov/ctg-public-xml/NCT0483xxxx/NCT04834141.xml'

In [3]:
inclusion_text, exclusion_text = extract_criteria(xml_paths[10])
inclusion_list = process_criteria(inclusion_text)
exclusion_list = process_criteria(exclusion_text)

In [44]:
inclusion_list

['Adults aged 18 and over willing to attend the study.',
 'For those who join the thoracic kyphosis group, individuals with a kyphosis angle ≥ 40 degrees.',
 'Individuals with a kyphosis angle < 40 degrees for the control group.']

In [45]:
exclusion_list

['Spine trauma, surgery, bone pathology, arthritis etc. have a history of illness',
 "Kyphotic deformity types are rounded back, Scheuermann's disease, hunched back, flat back and Dowager hump.",
 'Any spinal deformity, bone abnormality, postural deformity and disc herniation with / without peripheral symptoms.',
 'Body mass index (BMI), which is an indicator of obesity, is more than > 30.',
 'Complaining of balance problems, coordination problems, other neurological or vestibular diseases that affect body balance and posture.',
 'Having any orthopedic or neurological disease that affects the body joints or the integrity of the musculoskeletal system.',
 'Use of any medication that can cause dizziness or drowsiness in the last months.']

## **Gemini to generate inclusion and exclusion criteria**

In [34]:
from google.oauth2 import service_account
from google.cloud import aiplatform
from vertexai.generative_models import GenerativeModel, Part, Image

project_name = "cpf-generative-ai" # ใส่ชื่อ project ที่นี้
json_path = "/Users/titipata/Documents/git/cpf-generative-ai-c7a9717e186a.json"
credentials = service_account.Credentials.from_service_account_file(json_path) # ใส่ path ไปยัง JSON file ที่นี่
aiplatform.init(project=project_name, credentials=credentials)

In [32]:
from lxml import etree

def extract_info_and_generate_prompt(xml_file):
    # Parse the XML file
    tree = etree.parse(xml_file)
    root = tree.getroot()

    # Extract title and description
    title = root.xpath("//brief_title")[0].text.strip()
    description = root.xpath("//detailed_description/textblock")[0].text.strip()

    # Generate prompt for Gemini
    prompt = f"""Based on the following clinical study information, generate potential inclusion and exclusion criteria:

Title: {title}

Description: {description}

Please provide a list of inclusion criteria and a list of exclusion criteria that would be appropriate for this study. Format your response as follows:

Inclusion Criteria:
1. [Criterion 1]
2. [Criterion 2]
...

Exclusion Criteria:
1. [Criterion 1]
2. [Criterion 2]
...
"""

    return prompt

def get_response(prompt, model_name="gemini-1.5-flash"):
    """
    ฟังก์ชันสำหรับรับคำตอบจากโมเดล AI
    
    Args:
    prompt (str): คำถามหรือข้อความที่ต้องการให้ AI ตอบ
    model_name (str): ชื่อของโมเดลที่ต้องการใช้ (ค่าเริ่มต้นคือ "gemini-pro")
    
    Returns:
    str: ข้อความตอบกลับจาก AI
    """
    model = GenerativeModel(model_name)
    response = model.generate_content(prompt)
    return response.text

# Example usage
xml_file = xml_paths[10]  # Replace with the path to your XML file
gemini_prompt = extract_info_and_generate_prompt(xml_file)
print(gemini_prompt)

Based on the following clinical study information, generate potential inclusion and exclusion criteria:

Title: Correlation Between Thoracic Kyphosis Posture and Static Balance

Description: Background:

Kyphosis is roughly a slight forward curvature of the spine. A slight kyphosis or
posterior curvature is normal throughout the human body and is present in every
individual. Hyperkyphotic is a kyphotic angle greater than 40° commonly measured on a
lateral X-ray measured by the Cobb method between C7 and T12. Postural stability or
balance is the ability to keep the center of mass within the boundaries of the support
base. Moving the center of mass beyond the boundaries of the support base may cause
postural instability and loss of balance. There are studies showing that kyphotic posture
affects the center of gravity and affects fall in the elderly, but there are limited
studies on the effect of balance in young individuals.

The Aim of This Study Is:

To study the correlat

In [9]:
criteria_pred = get_response(gemini_prompt)

In [11]:
print(criteria_pred)

## Inclusion Criteria:

1. **Age:** Participants must be between the ages of [specify age range] (e.g., 18-35 years old). This ensures the study focuses on a specific age group and minimizes confounding factors from different age groups.
2. **Thoracic Kyphosis:** Participants must have a measurable thoracic kyphosis angle, determined via a lateral X-ray using the Cobb method. This ensures the study includes individuals with varying degrees of kyphosis for correlation analysis.
3. **No History of Neurological Disorders:** Participants should have no history of neurological disorders or injuries that could affect their balance (e.g., stroke, Parkinson's disease, spinal cord injury). This eliminates potential confounding factors related to neurological impairment.
4. **No History of Musculoskeletal Conditions:** Participants should not have any existing musculoskeletal conditions that could affect their posture or balance (e.g., scoliosis, osteoarthritis, spinal stenosis). This ensures th

In [16]:
def generate_comparison_prompt(xml_file, gemini_generated_criteria):
    actual_inclusion, actual_exclusion = extract_criteria(xml_file)
    
    prompt = f"""Compare the following sets of inclusion and exclusion criteria:

Actual Criteria (from the study):

Inclusion Criteria:
{actual_inclusion}

Exclusion Criteria:
{actual_exclusion}

Generated Criteria:
{gemini_generated_criteria}

Please analyze and compare these two sets of criteria. In your analysis:
1. Identify any criteria that are present in both sets.
2. Highlight any important criteria from the actual set that are missing in the generated set.
3. Point out any criteria in the generated set that might not be necessary or relevant based on the actual criteria.
4. Provide an overall assessment of how well the generated criteria match the actual criteria in terms of relevance and comprehensiveness.
5. Suggest any improvements that could be made to the generated criteria to better align with the actual study requirements.

Format your response in JSON format in a clear, structured manner, addressing each of these points separately."""

    return prompt

In [21]:
actual_inclusion, actual_exclusion = extract_criteria(xml_file)

In [30]:
# print("Inclusion Criteria: \n")
# print(actual_inclusion)
# print("Exclusion Criteria: \n")
# print(actual_exclusion)

# print("\n\nPredicted Criteria: \n")
# print(criteria_pred)

In [17]:
comparison_prompt = generate_comparison_prompt(xml_file, criteria_pred)
comparison = get_response(comparison_prompt)

In [19]:
print(comparison)

```json
{
  "analysis": {
    "common_criteria": [
      {
        "criteria": "Age",
        "description": "Both sets include an age criterion, though the actual criteria simply states 'adults aged 18 and over', while the generated criteria specifies an age range."
      },
      {
        "criteria": "History of Neurological Disorders",
        "description": "Both sets exclude individuals with a history of neurological disorders, with the actual criteria mentioning 'other neurological or vestibular diseases that affect body balance and posture' and the generated criteria focusing on balance-affecting neurological conditions."
      },
      {
        "criteria": "History of Musculoskeletal Conditions",
        "description": "Both sets exclude individuals with existing musculoskeletal conditions that could affect posture or balance. The actual criteria mentions 'orthopedic or neurological disease that affects the body joints or the integrity of the musculoskeletal system', while th

## **Generate CoT**

In [24]:
def extract_info_from_xml(xml_file):
    # Parse the XML file
    tree = etree.parse(xml_file)
    root = tree.getroot()

    # Extract title and description
    title = root.xpath("//brief_title")[0].text.strip()
    description = root.xpath("//detailed_description/textblock")[0].text.strip()

    criteria_textblock = root.xpath("//eligibility/criteria/textblock")[0].text
    criteria_parts = criteria_textblock.split("Exclusion Criteria:")
    inclusion_criteria = criteria_parts[0].split("Inclusion Criteria:")[1].strip()
    exclusion_criteria = criteria_parts[1].strip() if len(criteria_parts) > 1 else "No exclusion criteria specified"

    return {
        "title": title,
        "description": description,
        "inclusion_criteria": inclusion_criteria,
        "exclusion_criteria": exclusion_criteria
    }

In [7]:
xml_file = xml_paths[10]
extracted_keys = extract_info_from_xml(xml_file)

In [38]:
def generate_explanation_prompt(title, description, inclusion_criteria, exclusion_criteria):
    prompt = f"""Study Title: {title}

Study Description: {description}

Based on the study title, description, and the following inclusion and exclusion criteria, please provide a detailed explanation for why each criterion was likely chosen. Consider the study's objectives, potential confounding factors, and ethical considerations in your explanation.

Inclusion Criteria: {inclusion_criteria}

Exclusion Criteria: {exclusion_criteria}

For each criterion, please provide:
1. The likely reason for including or excluding this specific group according to the title and description of the study.
2. How this criterion relates to the study's objectives.
3. Any potential biases or limitations this criterion might introduce.
4. Any ethical considerations related to this criterion.

Please structure your response by addressing each criterion separately, clearly indicating whether it's an inclusion or exclusion criterion."""
    return prompt

In [39]:
generate_prompt = generate_explanation_prompt(
    extracted_keys["title"], extracted_keys["description"],
    extracted_keys["inclusion_criteria"], extracted_keys["exclusion_criteria"]
)

In [40]:
explanation = get_response(generate_prompt)

In [41]:
print(explanation)

## Inclusion/Exclusion Criteria Analysis:

**1. Inclusion Criterion: Adults aged 18 and over willing to attend the study.**

* **Reason:** This criterion ensures the study participants are legally capable of providing informed consent and understanding the study procedures.  
* **Relation to Objectives:** Focusing on adults eliminates age-related developmental variations in spinal curvature and balance abilities, allowing for a more focused analysis of the relationship between thoracic kyphosis and balance in a mature population.
* **Potential Biases:**  This criterion may introduce bias by limiting the study to a specific age group.  There may be different relationships between kyphosis and balance in adolescents or older adults.
* **Ethical Considerations:** This criterion aligns with ethical research practices by ensuring participants are of legal age to provide informed consent and participate in the study.

**2. Inclusion Criterion: Individuals with a kyphosis angle ≥ 40 degrees (