## **My Feb 16th Tasks:**

### **Part 1: Retrieving and Processing Training Data:**

1) Use 10,000 Randomly Generated PMIDs and retrieve their corresponding Titles from PubMed.

2) Using these titles, 10,000 Runnable PubMed Boolean Queries are generated using ChatGPT (Known as the **Gold-Standard Query**)

3) Retrieve the top 5 PMIDs and Titles for each of the Gold-Standard Queries

### **Part 1: Fine Tuning GPT Models:**

4) Use the five retrieved Titles and corresponding Queries to fine-tune the following GPT models:

    - bioGPT-Large
    - bioGPT
    - GPT2

In [1]:
import random
import requests
from Bio import Entrez

import warnings
warnings.filterwarnings("ignore")


In [2]:
import torch

# Setup device-agnostic code
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

#### **Part 1: Retrieving and Processing Training Data:**



The code cells below accomplishes the following:

1) Find a way to get random pubmed ID’s (PMID)

2) For each PMID, get the Article Title and save this data

In [3]:
def is_valid_pmid(pmid):
    try:
        # Query PubMed API to check if PMID exists
        response = requests.get(f'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id={pmid}&retmode=json')
        data = response.json()
        return 'error' not in data
    except Exception as e:
        print(f"An error occurred: {e}")
        return False

def generate_random_pmid():
    while True:
        # Generate a random number within the valid PMID range
        random_pmid = random.randint(1, 30000000)

        # Check if the generated PMID is valid
        if is_valid_pmid(random_pmid):
            return random_pmid

# Generate 10 random, valid PMIDs and store them in a list
generated_pmids = [generate_random_pmid() for _ in range(50)]

Entrez.email = "richard.finney@torontomu.ca"

results = []

for PMID in generated_pmids:
    handle = Entrez.efetch(db="pubmed", id=PMID, retmode="xml")
    record = Entrez.read(handle)
    articles = record['PubmedArticle']

    if articles:
        medline_citation = articles[0].get('MedlineCitation', {})
        article = medline_citation.get('Article', {})
        title = article.get('ArticleTitle', 'Title not available')

        results.append({
            'PMID': PMID,
            'Title': title,
        })

print("numer of valid results:", len(results))


numer of valid results: 49


The Output below is a List (results) containing the Title extracted from PubMed based on each of the 10 randomly generated PMIDs

### **In this part of the code, we:**

3) Create a ChatGPT prompt where you provide it with the title and it returns a PubMed runnable Boolean Query

4) Post-process so that the query is runnable on PubMed

So as noted above, we are going to take the Title from each of the 10 randomly Generated PMIDs, and use ChatGPT to create a PubMed runnable Boolean Query. There is also post-fetching processing done here, as sometimes the reponse contains explanation to accompany the Query - we want just the Query for PubMed.

This Query forms the Golden-Standard, and will be appended to our original results list for organization.

### **Dr. Ensan, Leandra, could I ask you to review the prompts given to ChatGPT and the responses retrieved, to see if we can improve these in any way?**

In [4]:
import re
import time
from openai import OpenAI

client = OpenAI(
    api_key="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",
)

def get_completion(prompt, model="gpt-3.5-turbo"):
    messages = [{"role": "user", "content": prompt}]
    response = client.chat.completions.create(
        model=model,
        messages=messages,
    temperature=0,
    )
    return response.choices[0].message.content

responses=[]

for result in results:

    prompt = "For the following article title Can you generate a PubMed runnable Boolean Query. Title: " + str(result['Title'])
    print(prompt)
    response = get_completion(prompt)
    responses.append(response)

    query = response
    print('-----------------------------------------------------------------------')
    print('')

#     print("PMID:", result['PMID'])
#     print('')
    print("Parsed Query:")
    print(query)
    result['Query'] = query
    print('')
    print('-----------------------------------------------------------------------')


For the following article title Can you generate a PubMed runnable Boolean Query. Title: Activity and expression of 15-hydroxyprostaglandin dehydrogenase in cultured chorionic trophoblast and villous trophoblast cells and in chorionic explants at term with and without spontaneous labor.
-----------------------------------------------------------------------

Parsed Query:
("15-hydroxyprostaglandin dehydrogenase" OR "15-HPGD") AND ("chorionic trophoblast" OR "villous trophoblast" OR "chorionic explants") AND ("spontaneous labor" OR "term") AND ("activity" OR "expression")

-----------------------------------------------------------------------
For the following article title Can you generate a PubMed runnable Boolean Query. Title: Segmentation-driven image registration- application to 4D DCE-MRI recordings of the moving kidneys.
-----------------------------------------------------------------------

Parsed Query:
("Segmentation-driven image registration" OR "image registration driven b

### **Finally, we:**

5) Run each query through PubMed, get the top 5-10 results from the query (That boolean query will be considered the gold-standard for those 5-10 documents). You also need to get the title, abstract, MeSH terms, and Keywords for each of those documents and save all that data

In [5]:
def search_pubmed(query, num_results=5):
    Entrez.email = "richard.finney@torontomu.ca"

    # Search query in Pubmed database
    handle = Entrez.esearch(db="pubmed", term=query, retmax=num_results)
    record = Entrez.read(handle)
    handle.close()

    # Retrieve the list of PubMed IDs (PMID)
    pmids = record["IdList"]

    return pmids

In [6]:
list_of_Golden_PMIDs = []

for result in results:
    result['Golden_PMIDs'] = search_pubmed(result['Query'], num_results=5)
    print('--------------------------------------------------------------------------------------------------------------')
    print("Original Generated PMID:")
    print(result['PMID'])
    print('')
    print("Title Used to create Gold-Standard Query:")
    print(result['Title'])
    print('')
    print("ChatGPT-created Golden Standard Query based on Title of original PMID:")
    print(result['Query'])
    print('')
    print("Top 5 PMIDs retrieved from the Golden Standard Query:")
    print(result['Golden_PMIDs'])
    print('')

--------------------------------------------------------------------------------------------------------------
Original Generated PMID:
10649182

Title Used to create Gold-Standard Query:
Activity and expression of 15-hydroxyprostaglandin dehydrogenase in cultured chorionic trophoblast and villous trophoblast cells and in chorionic explants at term with and without spontaneous labor.

ChatGPT-created Golden Standard Query based on Title of original PMID:
("15-hydroxyprostaglandin dehydrogenase" OR "15-HPGD") AND ("chorionic trophoblast" OR "villous trophoblast" OR "chorionic explants") AND ("spontaneous labor" OR "term") AND ("activity" OR "expression")

Top 5 PMIDs retrieved from the Golden Standard Query:
['10649182', '10599732', '8829211']

--------------------------------------------------------------------------------------------------------------
Original Generated PMID:
24710831

Title Used to create Gold-Standard Query:
Segmentation-driven image registration- application to 4D 

Below, we only keep the Original PMIDs and corresponding Queries if they were able to retreive 5 PMIDs:

In [7]:
cleansed_results=[]

for result in results:

    if len(result['Golden_PMIDs'])==5:
        cleansed_results.append({
            'PMID': result['PMID'],
            'Title': result['Title'],
            'Query': result['Query'],
            'Golden_PMIDs': result['Golden_PMIDs']

        })

Now we retrieve the Titles for each PMID retrieved from the Gold-Standard Queries, and append them to our working list:

In [8]:
for cleansed_result in cleansed_results:

    Golden_Titles=[]

    for Golden_PMID in cleansed_result['Golden_PMIDs']:

        handle = Entrez.efetch(db="pubmed", id=Golden_PMID, retmode="xml")
        record = Entrez.read(handle)
        articles = record['PubmedArticle']

        if articles:
            medline_citation = articles[0].get('MedlineCitation', {})
            article = medline_citation.get('Article', {})
            title = article.get('ArticleTitle', 'Title not available')

            Golden_Titles.append(title)

        else:
            Golden_Titles.append('')

    cleansed_result['Golden Titles'] = Golden_Titles

The final list containing the original randomly generated PMID, the ChatGPT-generated Gold-Standard Query, and the 5 retrieved PMIDs and corresponding Titles are saved in the list "cleansed_results" below:

In [9]:
import json

# Specify the file path where you want to save the JSON file
json_file_path = 'cleansed_results.json'

# Write the list of dictionaries to a JSON file
with open(json_file_path, 'w') as json_file:
    json.dump(cleansed_results, json_file, indent=4)

print("JSON file created successfully.")

JSON file created successfully.


In [10]:
import json

def preprocess_intents_json(intents_file):
    with open(intents_file, "r") as f:
        data = json.load(f)

    preprocessed_data = []

    for entry in data:
            preprocessed_data.append(f"User: {entry['Golden Titles']}\n")
            preprocessed_data.append(f"Assistant: {entry['Query']}\n")

    return "".join(preprocessed_data)


def save_preprocessed_data(preprocessed_data, output_file):
    with open(output_file, "w", encoding="utf-8") as f:
        f.write(preprocessed_data)


intents_file = "cleansed_results.json"
output_file = "Golden_Query_Titles.txt"


preprocessed_data = preprocess_intents_json(intents_file)
save_preprocessed_data(preprocessed_data, output_file)

### **Part 2: Fine Tuning GPT Models**

**1) GPT2**

In [11]:
from transformers import pipeline, set_seed
from transformers import BioGptTokenizer, BioGptForCausalLM
from transformers import Trainer, TrainingArguments
from transformers import TextDataset, DataCollatorForLanguageModeling

In [12]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2Config
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

In [13]:
def fine_tune_gpt2(model_name, train_file, output_dir):
    # Load GPT-2 model and tokenizer
    model = GPT2LMHeadModel.from_pretrained(model_name)
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model.to(device)
    tokenizer = GPT2Tokenizer.from_pretrained(model_name)

    # Load training dataset
    train_dataset = TextDataset(
        tokenizer=tokenizer,
        file_path=train_file,
        block_size=128)

    # Create data collator for language modeling
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False)

    # Set training arguments
    training_args = TrainingArguments(
        output_dir=output_dir,
        overwrite_output_dir=True,
        num_train_epochs=5,
        per_device_train_batch_size=4,
        save_steps=10_000,
        save_total_limit=2,
    )

    # Train the model
    trainer = Trainer(
        model=model,
        args=training_args,
        data_collator=data_collator,
        train_dataset=train_dataset,
    )

    trainer.train()

    # Save the fine-tuned model
    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)

In [14]:
# Fine-tune the model
fine_tune_gpt2("gpt2", "Golden_Query_Titles.txt", "D:\GPT2")

Step,Training Loss


**2) BioGPT**

In [18]:
from transformers import BioGptTokenizer, BioGptForCausalLM

def fine_tune_bioGPT(train_file, output_dir):
    # bioGPT model and tokenizer
    tokenizer = BioGptTokenizer.from_pretrained("microsoft/biogpt")
    model = BioGptForCausalLM.from_pretrained("microsoft/biogpt")
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model.to(device)

    # Load training dataset
    train_dataset = TextDataset(
        tokenizer=tokenizer,
        file_path=train_file,
        block_size=128)

    # Create data collator for language modeling
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False)

    # Set training arguments
    training_args = TrainingArguments(
        output_dir=output_dir,
        overwrite_output_dir=True,
        num_train_epochs=5,
        per_device_train_batch_size=1,
        save_steps=10_000,
        save_total_limit=2,
    )

    # Train the model
    trainer = Trainer(
        model=model,
        args=training_args,
        data_collator=data_collator,
        train_dataset=train_dataset,
    )

    trainer.train()

    # Save the fine-tuned model
    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)
    
fine_tune_bioGPT("Golden_Query_Titles.txt", 'D:\\bioGPT')

OutOfMemoryError: CUDA out of memory. Tried to allocate 166.00 MiB. GPU 0 has a total capacity of 8.00 GiB of which 0 bytes is free. Of the allocated memory 13.90 GiB is allocated by PyTorch, and 914.39 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

**3) bioGPT-Large**

In [15]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="microsoft/BioGPT-Large")


In [16]:
from transformers import AutoTokenizer, AutoModelForCausalLM

def fine_tune_bioGPT_Large(train_file, output_dir):
    # bioGPT-Large model and tokenizer
    tokenizer = AutoTokenizer.from_pretrained("microsoft/BioGPT-Large")
    model = AutoModelForCausalLM.from_pretrained("microsoft/BioGPT-Large")
    device = "cpu"
    model.to(device)

    # Load training dataset
    train_dataset = TextDataset(
        tokenizer=tokenizer,
        file_path=train_file,
        block_size=128)

    # Create data collator for language modeling
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False)

    # Set training arguments
    training_args = TrainingArguments(
        output_dir=output_dir,
        overwrite_output_dir=True,
        num_train_epochs=5,
        per_device_train_batch_size=4,
        save_steps=10_000,
        save_total_limit=2,
    )

    # Train the model
    trainer = Trainer(
        model=model,
        args=training_args,
        data_collator=data_collator,
        train_dataset=train_dataset,
    )

    trainer.train()

    # Save the fine-tuned model
    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)
    
fine_tune_bioGPT_Large("Golden_Query_Titles.txt", 'D:\\bioGPT-Large')

OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 8.00 GiB of which 0 bytes is free. Of the allocated memory 13.90 GiB is allocated by PyTorch, and 914.39 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)