# Weight Technique 2: Vectorization

Vectorization: Convert both the MedLM output and the rag text into vectors using a suitable embedding model.
Popular choices include:
1. Universal Sentence Encoder (USE): A pre-trained model from Google that generates sentence embeddings.
2. BERT: A powerful language model that can be fine-tuned for various tasks, including sentence embedding.
3. Word2Vec: A model that learns word embeddings based on their co-occurrence in text.

In [1]:
import json
from google.cloud import aiplatform
from google.cloud.aiplatform.gapic.schema import predict
from google.protobuf import json_format
from google.protobuf.struct_pb2 import Value
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from google.cloud import bigquery
from langchain_openai import ChatOpenAI
from langchain_experimental.sql import SQLDatabaseChain
from langchain.sql_database import SQLDatabase
from langchain.prompts import PromptTemplate
import warnings
warnings.filterwarnings('ignore')

In [2]:
from transformers import AutoModel, AutoTokenizer
from sentence_transformers import SentenceTransformer
import torch

2024-08-08 14:34:53.189054: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-08 14:34:53.219492: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-08 14:34:53.227263: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [3]:
openai_api_key = openai_api_key

In [4]:
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)




In [7]:
# Initialize BigQuery client
bigquery_client = bigquery.Client()

# Define the Gemini model
llm = ChatOpenAI(openai_api_key=openai_api_key, model="gpt-4-32k")

# Manually define the SQLDatabase
# Assume you have a BigQuery connection string or credentials file
connection_string = "bigquery://us-gcp-ame-con-5b680-sbx-1/mimic_iv_hosp_icu_dataset"

# Create the SQLDatabase instance
db = SQLDatabase.from_uri(connection_string)

# Create the SQLDatabaseChain
chain = SQLDatabaseChain(llm=llm, database=db)

# Define your natural language query
natural_language_query = "List of medication prescribed to patients after post op limit 1"

# Generate SQL and execute
sql_query = chain.run(natural_language_query)
print("Generated SQL:", sql_query)

Generated SQL: The medication prescribed to the patient with subject_id 15130648 and hadm_id 20919522 after post op was 'OxyCODONE (Immediate Rel 5mg TAB)'. The prescription started at 2175-12-29 18:00:00.


In [16]:
rag_text = sql_query  # Replace with your actual RAG text
rag_tokens = tokenizer(rag_text, return_tensors="pt")
rag_embeddings = model(**rag_tokens).last_hidden_state[:, 0, :]  # Get the first token's embedding


In [10]:
client_options = {"api_endpoint": "us-central1-aiplatform.googleapis.com"}

# Initialize client that will be used to create and send requests.

# This client only needs to be created once, and can be reused for multiple requests.

client = aiplatform.gapic.PredictionServiceClient(

    client_options=client_options

)

instance_dict = { "content": "Summarize " + rag_text }

instance = json_format.ParseDict(instance_dict, Value())

instances = [instance]

parameters_dict = {

    "candidateCount": 1,

    "maxOutputTokens": 500,

    "temperature": 0.2,

    "topP": 0.8,

    "topK": 40

}

parameters = json_format.ParseDict(parameters_dict, Value())

response = client.predict(

    endpoint="projects/us-gcp-ame-con-5b680-sbx-1/locations/us-central1/publishers/google/models/medlm-large", instances=instances, parameters=parameters

)

print("response")

predictions = response.predictions

for prediction in predictions:
    print(" prediction:", dict(prediction))

response
 prediction: {'content': " Oxycodone is an opioid pain medication. It is used to relieve moderate to severe pain. Oxycodone can slow or stop your breathing, and may be habit-forming. Oxycodone may also cause severe constipation, which can lead to other serious problems. Do not use oxycodone if you have severe breathing problems, a blockage in your stomach or intestines, or if you have recently used alcohol, sedatives, tranquilizers, or other opioid medications. Do not use oxycodone if you are allergic to it or to other opioid medications, such as codeine, hydrocodone, or morphine. To use oxycodone safely, follow your doctor's instructions and the directions on the medication label. Do not take more or less oxycodone than prescribed, and do not take it more often than prescribed. Do not crush, chew, or dissolve the tablets. Swallow them whole. Oxycodone can cause side effects, including drowsiness, dizziness, nausea, vomiting, constipation, and headache. If you experience any o

In [11]:
medlm_output = ''' Oxycodone is an opioid pain medication. It is used to relieve moderate to severe pain.
Oxycodone can slow or stop your breathing, and may be habit-forming. Oxycodone may also cause severe constipation, which can lead to other serious problems. 
Do not use oxycodone if you have severe breathing problems, a blockage in your stomach or intestines, or if you have recently used alcohol, sedatives, tranquilizers, 
or other opioid medications. Do not use oxycodone if you are allergic to it or to other opioid medications, such as codeine, hydrocodone, or morphine.
To use oxycodone safely, follow your doctor's instructions and the directions on the medication label. 
Do not take more or less oxycodone than prescribed, and do not take it more often than prescribed. Do not crush, chew, or dissolve the tablets. Swallow them whole. 
Oxycodone can cause side effects, including drowsiness, dizziness, nausea, vomiting, constipation, and headache. If you experience any of these side effects, talk to your doctor. 
Do not drink alcohol while taking oxycodone. Alcohol can increase the risk of serious side effects, including overdose and death. Do not drive or operate machinery while taking oxycodone. 
Oxycodone can impair your thinking and judgment. If you are pregnant or breastfeeding, talk to your doctor before taking oxycodone. Oxycodone can pass into breast milk and may harm a nursing baby. 
If you are taking oxycodone for a long time, your doctor may recommend that you take a lower dose of the medication over time to reduce your risk of addiction and other side effects. 
Do not stop taking oxycodone suddenly without talking to your doctor. Stopping the medication suddenly can cause withdrawal symptoms, such as anxiety, sweating, nausea, and vomiting. 
If you have any questions or concerns about oxycodone, talk to your doctor.'''

medlm_tokens = tokenizer(medlm_output, return_tensors="pt")
medlm_embeddings = model(**medlm_tokens).last_hidden_state[:, 0, :]  # Get the first token's embedding


In [17]:
sentence_transformer = SentenceTransformer("all-mpnet-base-v2")
medlm_embeddings = sentence_transformer.encode(medlm_output)
rag_embeddings = sentence_transformer.encode(rag_text)


In [18]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity([medlm_embeddings], [rag_embeddings])[0][0]
print("Cosine Similarity:", cosine_sim)


Cosine Similarity: 0.4793439


In [19]:
from rouge import Rouge
from nltk.translate.bleu_score import sentence_bleu

# Calculate BLEU score
bleu_score = sentence_bleu([rag_text.split()], medlm_output.split())
print("BLEU Score:", bleu_score)

# Calculate ROUGE score
rouge = Rouge()
rouge_scores = rouge.get_scores(medlm_output, rag_text)
print("ROUGE Scores:", rouge_scores)


BLEU Score: 6.227395038190965e-232
ROUGE Scores: [{'rouge-1': {'r': 0.19230769230769232, 'p': 0.034013605442176874, 'f': 0.05780346565404803}, 'rouge-2': {'r': 0.0, 'p': 0.0, 'f': 0.0}, 'rouge-l': {'r': 0.15384615384615385, 'p': 0.027210884353741496, 'f': 0.04624277201242956}}]


# No weights 

In [15]:
client_options = {"api_endpoint": "us-central1-aiplatform.googleapis.com"}

# Initialize client that will be used to create and send requests.

# This client only needs to be created once, and can be reused for multiple requests.

client = aiplatform.gapic.PredictionServiceClient(

    client_options=client_options

)

instance_dict = { "content": "Summarize " + sql_query }

instance = json_format.ParseDict(instance_dict, Value())

instances = [instance]

parameters_dict = {

    "candidateCount": 1,

    "maxOutputTokens": 500,

    "temperature": 0.2,

    "topP": 0.8,

    "topK": 40

}

parameters = json_format.ParseDict(parameters_dict, Value())

response = client.predict(

    endpoint="projects/us-gcp-ame-con-5b680-sbx-1/locations/us-central1/publishers/google/models/medlm-large", instances=instances, parameters=parameters

)

print("response")

predictions = response.predictions

for prediction in predictions:
    print(" prediction:", dict(prediction))

response
 prediction: {'content': " Oxycodone is an opioid pain medication. It is used to relieve moderate to severe pain. Oxycodone can slow or stop your breathing, and may be habit-forming. Oxycodone may also cause severe constipation, which can lead to other serious problems. Do not use oxycodone if you have severe breathing problems, a blockage in your stomach or intestines, or if you have recently used alcohol, sedatives, tranquilizers, or other opioid medications. Do not use oxycodone if you are allergic to it or to other opioid medications, such as codeine, hydrocodone, or morphine. To use oxycodone safely, follow your doctor's instructions and the directions on the medication label. Do not take more or less oxycodone than prescribed, and do not take it more often than prescribed. Do not crush, chew, or dissolve the tablets. Swallow them whole. Oxycodone can cause side effects, including drowsiness, dizziness, nausea, vomiting, constipation, and headache. If you experience any o

In [20]:
from sentence_transformers import SentenceTransformer, util

# Load a pre-trained sentence embedding model
model = SentenceTransformer('all-mpnet-base-v2')

# MedLM output
medlm_output = ''' Oxycodone is an opioid pain medication. It is used to relieve moderate to severe pain. Oxycodone can slow or stop your breathing, and may be habit-forming. Oxycodone may also cause severe constipation, which can lead to other serious problems. Do not use oxycodone if you have severe breathing problems, a blockage in your stomach or intestines, or if you have recently used alcohol, sedatives, tranquilizers, or other opioid medications. Do not use oxycodone if you are allergic to it or to other opioid medications, such as codeine, hydrocodone, or morphine. To use oxycodone safely, follow your doctor's instructions and the directions on the medication label. Do not take more or less oxycodone than prescribed, and do not take it more often than prescribed. Do not crush, chew, or dissolve the tablets. Swallow them whole. Oxycodone can cause side effects, including drowsiness, dizziness, nausea, vomiting, constipation, and headache. If you experience any of these side effects, talk to your doctor. Do not drink alcohol while taking oxycodone. Alcohol can increase the risk of serious side effects, including overdose and death. Do not drive or operate machinery while taking oxycodone. Oxycodone can impair your thinking and judgment. If you are pregnant or breastfeeding, talk to your doctor before taking oxycodone. Oxycodone can pass into breast milk and may harm a nursing baby. If you are taking oxycodone for a long time, your doctor may recommend that you take a lower dose of the medication over time to reduce your risk of addiction and other side effects. Do not stop taking oxycodone suddenly without talking to your doctor. Stopping the medication suddenly can cause withdrawal symptoms, such as anxiety, sweating, nausea, and vomiting. If you have any questions or concerns about oxycodone, talk to your doctor.'''

# Your reference text
rag_output = '''The medication prescribed to the patient with subject_id 15130648 and hadm_id 20919522 after post op was 'OxyCODONE (Immediate Rel 5mg TAB)'. The prescription started at 2175-12-29 18:00:00'''

# Encode the sentences into vectors
medlm_embedding = model.encode(medlm_output)
reference_embedding = model.encode(rag_output)

# Calculate cosine similarity
cosine_similarity = util.cos_sim(medlm_embedding, reference_embedding)[0][0]

print(f"Cosine Similarity: {cosine_similarity}")

Cosine Similarity: 0.4728148281574249


In [43]:
from rouge import Rouge
from nltk.translate.bleu_score import sentence_bleu

# Calculate BLEU score
bleu_score = sentence_bleu([rag_text.split()], medlm_output.split())
print("BLEU Score:", bleu_score)

# Calculate ROUGE score
rouge = Rouge()
rouge_scores = rouge.get_scores(medlm_output, rag_text)
print("ROUGE Scores:", rouge_scores)

BLEU Score: 9.947643045471176e-232
ROUGE Scores: [{'rouge-1': {'r': 0.45454545454545453, 'p': 0.15151515151515152, 'f': 0.22727272352272732}, 'rouge-2': {'r': 0.0, 'p': 0.0, 'f': 0.0}, 'rouge-l': {'r': 0.36363636363636365, 'p': 0.12121212121212122, 'f': 0.1818181780681819}}]


tokenizer : Tokenizes the input text into a sequence of words or subwords.

model : Processes the tokens through the BERT model to obtain hidden state representations.

last_hidden_state[:, 0, :] : Extracts the embedding for the first token (often the [CLS] token) as a vector representation of the entire input.

SentenceTransformer : Provides pre-trained models specifically designed for sentence embedding.

Important Notes:

BERT Model Choice: The choice of BERT model can affect the quality of the embeddings. Experiment with different models to find the best one for your task.

Sentence Embeddings: If you're working with longer texts, consider using sentence embeddings to capture the overall meaning.

Cosine Similarity: Cosine similarity is a good measure of semantic similarity, but it doesn't capture all aspects of text quality.


# Question 1

In [5]:
sql_query = "The medication 'Papain 2.5 % Solution' was used to treat a patient with id 12749568, who was admitted on 2136-11-09"
rag_text = sql_query  # Replace with your actual RAG text
rag_tokens = tokenizer(rag_text, return_tensors="pt")
rag_embeddings = model(**rag_tokens).last_hidden_state[:, 0, :]  # Get the first token's embedding

In [6]:
client_options = {"api_endpoint": "us-central1-aiplatform.googleapis.com"}

# Initialize client that will be used to create and send requests.

# This client only needs to be created once, and can be reused for multiple requests.

client = aiplatform.gapic.PredictionServiceClient(

    client_options=client_options

)

instance_dict = { "content": "Summarize " + rag_text }

instance = json_format.ParseDict(instance_dict, Value())

instances = [instance]

parameters_dict = {

    "candidateCount": 1,

    "maxOutputTokens": 500,

    "temperature": 0.2,

    "topP": 0.8,

    "topK": 40

}

parameters = json_format.ParseDict(parameters_dict, Value())

response = client.predict(

    endpoint="projects/us-gcp-ame-con-5b680-sbx-1/locations/us-central1/publishers/google/models/medlm-large", instances=instances, parameters=parameters

)

print("response")

predictions = response.predictions

for prediction in predictions:
    print(" prediction:", dict(prediction))

response
 prediction: {'citationMetadata': <proto.marshal.collections.maps.MapComposite object at 0x7f8b8ecf6bc0>, 'safetyAttributes': <proto.marshal.collections.maps.MapComposite object at 0x7f8b8ecf7c40>, 'content': ' Papain 2.5% solution is a medication that is used to treat a variety of conditions. It is a proteolytic enzyme, which means that it helps to break down proteins. Papain is derived from the papaya fruit, and it has been used for centuries to treat a variety of medical conditions. Papain 2.5% solution can be used to treat a variety of conditions, including: Wounds: Papain can help to clean and debride wounds, and it can also help to promote healing. Skin conditions: Papain can be used to treat a variety of skin conditions, such as eczema and psoriasis. Digestive problems: Papain can help to improve digestion and relieve symptoms of indigestion, such as heartburn and bloating. Inflammation: Papain can help to reduce inflammation in the body. Papain 2.5% solution is a safe 

In [7]:
medlm_output = '''Papain 2.5% solution is a medication that is used to treat a variety of conditions. It is a proteolytic enzyme, which means that it helps to break down proteins. Papain is derived from the papaya fruit, and it has been used for centuries to treat a variety of medical conditions. Papain 2.5% solution can be used to treat a variety of conditions, including: Wounds: Papain can help to clean and debride wounds, and it can also help to promote healing. Skin conditions: Papain can be used to treat a variety of skin conditions, such as eczema and psoriasis. Digestive problems: Papain can help to improve digestion and relieve symptoms of indigestion, such as heartburn and bloating. Inflammation: Papain can help to reduce inflammation in the body. Papain 2.5% solution is a safe and effective medication that can be used to treat a variety of conditions. It is important to follow the directions on the medication label and to talk to your doctor or pharmacist if you have any questions about the medication.'''

medlm_tokens = tokenizer(medlm_output, return_tensors="pt")
medlm_embeddings = model(**medlm_tokens).last_hidden_state[:, 0, :]  # Get the first token's embedding


In [8]:
sentence_transformer = SentenceTransformer("all-mpnet-base-v2")
medlm_embeddings = sentence_transformer.encode(medlm_output)
rag_embeddings = sentence_transformer.encode(rag_text)


In [9]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity([medlm_embeddings], [rag_embeddings])[0][0]
print("Cosine Similarity:", cosine_sim)

Cosine Similarity: 0.66296136


In [10]:
from rouge import Rouge
from nltk.translate.bleu_score import sentence_bleu

# Calculate BLEU score
bleu_score = sentence_bleu([rag_text.split()], medlm_output.split())
print("BLEU Score:", bleu_score)

# Calculate ROUGE score
rouge = Rouge()
rouge_scores = rouge.get_scores(medlm_output, rag_text)
print("ROUGE Scores:", rouge_scores)


BLEU Score: 0.014366817644349382
ROUGE Scores: [{'rouge-1': {'r': 0.35, 'p': 0.08139534883720931, 'f': 0.13207546863652553}, 'rouge-2': {'r': 0.15, 'p': 0.023622047244094488, 'f': 0.040816324179740064}, 'rouge-l': {'r': 0.35, 'p': 0.08139534883720931, 'f': 0.13207546863652553}}]


# Question 2

In [11]:
sql_query = "The most frequently prescribed prescription drug is Insulin."
rag_text = sql_query  # Replace with your actual RAG text
rag_tokens = tokenizer(rag_text, return_tensors="pt")
rag_embeddings = model(**rag_tokens).last_hidden_state[:, 0, :]  # Get the first token's embedding

In [12]:
client_options = {"api_endpoint": "us-central1-aiplatform.googleapis.com"}

# Initialize client that will be used to create and send requests.

# This client only needs to be created once, and can be reused for multiple requests.

client = aiplatform.gapic.PredictionServiceClient(

    client_options=client_options

)

instance_dict = { "content": "Summarize " + rag_text }

instance = json_format.ParseDict(instance_dict, Value())

instances = [instance]

parameters_dict = {

    "candidateCount": 1,

    "maxOutputTokens": 500,

    "temperature": 0.2,

    "topP": 0.8,

    "topK": 40

}

parameters = json_format.ParseDict(parameters_dict, Value())

response = client.predict(

    endpoint="projects/us-gcp-ame-con-5b680-sbx-1/locations/us-central1/publishers/google/models/medlm-large", instances=instances, parameters=parameters

)

print("response")

predictions = response.predictions

for prediction in predictions:
    print(" prediction:", dict(prediction))

response
 prediction: {'citationMetadata': <proto.marshal.collections.maps.MapComposite object at 0x7f8ce0128ee0>, 'content': " Insulin is a hormone that is produced by the pancreas and is essential for the body's use of glucose (sugar) for energy. It is used to treat diabetes, a condition in which the body does not produce enough insulin or does not use insulin effectively. Insulin is available in several different forms, including injectable, oral, and inhaled forms. The type and dose of insulin that is prescribed will depend on the individual's needs.", 'safetyAttributes': <proto.marshal.collections.maps.MapComposite object at 0x7f8ce0129900>}


In [13]:
medlm_output = "Insulin is a hormone that is produced by the pancreas and is essential for the body's use of glucose (sugar) for energy. It is used to treat diabetes, a condition in which the body does not produce enough insulin or does not use insulin effectively. Insulin is available in several different forms, including injectable, oral, and inhaled forms. The type and dose of insulin that is prescribed will depend on the individual's needs."
medlm_tokens = tokenizer(medlm_output, return_tensors="pt")
medlm_embeddings = model(**medlm_tokens).last_hidden_state[:, 0, :]  # Get the first token's embedding


In [14]:
sentence_transformer = SentenceTransformer("all-mpnet-base-v2")
medlm_embeddings = sentence_transformer.encode(medlm_output)
rag_embeddings = sentence_transformer.encode(rag_text)


In [15]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity([medlm_embeddings], [rag_embeddings])[0][0]
print("Cosine Similarity:", cosine_sim)

Cosine Similarity: 0.5294906


In [16]:
from rouge import Rouge
from nltk.translate.bleu_score import sentence_bleu

# Calculate BLEU score
bleu_score = sentence_bleu([rag_text.split()], medlm_output.split())
print("BLEU Score:", bleu_score)

# Calculate ROUGE score
rouge = Rouge()
rouge_scores = rouge.get_scores(medlm_output, rag_text)
print("ROUGE Scores:", rouge_scores)

BLEU Score: 8.202720662090001e-232
ROUGE Scores: [{'rouge-1': {'r': 0.5, 'p': 0.07692307692307693, 'f': 0.13333333102222228}, 'rouge-2': {'r': 0.0, 'p': 0.0, 'f': 0.0}, 'rouge-l': {'r': 0.375, 'p': 0.057692307692307696, 'f': 0.09999999768888894}}]


# Question 3

In [17]:
sql_query = "The most frequent order types in provider order entries are: 1. Medications with 17690066 entries 2. Lab with 6565578 entries 3. General Care with 5690221 entries 4. ADT orders with 2171456 entries 5. IV therapy with 2140236 entries"
rag_text = sql_query  # Replace with your actual RAG text
rag_tokens = tokenizer(rag_text, return_tensors="pt")
rag_embeddings = model(**rag_tokens).last_hidden_state[:, 0, :]  # Get the first token's embedding

In [18]:
client_options = {"api_endpoint": "us-central1-aiplatform.googleapis.com"}

# Initialize client that will be used to create and send requests.

# This client only needs to be created once, and can be reused for multiple requests.

client = aiplatform.gapic.PredictionServiceClient(

    client_options=client_options

)

instance_dict = { "content": "Summarize " + rag_text }

instance = json_format.ParseDict(instance_dict, Value())

instances = [instance]

parameters_dict = {

    "candidateCount": 1,

    "maxOutputTokens": 500,

    "temperature": 0.2,

    "topP": 0.8,

    "topK": 40

}

parameters = json_format.ParseDict(parameters_dict, Value())

response = client.predict(

    endpoint="projects/us-gcp-ame-con-5b680-sbx-1/locations/us-central1/publishers/google/models/medlm-large", instances=instances, parameters=parameters

)

print("response")

predictions = response.predictions

for prediction in predictions:
    print(" prediction:", dict(prediction))

response
 prediction: {'citationMetadata': <proto.marshal.collections.maps.MapComposite object at 0x7f8b8ed4f730>, 'safetyAttributes': <proto.marshal.collections.maps.MapComposite object at 0x7f8b8ed4eec0>, 'content': ' The most frequent order types in provider order entries are: \n1. Medications with 17690066 entries \n2. Lab with 6565578 entries \n3. General Care with 5690221 entries \n4. ADT orders with 2171456 entries \n5. IV therapy with 2140236 entries'}


In [19]:
medlm_output = "The most frequent order types in provider order entries are: 1. Medications with 17690066 entries 2. Lab with 6565578 entries 3. General Care with 5690221 entries 4. ADT orders with 2171456 entries 5. IV therapy with 2140236 entries"
medlm_tokens = tokenizer(medlm_output, return_tensors="pt")
medlm_embeddings = model(**medlm_tokens).last_hidden_state[:, 0, :]  # Get the first token's embedding


In [20]:
sentence_transformer = SentenceTransformer("all-mpnet-base-v2")
medlm_embeddings = sentence_transformer.encode(medlm_output)
rag_embeddings = sentence_transformer.encode(rag_text)


In [21]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity([medlm_embeddings], [rag_embeddings])[0][0]
print("Cosine Similarity:", cosine_sim)

Cosine Similarity: 1.0000002


In [22]:
from rouge import Rouge
from nltk.translate.bleu_score import sentence_bleu

# Calculate BLEU score
bleu_score = sentence_bleu([rag_text.split()], medlm_output.split())
print("BLEU Score:", bleu_score)

# Calculate ROUGE score
rouge = Rouge()
rouge_scores = rouge.get_scores(medlm_output, rag_text)
print("ROUGE Scores:", rouge_scores)

BLEU Score: 1.0
ROUGE Scores: [{'rouge-1': {'r': 1.0, 'p': 1.0, 'f': 0.999999995}, 'rouge-2': {'r': 1.0, 'p': 1.0, 'f': 0.999999995}, 'rouge-l': {'r': 1.0, 'p': 1.0, 'f': 0.999999995}}]


# Question 4

In [23]:
sql_query = "The unique names of the tests given to patients who were admitted and the doctor comments VRE isolated are 'R/O VANCOMYCIN RESISTANT ENTEROCOCCUS', 'WOUND CULTURE', and 'FECAL CULTURE - R/O VIBRIO'"
rag_text = sql_query  # Replace with your actual RAG text
rag_tokens = tokenizer(rag_text, return_tensors="pt")
rag_embeddings = model(**rag_tokens).last_hidden_state[:, 0, :]  # Get the first token's embedding

In [24]:
client_options = {"api_endpoint": "us-central1-aiplatform.googleapis.com"}

# Initialize client that will be used to create and send requests.

# This client only needs to be created once, and can be reused for multiple requests.

client = aiplatform.gapic.PredictionServiceClient(

    client_options=client_options

)

instance_dict = { "content": "Summarize " + rag_text }

instance = json_format.ParseDict(instance_dict, Value())

instances = [instance]

parameters_dict = {

    "candidateCount": 1,

    "maxOutputTokens": 500,

    "temperature": 0.2,

    "topP": 0.8,

    "topK": 40

}

parameters = json_format.ParseDict(parameters_dict, Value())

response = client.predict(

    endpoint="projects/us-gcp-ame-con-5b680-sbx-1/locations/us-central1/publishers/google/models/medlm-large", instances=instances, parameters=parameters

)

print("response")

predictions = response.predictions

for prediction in predictions:
    print(" prediction:", dict(prediction))

response
 prediction: {'content': " The doctor's comments indicate that the patient was admitted to the hospital and that the patient was isolated due to a vancomycin-resistant enterococcus (VRE) infection. VRE is a type of bacteria that is resistant to the antibiotic vancomycin. VRE infections can be serious and can cause a variety of symptoms, depending on the location of the infection. Treatment for VRE infections typically involves the use of antibiotics that are effective against VRE. The specific antibiotics that are used will depend on the individual patient and the severity of the infection. In some cases, surgery may be necessary to remove infected tissue.", 'citationMetadata': <proto.marshal.collections.maps.MapComposite object at 0x7f8b8ed4f5e0>, 'safetyAttributes': <proto.marshal.collections.maps.MapComposite object at 0x7f8b8ed4fdf0>}


In [25]:
medlm_output = "The doctor's comments indicate that the patient was admitted to the hospital and that the patient was isolated due to a vancomycin-resistant enterococcus (VRE) infection. VRE is a type of bacteria that is resistant to the antibiotic vancomycin. VRE infections can be serious and can cause a variety of symptoms, depending on the location of the infection. Treatment for VRE infections typically involves the use of antibiotics that are effective against VRE. The specific antibiotics that are used will depend on the individual patient and the severity of the infection. In some cases, surgery may be necessary to remove infected tissue."
medlm_tokens = tokenizer(medlm_output, return_tensors="pt")
medlm_embeddings = model(**medlm_tokens).last_hidden_state[:, 0, :]  # Get the first token's embedding

In [26]:
sentence_transformer = SentenceTransformer("all-mpnet-base-v2")
medlm_embeddings = sentence_transformer.encode(medlm_output)
rag_embeddings = sentence_transformer.encode(rag_text)

In [27]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity([medlm_embeddings], [rag_embeddings])[0][0]
print("Cosine Similarity:", cosine_sim)

Cosine Similarity: 0.7876705


In [28]:
from rouge import Rouge
from nltk.translate.bleu_score import sentence_bleu

# Calculate BLEU score
bleu_score = sentence_bleu([rag_text.split()], medlm_output.split())
print("BLEU Score:", bleu_score)

# Calculate ROUGE score
rouge = Rouge()
rouge_scores = rouge.get_scores(medlm_output, rag_text)
print("ROUGE Scores:", rouge_scores)

BLEU Score: 3.293401528213429e-155
ROUGE Scores: [{'rouge-1': {'r': 0.3448275862068966, 'p': 0.16393442622950818, 'f': 0.22222221785432106}, 'rouge-2': {'r': 0.06666666666666667, 'p': 0.022222222222222223, 'f': 0.03333332958333376}, 'rouge-l': {'r': 0.3103448275862069, 'p': 0.14754098360655737, 'f': 0.1999999956320989}}]


# Question 5

In [29]:
sql_query = "The radiology findings of a patient who received ORTHOPAEDICS service in discharge text are: EXAMINATION:  CHEST (PORTABLE AP) INDICATION:  ___ year old woman with urosepsis with fever rule out pneumonia TECHNIQUE:  CHEST (PORTABLE AP) COMPARISON:  ___ IMPRESSION:Left internal jugular line tip is at the level of mid SVC.  Cardiomegaly is substantial, unchanged."
rag_text = sql_query  # Replace with your actual RAG text
rag_tokens = tokenizer(rag_text, return_tensors="pt")
rag_embeddings = model(**rag_tokens).last_hidden_state[:, 0, :]  # Get the first token's embedding

In [30]:
client_options = {"api_endpoint": "us-central1-aiplatform.googleapis.com"}

# Initialize client that will be used to create and send requests.

# This client only needs to be created once, and can be reused for multiple requests.

client = aiplatform.gapic.PredictionServiceClient(

    client_options=client_options

)

instance_dict = { "content": "Summarize " + rag_text }

instance = json_format.ParseDict(instance_dict, Value())

instances = [instance]

parameters_dict = {

    "candidateCount": 1,

    "maxOutputTokens": 500,

    "temperature": 0.2,

    "topP": 0.8,

    "topK": 40

}

parameters = json_format.ParseDict(parameters_dict, Value())

response = client.predict(

    endpoint="projects/us-gcp-ame-con-5b680-sbx-1/locations/us-central1/publishers/google/models/medlm-large", instances=instances, parameters=parameters

)

print("response")

predictions = response.predictions

for prediction in predictions:
    print(" prediction:", dict(prediction))

response
 prediction: {'citationMetadata': <proto.marshal.collections.maps.MapComposite object at 0x7f8b8c16db40>, 'safetyAttributes': <proto.marshal.collections.maps.MapComposite object at 0x7f8b8c16eb00>, 'content': ' There is no focal infiltrate, effusion or pneumothorax. The left internal jugular line tip is at the level of mid SVC, which is an appropriate position. The heart is enlarged, but this is unchanged from previous studies. There is no evidence of pneumonia, effusion, or pneumothorax.'}


In [31]:
medlm_output = "There is no focal infiltrate, effusion or pneumothorax. The left internal jugular line tip is at the level of mid SVC, which is an appropriate position. The heart is enlarged, but this is unchanged from previous studies. There is no evidence of pneumonia, effusion, or pneumothorax"
medlm_tokens = tokenizer(medlm_output, return_tensors="pt")
medlm_embeddings = model(**medlm_tokens).last_hidden_state[:, 0, :]  # Get the first token's embedding

In [32]:
sentence_transformer = SentenceTransformer("all-mpnet-base-v2")
medlm_embeddings = sentence_transformer.encode(medlm_output)
rag_embeddings = sentence_transformer.encode(rag_text)

In [33]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity([medlm_embeddings], [rag_embeddings])[0][0]
print("Cosine Similarity:", cosine_sim)

Cosine Similarity: 0.6625615


In [34]:
from rouge import Rouge
from nltk.translate.bleu_score import sentence_bleu

# Calculate BLEU score
bleu_score = sentence_bleu([rag_text.split()], medlm_output.split())
print("BLEU Score:", bleu_score)

# Calculate ROUGE score
rouge = Rouge()
rouge_scores = rouge.get_scores(medlm_output, rag_text)
print("ROUGE Scores:", rouge_scores)

BLEU Score: 0.177510274862403
ROUGE Scores: [{'rouge-1': {'r': 0.26666666666666666, 'p': 0.34285714285714286, 'f': 0.299999995078125}, 'rouge-2': {'r': 0.1836734693877551, 'p': 0.21428571428571427, 'f': 0.19780219283178374}, 'rouge-l': {'r': 0.26666666666666666, 'p': 0.34285714285714286, 'f': 0.299999995078125}}]


# Question 6

In [35]:
sql_query = "The radiology findings for the patient whose Chief Complaint was Abdominal pain in the discharge text were: EXAMINATION:  CT ABD AND PELVIS WITH CONTRAST INDICATION:  ___ with history of feeding tube, purulent drainage from wound site. Difficult venous access.NO_PO contrast Abscess?  G-tube placement? TECHNIQUE:  Single phase contrast: MDCT axial images were acquired through the abdomen and pelvis"
rag_text = sql_query  # Replace with your actual RAG text
rag_tokens = tokenizer(rag_text, return_tensors="pt")
rag_embeddings = model(**rag_tokens).last_hidden_state[:, 0, :]  # Get the first token's embedding

In [36]:
client_options = {"api_endpoint": "us-central1-aiplatform.googleapis.com"}

# Initialize client that will be used to create and send requests.

# This client only needs to be created once, and can be reused for multiple requests.

client = aiplatform.gapic.PredictionServiceClient(

    client_options=client_options

)

instance_dict = { "content": "Summarize " + rag_text }

instance = json_format.ParseDict(instance_dict, Value())

instances = [instance]

parameters_dict = {

    "candidateCount": 1,

    "maxOutputTokens": 500,

    "temperature": 0.2,

    "topP": 0.8,

    "topK": 40

}

parameters = json_format.ParseDict(parameters_dict, Value())

response = client.predict(

    endpoint="projects/us-gcp-ame-con-5b680-sbx-1/locations/us-central1/publishers/google/models/medlm-large", instances=instances, parameters=parameters

)

print("response")

predictions = response.predictions

for prediction in predictions:
    print(" prediction:", dict(prediction))

response
 prediction: {'citationMetadata': <proto.marshal.collections.maps.MapComposite object at 0x7f8b7fe0b520>, 'safetyAttributes': <proto.marshal.collections.maps.MapComposite object at 0x7f8b7fe08d90>, 'content': ' following the administration There is a small amount of free fluid in the right lower quadrant. There is also a small amount of free fluid in the pelvis. There is no evidence of an abscess. The feeding tube is in good position.'}


In [37]:
medlm_output = "following the administration There is a small amount of free fluid in the right lower quadrant. There is also a small amount of free fluid in the pelvis. There is no evidence of an abscess. The feeding tube is in good position."
medlm_tokens = tokenizer(medlm_output, return_tensors="pt")
medlm_embeddings = model(**medlm_tokens).last_hidden_state[:, 0, :]  # Get the first token's embedding

In [38]:
sentence_transformer = SentenceTransformer("all-mpnet-base-v2")
medlm_embeddings = sentence_transformer.encode(medlm_output)
rag_embeddings = sentence_transformer.encode(rag_text)

In [39]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity([medlm_embeddings], [rag_embeddings])[0][0]
print("Cosine Similarity:", cosine_sim)

Cosine Similarity: 0.5889751


In [40]:
from rouge import Rouge
from nltk.translate.bleu_score import sentence_bleu

# Calculate BLEU score
bleu_score = sentence_bleu([rag_text.split()], medlm_output.split())
print("BLEU Score:", bleu_score)

# Calculate ROUGE score
rouge = Rouge()
rouge_scores = rouge.get_scores(medlm_output, rag_text)
print("ROUGE Scores:", rouge_scores)

BLEU Score: 2.6353194467779707e-155
ROUGE Scores: [{'rouge-1': {'r': 0.10714285714285714, 'p': 0.23076923076923078, 'f': 0.14634145908387877}, 'rouge-2': {'r': 0.017543859649122806, 'p': 0.03125, 'f': 0.022471905506881388}, 'rouge-l': {'r': 0.10714285714285714, 'p': 0.23076923076923078, 'f': 0.14634145908387877}}]


# Question 7

In [41]:
sql_query = "The patient with note_id '17829563-RR-2' had a CT HEAD W/O CONTRAST examination where they found 'Routine unenhanced head CT was performed and viewed in brain, intermediate and bone windows.  Coronal and sagittal reformats were also performed. DOSE:  Total DLP (Head) = 903 mGy-cm. COMPARISON: ...'. The patient with note_id '14707889-RR-3' was indicated with 'History of hep C without treatment.  Presenting with abdominal distension.  Question cirrhosis. COMPARISON:  None.FINDINGS:  The liver is shrunken and nodular in appearance consistent with\ncirrhosis.  There is no evidence of intrahepatic biliary duct dilatation."
rag_text = sql_query  # Replace with your actual RAG text
rag_tokens = tokenizer(rag_text, return_tensors="pt")
rag_embeddings = model(**rag_tokens).last_hidden_state[:, 0, :]  # Get the first token's embedding

In [42]:
client_options = {"api_endpoint": "us-central1-aiplatform.googleapis.com"}

# Initialize client that will be used to create and send requests.

# This client only needs to be created once, and can be reused for multiple requests.

client = aiplatform.gapic.PredictionServiceClient(

    client_options=client_options

)

instance_dict = { "content": "Summarize " + rag_text }

instance = json_format.ParseDict(instance_dict, Value())

instances = [instance]

parameters_dict = {

    "candidateCount": 1,

    "maxOutputTokens": 500,

    "temperature": 0.2,

    "topP": 0.8,

    "topK": 40

}

parameters = json_format.ParseDict(parameters_dict, Value())

response = client.predict(

    endpoint="projects/us-gcp-ame-con-5b680-sbx-1/locations/us-central1/publishers/google/models/medlm-large", instances=instances, parameters=parameters

)

print("response")

predictions = response.predictions

for prediction in predictions:
    print(" prediction:", dict(prediction))

response
 prediction: {'safetyAttributes': <proto.marshal.collections.maps.MapComposite object at 0x7f8b8cb2ba00>, 'content': " The patient with note_id '17829563-RR-2' had a CT HEAD W/O CONTRAST examination. The report indicates that the patient has a normal head CT.\n\nThe patient with note_id '14707889-RR-3' had an ultrasound of the abdomen. The report indicates that the patient has cirrhosis of the liver.", 'citationMetadata': <proto.marshal.collections.maps.MapComposite object at 0x7f8b7fe0bac0>}


In [43]:
medlm_output = "The patient with note_id '17829563-RR-2' had a CT HEAD W/O CONTRAST examination. The report indicates that the patient has a normal head CT. The patient with note_id '14707889-RR-3' had an ultrasound of the abdomen. The report indicates that the patient has cirrhosis of the liver."
medlm_tokens = tokenizer(medlm_output, return_tensors="pt")
medlm_embeddings = model(**medlm_tokens).last_hidden_state[:, 0, :]  # Get the first token's embedding

In [44]:
sentence_transformer = SentenceTransformer("all-mpnet-base-v2")
medlm_embeddings = sentence_transformer.encode(medlm_output)
rag_embeddings = sentence_transformer.encode(rag_text)

In [45]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity([medlm_embeddings], [rag_embeddings])[0][0]
print("Cosine Similarity:", cosine_sim)

Cosine Similarity: 0.868711


In [46]:
from rouge import Rouge
from nltk.translate.bleu_score import sentence_bleu

# Calculate BLEU score
bleu_score = sentence_bleu([rag_text.split()], medlm_output.split())
print("BLEU Score:", bleu_score)

# Calculate ROUGE score
rouge = Rouge()
rouge_scores = rouge.get_scores(medlm_output, rag_text)
print("ROUGE Scores:", rouge_scores)

BLEU Score: 0.12243273996930239
ROUGE Scores: [{'rouge-1': {'r': 0.2463768115942029, 'p': 0.6538461538461539, 'f': 0.35789473286648205}, 'rouge-2': {'r': 0.15476190476190477, 'p': 0.38235294117647056, 'f': 0.22033897894857804}, 'rouge-l': {'r': 0.2318840579710145, 'p': 0.6153846153846154, 'f': 0.3368421012875346}}]


# Question 8

In [47]:
sql_query = "The radiology findings of a patient with a past medical history of coronary artery disease are as follows: There is a mild homogeneous plaque in the proximal internal carotid artery without significant increase in peak systolic velocities. The peak systolic velocity in the common carotid artery is 86 cm/sec."
rag_text = sql_query  # Replace with your actual RAG text
rag_tokens = tokenizer(rag_text, return_tensors="pt")
rag_embeddings = model(**rag_tokens).last_hidden_state[:, 0, :]  # Get the first token's embedding

In [48]:
client_options = {"api_endpoint": "us-central1-aiplatform.googleapis.com"}

# Initialize client that will be used to create and send requests.

# This client only needs to be created once, and can be reused for multiple requests.

client = aiplatform.gapic.PredictionServiceClient(

    client_options=client_options

)

instance_dict = { "content": "Summarize " + rag_text }

instance = json_format.ParseDict(instance_dict, Value())

instances = [instance]

parameters_dict = {

    "candidateCount": 1,

    "maxOutputTokens": 500,

    "temperature": 0.2,

    "topP": 0.8,

    "topK": 40

}

parameters = json_format.ParseDict(parameters_dict, Value())

response = client.predict(

    endpoint="projects/us-gcp-ame-con-5b680-sbx-1/locations/us-central1/publishers/google/models/medlm-large", instances=instances, parameters=parameters

)

print("response")

predictions = response.predictions

for prediction in predictions:
    print(" prediction:", dict(prediction))

response
 prediction: {'content': " The patient has a mild homogeneous plaque in the proximal internal carotid artery, which is the artery that supplies blood to the brain. This plaque does not appear to be causing any significant obstruction to blood flow, as the peak systolic velocity in the common carotid artery is normal. However, the patient's history of coronary artery disease suggests that they may be at increased risk of developing future cardiovascular events, such as a stroke or heart attack. It is important for the patient to continue to follow up with their doctor and manage their risk factors for cardiovascular disease, such as high blood pressure, high cholesterol, and diabetes.", 'citationMetadata': <proto.marshal.collections.maps.MapComposite object at 0x7f8b8c16ebf0>, 'safetyAttributes': <proto.marshal.collections.maps.MapComposite object at 0x7f8b7fe0a470>}


In [49]:
medlm_output = "The patient has a mild homogeneous plaque in the proximal internal carotid artery, which is the artery that supplies blood to the brain. This plaque does not appear to be causing any significant obstruction to blood flow, as the peak systolic velocity in the common carotid artery is normal. However, the patient's history of coronary artery disease suggests that they may be at increased risk of developing future cardiovascular events, such as a stroke or heart attack. It is important for the patient to continue to follow up with their doctor and manage their risk factors for cardiovascular disease, such as high blood pressure, high cholesterol, and diabetes."
medlm_tokens = tokenizer(medlm_output, return_tensors="pt")
medlm_embeddings = model(**medlm_tokens).last_hidden_state[:, 0, :]  # Get the first token's embedding

In [50]:
sentence_transformer = SentenceTransformer("all-mpnet-base-v2")
medlm_embeddings = sentence_transformer.encode(medlm_output)
rag_embeddings = sentence_transformer.encode(rag_text)

In [51]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity([medlm_embeddings], [rag_embeddings])[0][0]
print("Cosine Similarity:", cosine_sim)

Cosine Similarity: 0.72898984


In [52]:
from rouge import Rouge
from nltk.translate.bleu_score import sentence_bleu

# Calculate BLEU score
bleu_score = sentence_bleu([rag_text.split()], medlm_output.split())
print("BLEU Score:", bleu_score)

# Calculate ROUGE score
rouge = Rouge()
rouge_scores = rouge.get_scores(medlm_output, rag_text)
print("ROUGE Scores:", rouge_scores)

BLEU Score: 0.18550839090700794
ROUGE Scores: [{'rouge-1': {'r': 0.6666666666666666, 'p': 0.32, 'f': 0.4324324280496713}, 'rouge-2': {'r': 0.4222222222222222, 'p': 0.18095238095238095, 'f': 0.2533333291333334}, 'rouge-l': {'r': 0.6388888888888888, 'p': 0.30666666666666664, 'f': 0.41441441003165325}}]


# Question 9

In [53]:
sql_query = "Two patients had allergies to Morphine. Their admission notes include the following information: 1. The first patient has an altered mental status and is in the MEDICINE department. No major surgical or invasive procedure was performed. 2. The second patient was admitted to the UROLOGY department for urinary incontinence. This patient is also allergic to Lipitor and Oxycodone, and no major surgical or invasive procedure was reported. The specific medications on admission for these patients are not provided in the available details."
rag_text = sql_query  # Replace with your actual RAG text
rag_tokens = tokenizer(rag_text, return_tensors="pt")
rag_embeddings = model(**rag_tokens).last_hidden_state[:, 0, :]  # Get the first token's embedding

In [54]:
client_options = {"api_endpoint": "us-central1-aiplatform.googleapis.com"}

# Initialize client that will be used to create and send requests.

# This client only needs to be created once, and can be reused for multiple requests.

client = aiplatform.gapic.PredictionServiceClient(

    client_options=client_options

)

instance_dict = { "content": "Summarize " + rag_text }

instance = json_format.ParseDict(instance_dict, Value())

instances = [instance]

parameters_dict = {

    "candidateCount": 1,

    "maxOutputTokens": 500,

    "temperature": 0.2,

    "topP": 0.8,

    "topK": 40

}

parameters = json_format.ParseDict(parameters_dict, Value())

response = client.predict(

    endpoint="projects/us-gcp-ame-con-5b680-sbx-1/locations/us-central1/publishers/google/models/medlm-large", instances=instances, parameters=parameters

)

print("response")

predictions = response.predictions

for prediction in predictions:
    print(" prediction:", dict(prediction))

response
 prediction: {'citationMetadata': <proto.marshal.collections.maps.MapComposite object at 0x7f8b7fe0a020>, 'safetyAttributes': <proto.marshal.collections.maps.MapComposite object at 0x7f8b7fe098d0>, 'content': ' There are two patients who have allergies to Morphine. The first patient has an altered mental status and is in the MEDICINE department. No major surgical or invasive procedure was performed. The second patient was admitted to the UROLOGY department for urinary incontinence. This patient is also allergic to Lipitor and Oxycodone, and no major surgical or invasive procedure was reported. The specific medications on admission for these patients are not provided in the available details.'}


In [55]:
medlm_output = "There are two patients who have allergies to Morphine. The first patient has an altered mental status and is in the MEDICINE department. No major surgical or invasive procedure was performed. The second patient was admitted to the UROLOGY department for urinary incontinence. This patient is also allergic to Lipitor and Oxycodone, and no major surgical or invasive procedure was reported. The specific medications on admission for these patients are not provided in the available details."
medlm_tokens = tokenizer(medlm_output, return_tensors="pt")
medlm_embeddings = model(**medlm_tokens).last_hidden_state[:, 0, :]  # Get the first token's embedding

In [56]:
sentence_transformer = SentenceTransformer("all-mpnet-base-v2")
medlm_embeddings = sentence_transformer.encode(medlm_output)
rag_embeddings = sentence_transformer.encode(rag_text)

In [57]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity([medlm_embeddings], [rag_embeddings])[0][0]
print("Cosine Similarity:", cosine_sim)

Cosine Similarity: 0.9855383


In [59]:
from rouge import Rouge
from nltk.translate.bleu_score import sentence_bleu

# Calculate BLEU score
bleu_score = sentence_bleu([rag_text.split()], medlm_output.split())
print("BLEU Score:", bleu_score)

# Calculate ROUGE score
rouge = Rouge()
rouge_scores = rouge.get_scores(medlm_output, rag_text)
print("ROUGE Scores:", rouge_scores)

BLEU Score: 0.8143610145564801
ROUGE Scores: [{'rouge-1': {'r': 0.8448275862068966, 'p': 0.9245283018867925, 'f': 0.8828828778930282}, 'rouge-2': {'r': 0.8133333333333334, 'p': 0.8840579710144928, 'f': 0.847222217230903}, 'rouge-l': {'r': 0.8448275862068966, 'p': 0.9245283018867925, 'f': 0.8828828778930282}}]


# Question 10

In [60]:
sql_query = "The three most frequent procedures are 'Venous catheterization, not elsewhere classified' with 13928 occurrences, 'Insertion of Infusion Device into Superior Vena Cava, Percutaneous Approach' with 10061 occurrences and 'Other nonoperative respiratory measurements' with 10041 occurrences."
rag_text = sql_query  # Replace with your actual RAG text
rag_tokens = tokenizer(rag_text, return_tensors="pt")
rag_embeddings = model(**rag_tokens).last_hidden_state[:, 0, :]  # Get the first token's embedding

In [61]:
client_options = {"api_endpoint": "us-central1-aiplatform.googleapis.com"}

# Initialize client that will be used to create and send requests.

# This client only needs to be created once, and can be reused for multiple requests.

client = aiplatform.gapic.PredictionServiceClient(

    client_options=client_options

)

instance_dict = { "content": "Summarize " + rag_text }

instance = json_format.ParseDict(instance_dict, Value())

instances = [instance]

parameters_dict = {

    "candidateCount": 1,

    "maxOutputTokens": 500,

    "temperature": 0.2,

    "topP": 0.8,

    "topK": 40

}

parameters = json_format.ParseDict(parameters_dict, Value())

response = client.predict(

    endpoint="projects/us-gcp-ame-con-5b680-sbx-1/locations/us-central1/publishers/google/models/medlm-large", instances=instances, parameters=parameters

)

print("response")

predictions = response.predictions

for prediction in predictions:
    print(" prediction:", dict(prediction))

response
 prediction: {'citationMetadata': <proto.marshal.collections.maps.MapComposite object at 0x7f8b7f9ab3a0>, 'content': " The three most frequent procedures are 'Venous catheterization, not elsewhere classified' with 13928 occurrences, 'Insertion of Infusion Device into Superior Vena Cava, Percutaneous Approach' with 10061 occurrences and 'Other nonoperative respiratory measurements' with 10041 occurrences.\n\n'Venous catheterization, not elsewhere classified' is the insertion of a catheter into a vein. It is a common procedure that is used for a variety of purposes, such as administering fluids, medications, or blood products, or for collecting blood samples.\n\n'Insertion of Infusion Device into Superior Vena Cava, Percutaneous Approach' is the placement of a catheter into the superior vena cava, a large vein that carries blood from the upper body to the heart. This procedure is often used to administer fluids, medications, or nutrition to patients who are unable to receive the

In [62]:
medlm_output = "The three most frequent procedures are 'Venous catheterization, not elsewhere classified' with 13928 occurrences, 'Insertion of Infusion Device into Superior Vena Cava, Percutaneous Approach' with 10061 occurrences and 'Other nonoperative respiratory measurements' with 10041 occurrences.'Venous catheterization, not elsewhere classified' is the insertion of a catheter into a vein. It is a common procedure that is used for a variety of purposes, such as administering fluids, medications, or blood products, or for collecting blood samples.\n\n'Insertion of Infusion Device into Superior Vena Cava, Percutaneous Approach' is the placement of a catheter into the superior vena cava, a large vein that carries blood from the upper body to the heart. This procedure is often used to administer fluids, medications, or nutrition to patients who are unable to receive them through their veins.'Other nonoperative respiratory measurements' refers to a variety of tests that are used to assess the function of the lungs. These tests may include spirometry, which measures the amount of air that a person can breathe in and out, and pulse oximetry, which measures the level of oxygen in the blood. The frequency of these procedures reflects the fact that they are all common and important medical interventions. They are used to diagnose and treat a variety of conditions, and they can play a vital role in improving patient outcomes."
medlm_tokens = tokenizer(medlm_output, return_tensors="pt")
medlm_embeddings = model(**medlm_tokens).last_hidden_state[:, 0, :]  # Get the first token's embedding

In [63]:
sentence_transformer = SentenceTransformer("all-mpnet-base-v2")
medlm_embeddings = sentence_transformer.encode(medlm_output)
rag_embeddings = sentence_transformer.encode(rag_text)

In [64]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity([medlm_embeddings], [rag_embeddings])[0][0]
print("Cosine Similarity:", cosine_sim)

Cosine Similarity: 0.85894895


In [65]:
from rouge import Rouge
from nltk.translate.bleu_score import sentence_bleu

# Calculate BLEU score
bleu_score = sentence_bleu([rag_text.split()], medlm_output.split())
print("BLEU Score:", bleu_score)

# Calculate ROUGE score
rouge = Rouge()
rouge_scores = rouge.get_scores(medlm_output, rag_text)
print("ROUGE Scores:", rouge_scores)

BLEU Score: 0.14933878337064085
ROUGE Scores: [{'rouge-1': {'r': 1.0, 'p': 0.26666666666666666, 'f': 0.4210526282548477}, 'rouge-2': {'r': 1.0, 'p': 0.17989417989417988, 'f': 0.3049327328416015}, 'rouge-l': {'r': 1.0, 'p': 0.26666666666666666, 'f': 0.4210526282548477}}]


In [66]:
Average_Cosine = (0.85894895+0.9855383+0.72898984+0.868711+0.5889751+0.6625615+0.7876705+1.0000002+0.5294906+0.66296136)/10
print(" Average Cosine with added Vector Weights: ", Average_Cosine)

 Average Cosine with added Vector Weights:  0.7673847349999999
