# RAG vs Finetuning

In [63]:
#Importing Libraries
import os
from openai import OpenAI
from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers import GrobidParser
import pandas as pd

In [64]:
#Open AI KEY
openai_api_key = ##Removed for privacy

## Model Pipeline and Data Preperation

### RAG 

In [65]:
#Load data from Grobid
loader = GenericLoader.from_filesystem(
    "/Users/Scott/Downloads/test input/",
    glob="*",
    suffixes=[".pdf"],
    parser=GrobidParser(segment_sentences=False),
)

data = loader.load()

In [66]:
#Checking Metadata
data[1].metadata

{'text': 'In this paper, we introduce Dynamically Rewired Message Passing (DRew), a novel framework for layer-dependent, multi-hop message passing that takes a principled approach to information flow, is robust to over-squashing, and can be applied to any MPNN for deep learning on graphs.',
 'para': '0',
 'bboxes': "[[{'page': '1', 'x': '307.44', 'y': '303.89', 'h': '234.00', 'w': '9.03'}, {'page': '1', 'x': '307.44', 'y': '315.85', 'h': '235.25', 'w': '9.03'}, {'page': '1', 'x': '307.44', 'y': '328.19', 'h': '234.00', 'w': '8.64'}, {'page': '1', 'x': '307.44', 'y': '340.15', 'h': '234.00', 'w': '8.64'}, {'page': '1', 'x': '307.44', 'y': '352.10', 'h': '202.11', 'w': '8.64'}]]",
 'pages': "('1', '1')",
 'section_title': 'Introduction',
 'section_number': '1.',
 'paper_title': 'DRew: Dynamically Rewired Message Passing with Delay',
 'file_path': '/Users/Scott/Downloads/test input/2305.08018v2.pdf'}

In [67]:
# Import libraries
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings.openai import OpenAIEmbeddings
import os
from langchain_chroma import Chroma

In [68]:
#Split the document using RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap = 100) #CHANGE CHUNK SIZE?
docs = splitter.split_documents(data) 

#Embed the documents in a persistent Chroma vector Database
embedding_function = OpenAIEmbeddings(openai_api_key=openai_api_key)
vectorstore = Chroma.from_documents(
    docs,
    embedding=embedding_function,
    persist_directory=os.getcwd()
)


#Configure the vectore sotre as a retriever 
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k":3}
)

In [69]:
#Import Libraries
from langchain_core.prompts import ChatPromptTemplate

In [70]:
# Add placeholders to the message string
message = """
Answer the following question using the context provided:

Context:
{context}

Question:
{question}

Answer:
"""

# Create a chat prompt template from the message string
prompt_template = ChatPromptTemplate.from_messages([("human", message)])

In [71]:
#Import Libraries
from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnablePassthrough
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

In [89]:
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.7, openai_api_key=openai_api_key)

system_prompt = (
    "You are an assistant for question-answering tasks specifically about the provided PDF documents."
    "Use the following pieces of retrieved context to answer the questions."
    "Use as many PDF documents loaded as possible"
    "Do not use any external knowledge or information outside of the PDF loaded."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)


question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

response = rag_chain.invoke({"input": "What considerations should researchers take into account when fine-tuning pre-trained models on datasets from diverse atomic domains?"})
rag_message = response['answer']
print(rag_message)

Researchers should consider the following factors when fine-tuning pre-trained models on datasets from diverse atomic domains:

1. **Dataset Selection**: Choose datasets that cover a wide range of atomic domains to ensure generalization of the pre-trained models. In the provided context, datasets were selected from materials, small molecules, and large molecules domains.

2. **Task Diversity**: Select a diverse set of fine-tuning tasks within each atomic domain to test the generalization capabilities of the pre-trained models effectively.

3. **Model Architecture**: Ensure that the pre-trained model architecture is suitable for the specific atomic domains being considered. Fine-tuning on diverse datasets may require adjustments or optimization of the model architecture.

4. **Hyperparameter Tuning**: Optimize hyperparameters during fine-tuning to achieve better performance across different atomic domains. Fine-tuning on diverse datasets may require specific hyperparameter settings for 

### Finetuning

In [73]:
client = OpenAI(
  api_key=openai_api_key,
)

In [74]:
#File path for Training and Validation Prompts in JSONL format
training_file_name = "/Users/Scott/Documents/Python/Homework/Finetuning/trainingprompts2.jsonl"
validation_file_name = "/Users/Scott/Documents/Python/Homework/Finetuning/validationprompts.jsonl"

In [75]:
#Printing head of Training Prompts
training_prompts = pd.read_json(training_file_name, lines=True)
print(training_prompts.head(10))

                                            messages
0  [{'role': 'system', 'content': 'An assistant c...
1  [{'role': 'system', 'content': 'An assistant c...
2  [{'role': 'system', 'content': 'An assistant c...
3  [{'role': 'system', 'content': 'An assistant c...
4  [{'role': 'system', 'content': 'An assistant c...
5  [{'role': 'system', 'content': 'An assistant c...
6  [{'role': 'system', 'content': 'An assistant c...
7  [{'role': 'system', 'content': 'An assistant c...
8  [{'role': 'system', 'content': 'An assistant c...
9  [{'role': 'system', 'content': 'An assistant c...


In [76]:
#Printing head of Validation Prompts
validation_prompts = pd.read_json(validation_file_name, lines=True)
print(validation_prompts.head(5))

                                            messages
0  [{'role': 'system', 'content': 'An assistant c...
1  [{'role': 'system', 'content': 'An assistant c...
2  [{'role': 'system', 'content': 'An assistant c...
3  [{'role': 'system', 'content': 'An assistant c...
4  [{'role': 'system', 'content': 'An assistant c...


In [77]:
#Uploading Training and Validation JSNOL files to OPEN AI Client 
training_file_id = client.files.create(
  file=open(training_file_name, "rb"),
  purpose="fine-tune"
)

validation_file_id = client.files.create(
  file=open(validation_file_name, "rb"),
  purpose="fine-tune"
)

#Pringing Training and Validation File IDs
print(f"Training File: {training_file_id}")
print(f"Validation File: {validation_file_id}")

Training File: FileObject(id='file-pC79JuTTgRUwW60nvOp7MY4P', bytes=32906, created_at=1726143512, filename='trainingprompts2.jsonl', object='file', purpose='fine-tune', status='processed', status_details=None)
Validation File: FileObject(id='file-AZjWuOECv5v9YBYEq8LfwEnw', bytes=12494, created_at=1726143513, filename='validationprompts.jsonl', object='file', purpose='fine-tune', status='processed', status_details=None)


In [78]:
#Creating Fine-Tuning model Job
model_training_response = client.fine_tuning.jobs.create(
  training_file=training_file_id.id, 
  validation_file=validation_file_id.id,
  model="gpt-3.5-turbo", 
  hyperparameters={
    "n_epochs": 15,
	"batch_size": 3,
	"learning_rate_multiplier": 0.3
  }
)
print(f"Training Response: {model_training_response}")

Training Response: FineTuningJob(id='ftjob-UWnIEWAhixlmajG719m18S7T', created_at=1726143515, error=Error(code=None, message=None, param=None), fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs=15, batch_size=3, learning_rate_multiplier=0.3), model='gpt-3.5-turbo-0125', object='fine_tuning.job', organization_id='org-4rTIJZ8QzdOrbetyoHq0qqB1', result_files=[], seed=2089879020, status='validating_files', trained_tokens=None, training_file='file-pC79JuTTgRUwW60nvOp7MY4P', validation_file='file-AZjWuOECv5v9YBYEq8LfwEnw', estimated_finish=None, integrations=[], user_provided_suffix=None)


In [79]:
completion = client.chat.completions.create(
  model="ft:gpt-3.5-turbo-0125:personal::A1DibUK6",
  messages=[
    {"role": "system", "content": "An assistant chatbot trained on academic papers used to answer questions that intersect AI and Drug Discovery with respect to Graph Theory and Molecular Modelling"},
    {"role": "user", "content": "What are somethings that i need to be careful about when using GCNs and drug discovery?"}
  ]
)
print(completion.choices[0].message.content)

One major consideration is the interpretability of GCN predictions for drug-target interactions. Model explainability is crucial in understanding how molecular features influence predictions.


# Comparing Responses

Responses are compared: 
Finetuning vs Basemodel
Finetuning vs RAG 

Answers are ranked by a LLM to determine the quality of the answer. 

### Instantiate Base Model

In [82]:
base_response = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "An assistant chatbot trained on academic papers used to answer questions that intersect AI and Drug Discovery with respect to Graph Theory and Molecular Modelling"},
    {"role": "user", "content": "What considerations should researchers take into account when fine-tuning pre-trained models on datasets from diverse atomic domains?"}
  ]
)

In [84]:
base_message = base_response.choices[0].message.content
print(base_message)

When fine-tuning pre-trained models on datasets from diverse atomic domains in the context of AI and Drug Discovery, researchers should consider the following key considerations:

1. **Domain-specific features**: Different atomic domains may have unique features and characteristics. Researchers should carefully analyze and understand the specific features of the atomic domains they are working with to ensure the pre-trained models can capture domain-specific patterns effectively during fine-tuning.

2. **Data preprocessing**: Preprocessing techniques such as normalization, feature scaling, and data augmentation may need to be adapted to suit the characteristics of the diverse atomic domains. This step is crucial to ensure that the pre-trained models can generalize well to new datasets from different atomic domains.

3. **Transfer learning strategies**: Researchers should consider appropriate transfer learning strategies to fine-tune pre-trained models effectively on diverse atomic doma

### Using GPT for evaulation of answers

In [92]:
eval_response = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "An assistant chatbot used to evaluate answers and rank them on a scale of 1-5. Rate the responses based on quality of the response and logic behind the response. "},
    {"role": "user", "content": rag_message }
  ]
)

In [93]:
eval_message = eval_response.choices[0].message.content
print(eval_message)

This is a detailed and comprehensive response that covers essential factors for researchers to consider when fine-tuning pre-trained models on datasets from diverse atomic domains. Each point is clearly articulated and provides a logical reasoning behind it. The response not only highlights the key considerations but also explains why they are crucial for achieving better performance and generalization of models across different atomic domains. Overall, this response demonstrates a strong understanding of the topic and presents the information in a structured and informative manner.

I would rate this response a 5 out of 5 for its quality and thoroughness.
