<div class="list-group">

  <h3 class="list-group-item list-group-item-action active" data-toggle="list" role='tab' aria-controls="home"
      style="background-color:DODGERBLUE; color:white">Table of Contents</h3>

  <a href='#Import Packages' style="background-color:white; color:grey" class="list-group-item list-group-item-action">1. Import Packages</a>

  <a href='#Define LLM' style="background-color:white; color:grey" class="list-group-item list-group-item-action">2. Define LLM</a>

  <a href='#Process PDFs' style="background-color:white; color:grey" class="list-group-item list-group-item-action">3. Process PDFs</a>
  
  <a href='#Create Embeddings' style="background-color:white; color:grey" class="list-group-item list-group-item-action">4. Create Embeddings</a>
  
  <a href='#Create Vector Database' style="background-color:white; color:grey" class="list-group-item list-group-item-action">5. Create Vector Database</a>

  <a href='#Query Relevant information' style="background-color:white; color:grey" class="list-group-item list-group-item-action">6. Query Relevant Information 
  </a>

  <a href='#Define Prompt Template' style="background-color:white; color:grey" class="list-group-item list-group-item-action">7. Define Prompt Template</a>

  <a href='#Define Langchain' style="background-color:white; color:grey" class="list-group-item list-group-item-action">8. Define Langchain</a>

  <a href='#Structure Response' style="background-color:white; color:grey" class="list-group-item list-group-item-action">9. Structure Response</a>

  <a href='#Structure Response into Tabular Format for User Consumption' style="background-color:white; color:grey" class="list-group-item list-group-item-action">10. Structure Response into Tabular Format for User Consumption</a>

</div>

#### <a id='Import Packages' style="color:blue;font-size:140%;"> 1. Import Packages</a>

In [43]:
# Import Langchain modules
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.vectorstores import Chroma
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain.evaluation import load_evaluator
# Other modules and packages
import os
import tempfile
import streamlit as st  
import pandas as pd
from dotenv import load_dotenv
import uuid 

In [44]:
#read api key from .env file
load_dotenv() 

True

In [45]:
#get API key for testing
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")

#### <a id='Define LLM' style="color:blue;font-size:140%;"> 2. Define LLM</a>

In [46]:
#can choose model based on tasks and requirements
llm = ChatOpenAI(model="gpt-4o-mini", api_key=OPENAI_API_KEY)
#test
llm.invoke("Tell me a joke about dogs")

AIMessage(content='Why did the dog sit in the shade?\n\nBecause he didn’t want to become a hot dog!', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 21, 'prompt_tokens': 13, 'total_tokens': 34, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_0392822090', 'id': 'chatcmpl-BPh8L5kcmAXlDgq9MLFos9duZv6tx', 'finish_reason': 'stop', 'logprobs': None}, id='run-e1a77553-1f5f-4d0d-b043-645c37ca6fd0-0', usage_metadata={'input_tokens': 13, 'output_tokens': 21, 'total_tokens': 34, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}})

#### <a id='Process PDFs' style="color:blue;font-size:140%;"> 3. Process PDFs</a>

In [47]:
# load PDF document
loader = PyPDFLoader("data/NeuCF.pdf")
pages = loader.load()
pages

[Document(metadata={'producer': 'pdfTeX-1.40.17', 'creator': 'LaTeX with hyperref package', 'creationdate': '2017-08-29T01:23:59+00:00', 'author': '', 'keywords': '', 'moddate': '2017-08-29T01:23:59+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) kpathsea version 6.2.2', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'data/NeuCF.pdf', 'total_pages': 10, 'page': 0, 'page_label': '1'}, page_content='Neural Collaborative Filtering∗\nXiangnan He\nNational University of\nSingapore, Singapore\nxiangnanhe@gmail.com\nLizi Liao\nNational University of\nSingapore, Singapore\nliaolizi.llz@gmail.com\nHanwang Zhang\nColumbia University\nUSA\nhanwangzhang@gmail.com\nLiqiang Nie\nShandong University\nChina\nnieliqiang@gmail.com\nXia Hu\nTexas A&M University\nUSA\nhu@cse.tamu.edu\nTat-Seng Chua\nNational University of\nSingapore, Singapore\ndcscts@nus.edu.sg\nABSTRACT\nIn recent years, deep neural networks have yielded immense\nsuccess on speech r

In [48]:
# split document into much smaller trunk so that it is easy for model to consume and focus 
# need to adjust chunk size and overlap to handle different PDF format
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500,
                                               chunk_overlap = 200,
                                                length_function=len,
                                                 separators=['\n\n','\n',' '] ) #seperator is page break, line break or space
chunks = text_splitter.split_documents(pages)

#### <a id='Create Embeddings' style="color:blue;font-size:140%;"> 4. Create Embeddings</a>

In [49]:
def get_embedding_function():
    embeddings = OpenAIEmbeddings(
        model = 'text-embedding-ada-002',
        openai_api_key = OPENAI_API_KEY
    )
    return embeddings
embedding_function = get_embedding_function()
test_vector = embedding_function.embed_query("dog")

In [50]:
#get intuition of using embedding to speed up retrieval
evaluator = load_evaluator(evaluator="embedding_distance",
                           embeddings=embedding_function)
evaluator.evaluate_strings(prediction='dog', reference='poodle')

{'score': 0.1402799252940825}

In [51]:
#test the embedding to see if it is able to capture the difference between dog and poodle vs cat and poodle
print(f'The distance between dog and poodle is {evaluator.evaluate_strings(prediction='dog', reference='poodle')}')
print(f'The distance between cat and poodle is {evaluator.evaluate_strings(prediction='cat', reference='poodle')}')

The distance between dog and poodle is {'score': 0.1402799252940825}
The distance between cat and poodle is {'score': 0.2088812781695495}


#### <a id='Create Vector Database' style="color:blue;font-size:140%;"> 5. Create Vector Database</a>

In [52]:
#create a chroma database as a vector database to store all the embedding from the trunks
#you need to specify the unique id for each document, otherwise, it is create duplicate vecors for the same file in chroma
def create_vectorstore(chunks, embedding_function, vectorstore_path):
    #create a list of unique ids for each document based on the content
    ids = [str(uuid.uuid5(uuid.NAMESPACE_DNS, doc.page_content)) for doc in chunks]

    # ensure that only unique docs with unique ids are kept
    unique_ids = set()
    unique_chunks = []

    for chunk, id in zip(chunks, ids):
        if id not in unique_ids:
            unique_ids.add(id)
            unique_chunks.append(chunk)

    #create a chroma database as a vector database to store all the embedding from the trunks
    vectorstore = Chroma.from_documents(documents=unique_chunks,
                                        ids=list(unique_ids),
                                        embedding=embedding_function,
                                        persist_directory=vectorstore_path)

    vectorstore.persist()

    return vectorstore

In [53]:
#create vectorestore
vectorstore = create_vectorstore(chunks=chunks,
                                 embedding_function=embedding_function,
                                 vectorstore_path="vectorstore_chroma_pdf")

#### <a id='Query Relevant information' style="color:blue;font-size:140%;"> 6. Query Relevant information</a>

In [54]:
#load vector store
vectorstore = Chroma(persist_directory='vectorstore_chroma_pdf', embedding_function=embedding_function)

In [55]:
#create retriever and get relevant chunks
retriever = vectorstore.as_retriever(search_type='similarity') #use cosin similarity as default

In [56]:
#test out the relevant chunks finding
relevant_chunks = retriever.invoke("what's the advantages of this model?")
relevant_chunks
#in the example below, it can really shows the difference.


[Document(metadata={'total_pages': 10, 'producer': 'pdfTeX-1.40.17', 'creator': 'LaTeX with hyperref package', 'creationdate': '2017-08-29T01:23:59+00:00', 'author': '', 'moddate': '2017-08-29T01:23:59+00:00', 'trapped': '/False', 'page': 1, 'subject': '', 'source': 'data/NeuCF.pdf', 'title': '', 'keywords': '', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) kpathsea version 6.2.2', 'page_label': '2'}, page_content='can see, MF models the two-way interaction of user and item\nlatent factors, assuming each dimension of the latent space\nis independent of each other and linearly combining them\nwith the same weight. As such, MF can be deemed as a\nlinear model of latent factors.\nFigure 1 illustrates how the inner product function can\nlimit the expressiveness of MF. There are two settings to be\nstated clearly beforehand to understand the example well.\nFirst, since MF maps users and items to the same latent\nspace, the similarity between two users ca

#### <a id='Define Prompt Template' style="color:blue;font-size:140%;"> 7. Define Prompt Template</a>

In [57]:
#prompt template
#context is the information from retriever, and question is the actual question
PROMPT_TEMPLATE = """
You are a research assistant for question-answering task.
Use the following pieces of retrieved context to answer the question.
If you don't know the answer, say that you don't know.
DON'T MAKE UP ANYTHING YOU ARE NOT SURE ABOUT.

{context}

---

Answer the question based on the above context: {question}
"""


In [58]:
# Concatenate context text
context_text = "\n\n---\n\n".join([doc.page_content for doc in relevant_chunks])

# Create prompt
prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
prompt = prompt_template.format(context=context_text, 
                                question="What is the title of the paper?")
print(prompt)

Human: 
You are a research assistant for question-answering task.
Use the following pieces of retrieved context to answer the question.
If you don't know the answer, say that you don't know.
DON'T MAKE UP ANYTHING YOU ARE NOT SURE ABOUT.

can see, MF models the two-way interaction of user and item
latent factors, assuming each dimension of the latent space
is independent of each other and linearly combining them
with the same weight. As such, MF can be deemed as a
linear model of latent factors.
Figure 1 illustrates how the inner product function can
limit the expressiveness of MF. There are two settings to be
stated clearly beforehand to understand the example well.
First, since MF maps users and items to the same latent
space, the similarity between two users can also be measured
with an inner product, or equivalently 2, the cosine of the
angle between their latent vectors. Second, without loss of
2Assuming latent vectors are of a unit length.

---

play a pivotal role in alleviating

#### <a id='Define Langchain' style="color:blue;font-size:140%;"> 8. Define Langchain</a>

In [59]:
def format_docs(docs):
    return "\n\n---\n\n".join([doc.page_content for doc in relevant_chunks])

rag_chain = (
            {"context": retriever | format_docs, "question": RunnablePassthrough()}
             |prompt_template
             |llm

)

In [60]:
rag_chain.invoke("what's the advantages of this model?")

AIMessage(content="The advantages of the model, particularly the NeuMF (Neural Matrix Factorization) model, include:\n\n1. **High Expressiveness**: NeuMF combines both linear matrix factorization (MF) and non-linear multilayer perceptron (MLP) models, allowing for improved modeling of relationships between users and items.\n\n2. **Improved Performance**: NeuMF demonstrates consistent improvements in performance over other methods across different ranking positions, showcasing its efficacy in generating recommendations.\n\n3. **Statistically Significant Improvements**: The performance enhancements provided by NeuMF have been verified to be statistically significant, indicating strong reliability in its effectiveness.\n\n4. **Overfitting Management**: While GMF shows strong performance, the report notes that NeuMF is less prone to overfitting due to its advanced architecture, allowing for better generalization across datasets.\n\n5. **Robust Against Different Factors**: NeuMF maintains r

#### <a id='Structure Response' style="color:blue;font-size:140%;"> 9. Structure Response</a>

In [61]:
class AnswerWithSources(BaseModel):
    """
    An answer to the question, with sources and reasoning.
    """
    answer: str = Field(description="Answer to the question")
    sources: str = Field(description="Full direct text chunk from the context used to answer the question")
    reasoning: str = Field(description="Explain the reasoning of the answer based on sources")

class ExtractedInfo(BaseModel):
    """Extracted information about the research article"""
    paper_summary: AnswerWithSources
    model_motivation: AnswerWithSources
    model_architecture: AnswerWithSources
    model_limitation: AnswerWithSources

In [62]:
rag_chain = (
            {"context": retriever | format_docs, "question": RunnablePassthrough()}
            | prompt_template
            | llm.with_structured_output(ExtractedInfo, strict=True)
        )

rag_chain.invoke("Give me the paper summary, model motivation, model architecture, and model limitation of the research paper.")

#below is the example of multiple prompt and answers at one shot
"""
questions = [
    "Give me the title, summary, publication date, authors of the research paper.",
    "Give me the experiment dataset, baseline and evaluation metrics of the research paper.",
    "Give me the model motivation, architecture, and limitations."
]
results = rag_chain.batch(questions) 
for q, info in zip(questions, results):
    print("▶ QUESTION:", q)
    print("▶ ANSWER:", info)
    print()
"""



'\nquestions = [\n    "Give me the title, summary, publication date, authors of the research paper.",\n    "Give me the experiment dataset, baseline and evaluation metrics of the research paper.",\n    "Give me the model motivation, architecture, and limitations."\n]\nresults = rag_chain.batch(questions) \nfor q, info in zip(questions, results):\n    print("▶ QUESTION:", q)\n    print("▶ ANSWER:", info)\n    print()\n'

#### <a id='Structure Response' style="color:blue;font-size:140%;"> 10. Structure Response into Tabular Format for User Consumption</a>

In [63]:
structured_response = rag_chain.invoke("Give me the paper summary, model motivation, model architecture, and model limitation of the research paper.")
df = pd.DataFrame([structured_response.dict()])

# Transforming into a table with two rows: 'answer' and 'source'
answer_row = []
source_row = []
reasoning_row = []

for col in df.columns:
    answer_row.append(df[col][0]['answer'])
    source_row.append(df[col][0]['sources'])
    reasoning_row.append(df[col][0]['reasoning'])

# Create new dataframe with two rows: 'answer' and 'source'
structured_response_df = pd.DataFrame([answer_row, source_row, reasoning_row], columns=df.columns, index=['answer', 'source', 'reasoning'])
structured_response_df

Unnamed: 0,paper_summary,model_motivation,model_architecture,model_limitation
answer,The research paper discusses matrix factorizat...,The motivation behind the model is to enhance ...,The architecture involves projecting users and...,A key limitation of the proposed model is its ...
source,MF models the two-way interaction of user and ...,Despite the effectiveness of MF for collaborat...,"Popularized by the Netflix Prize, MF has becom...",It's worth noting that large factors may cause...
reasoning,The text provides a clear description of MF an...,The motivation is rooted in the realization th...,The architecture description combines both lin...,"The limitation is directly noted in the text, ..."
