In [1]:
import os
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFLoader
from langchain.llms import GooglePalm

## Pre-trained Large Language Model
This LLM will be used as a head in a RAG pipeline to generate natural language fashion answers.

In [2]:
# read second line in txt
api_file = open("model_API_keys/palm_api.txt", "r")
api_key = api_file.readlines()[1] # read second line
api_key = api_key.strip() # remove newline (\n)

In [3]:
# pre-trained model
model = GooglePalm(google_api_key=api_key)

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
def get_answer(model, question:str):
    print(f"Question: {question}")
    try:
        answer = model(question)
        print(f"Answer:\n{answer}")
    except:
        print("Similarity search failed.")

In [5]:
# sample questions
sample_Q = "3rd degree burns on palms, what to do?"
get_answer(model, sample_Q)

sample_Q = "What to do if you are tired?"
get_answer(model, sample_Q)

Question: 3rd degree burns on palms, what to do?
Similarity search failed.
Question: What to do if you are tired?
Answer:
**1. Get enough sleep.** This is the most important thing you can do to avoid feeling tired. Most adults need around 7-8 hours of sleep per night. If you're not getting enough sleep, you'll likely feel tired during the day, even if you've had a cup of coffee.
2. **Take breaks throughout the day.** If you're feeling tired, don't try to push through it. Take a break and do something relaxing, like reading, listening to music, or taking a short nap.
3. **Eat healthy foods.** Eating a healthy diet can help you have more energy. Make sure to eat plenty of fruits, vegetables, and whole grains.
4. **Exercise regularly.** Exercise can help improve your mood and energy levels. Aim for at least 30 minutes of moderate-intensity exercise most days of the week.
5. **Manage stress.** Stress can take a toll on your physical and mental health, leading to fatigue. Find healthy ways 

## Load PDF Document
This model converts natural language texts to embedding vectors.

In [6]:
# load pdf
# pdf = os.path.join("data", "nurseslabs-cram-sheet.pdf") # PDF file
pdf = os.path.join("data", "GAN.pdf") # PDF file
loader = PyPDFLoader(pdf) # PDF loader
docs = loader.load() # load document

In [7]:
print(f"Number of pages in {os.path.basename(pdf)}: {len(docs)}")
print(f"5th page: {docs[4]}")

Number of pages in GAN.pdf: 9
5th page: page_content='Theorem 1. The global minimum of the virtual training criterion C(G)is achieved if and only if\npg=pdata. At that point, C(G)achieves the value −log 4 .\nProof. Forpg=pdata,D∗\nG(x) =1\n2, (consider Eq. 2). Hence, by inspecting Eq. 4 at D∗\nG(x) =1\n2, we\nﬁndC(G) = log1\n2+ log1\n2=−log 4 . To see that this is the best possible value of C(G), reached\nonly forpg=pdata, observe that\nEx∼pdata[−log 2] + Ex∼pg[−log 2] =−log 4\nand that by subtracting this expression from C(G) =V(D∗\nG,G), we obtain:\nC(G) =−log(4) +KL(\npdata\ued79\ued79\ued79\ued79pdata+pg\n2)\n+KL(\npg\ued79\ued79\ued79\ued79pdata+pg\n2)\n(5)\nwhere KL is the Kullback–Leibler divergence. We recognize in the previous expression the Jensen–\nShannon divergence between the model’s distribution and the data generating process:\nC(G) =−log(4) + 2·JSD (pdata∥pg) (6)\nSince the Jensen–Shannon divergence between two distributions is always non-negative and zero\nonly when t

## Retrieval-Augmented Generation (RAG) Pipeline

In [8]:
from transformers import AutoTokenizer, pipeline
# from langchain.llms import HuggingFaceHub # <-- currently invalid for Q&A task
from langchain import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter, TokenTextSplitter, NLTKTextSplitter, SpacyTextSplitter
from langchain.embeddings import HuggingFaceInstructEmbeddings, HuggingFaceEmbeddings
from langchain.vectorstores import Cassandra, Chroma, FAISS # vector database

In [9]:
# Document Splitting: split the document into small chunks
chunk_size = 500
text_splitter = CharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=0)
chunks = text_splitter.split_documents(docs)

In [10]:
embedding_model = HuggingFaceInstructEmbeddings

In [11]:
# Pre-trained Text Embedding Model
device = "cuda"
query_instruction = "Represent the query for retrieval: "
embeddings = embedding_model(query_instruction=query_instruction,
                             model_kwargs={"device": device},
                            )

load INSTRUCTOR_Transformer
max_seq_length  512


In [12]:
# vector database (options: Cassandra, Chroma, FAISS)
db = FAISS.from_documents(documents=docs, embedding=embeddings)

In [13]:
# vector retrieval from database
search_type = "similarity"
k = 3 # top k similar documents
retriever = db.as_retriever(
    search_type=search_type,
    search_kwargs={"k": k}
    )

In [14]:
# show top k similar texts retrieved from the vector database
similar_texts = retriever.get_relevant_documents("ABG Values of CO2")
print(similar_texts[0].page_content)

3. One can approximately model all conditionals p(xS|x̸S)whereSis a subset of the indices
ofxby training a family of conditional models that share parameters. Essentially, one can use
adversarial nets to implement a stochastic extension of the deterministic MP-DBM [11].
4.Semi-supervised learning : features from the discriminator or inference net could improve perfor-
mance of classiﬁers when limited labeled data is available.
5.Efﬁciency improvements: training could be accelerated greatly by divising better methods for
coordinating GandDor determining better distributions to sample zfrom during training.
This paper has demonstrated the viability of the adversarial modeling framework, suggesting that
these research directions could prove useful.
Acknowledgments
We would like to acknowledge Patrice Marcotte, Olivier Delalleau, Kyunghyun Cho, Guillaume
Alain and Jason Yosinski for helpful discussions. Yann Dauphin shared his Parzen window eval-
uation code with us. We would like to thank

In [15]:
QA_chain = RetrievalQA.from_chain_type(
    llm=model,
    chain_type="stuff", # other options: map_reduce, refine, etc.
    retriever=retriever,
    return_source_documents=False
    )

In [17]:
query = "What does the paper propose?"
response = QA_chain(query)
print(f"Question: {query}\nAnswer: {response['result']}")

query = "Who are the authors?"
response = QA_chain(query)
print(f"Question: {query}\nAnswer: {response['result']}")

Question: What does the paper propose?
Answer: A new framework for estimating generative models via an adversarial process
Question: Who are the authors?
Answer: Ian J. Goodfellow, Jean Pouget-Abadie∗, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair†, Aaron Courville, Yoshua Bengio
