# RAG Application: Ask Questions from a PDF Document using Large Language Models

Retrieval-Augmented Generation (RAG) is a generative AI framework that combines pre-trained large language models (LLMs) with external data sources. RAG improves the output of LLMs by using fresh data from authoritative knowledge bases and enterprise systems to generate more reliable responses.

For example, this project is about using RAG to ask questions from a PDF document. The RAG system uses its large language model to understand the question, then it retrieves relevant information from the PDF document, and finally generates a response. This way, we can extract precise information from a document.

## 0. Setup Ollama

I used [Ollama](https://ollama.com) because it's the easiest way to get up and running with large language models, locally on my computer.

In this case, I used [TinyLlama](https://arxiv.org/pdf/2401.02385.pdf) model by StatNLP Research Group and 
Singapore University of Technology and Design.

On your terminal, run:

```bash
ollama run tinyllama
```

## 1. Loading Environment Variables and Setting Up the Model

In [9]:
import os
from dotenv import load_dotenv

# If you want to use the OpenAI API, you need to set the OPENAI_API_KEY environment variable
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
#MODEL = "gpt-3.5-turbo"

MODEL = "tinyllama"

## 2. Prepare Embeddings and Test the Model

In [10]:
from langchain_community.llms import Ollama
from langchain_openai.chat_models import ChatOpenAI
from langchain_community.embeddings import OllamaEmbeddings
from langchain_openai.embeddings import OpenAIEmbeddings

if MODEL.startswith("gpt"):
    model = ChatOpenAI(openai_api_key=OPENAI_API_KEY, model=MODEL)
    embeddings = OpenAIEmbeddings()
else:
    model = Ollama(model=MODEL)
    embeddings = OllamaEmbeddings(model=MODEL)

model.invoke("what is machine learning in a few words?")

'Machine Learning (ML) is a field of artificial intelligence that enables machines to learn from experience and improve their performance over time without being programmed specifically for those tasks. In simple terms, ML helps machines to "learn" from data or experiences to make decisions and perform tasks better than they would be able to do on their own.'

In [11]:
from langchain_core.output_parsers import StrOutputParser

parser = StrOutputParser()

chain = model | parser 
chain.invoke("what is machine learning in a few words?")

'Machine learning (ML) is a field of artificial intelligence that allows computers to learn and improve their own behavior based on data inputted by humans or other machines. In simpler terms, ML is a type of AI algorithm that enables machines to think like humans and improve on their performance based on the patterns they encounter. It involves using algorithms to process massive amounts of data and make predictions or decisions based on it, which can be beneficial for various applications in various industries.'

## 3. Load the PDF Document

In [12]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("gemma-model.pdf")
pages = loader.load_and_split()
pages

[Document(page_content='3/10/24, 8:32 PM Gemma: Google introduces new state-of-the-art open models\nhttps://blog.google/technology/developers/gemma-open-models/ 1/5DEVELOPERS\nGemma: Introducing new state-\nof-the-art open models\nGemma is built for responsible AI development from the same research and\ntechnology used to create Gemini models.\nFeb 21, 2024·3 min read\nJJeanine Banks\nVP & GM, Developer X and\nDevRelTTris Warkentin\nDirector, Google DeepMindShare\nListen to article\n7 minutes\nThe Keyword', metadata={'source': 'gemma-model.pdf', 'page': 0}),
 Document(page_content='3/10/24, 8:32 PM Gemma: Google introduces new state-of-the-art open models\nhttps://blog.google/technology/developers/gemma-open-models/ 2/5At Google, we believe in making AI helpful for everyone. We have a long history of contributing innovations\nto the open community, such as with Transformers, TensorFlow, BERT, T5, JAX, AlphaFold, and AlphaCode.\nToday, we’re excited to introduce a new generation of open

In [13]:
from langchain.prompts import PromptTemplate

template = """
Answer the question based on the context below. If you can't 
answer the question, respond with "I don't know".

Context: {context}

Question: {question}
"""

prompt = PromptTemplate.from_template(template)
print(prompt.format(context="Here is some context", question="Here is a question"))


Answer the question based on the context below. If you can't 
answer the question, respond with "I don't know".

Context: Here is some context

Question: Here is a question



## 4. Chain the Prompt, Model, and Parser

In [14]:
chain = prompt | model | parser

In [15]:
chain.input_schema.schema()

{'title': 'PromptInput',
 'type': 'object',
 'properties': {'context': {'title': 'Context', 'type': 'string'},
  'question': {'title': 'Question', 'type': 'string'}}}

In [17]:
chain.invoke(
    {
        "context": "i like to deploy machine learning models as web apps", 
        "question": "what do you think is my college background?"
    }
)

"I can't answer for your college background due to the lack of context. However, in general, a degree in computer science or a related field would indicate that you have a thorough understanding of machine learning and web application development. If your program includes the deployment feature, it's likely that you possess relevant knowledge and skills in this area as well."

## 5.0 Use a Vector Database to Store and Retrieve the Results

In [18]:
from langchain_community.vectorstores import DocArrayInMemorySearch

vectorstore = DocArrayInMemorySearch.from_documents(pages, embedding=embeddings)




In [19]:
retriever = vectorstore.as_retriever()

In [20]:
from operator import itemgetter

chain = (
    {
        "context": itemgetter("question") | retriever,
        "question": itemgetter("question"),
    }
    | prompt
    | model
    | parser
)

In [23]:
questions = [
    "What makes the Gemma model special?",
    "Why is Gemma model a new state-of-the-art?",
]

for question in questions:
    print(f"Question: {question}")
    print(f"Answer: {chain.invoke({'question': question})}")
    print()

Question: What makes the Gemma model special?
Answer: The Gemma model is unique in several ways, including its large size and ability to run directly on a developer's laptop or desktop computer without the need for specialized hardware. Additionally, Gemma surpasses other open models on key benchmarks while adhering to our responsible and safe outputs guidelines. These factors make Gemma an effective choice for developers and researchers looking to build safe and responsible AI applications.

Question: Why is Gemma model a new state-of-the-art?
Answer: Answer: Gemma model is a new and innovative open models that are designed with responsible and safe outputs, as per the AI Principles. It surpasses significantly larger models on key benchmarks while adhering to our rigorous standards for safe and responsible outputs. This makes it an excellent choice for developers and researchers in the field of AI development from a trustworthy and responsible perspective.



## 5.1 Streaming Questions to the Language Model
Basically, what stream does is make the response appear like the style of a chatbot because of a typewriter effect.

In [29]:
for s in chain.stream({"question": "Can I fine-tune Gemma on my own data?"}):
    print(s, end="", flush=True)

The answer to the question is "Yes, you can fine-tune Gemma models on your own data." This means that you can train and tune Gemma models on your specific data and use them for different applications or tasks. In other words, you can customize the training process and ensure that the model is tailored to your needs.

## 5.2 Batching Questions to the Language Model

Basically, what batch does is that it allows you to send a batch of questions to the model. This is useful when you have a lot of questions to ask and you don't want to wait for the model to process each question one by one. This is done in parallel.

In [40]:
questions = [
    "Can I use TensorFlow and Keras with Gemma?",
    "Is there debugging support?",
]

In [41]:

chain.batch([{"question": q} for q in questions])

["Yes, TensoRFlow and Keras can be used with Gemma for training and evaluating AI models. As mentioned in the context given earlier, Gemma's state-of-the-art open models support multiple frameworks, tools, and hardware platforms, including NVIDIA GPUs, from data center to the cloud to local RTX AI PCs, ensuring industry-leading performance and cost-efficient infrastructure across GPU, TPU, and CPU. \n\nFurthermore, Gemma is designed with our AI Principles at the forefront, making it safe and reliable for responsible AI development while using extensive fine-tuining and reinforce learning techniques to align its instruction-tuned models with responsible behavior standards. It also conducts robust evaluation with manual red-teaming, automatied adversarial testing, and assessments of model capabilities for dangerous activities to prioritize building safe and responsible AI applications.",
 'Yes, the given document (i.e., [document(page_content="3/10/24, 8:32 PM Gemma: Google introduce..."