# Retrieval Augmented Generation with LangChain

In this notebook, we'll build some simple naive RAG with LangChain. We will leverage OpenAI for embeddings and LLM responses, and will use the [FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) vector database.

In [1]:
from operator import itemgetter
import openai
import faiss
import os
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_community.chat_models import ChatOpenAI
from langchain_community.embeddings import OpenAIEmbeddings
from dotenv import load_dotenv
load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")



## Naive RAG

The below cells show a very simple version of RAG, without a document. We simply pass in a sentence, and have the LLM generate a response based on that sentence.

In [5]:
vectorstore = FAISS.from_texts(
    ["Addy ran to CCRB"], embedding=OpenAIEmbeddings(api_key = api_key)
)


retriever = vectorstore.as_retriever()

template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = PromptTemplate.from_template(template)

model = ChatOpenAI(api_key= api_key)

  ["Addy ran to CCRB"], embedding=OpenAIEmbeddings(api_key = api_key)
  model = ChatOpenAI(api_key= api_key)


In [6]:
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)



In [10]:
chain.invoke("who is addy?")

'Addy is a person.'

In [12]:
template = """Answer the question based only on the following context:
{context}

Question: {question}

Answer in the following language: {language}
"""
prompt = PromptTemplate.from_template(template)

chain = (
    {
        "context": itemgetter("question") | retriever,
        "question": itemgetter("question"),
        "language": itemgetter("language"),
    }
    | prompt
    | model
    | StrOutputParser()
)

In [30]:
vectorstore = FAISS.from_texts(
    ["Addy ran to CCRB", "Addy is a woman", "Addy fell while running"], embedding=OpenAIEmbeddings(api_key = api_key)
)

retriever = vectorstore.as_retriever()
chain = (
    {
        "context": itemgetter("question") | retriever,
        "question": itemgetter("question"),
        "language": itemgetter("language"),
    }
    | prompt
    | model
    | StrOutputParser()
)
chain.invoke({"question": "What was addy doing and what happened during that action?", "language": "marathi"})

'अड्डी धावत होत्या आणि दौडताना पडल्या.'

### Naive RAG with Documents

Now, we will perform RAG over an Environmental Science text. You can find the PDF in the [Drive](https://drive.google.com/drive/folders/1EBnXiHcnpZNQ3IWwXOFQLbRJCVQG4sXb?usp=drive_link).

In [2]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.output_parsers import StrOutputParser

In [3]:
loader = PyPDFLoader("environmental_sci.pdf")

# The text splitter is used to split the document into chunks
# Mess with the parameters to see how it affects the output
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=100,
    length_function=len,
    is_separator_regex=False,
)

chunks = loader.load_and_split(text_splitter=text_splitter)

print(chunks[25].page_content)

form the hydrosphere, the air constitutes the atmosphere, and the biosphere
contains the entire community of living organisms.
Materials move cyclically among these spheres. They originate in the rocks
(lithosphere) and are released by weathering or by volcanism. They enter
water (hydrosphere) from where those serving as nutrients are taken up
by plants and from there enter animals and other organisms (biosphere).
From living organisms they may enter the air (atmosphere) or water
(hydrosphere). Eventually they enter the oceans (hydrosphere), where
they are taken up by marine organisms (biosphere). These return them to
the air (atmosphere), from where they are washed to the ground by rain,
thus returning to the land.
The idea that biogeochemical cycles are components of an overall system raises an obvious question:
what drives this system? It used to be thought that the global system is purely mechanical, driven by
physical forces, and, indeed, this is the way it can seem. Volcanoes, fr

In [41]:
len(chunks)
print(chunks[4])

page_content='4 Biosphere 137
32. Biosphere, biomes, biogeography 137
33. Major biomes 141
34. Nutrient cycles 147
35. Respiration and photosynthesis 151
36. Trophic relationships 151
37. Energy, numbers, biomass 160
38. Ecosystems 163
39. Succession and climax 168
40. Arrested successions 172
41. Colonization 176
42. Stability, instability, and reproductive strategies 179
43. Simplicity and diversity 183
44. Homoeostasis, feedback, regulation 188
45. Limits of tolerance 192
Further reading 197
References 197
5 Biological Resources 200
46. Evolution 200
47. Evolutionary strategies and game theory 206
48. Adaptation 210
49. Dispersal mechanisms 214
50. Wildlife species and habitats 218
51. Biodiversity 222
52. Fisheries 227
53. Forests 233
54. Farming for food and fibre 239
55. Human populations and demographic change 249
56. Genetic engineering 250
Further reading 257
Notes 257
References 258
6 Environmental Management 261
57. Wildlife conservation 261
58. Zoos, nature reserves, wilder

In [43]:
# We will now use the from_documents method to create a vectorstore from the chunks
vectorstore = FAISS.from_documents(
    chunks, embedding=OpenAIEmbeddings(api_key =api_key)
)

retriever = vectorstore.as_retriever(k=5)

template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = PromptTemplate.from_template(template)

In [44]:
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

In [45]:
# An overly complicated one-liner to test what the top 5 most similar chunks are to the question
# Use this to make sense of the output of the next cell
print("\n\n".join([x.page_content for x in vectorstore.similarity_search("What is the main cause of global warming?", k=5)]))

46 / Basics of Environmental Science
atmospheric greenhouse effect is real and important, and the gases which cause it are justly known
as ‘greenhouse gases’.
Both the global climate and atmospheric concentrations of greenhouse gases vary from time to time.
Studies of air trapped in bubbles inside ice cores from Greenland and from the Russian Vostok
station in Antarctica have revealed a clear and direct relationship between these variations and air
temperature, in the case of the Vostok cores back to about 160 000 years ago. The correlation is
convincing, although it is possible that the fluctuating greenhouse-gas concentration is an effect of
temperature change rather than the cause of it. As temperatures rose at the end of the last ice age, the
increase in the atmospheric concentration of carbon dioxide lagged behind the temperature (CALDER,
1999) and so carbon dioxide cannot have been the cause of the warming. There is also evidence that
the carbon dioxide concentration was far from

In [46]:
chain.invoke("What is the main cause of global warming?")

'The main cause of global warming is debated, with some scientists attributing it to changes in solar output and volcanic eruptions rather than human intervention.'

Try RAG yourself! Take a file of your choice and apply the same concepts. 

In [6]:
loader = PyPDFLoader("Manifesto.pdf")

# The text splitter is used to split the document into chunks
# Mess with the parameters to see how it affects the output
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=100,
    length_function=len,
    is_separator_regex=False,
)

chunks = loader.load_and_split(text_splitter=text_splitter)

model = ChatOpenAI(api_key= api_key)

vectorstore = FAISS.from_documents(
    chunks, embedding=OpenAIEmbeddings(api_key =api_key)
)

retriever = vectorstore.as_retriever(k=5)

template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = PromptTemplate.from_template(template)

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

  model = ChatOpenAI(api_key= api_key)


In [20]:
chain.invoke("True or false: The government should control the economy of the entire nation.")

'True'

Mess with the splitting method ([LangChain splitters](https://python.langchain.com/docs/modules/data_connection/document_transformers/)), the parameters to the splitter, and the number of retrieved chunks that are injected into the LLM's prompt as context. These will significantly impact how the LLM performs and answers questions.

## Advanced RAG

We leave this as a (optional) challenge for you. How can we implement advanced RAG methods in LangChain?

1. Find some data that you would like to perform RAG over. 
2. Implement some form of advanced search with LangChain. 

Note: The LangChain [EnsembleRetriever](https://python.langchain.com/docs/modules/data_connection/retrievers/ensemble) may be of use.

In [None]:
pass