<a href="https://colab.research.google.com/github/zubair-hafeez/RaG-OpenAI/blob/main/RAG_OpenAI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Retrieval augmented generation

In retrieval augmented generation (RAG), an LLM retrieves contextual documents from an external dataset as part of its execution.

This is useful if we want to ask question about specific documents (e.g., our PDFs, a set of videos, etc).

In [None]:
# @title
! pip install langchain
! pip install openai
! pip install langchain-community
! pip install pypdf
! pip install tiktoken
! pip install chromadb
! pip install lark

Collecting langchain
  Downloading langchain-0.2.12-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.3.0,>=0.2.27 (from langchain)
  Downloading langchain_core-0.2.28-py3-none-any.whl.metadata (6.2 kB)
Collecting langchain-text-splitters<0.3.0,>=0.2.0 (from langchain)
  Downloading langchain_text_splitters-0.2.2-py3-none-any.whl.metadata (2.1 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.96-py3-none-any.whl.metadata (13 kB)
Collecting tenacity!=8.4.0,<9.0.0,>=8.1.0 (from langchain)
  Downloading tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain-core<0.3.0,>=0.2.27->langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting orjson<4.0.0,>=3.9.14 (from langsmith<0.2.0,>=0.1.17->langchain)
  Downloading orjson-3.10.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (50 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4

In [None]:
import os
import openai
import sys
sys.path.append('/content/')

In [None]:
from google.colab import userdata

openai_api_key = userdata.get('OPENAI_API_KEY')
openai.api_key  = userdata.get('OPENAI_API_KEY')
userdata.get('DUMMY_API_KEY')

'sk-XXXXXXXXXXXXXXX'

## PDFs

Let's load a PDF [transcript](https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf) from Andrew Ng's famous CS229 course! These documents are the result of automated transcription so words and sentences are sometimes split unexpectedly.

In [None]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("Sample.pdf")
pages = loader.load()

Each page is a `Document`.

A `Document` contains text (`page_content`) and `metadata`.

In [None]:
len(pages)

11

In [None]:
page = pages[0]

In [None]:
print(page.page_content[0:500])

Essential
Generative
AI
Training
Program
Details
Updated
July
28,
2024,
10:45
pm
PKT
Table
of
Contents:
1.0
Schedule:
Essential
Gen-AI
Training
-
Online
only
4
2.0
Module
1:
Intro
to
Gen-AI
Opening
Session,
Agenda
5
3.0
Important
Online
Links
6
1)
Microsoft
Teams
Link
for
the
Main
Training
Sessions:
6
2)
Join
us
on
Follow
Up
Technical
meetings
at
7
PM
PKT
every
weekday.
6
3)
Link
to
The
White
Paper
on
Gen-AI
&
Applications
for
Vertical
Industries
6
4)
Link
to
Generative
AI
&
New
Opportunities,
a


In [None]:
page.metadata

{'source': 'Sample.pdf', 'page': 0}

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

In [None]:
chunk_size =26
chunk_overlap = 4

In [None]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

Why doesn't this split the string below?

In [None]:
text1 = 'abcdefghijklmnopqrstuvwxyz'

In [None]:
r_splitter.split_text(text1)

In [None]:
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'

In [None]:
r_splitter.split_text(text2)

Ok, this splits the string but we have an overlap specified as 5, but it looks like 3? (try an even number)

In [None]:
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"

In [None]:
r_splitter.split_text(text3)

In [None]:
c_splitter.split_text(text3)

In [None]:
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' '
)
c_splitter.split_text(text3)

Try your own examples!

## Recursive splitting details

`RecursiveCharacterTextSplitter` is recommended for generic text.

In [None]:
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""

In [None]:
len(some_text)

In [None]:
c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator = ' '
)
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separators=["\n\n", "\n", " ", ""]
)

In [None]:
c_splitter.split_text(some_text)

In [None]:
r_splitter.split_text(some_text)

# Vectorstores and Embeddings

Recall the overall workflow for retrieval augmented generation (RAG):

We just discussed `Document Loading` and `Splitting`.

In [None]:
from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("Sample.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

In [None]:
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

In [None]:
splits = text_splitter.split_documents(docs)

In [None]:
len(splits)

16

## Embeddings

Let's take our splits and embed them.

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings(openai_api_key=openai_api_key)

  warn_deprecated(


In [None]:
word1 = "i love burgers"
word2 = "the weather is great today"
word3 = "Its raining outside"

In [None]:
embedding1 = embedding.embed_query(word1)
embedding2 = embedding.embed_query(word2)
embedding3 = embedding.embed_query(word3)

In [None]:
import numpy as np

In [None]:
np.dot(embedding2, embedding3)

0.8503085707989149

## Vectorstores

In [None]:
from langchain.vectorstores import Chroma

In [None]:
persist_directory = 'docs/chroma/'

In [None]:
!rm -rf ./docs/chroma  # remove old database files if any

In [None]:
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

In [None]:
print(vectordb._collection.count())

32


### Similarity Search

In [None]:
question = "What is Pak Angels Generative AI Training Program?"

In [None]:
docs = vectordb.similarity_search(question,k=3)

In [None]:
len(docs)

3

Let's save this so we can use it later!

In [None]:
vectordb.persist()

  warn_deprecated(


# Retrieval

Retrieval is the centerpiece of our retrieval augmented generation (RAG) flow.

Let's get our vectorDB from before.

## Vectorstore retrieval


### Similarity Search

In [None]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
persist_directory = 'docs/chroma/'

In [None]:
embedding = OpenAIEmbeddings(openai_api_key=openai_api_key)
vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)

  warn_deprecated(


In [None]:
print(vectordb._collection.count())

32


# Question Answering

We discussed `Document Loading` and `Splitting` as well as `Storage` and `Retrieval`.

Let's load our vectorDB.

The code below was added to assign the openai LLM version filmed until it is deprecated, currently in Sept 2023.
LLM responses can often vary, but the responses may be significantly different when using a different model version.

In [None]:
llm_name = "gpt-3.5-turbo"

In [None]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
persist_directory = 'docs/chroma/'
embedding = OpenAIEmbeddings(openai_api_key=openai_api_key)
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)

In [None]:
print(vectordb._collection.count())

32


In [None]:
question = "What are major topics for this class?"
docs = vectordb.similarity_search(question,k=3)
len(docs)

3

In [None]:
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model_name=llm_name, temperature=0, openai_api_key=openai_api_key)

  warn_deprecated(


### RetrievalQA chain

In [None]:
from langchain.chains import RetrievalQA

In [None]:
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever()
)

In [None]:
result = qa_chain({"query": question})

  warn_deprecated(


In [None]:
result["result"]

'The major topics for this class seem to revolve around organizing a hackathon, understanding Gen-AI models like GPT, data cleaning and preprocessing for generative AI models, choosing the right model for your needs, and discussions on text, code, audio, and image models. Additionally, there are sessions on common pitfalls in AI models and interactive sessions like quizzes and discussions.'

### Prompt

In [None]:
from langchain.prompts import PromptTemplate

# Build prompt
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer.
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)


In [None]:
# Run chain
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

In [None]:
llm.invoke("What are the objectives of Pak Angels generative ai essential training?").content

'1. To provide participants with a comprehensive understanding of generative artificial intelligence and its applications in various industries.\n2. To equip participants with the necessary skills and knowledge to develop and implement generative AI solutions.\n3. To help participants understand the ethical implications and considerations of using generative AI technology.\n4. To enable participants to effectively communicate and collaborate with other professionals in the field of generative AI.\n5. To empower participants to stay updated on the latest advancements and trends in generative AI technology.\n6. To foster a community of like-minded individuals who are passionate about generative AI and its potential impact on society.'

In [None]:
question = "What is the agenda for module 1"
result = qa_chain({"query": question})
result["result"]


'The agenda for Module 1 is an introduction to Gen-AI during the opening session. Thanks for asking!'

In [None]:
result["source_documents"][0]

Document(metadata={'page': 0, 'source': 'Sample.pdf'}, page_content='Arif)\n10\nModule\n5:\nHands-on\nwith\nGenerative\nAI\nModels\n(Trainer:\nZubair\nZafar)\n11\nModule\n6:\nDeveloping\nGenerative\nAI\nApplications\n(Trainer:\nMuhammad\nDanish\nIqbal)\n12\nModule\n7:\nResponsible\nUse\nof\nGenerative\nAI\n(Trainer:\nAbdullah\nArif)\n12\nModule\n8:\nHow\nto\nSell\nThese\nGenerative\nAI\nSkills\nLike\na\nPro\n(Trainer:\nHassan\nSyed\n,\nNaeem \nZafar,\nM.\nAnwar\nKhan)\n12\n6.0\nFollow\nup\nTechnical\nSessions:\nAll\nWeekdays\nOnline\n13\nPurpose\nof\nFollow\nup\nOnline\nSessions\n13\n7.0\nHackathon\nConducted\nat\nthe\nEnd\nof\nGen-AI\nTraining\n15\nIntroduction\nto\nHackathons\n15\niCodeGuru\nand\nHackathons\n15')

Note, The LLM response varies. Some responses **do** include a reference to probability which might be gleaned from referenced documents. The point is simply that the model does not have access to past questions or answers, this will be covered in the next section.

# References
https://learn.deeplearning.ai/courses/langchain-chat-with-your-data/

```
# This is formatted as code
```

