# RAG with Pinecone and Chroma

## Similarity Searching using top_k

In this notebook, you will learn to load data, split it, retrieve and store embeddings in either Pinecone or ChromaDB, and perform similarity searching which is a way to ask your documents questions.

Covered topics:
  - LangChain
    - Document Loaders
    - Text Splitters
    - Chat Models
  - Retrieval Augmented Generation (RAG)

## Setup

Begin by installing all required Python libraries and `git` cloning a repository. The 'langchain-tutorials' repository is what inspired this notebook as much of the work contained herein was pulled from Greg's work. Shoutsout!!

If you'd like to use your own data, comment the `git clone` command and update the `file_path` in `TextLoader` appropriately.

In [12]:
%pip install -qU langchain pinecone-client python-dotenv \
  openai cohere tiktoken chromadb
# Version: 0.0.164

!git clone https://github.com/gkamradt/langchain-tutorials.git

Note: you may need to restart the kernel to use updated packages.


fatal: destination path 'langchain-tutorials' already exists and is not an empty directory.


In [13]:
# PDF Loaders. If unstructured gives you a hard time
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader, PyPDFLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
#from dotenv import load_dotenv
import os

#load_dotenv()

### Stage the Data Loader

Customize the file path of the data you'd like to load.

If you're in Google Colab, prepend the below path with 'content/' and you should be good to go if you cloned the repo above.

In [14]:
loader = TextLoader(file_path="langchain-tutorials/data/PaulGrahamEssays/vb.txt")

## Other options for loaders 
# loader = PyPDFLoader("../data/field-guide-to-data-science.pdf")
# loader = UnstructuredPDFLoader("../data/field-guide-to-data-science.pdf")
# loader = OnlinePDFLoader("https://wolfpaulus.com/wp-content/uploads/2017/05/field-guide-to-data-science.pdf")

In [15]:
# Load the data
data = loader.load()

Let's take a look at the data we loaded before processing it further.

In [16]:
# Note: If you're using PyPDFLoader then it will split by page for you already
print (f'You have {len(data)} document(s) in your data')
print (f'There are {len(data[0].page_content)} characters in your sample document')
print (f'Here is a sample: {data[0].page_content[:200]}')

You have 1 document(s) in your data
There are 9156 characters in your sample document
Here is a sample: January 2016Life is short, as everyone knows. When I was a kid I used to wonder
about this. Is life actually short, or are we really complaining
about its finiteness?  Would we be just as likely to fe


### Chunk your data up into smaller documents

While we could pass the entire essay to a model w/ long context, we want to be picky about which information we share with our model. The better signal to noise ratio we have the more likely we are to get the right answer.

The first thing we'll do is chunk up our document into smaller pieces. The goal will be to take only a few of those smaller pieces and pass them to the LLM.

In [17]:
# Note: If you're using PyPDFLoader then we'll be splitting for the 2nd time.
# This is optional, test out on your own data.

text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=128)
texts = text_splitter.split_documents(data)

In [18]:
# Let's see how many small chunks we have
print (f'Now you have {len(texts)} documents')

Now you have 23 documents


### Create embeddings of your documents to get ready for semantic search

In [19]:
from langchain.vectorstores import Chroma, Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
import pinecone

Check to see if there is an environment variable with you API keys, if not, use what you put below

In [20]:
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY', '')

embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

### Option #1: Pinecone
If you want to use pinecone, run the code below, if not then skip over to Chroma below it. You must go to [Pinecone.io](https://www.pinecone.io/) and set up an account

In [27]:
PINECONE_API_KEY = os.getenv('PINECONE_API_KEY', '')
PINECONE_API_ENV = os.getenv('PINECONE_API_ENV', 'us-central1-gcp') # You may need to switch with your env

# initialize pinecone
pinecone.init(
    api_key=PINECONE_API_KEY,  # find at app.pinecone.io
    environment=PINECONE_API_ENV  # next to api key in console
)

# Create the index
#pinecone.create_index('langchaintest', dimension=1536)

# connect to index
index_name = "langchaintest"

docsearch = Pinecone.from_texts([t.page_content for t in texts], embeddings, index_name=index_name)

### Option #2: Chroma

Chroma is a private and easy vectorstore to set up without an account

In [21]:
# load it into Chroma
vectorstore = Chroma.from_documents(texts, embeddings)

Lets test it out by searching for the most closely related documents to our query:

In [22]:
# In our code, ChromaDB uses `vectorstore`
query = "What is great about having kids?"
docs = vectorstore.similarity_search(query)

In [23]:
# Here's an example of the first document that was returned
for doc in docs:
    print (f"{doc.page_content}\n")

how it tricks you.  The area under the curve is small, but its shape
jabs into your consciousness like a pin.The things that matter aren't necessarily the ones people would
call "important."  Having coffee with a friend matters.  You won't
feel later like that was a waste of time.One great thing about having small children is that they make you
spend time on things that matter: them. They grab your sleeve as
you're staring at your phone and say "will you play with me?" And

the question, and the answer is that life actually is short.Having kids showed me how to convert a continuous quantity, time,
into discrete quantities. You only get 52 weekends with your 2 year
old.  If Christmas-as-magic lasts from say ages 3 to 10, you only
get to watch your child experience it 8 times.  And while it's
impossible to say what is a lot or a little of a continuous quantity
like time, 8 is not a lot of something.  If you had a handful of 8

January 2016Life is short, as everyone knows. When I was a ki

### Query those docs to get your answer back

I use `ChatOpenAI` because I prefer the other classes from LangChain that go with it. You may also use an llm for the same task.

<s>
I encourage you to specify the model name because you will save money.

- For example, "gpt-3.5-turbo-1106" is ***far far far*** cheaper than the base models.
</s>

Base models were deprecated on January 4th, 2023.

In [32]:
from langchain.chat_models import ChatOpenAI
from langchain.chains.question_answering import load_qa_chain

In [38]:
llm = ChatOpenAI(model_name="gpt-3.5-turbo-1106", temperature=0.7, openai_api_key=OPENAI_API_KEY)
chain = load_qa_chain(llm, chain_type="stuff")

In [29]:
# In our code, Pinecone uses `docsearch`
query = "What is great about having kids?"
docs = docsearch.similarity_search(query)

In [39]:
# Run a chain to perform RAG
chain.run(input_documents=docs, question=query)

'One great thing about having small children is that they make you spend time on things that matter: them. They grab your sleeve as you\'re staring at your phone and say "will you play with me?" This helps you focus on the important things in life and spend quality time with your children.'