# **VectorStore and Embeddings**

We need vector stores and embeddings to efficiently handle and retrieve relevant information from large text datasets. Embeddings convert text data into numerical vectors that capture semantic meaning, enabling more accurate search and retrieval by understanding context and similarity. Vector stores index these embeddings, allowing for quick and scalable similarity searches, essential for applications like recommendation systems, information retrieval, and natural language processing tasks. Combining both ensures high performance in accessing and utilizing vast amounts of text data.

In [1]:
%%capture
# update or install the necessary libraries
!pip install --upgrade langchain langchain_community langchain_aws pypdf tiktoken chromadb

In [2]:
import os
from google.colab import userdata
os.environ["AWS_ACCESS_KEY_ID"] = userdata.get('AWS_ACCESS_KEY_ID')
os.environ["AWS_SECRET_ACCESS_KEY"] = userdata.get('AWS_SECRET_ACCESS_KEY')
os.environ["AWS_DEFAULT_REGION"] = userdata.get('AWS_DEFAULT_REGION')

We just discussed `Document Loading` and `Splitting`.


In [4]:
from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("/content/content/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("/content/content/MachineLearning-Lecture02.pdf"),
    PyPDFLoader("/content/content/MachineLearning-Lecture03.pdf")

]
docs = []
for loader in loaders:
    docs.extend(loader.load())

In [5]:
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

splits = text_splitter.split_documents(docs)

len(splits)

151

# **Embeddings**

Embedding is a technique that transforms text or other data into numerical vectors, capturing semantic relationships and contextual meaning. These vectors enable machines to process and analyze the data more effectively, facilitating tasks such as search, recommendation, and natural language understanding.

<br>


Let's take our splits and embed them.

In [6]:
# Embeddings

from langchain_aws import BedrockEmbeddings

embedding = BedrockEmbeddings(
    model_id="amazon.titan-embed-text-v2:0"
)

sentence1 = "i like Workplace conditions"
sentence2 = "i like Employees  Efficiency and Effectiveness"
sentence3 = " Employee’s Characteristics and Creativity"

embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)


In [7]:
import numpy as np

np.dot(embedding1, embedding3)


np.float64(0.22245856395925306)

# **Vectorstores**

A vector store is a database designed to store and manage numerical vectors, such as embeddings, for efficient retrieval and similarity search. It enables quick and accurate matching of vectors, facilitating tasks like nearest neighbor search, clustering, and recommendation systems based on vector similarity.

In [8]:
from langchain.vectorstores import Chroma
persist_directory = 'docs/chroma/'
# !rm -rf ./docs/chroma  # remove old database files if any
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

print(vectordb._collection.count())

151


In [9]:
question = "is there an email i can ask for help"

In [10]:
docs = vectordb.similarity_search(question,k=3)

In [11]:
docs[0].page_content

"cs229-qa@cs.stanford.edu. This goes to an account that's read by all the TAs and me. So \nrather than sending us email individually, if you send email to this account, it will \nactually let us get back to you maximally quickly with answers to your questions.  \nIf you're asking questions about homework problems, please say in the subject line which \nassignment and which question the email refers to, since that will also help us to route \nyour question to the appropriate TA or to me appropriately and get the response back to \nyou quickly.  \nLet's see. Skipping ahead — let's see — for homework, one midterm, one open and term \nproject. Notice on the honor code. So one thing that I think will help you to succeed and \ndo well in this class and even help you to enjoy this class more is if you form a study \ngroup.  \nSo start looking around where you're sitting now or at the end of class today, mingle a \nlittle bit and get to know your classmates. I strongly encourage you to form st

In [12]:
# Let's save this so we can use it later!
vectordb.persist()

  vectordb.persist()


# Failure modes

This seems great, and basic similarity search will get you 80% of the way there very easily.

But there are some failure modes that can creep up.

Here are some edge cases that can arise - we'll fix them in the next class.

In [13]:
question = "what did they say about matlab?"

In [14]:
docs = vectordb.similarity_search(question,k=5)

Notice that we're getting duplicate chunks (because of the duplicate MachineLearning-Lecture01.pdf in the index).

Semantic search fetches all similar documents, but does not enforce diversity.

docs[0] and docs[1] are indentical.

In [15]:
docs[0]

Document(metadata={'total_pages': 22, 'title': '', 'creationdate': '2008-07-11T11:25:23-07:00', 'moddate': '2008-07-11T11:25:23-07:00', 'author': '', 'source': '/content/content/MachineLearning-Lecture01.pdf', 'page_label': '9', 'page': 8, 'creator': 'PScript5.dll Version 5.2.2', 'producer': 'Acrobat Distiller 8.1.0 (Windows)'}, page_content='into his office and he said, "Oh, professor, professor, thank you so much for your \nmachine learning class. I learned so much from it. There\'s this stuff that I learned in your \nclass, and I now use every day. And it\'s helped me make lots of money, and here\'s a \npicture of my big house."  \nSo my friend was very excited. He said, "Wow. That\'s great. I\'m glad to hear this \nmachine learning stuff was actually useful. So what was it that you learned? Was it \nlogistic regression? Was it the PCA? Was it the data networks? What was it that you \nlearned that was so helpful?" And the student said, "Oh, it was the MATLAB."  \nSo for those of you

In [16]:
docs[1]

Document(metadata={'page': 8, 'creator': 'PScript5.dll Version 5.2.2', 'total_pages': 22, 'author': '', 'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'moddate': '2008-07-11T11:25:23-07:00', 'page_label': '9', 'creationdate': '2008-07-11T11:25:23-07:00', 'title': '', 'source': '/content/content/MachineLearning-Lecture01.pdf'}, page_content='those homeworks will be done in either MATLAB or in Octave, which is sort of — I \nknow some people call it a free version of MATLAB, which it sort of is, sort of isn\'t.  \nSo I guess for those of you that haven\'t seen MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to \nwrite codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it\'s sort of an extremely easy to learn tool to use for implementing a lot of \nlearning algorithms.  \nAnd in case some of you want to work on your own home computer or something if you \ndon\'t have

We can see a new failure mode.

The question below asks a question about the third lecture, but includes results from other lectures as well.

In [17]:
question = "what did they say about regression in the third lecture?"
docs = vectordb.similarity_search(question,k=5)
for doc in docs:
    print(doc.metadata)
print(docs[4].page_content)

{'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'author': '', 'creator': 'PScript5.dll Version 5.2.2', 'title': '', 'creationdate': '2008-07-11T11:25:03-07:00', 'source': '/content/content/MachineLearning-Lecture03.pdf', 'page_label': '1', 'page': 0, 'moddate': '2008-07-11T11:25:03-07:00', 'total_pages': 16}
{'title': '', 'moddate': '2008-07-11T11:25:05-07:00', 'source': '/content/content/MachineLearning-Lecture02.pdf', 'creationdate': '2008-07-11T11:25:05-07:00', 'author': '', 'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'total_pages': 18, 'page': 0, 'page_label': '1', 'creator': 'PScript5.dll Version 5.2.2'}
{'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'page_label': '4', 'creationdate': '2008-07-11T11:25:03-07:00', 'creator': 'PScript5.dll Version 5.2.2', 'title': '', 'source': '/content/content/MachineLearning-Lecture03.pdf', 'page': 3, 'moddate': '2008-07-11T11:25:03-07:00', 'total_pages': 16, 'author': ''}
{'creationdate': '2008-07-11T11:25:03-07:00', 'page_label': '7',

# **Let's Do an Activity**

## **Objective**

In this activity, you will learn to use embeddings and vector stores to perform efficient similarity searches and data retrieval. You will practice creating embeddings from text data, storing them in a vector store, and retrieving relevant information based on similarity queries.

## **Scenario**

You are building a recommendation system that suggests documents based on user queries. To achieve this, you will use LangChain to create embeddings from text data and store these embeddings in a vector store. You will then use the vector store to find the most relevant documents for a given query.

## **Steps**

* Load and Split Documents
* Create Embeddings
* Store Embeddings in Vector Store
* Perform Similarity Search