# **VectorStore and Embeddings**

We need vector stores and embeddings to efficiently handle and retrieve relevant information from large text datasets. Embeddings convert text data into numerical vectors that capture semantic meaning, enabling more accurate search and retrieval by understanding context and similarity. Vector stores index these embeddings, allowing for quick and scalable similarity searches, essential for applications like recommendation systems, information retrieval, and natural language processing tasks. Combining both ensures high performance in accessing and utilizing vast amounts of text data.

In [None]:
%%capture
# update or install the necessary libraries
!pip install --upgrade langchain langchain_community langchain_aws pypdf tiktoken chromadb python-dotenv

In [None]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

os.environ["AWS_ACCESS_KEY_ID"] = os.getenv("AWS_ACCESS_KEY_ID")
os.environ["AWS_SECRET_ACCESS_KEY"] = os.getenv("AWS_SECRET_ACCESS_KEY")
os.environ["AWS_DEFAULT_REGION"] = os.getenv("AWS_DEFAULT_REGION")

We just discussed `Document Loading` and `Splitting`.


In [None]:
from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("./content/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("./content/MachineLearning-Lecture02.pdf"),
    PyPDFLoader("./content/MachineLearning-Lecture03.pdf")

]
docs = []
for loader in loaders:
    docs.extend(loader.load())

In [None]:
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

splits = text_splitter.split_documents(docs)

len(splits)

# **Embeddings**

Embedding is a technique that transforms text or other data into numerical vectors, capturing semantic relationships and contextual meaning. These vectors enable machines to process and analyze the data more effectively, facilitating tasks such as search, recommendation, and natural language understanding.

<br>


Let's take our splits and embed them.

In [None]:
# Embeddings

from langchain_aws import BedrockEmbeddings

embedding = BedrockEmbeddings(
    model_id="amazon.titan-embed-text-v2:0"
)

sentence1 = "i like Workplace conditions"
sentence2 = "i like Employees  Efficiency and Effectiveness"
sentence3 = " Employee’s Characteristics and Creativity"

embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)


In [None]:
import numpy as np

np.dot(embedding1, embedding3)


# **Vectorstores**

A vector store is a database designed to store and manage numerical vectors, such as embeddings, for efficient retrieval and similarity search. It enables quick and accurate matching of vectors, facilitating tasks like nearest neighbor search, clustering, and recommendation systems based on vector similarity.

In [None]:
from langchain.vectorstores import Chroma
persist_directory = 'docs/chroma/'
# !rm -rf ./docs/chroma  # remove old database files if any
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

print(vectordb._collection.count())

In [None]:
question = "is there an email i can ask for help"

In [None]:
docs = vectordb.similarity_search(question,k=3)

In [None]:
docs[0].page_content

In [None]:
# Let's save this so we can use it later!
vectordb.persist()

# Failure modes

This seems great, and basic similarity search will get you 80% of the way there very easily.

But there are some failure modes that can creep up.

Here are some edge cases that can arise - we'll fix them in the next class.

In [None]:
question = "what did they say about matlab?"

In [None]:
docs = vectordb.similarity_search(question,k=5)

Notice that we're getting duplicate chunks (because of the duplicate MachineLearning-Lecture01.pdf in the index).

Semantic search fetches all similar documents, but does not enforce diversity.

docs[0] and docs[1] are indentical.

In [None]:
docs[0]

In [None]:
docs[1]

We can see a new failure mode.

The question below asks a question about the third lecture, but includes results from other lectures as well.

In [None]:
question = "what did they say about regression in the third lecture?"
docs = vectordb.similarity_search(question,k=5)
for doc in docs:
    print(doc.metadata)
print(docs[4].page_content)

# **Let's Do an Activity**

## **Objective**

In this activity, you will learn to use embeddings and vector stores to perform efficient similarity searches and data retrieval. You will practice creating embeddings from text data, storing them in a vector store, and retrieving relevant information based on similarity queries.

## **Scenario**

You are building a recommendation system that suggests documents based on user queries. To achieve this, you will use LangChain to create embeddings from text data and store these embeddings in a vector store. You will then use the vector store to find the most relevant documents for a given query.

## **Steps**

* Load and Split Documents
* Create Embeddings
* Store Embeddings in Vector Store
* Perform Similarity Search