# Vectorstores and Embeddings

Recall the overall workflow for retrieval augmented generation (RAG):

![overview.jpeg](attachment:overview.jpeg)

In [12]:
#!pip install openai langchain
#!pip install tiktoken
#!pip install python-dotenv
import os
import openai
import sys
from google.colab import drive
drive.mount('/content/drive')
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
#import os
os.environ["OPENAI_API_KEY"] = "sk-R6lfhchOciFp2GQsBuaUT3BlbkFJAzQNN6ZiPqfkJeKFkff"
openai.api_key  = os.environ['OPENAI_API_KEY']
print(openai.api_key)
!cd /content/drive/MyDrive/python_colab_prj/langchain-course/docs;ls -lt

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
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
total 302
-rw------- 1 root root 186730 Jul  8 07:07 Stanley_Yao_Resume.pdf
-rw------- 1 root root 121525 Jul  8 06:36 Gloria_Li_Resume.pdf


We just discussed `Document Loading` and `Splitting`.

In [14]:
!pip install pypdf
from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("/content/drive/MyDrive/python_colab_prj/langchain-course/docs/Gloria_Li_Resume.pdf"),
    PyPDFLoader("/content/drive/MyDrive/python_colab_prj/langchain-course/docs/Stanley_Yao_Resume.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

Collecting pypdf
  Downloading pypdf-3.12.0-py3-none-any.whl (254 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/254.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m245.8/254.5 kB[0m [31m7.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m254.5/254.5 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-3.12.0


In [None]:
from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("docs/Gloria_Li_Resume.pdf"),
    PyPDFLoader("docs/Stanley_Yao_Resume.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

In [15]:
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

In [16]:
splits = text_splitter.split_documents(docs)

In [26]:
print(f'''length of splite is    len(splits): {len(splits)}''')
print(type(splits))
print(splits)

length of splite is    len(splits): 8
<class 'list'>
[Document(page_content='Gloria Li  \n+1 332.248.6208 | yl4661@columbia.edu  | linkedin.com/in/yutong -li-415780229    \nEDUCATION  \nColumbia University                                                                                                                      New York, NY \nB.A. in Economics                                                                                                                       Expected May 2023  \n• Cumulative GPA:  3.7/4.0；Honors: Dean ’s List（2021-2022） \n• Relevant Coursework:  Econometrics  (A-), Java and programming  (A), Financial Economics  (A-), \nCorporate Finance  (A), Statistics  (A) \n• Standardized Test Score:  GRE 335/340  (Verbal Reasoning: 165, 95th percentile; Quantitative \nReasoning:170, 96th percentile)  \nLanguages:  Mandarin , English , Japanese  \nTechnical Skills : SQL, Java, Excel, PowerPoint, Financial Modeling   \n \nPROFESSIONAL  EXPERIENCE  \nArk Technology        

## Embeddings

Let's take our splits and embed them.

In [27]:
from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings()

In [28]:
sentence1 = "i like dogs"
sentence2 = "i like canines"
sentence3 = "the weather is ugly outside"

In [29]:
embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)

In [30]:
import numpy as np

In [31]:
np.dot(embedding1, embedding2)

0.9631853877103518

In [32]:
np.dot(embedding1, embedding3)

0.7709997651294672

In [33]:
np.dot(embedding2, embedding3)

0.7596334120325523

## Vectorstores

In [34]:
#!git clone https://github.com/nmslib/hnswlib.git
#!cd ./hnswlib
#!python setup.py install

!pip install chromadb

Collecting chromadb
  Downloading chromadb-0.3.26-py3-none-any.whl (123 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/123.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m123.6/123.6 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
Collecting requests>=2.28 (from chromadb)
  Downloading requests-2.31.0-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.6/62.6 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
Collecting hnswlib>=0.7 (from chromadb)
  Downloading hnswlib-0.7.0.tar.gz (33 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting clickhouse-connect>=0.5.7 (from chromadb)
  Downloading clickhouse_connect-0.6.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (966 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

In [35]:
from langchain.vectorstores import Chroma

In [36]:
persist_directory = '/content/drive/MyDrive/python_colab_prj/langchain-course/docs/chroma/'

In [38]:
!rm -rf /content/drive/MyDrive/python_colab_prj/langchain-course/docs/chroma  # remove old database files if any

In [39]:
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

In [40]:
print(vectordb._collection.count())

8


### Similarity Search

In [46]:
question = "anything about Java programming?"

In [47]:
docs = vectordb.similarity_search(question,k=3)

In [48]:
len(docs)

3

In [49]:
docs[0].page_content

'such as Pinecone, Chroma DB for embedding and querying via similarity searches. • Personal pilot AI project https://stan.cool (working in progress and launch ETA late August, 2023 ) Stock Analyst Sponge Capital | Beijing, China | September 2021 - April 2022 (PE firm in Beijing managing ¥2 Billion equity) • Prepared investment materials and secured financing for investment deals. • Conducted in-depth analysis on 100+ corporations across various markets, providing investment recommendations to the executive team. • Delivered oral and written reports on general economic trends, individual corporations, and entire industries. • Valued and priced securities based on thorough analysis. • Contributed to quarterly close and monthly forecast processes. Game Designer  Laya Box | Beijing, Beijing, China  May 2021 - September 2021 (Provides largest H5 engine and developer community in China) • Supported game balance by analyzing statistics, virtual goods, economics, and user motivations. • Utiliz

Let's save this so we can use it later!

In [45]:
vectordb.persist()

## Failure modes

This seems great, and basic similarity search will get you 80% of the way there very easily.

But there are some failure modes that can creep up.

Here are some edge cases that can arise - we'll fix them in the next class.

In [50]:
question = "what did they say about Experiences?"

In [51]:
docs = vectordb.similarity_search(question,k=5)

Notice that we're getting duplicate chunks (because of the duplicate `MachineLearning-Lecture01.pdf` in the index).

Semantic search fetches all similar documents, but does not enforce diversity.

`docs[0]` and `docs[1]` are indentical.

In [52]:
docs[0]

Document(page_content='Commented [JQ2]: Experience部分我的建议是每一个bullet point前半句是用了什么tools，做了什么。后半句是达到了什么样的outcome/achievement。并且最好要量化这个结果。比如说用了用pytorch train了xxx model，model的accuracy/roc达到了什么数字，最后这个model给你/公司带来了什么insights \nCommented [JQ3]: 比如这句，就可以再往后写一下这个recommendation给team带来了什么影响，怎么影响了他们的投资方向，公司的revenue有没有因此上升', metadata={'source': '/content/drive/MyDrive/python_colab_prj/langchain-course/docs/Stanley_Yao_Resume.pdf', 'page': 0})

In [53]:
docs[1]

Document(page_content='interview samples  \n▪ Compel led data gathered from various corporations’ financial reports using Tableau , enhanc ed project \nefficiency by 25%  \n▪ Fostered communication between stakeholders, engineering, operation, and marketing  to manage timeline s \n▪ Develop ed milestones and facilitated internal collaboration , improved speeds of production circle by 30% \nyear over year  \n▪ Developed a road map for mobile U ser Generated Content platform, led a cross -functional product team, \nmanage d prioritization and monitor ed the work to ensure whole -product readiness  \n▪ Defin ed key performance metrics  such as click -through  rate and sign -on rate  by collaborating with C- level \nsuite and key stakeholders to  improve methodologies across the company  \n▪ Analy zed the performance of the mobile  User Generated Content platform , formed the report based on \nchanges in users’ preferences  and forecast ed the trends to improve operations  \n \nChina Poly 

We can see a new failure mode.

The question below asks a question about the third lecture, but includes results from other lectures as well.

In [54]:
question = "what did they say about year of graduate?"

In [55]:
docs = vectordb.similarity_search(question,k=5)

In [56]:
for doc in docs:
    print(doc.metadata)

{'source': '/content/drive/MyDrive/python_colab_prj/langchain-course/docs/Stanley_Yao_Resume.pdf', 'page': 1}
{'source': '/content/drive/MyDrive/python_colab_prj/langchain-course/docs/Gloria_Li_Resume.pdf', 'page': 0}
{'source': '/content/drive/MyDrive/python_colab_prj/langchain-course/docs/Gloria_Li_Resume.pdf', 'page': 0}
{'source': '/content/drive/MyDrive/python_colab_prj/langchain-course/docs/Stanley_Yao_Resume.pdf', 'page': 0}
{'source': '/content/drive/MyDrive/python_colab_prj/langchain-course/docs/Stanley_Yao_Resume.pdf', 'page': 0}


In [57]:
print(docs[4].page_content)

Commented [JQ2]: Experience部分我的建议是每一个bullet point前半句是用了什么tools，做了什么。后半句是达到了什么样的outcome/achievement。并且最好要量化这个结果。比如说用了用pytorch train了xxx model，model的accuracy/roc达到了什么数字，最后这个model给你/公司带来了什么insights 
Commented [JQ3]: 比如这句，就可以再往后写一下这个recommendation给team带来了什么影响，怎么影响了他们的投资方向，公司的revenue有没有因此上升


Approaches discussed in the next lecture can be used to address both!