#### Vector stores and embeddings
- Using embeddings and vector stores to convert splitted chunks into an index for easy retrieval.

![vector store](https://python.langchain.com/assets/images/vector_stores-9dc1ecb68c4cb446df110764c9cc07e0.jpg)

In [1]:
import os

In [2]:
# PaLM API key
GOOGLE_API_KEY = os.environ["GOOGLE_API_KEY"]

In [3]:
from langchain.document_loaders import PyPDFLoader

In [4]:
loaders = [
    PyPDFLoader("./docs/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("./docs/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("./docs/MachineLearning-Lecture02.pdf"),
    PyPDFLoader("./docs/MachineLearning-Lecture03.pdf")
]
# loading the duplicate of 1st lecture pdf

In [5]:
pdf_docs = []
for loader in loaders:
    pdf_docs.extend(loader.load())

In [6]:
len(pdf_docs)

78

In [7]:
pdf_docs[0].page_content[:200]

'MachineLearning-Lecture01  \nInstructor (Andrew Ng):  Okay. Good morning. Welcome to CS229, the machine \nlearning class. So what I wanna do today is ju st spend a little time going over the logistics \n'

In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [9]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500, chunk_overlap=150)

In [10]:
splits = text_splitter.split_documents(pdf_docs)

In [11]:
len(splits)

209

#### GooglePalm LLM

In [12]:
from langchain.embeddings import GooglePalmEmbeddings
embedding = GooglePalmEmbeddings(google_api_key=GOOGLE_API_KEY)

#### Vector store
- using Chroma open-source vector database

In [13]:
from langchain.vectorstores import Chroma

In [14]:
persist_directory = "./vector_database/chroma/"

In [15]:
!rm -rf ./vector_database/chroma
# remove old database files if any

In [16]:
vectordb = Chroma.from_documents(
    documents=splits, embedding=embedding, persist_directory=persist_directory)

In [17]:
print(vectordb._collection.count())

209


#### Similarity search to ask query

In [18]:
question = "is there an email I can ask for help"

In [19]:
docs = vectordb.similarity_search(question, k=3)

In [20]:
len(docs)

3

In [21]:
print(docs[0].page_content)

cs229-qa@cs.stanford.edu. This goes to an acc ount that's read by all the TAs and me. So 
rather than sending us email individually, if you send email to this account, it will 
actually let us get back to you maximally quickly with answers to your questions.  
If you're asking questions about homework probl ems, please say in the subject line which 
assignment and which question the email refers to, since that will also help us to route 
your question to the appropriate TA or to me  appropriately and get the response back to 
you quickly.  
Let's see. Skipping ahead — let's see — for homework, one midterm, one open and term 
project. Notice on the honor code. So one thi ng that I think will help you to succeed and 
do well in this class and even help you to enjoy this cla ss more is if you form a study 
group.  
So start looking around where you' re sitting now or at the end of class today, mingle a 
little bit and get to know your classmates. I strongly encourage you to form study gro

In [22]:
vectordb.persist()

#### Failure modes: some examples where similarity search can fail

In [23]:
question = "what did they say about matlab?"

In [24]:
docs = vectordb.similarity_search(question, k=5)

In [25]:
docs[0]

Document(page_content='those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people call it a free ve rsion of MATLAB, which it sort  of is, sort of isn\'t.  \nSo I guess for those of you that haven\'t s een MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it\'s sort of an extremely easy to  learn tool to use for implementing a lot of \nlearning algorithms.  \nAnd in case some of you want to work on your  own home computer or something if you \ndon\'t have a MATLAB license, for the purposes of  this class, there\'s also — [inaudible] \nwrite that down [inaudible] MATLAB — there\' s also a software package called Octave \nthat you can download for free off the Internet. And it has somewhat fewer features than MATLAB, but it\'s free, and for the purposes of  this class,

In [26]:
docs[1]

Document(page_content='those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people call it a free ve rsion of MATLAB, which it sort  of is, sort of isn\'t.  \nSo I guess for those of you that haven\'t s een MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it\'s sort of an extremely easy to  learn tool to use for implementing a lot of \nlearning algorithms.  \nAnd in case some of you want to work on your  own home computer or something if you \ndon\'t have a MATLAB license, for the purposes of  this class, there\'s also — [inaudible] \nwrite that down [inaudible] MATLAB — there\' s also a software package called Octave \nthat you can download for free off the Internet. And it has somewhat fewer features than MATLAB, but it\'s free, and for the purposes of  this class,

#### These are identical chunks, semantic search fetches all similar documents, but does not enforce diversity.

In [27]:
# another example
question = "what did they say about regression in the third lecture?"

In [28]:
docs = vectordb.similarity_search(question, k=5)

In [29]:
docs[0].page_content[:100]

'MachineLearning-Lecture03  \nInstructor (Andrew Ng) :Okay. Good morning and welcome b ack to the thir'

In [30]:
docs[1].page_content[:100]

'Instructor (Andrew Ng) :All right, so who thought driving could be that dramatic, right? \nSwitch bac'

In [31]:
for doc in docs:
    print(doc.metadata)

{'page': 0, 'source': './docs/MachineLearning-Lecture03.pdf'}
{'page': 2, 'source': './docs/MachineLearning-Lecture02.pdf'}
{'page': 14, 'source': './docs/MachineLearning-Lecture03.pdf'}
{'page': 9, 'source': './docs/MachineLearning-Lecture02.pdf'}
{'page': 0, 'source': './docs/MachineLearning-Lecture03.pdf'}


In [32]:
print(docs[4].page_content)

this to denote the predicted value of “by my hypothesis H” on the input XI. And my 
hypothesis was franchised by the vector of gram s as theta and so we said that this was 
equal to some from theta J, si J, and more  theta transpose X. And we had the convention 
that X subscript Z is equal to one so this accounts for the intercept term in our linear regression model. And lowercas e n here was the notation I was using for the number of 
features in my training set. Okay? So in  the example when trying to predict housing 
prices, we had two features, the size of the house and the number of bedrooms. We had 
two features and there was – li ttle n was equal to two. So just to finish recapping the 
previous lecture, we define d this quadratic cos function J of theta equals one-half, 
something I equals one to m, theta of XI mi nus YI squared where this is the sum over our 
m training examples and my training set. So lowercase m was the notation I’ve been 
using to denote the number of train

#### MMR search: Maximal Marginal Relevance
- retrieve both relevant and distinct chunks at the same time, getting the diverse information for a query

In [33]:
question = "what did they say about matlab?"

In [34]:
vectordb.max_marginal_relevance_search(question, k=3)[0].page_content[:100]

'those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people '

In [35]:
vectordb.max_marginal_relevance_search(question, k=3)[1].page_content[:100]

'simplicity of this algorithm will let us come back and use it as a building block. Okay? \nBut that’s'

In [36]:
vectordb.max_marginal_relevance_search(question, k=3)[2].page_content[:100]

'Student: [Inaudible]?  \nInstructor (Andrew Ng) :Yeah, I threw a lot of notations  at you today. So M'

##### The MMR results provide diverse information and differ from similarity search results.