## Libraries and Setup

In [1]:
import openai
from langchain_openai import OpenAIEmbeddings, OpenAI
from langchain_chroma  import Chroma
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain.retrievers import SVMRetriever, TFIDFRetriever
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

from dotenv import load_dotenv, find_dotenv

import warnings
warnings.filterwarnings('ignore')

In [2]:
_ = load_dotenv(find_dotenv())

In [3]:
embeddings = OpenAIEmbeddings()
llm = OpenAI(temperature=0)

## Retrieval

In [4]:
persist_directory = "db/chroma/"

In [5]:
vector_db = Chroma(
    embedding_function=embeddings,
    persist_directory=persist_directory
)

In [6]:
print(vector_db._collection.count())

208


### Maximum Marginal Relevance

Maximum marginal relevance strives to achieve both relevance to the query and diversity among the results.

In [7]:
texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

In [8]:
small_db = Chroma.from_texts(
    texts=texts,
    embedding=embeddings
)

In [9]:
question = "Tell me about all-white mushrooms with large fruiting bodies"

In [10]:
small_db.similarity_search(
    query=question, 
    k=2
)

[Document(id='0f35251a-f213-4208-9779-5d91b4489357', metadata={}, page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(id='078b1a09-0440-4a70-871f-96c66a3a9abe', metadata={}, page_content='The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).')]

In [11]:
small_db.max_marginal_relevance_search(
    query=question, 
    k=2, 
    fetch_k=3
)

[Document(id='0f35251a-f213-4208-9779-5d91b4489357', metadata={}, page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(id='cdd3128e-978c-4d16-a0ac-399f67bf4ffa', metadata={}, page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.')]

#### Addressing Diversity

In VectorStores and Embeddings notebook we encountered one problem: how to enforce diversity in the search results.

In [12]:
question = "What did they say about Matlab?"

**Using Similarity Search**

In [13]:
similarity_search_document = vector_db.similarity_search(
    query=question, 
    k=2
)

In [14]:
similarity_search_document[0].page_content[0:100]

'those homeworks will be done in either MATLAB or in Octave, which is sort of — I \nknow some people c'

In [15]:
similarity_search_document[1].page_content[0:100]

'those homeworks will be done in either MATLAB or in Octave, which is sort of — I \nknow some people c'

**Using Maximum Marginal Relevance**

Note the difference with using MMR

In [16]:
maximum_marginal_relevance_document = vector_db.max_marginal_relevance_search(
    query=question,
    k=2,
)

In [17]:
maximum_marginal_relevance_document[0].page_content[0:100]

'those homeworks will be done in either MATLAB or in Octave, which is sort of — I \nknow some people c'

In [18]:
maximum_marginal_relevance_document[1].page_content[0:100]

'he says it in sort of a really touching, sincere way, and then he has this — you can see it \nin his '

### Self Query Retriever

In [19]:
question = "what did they say about regression in the third lecture?"

In [20]:
docs = vector_db.similarity_search(
    query=question,
    k=3,
    filter={"source":"documents/MachineLearning-Lecture03.pdf"}
)

In [21]:
for doc in docs:
    print(doc.metadata)

{'page': 0, 'page_label': '1', 'source': 'documents/MachineLearning-Lecture03.pdf'}
{'page': 13, 'page_label': '14', 'source': 'documents/MachineLearning-Lecture03.pdf'}
{'page': 4, 'page_label': '5', 'source': 'documents/MachineLearning-Lecture03.pdf'}


**Addressing Specificity: working with metadata using self-query retriever**

To address this, we can use SelfQueryRetriever, which uses an LLM to extract:

1. The query string to use for vector search
2. A metadata filter to pass in as well

Most vector databases support metadata filters, so this doesn't require any new databases or indexes.

In [22]:
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The lecture the chunk is from, should be one of `documents/MachineLearning-Lecture01.pdf`, `documents/MachineLearning-Lecture02.pdf`, or `documents/MachineLearning-Lecture03.pdf`",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer",
    ),
]

In [23]:
document_content_description = "Lecture notes"
retriever = SelfQueryRetriever.from_llm(
    llm=llm,
    vectorstore=vector_db,
    document_contents=document_content_description,
    metadata_field_info=metadata_field_info,
    verbose=True
)

In [24]:
docs = retriever.get_relevant_documents(query=question)

In [25]:
for doc in docs:
    print(doc.metadata)

{'page': 0, 'page_label': '1', 'source': 'documents/MachineLearning-Lecture03.pdf'}
{'page': 10, 'page_label': '11', 'source': 'documents/MachineLearning-Lecture03.pdf'}
{'page': 5, 'page_label': '6', 'source': 'documents/MachineLearning-Lecture03.pdf'}
{'page': 2, 'page_label': '3', 'source': 'documents/MachineLearning-Lecture03.pdf'}


### Contextual Compression

In [26]:
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))

In [27]:
compressor = LLMChainExtractor.from_llm(llm=llm)

In [28]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vector_db.as_retriever()
)

In [29]:
question = "What did they say about Matlab?"

In [30]:
compressed_docs = compression_retriever.get_relevant_documents(query=question)

In [31]:
pretty_print_docs(compressed_docs)

Document 1:

- those homeworks will be done in either MATLAB or in Octave
- I know some people call it a free version of MATLAB
- MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data
- it's sort of an extremely easy to learn tool to use for implementing a lot of learning algorithms
- there's also a software package called Octave that you can download for free off the Internet
- it has somewhat fewer features than MATLAB, but it's free, and for the purposes of this class, it will work for just about everything
- once a colleague of mine at a different university, not at Stanford, actually teaches another machine learning course
----------------------------------------------------------------------------------------------------
Document 2:

- those homeworks will be done in either MATLAB or in Octave
- I know some people call it a free version of MATLAB
- MATLAB is

### Combining Various Techniques

In [32]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vector_db.as_retriever(search_type='mmr')
)

In [33]:
compressed_docs = compression_retriever.get_relevant_documents(query=question)
pretty_print_docs(compressed_docs)

Document 1:

- those homeworks will be done in either MATLAB or in Octave
- I know some people call it a free version of MATLAB
- MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data
- it's sort of an extremely easy to learn tool to use for implementing a lot of learning algorithms
- there's also a software package called Octave that you can download for free off the Internet
- it has somewhat fewer features than MATLAB, but it's free, and for the purposes of this class, it will work for just about everything
- once a colleague of mine at a different university, not at Stanford, actually teaches another machine learning course
----------------------------------------------------------------------------------------------------
Document 2:

"Oh, it was the MATLAB."
----------------------------------------------------------------------------------------------------


### Other Types of Retrievals

It's worth noting that vectordb as not the only kind of tool to retrieve documents.

The LangChain retriever abstraction includes other ways to retrieve documents, such as TF-IDF or SVM.

In [34]:
loader = PyPDFLoader('documents/MachineLearning-Lecture01.pdf')

In [35]:
pages = loader.load()

In [36]:
all_pages_text = [page.page_content for page in pages]

In [37]:
joined_page_text = " ".join(all_pages_text)

In [38]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,
    chunk_overlap=150
)
splits = text_splitter.split_text(joined_page_text)

**SVM Retriever**

In [39]:
svm_retriever = SVMRetriever.from_texts(
    texts=splits,
    embeddings=embeddings
)

In [40]:
question = "What did they say about Matlab?"

In [41]:
svm_retriever_document = svm_retriever.get_relevant_documents(query=question)
svm_retriever_document[0]

Document(metadata={}, page_content="yourselves. You can also come and talk to me or the TAs if you want to brainstorm ideas \nwith us.  \nOkay. So one more organizational question. I'm curious, how many of you know \nMATLAB? Wow, cool, quite a lot. Okay. So as part of the — act ually how many of you \nknow Octave or have used Octave? Oh, okay, much smaller number.  \nSo as part of this class, especially in the homeworks, we'll ask you to implement a few \nprograms, a few machine learning algorithms as part of the homeworks. And most of  those homeworks will be done in either MATLAB or in Octave, which is sort of — I \nknow some people call it a free version of MATLAB, which it sort of is, sort of isn't.  \nSo I guess for those of you that haven't seen MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to \nwrite codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it's 

**TF-IDF Retriever**

In [42]:
tfidf_retriever = TFIDFRetriever.from_texts(texts=splits)

In [43]:
question = "What did they say about Matlab?"

In [44]:
tfidf_retriever_document = tfidf_retriever.get_relevant_documents(question)
tfidf_retriever_document[0]

Document(metadata={}, page_content="yourselves. You can also come and talk to me or the TAs if you want to brainstorm ideas \nwith us.  \nOkay. So one more organizational question. I'm curious, how many of you know \nMATLAB? Wow, cool, quite a lot. Okay. So as part of the — act ually how many of you \nknow Octave or have used Octave? Oh, okay, much smaller number.  \nSo as part of this class, especially in the homeworks, we'll ask you to implement a few \nprograms, a few machine learning algorithms as part of the homeworks. And most of  those homeworks will be done in either MATLAB or in Octave, which is sort of — I \nknow some people call it a free version of MATLAB, which it sort of is, sort of isn't.  \nSo I guess for those of you that haven't seen MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to \nwrite codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it's 