## Libraries and Setup

In [1]:
import openai
import datetime
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from dotenv import load_dotenv, find_dotenv

import warnings
warnings.filterwarnings('ignore')

In [2]:
_ = load_dotenv(find_dotenv())

In [3]:
embeddings = OpenAIEmbeddings()
llm = ChatOpenAI(temperature=0)

## Retrieval

In [4]:
persist_directory = "db/chroma/"

In [5]:
vector_db = Chroma(
    persist_directory=persist_directory,
    embedding_function=embeddings
)

In [6]:
print(vector_db._collection.count())

208


In [7]:
question = "What are the major topics for this class?"

In [8]:
docs = vector_db.similarity_search(
    query=question,
    k=3
)

In [9]:
len(docs)

3

## RetrievalQA Chain

In [10]:
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vector_db.as_retriever()
)

In [11]:
result = qa_chain.invoke(question)

In [12]:
result["result"]

'The major topics for this class include machine learning, statistics, and algebra. Additionally, there will be discussions on extensions of the material covered in the main lectures.'

## Prompts

In [13]:
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer\
just say that you don't know, don't try to make up an answer. Use three sentences maximum, Keep the answer as consise\
as possible. Always say "thanks for asking!" at the end of the answer 
{context}
Question: {question}
Answer: """

In [14]:
QA_CHAIN_PROMPT = PromptTemplate.from_template(template=template)

In [15]:
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vector_db.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt":QA_CHAIN_PROMPT}
)

In [16]:
question = "Is Probability a class topic"

In [17]:
result = qa_chain.invoke(question)

In [18]:
result["result"]

'Yes, probability is a class topic as the instructor assumes familiarity with basic probability and statistics. Thanks for asking!'

In [19]:
result["source_documents"][0]

Document(id='23265565-d6e5-4d35-b358-8f876bea0149', metadata={'page': 4, 'page_label': '5', 'source': 'documents/MachineLearning-Lecture01.pdf'}, page_content="of this class will not be very programming intensive, although we will do some \nprogramming, mostly in either MATLAB or Octave. I'll say a bit more about that later.  \nI also assume familiarity with basic probability and statistics. So most undergraduate \nstatistics class, like Stat 116 taught here at Stanford, will be more than enough. I'm gonna \nassume all of you know what random variables are, that all of you know what expectation \nis, what a variance or a random variable is. And in case of some of you, it's been a while \nsince you've seen some of this material. At some of the discussion sections, we'll actually \ngo over some of the prerequisites, sort of as a refresher course under prerequisite class. \nI'll say a bit more about that later as well.  \nLastly, I also assume familiarity with basic linear algebra. And ag

## Map Reduce

In [20]:
map_reduce_qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vector_db.as_retriever(),
    chain_type="map_reduce"
)

In [21]:
result = map_reduce_qa_chain.invoke(question)
result["result"]

'Yes, probability is a class topic in the document.'

## Refine

In [22]:
refine_qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vector_db.as_retriever(),
    chain_type="refine"
)

In [23]:
result = refine_qa_chain.invoke(question)
result["result"]

'The additional context provided does not significantly impact the original answer, as it already addresses the topic of probability being covered in the class. The instructor mentions using a probabilistic interpretation to derive the next learning algorithm, which will be the first classification algorithm discussed in the class. This further reinforces the importance of understanding probability in the context of machine learning algorithms. The mention of using discussion sections for refresher topics like statistics and algebra, as well as for extensions of the main lecture material, does not directly impact the relevance of probability as a class topic.'

### Map ReRank

In [24]:
map_rerank_qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vector_db.as_retriever(),
    chain_type="map_rerank"
)

In [25]:
result = map_rerank_qa_chain.invoke(question)
result["result"]

'Yes, probability is a class topic mentioned in the context.'

## RetrievalQA Limitations

RetrievalQA fails to preserve conversational history.

In [26]:
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vector_db.as_retriever()
)

In [27]:
question = "Is probability a class topic?"
result = qa_chain.invoke(question)
result["result"]

'Yes, probability is a class topic in the course being described. The instructor assumes familiarity with basic probability and statistics, so it is likely that probability concepts will be covered in the class.'

In [28]:
question = "why are those prerequesites needed?"
result = qa_chain.invoke(question)
result["result"]

'The prerequisites mentioned in the context are needed because the course assumes familiarity with basic concepts in probability and statistics, as well as basic linear algebra. Understanding these concepts is essential for grasping the material covered in the machine learning course. For example, knowledge of probability and statistics is crucial for understanding algorithms and their performance, while linear algebra is fundamental for understanding how machine learning algorithms work with matrices and vectors.'

Note, The LLM response varies. Some responses do include a reference to probability which might be gleaned from referenced documents. The point is simply that the model does not have access to past questions or answers, this will be covered in the Chat Notebook.