In [1]:
from dotenv import load_dotenv,find_dotenv

_ = load_dotenv(find_dotenv())

In [2]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
persist_directory = 'db/chroma/'
embedding = OpenAIEmbeddings()
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)

In [11]:
print(vectordb._collection.count())

209


In [12]:
question = "What are major topics for this class?"
docs = vectordb.similarity_search(question,k=3) # this does not involve call to API
len(docs)

3

In [15]:
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(temperature=0)

import langchain
langchain.debug = True

By default , retreivalQA uses stuff : fit all relevant context into one chunk.

But we divide using map_reduce , refine and map_rerank can be used to overcome context limit.
But they will use more API calls

In [20]:
from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(
    llm,
    chain_type="stuff", 
    retriever=vectordb.as_retriever()
)

In [21]:
result = qa_chain({"query": question})

[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "What are major topics for this class?"
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 3:chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 3:chain:StuffDocumentsChain > 4:chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "What are major topics for this class?",
  "context": "statistics for a while or maybe algebra, we'll go over those in the discussion sections as a \nrefresher for those of you that want one.  \nLater in this quarter, we'll also use the disc ussion sections to go over extensions for the \nmaterial that I'm teaching in the main lectur es. So machine learning is a huge field, and \nthere are a few extensions that we really want  to teach but didn't have time in the main \nlectures for.\n\nstatistics for a while or maybe algebra, we'll go over those in the discu

In [18]:
print(result['result'])

The major topics for this class are machine learning and its various applications.


Define custom Prompt : this is the main peice of information given to LLM

In [22]:
from langchain.prompts import PromptTemplate

# Build prompt
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer. 
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)


In [29]:
# Run chain
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(search_type="mmr"),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

In [24]:
question = "Is statistics a class topic?"

In [30]:
result = qa_chain({"query":question})

[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "Is statistics a class topic?"
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 3:chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 3:chain:StuffDocumentsChain > 4:chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "Is statistics a class topic?",
  "context": "statistics for a while or maybe algebra, we'll go over those in the discussion sections as a \nrefresher for those of you that want one.  \nLater in this quarter, we'll also use the disc ussion sections to go over extensions for the \nmaterial that I'm teaching in the main lectur es. So machine learning is a huge field, and \nthere are a few extensions that we really want  to teach but didn't have time in the main \nlectures for.\n\nof this class will not be very program ming intensive, although we will do some \nprogrammi

In [31]:
print(result)

{'query': 'Is statistics a class topic?', 'result': 'Yes, statistics is a class topic. Thanks for asking!', 'source_documents': [Document(page_content="statistics for a while or maybe algebra, we'll go over those in the discussion sections as a \nrefresher for those of you that want one.  \nLater in this quarter, we'll also use the disc ussion sections to go over extensions for the \nmaterial that I'm teaching in the main lectur es. So machine learning is a huge field, and \nthere are a few extensions that we really want  to teach but didn't have time in the main \nlectures for.", metadata={'page': 8, 'source': 'ml_doc1.pdf'}), Document(page_content="of this class will not be very program ming intensive, although we will do some \nprogramming, mostly in either MATLAB or Octa ve. I'll say a bit more about that later.  \nI also assume familiarity with basic proba bility and statistics. So most undergraduate \nstatistics class, like Stat 116 taught here at Stanford, will be more than enou

In [34]:
for source in result['source_documents']:
    print(source.page_content , source.metadata)

statistics for a while or maybe algebra, we'll go over those in the discussion sections as a 
refresher for those of you that want one.  
Later in this quarter, we'll also use the disc ussion sections to go over extensions for the 
material that I'm teaching in the main lectur es. So machine learning is a huge field, and 
there are a few extensions that we really want  to teach but didn't have time in the main 
lectures for. {'page': 8, 'source': 'ml_doc1.pdf'}
of this class will not be very program ming intensive, although we will do some 
programming, mostly in either MATLAB or Octa ve. I'll say a bit more about that later.  
I also assume familiarity with basic proba bility and statistics. So most undergraduate 
statistics class, like Stat 116 taught here at Stanford, will be more than enough. I'm gonna 
assume all of you know what ra ndom variables are, that all of you know what expectation 
is, what a variance or a random variable is. And in case of some of you, it's been a while 

In [35]:
question = "Is probability a class topic?"
qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="map_reduce"
)

result = qa_chain_mr({"query": question})

[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "Is probability a class topic?"
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 3:chain:MapReduceDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 3:chain:MapReduceDocumentsChain > 4:chain:LLMChain] Entering Chain run with input:
[0m{
  "input_list": [
    {
      "context": "of this class will not be very program ming intensive, although we will do some \nprogramming, mostly in either MATLAB or Octa ve. I'll say a bit more about that later.  \nI also assume familiarity with basic proba bility and statistics. So most undergraduate \nstatistics class, like Stat 116 taught here at Stanford, will be more than enough. I'm gonna \nassume all of you know what ra ndom variables are, that all of you know what expectation \nis, what a variance or a random variable is. And in case of some of you, it's been a whil

Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-gpt-3.5-turbo in organization org-Ipkgjb8eV212suWndEgA3XmD on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method..
Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-gpt-3.5-turbo in organization org-Ipkgjb8eV212suWndEgA3XmD on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit ht

[36;1m[1;3m[llm/end][0m [1m[1:chain:RetrievalQA > 3:chain:MapReduceDocumentsChain > 4:chain:LLMChain > 5:llm:ChatOpenAI] [28.16s] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": "Yes, probability is mentioned as a prerequisite for the class. The instructor assumes familiarity with basic probability and statistics.",
        "generation_info": {
          "finish_reason": "stop"
        },
        "message": {
          "lc": 1,
          "type": "constructor",
          "id": [
            "langchain",
            "schema",
            "messages",
            "AIMessage"
          ],
          "kwargs": {
            "content": "Yes, probability is mentioned as a prerequisite for the class. The instructor assumes familiarity with basic probability and statistics.",
            "additional_kwargs": {}
          }
        }
      }
    ]
  ],
  "llm_output": {
    "token_usage": {
      "prompt_tokens": 383,
      "completion_tokens": 22,
      "to

Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-gpt-3.5-turbo in organization org-Ipkgjb8eV212suWndEgA3XmD on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method..
Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-gpt-3.5-turbo in organization org-Ipkgjb8eV212suWndEgA3XmD on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit ht

[36;1m[1;3m[llm/end][0m [1m[1:chain:RetrievalQA > 3:chain:MapReduceDocumentsChain > 9:chain:LLMChain > 10:llm:ChatOpenAI] [14.53s] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": "Based on the provided information, it is not clear whether probability is a specific topic in the class.",
        "generation_info": {
          "finish_reason": "stop"
        },
        "message": {
          "lc": 1,
          "type": "constructor",
          "id": [
            "langchain",
            "schema",
            "messages",
            "AIMessage"
          ],
          "kwargs": {
            "content": "Based on the provided information, it is not clear whether probability is a specific topic in the class.",
            "additional_kwargs": {}
          }
        }
      }
    ]
  ],
  "llm_output": {
    "token_usage": {
      "prompt_tokens": 167,
      "completion_tokens": 20,
      "total_tokens": 187
    },
    "model_name": "gpt-3.5-turbo"
  },


Retrieval Limitation : can't preserve memory/history of chat

In [36]:
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever()
)

In [37]:
question = "Is probability a class topic?"
result = qa_chain({"query": question})
result["result"]

[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "Is probability a class topic?"
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 3:chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 3:chain:StuffDocumentsChain > 4:chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "Is probability a class topic?",
  "context": "of this class will not be very program ming intensive, although we will do some \nprogramming, mostly in either MATLAB or Octa ve. I'll say a bit more about that later.  \nI also assume familiarity with basic proba bility and statistics. So most undergraduate \nstatistics class, like Stat 116 taught here at Stanford, will be more than enough. I'm gonna \nassume all of you know what ra ndom variables are, that all of you know what expectation \nis, what a variance or a random variable is. And in case of some of you, it's 

'Yes, probability is a topic that will be covered in this class. The instructor assumes familiarity with basic probability and statistics, so it is expected that students have prior knowledge in this area.'

In [38]:
question = "why are those prerequesites needed?"
result = qa_chain({"query": question})
result["result"]

[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "why are those prerequesites needed?"
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 3:chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 3:chain:StuffDocumentsChain > 4:chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "why are those prerequesites needed?",
  "context": "So, see, most of us use learning algorithms half a dozen, a dozen, maybe dozens of times \nwithout even knowing it.  \nAnd of course, learning algorithms are also  doing things like giving us a growing \nunderstanding of the human genome. So if so meday we ever find a cure for cancer, I bet \nlearning algorithms will have had a large role in that. That's sort of the thing that Tom \nworks on, yes?  \nSo in teaching this class, I sort of have thre e goals. One of them is just to I hope convey \nsome of my own 

'The prerequisites are needed because they provide the foundational knowledge and skills necessary to understand and apply machine learning algorithms. \n\nBasic knowledge of computer science and computer skills is important because machine learning algorithms often involve programming and working with data. Understanding concepts like big-O notation helps in analyzing the efficiency and scalability of algorithms.\n\nFamiliarity with probability and statistics is necessary because machine learning involves working with data and making predictions based on statistical models. Knowledge of random variables, expectation, variance, and probability distributions is essential in understanding and implementing machine learning algorithms.\n\nBasic knowledge of linear algebra is required because many machine learning algorithms involve matrix operations and linear transformations. Understanding concepts like matrices, vectors, matrix multiplication, and matrix inverse is important in working w