### Loading the vector store

In [1]:
from dotenv import load_dotenv
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores.pgvector import PGVector

load_dotenv()

COLLECTION_NAME = "documents"
DB_CONNECTION = "postgresql://postgres:supa-jupyteach@192.168.0.77:54328/postgres"

def get_vectorstore():
    embeddings = OpenAIEmbeddings()

    db = PGVector(embedding_function=embeddings,
        collection_name=COLLECTION_NAME,
        connection_string=DB_CONNECTION,
    )
    return db

db = get_vectorstore()
retriever = db.as_retriever()

### Connecting to the LLM

To connect to the LLM we will follow recent (August 2023) reccomendations from the langchain team. 

For reference you can use this [blog post](https://blog.langchain.dev/conversational-retrieval-agents/) and this [guide](https://python.langchain.com/docs/use_cases/question_answering/conversational_retrieval_agents?ref=blog.langchain.dev)

The tools those resources reccommend are imported as follows:

In [None]:
from langchain.agents.agent_toolkits import create_conversational_retrieval_agent
from langchain.agents.agent_toolkits import create_retriever_tool

We start with the `create_conversational_retrieval_agent` function. 

Let's check its docstring

In [9]:
create_conversational_retrieval_agent?

[0;31mSignature:[0m
[0mcreate_conversational_retrieval_agent[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mllm[0m[0;34m:[0m [0mlangchain[0m[0;34m.[0m[0mschema[0m[0;34m.[0m[0mlanguage_model[0m[0;34m.[0m[0mBaseLanguageModel[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtools[0m[0;34m:[0m [0mList[0m[0;34m[[0m[0mlangchain[0m[0;34m.[0m[0mtools[0m[0;34m.[0m[0mbase[0m[0;34m.[0m[0mBaseTool[0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mremember_intermediate_steps[0m[0;34m:[0m [0mbool[0m [0;34m=[0m [0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmemory_key[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m'chat_history'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msystem_message[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mlangchain[0m[0;34m.[0m[0mschema[0m[0;34m.[0m[0mmessages[0m[0;34m.[0m[0mSystemMessage[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mverbose[0m[0;34m:[0m

Note that we need to pass three arguments

1. `llm`: an instance of an LLM subclass. We will use `langchain.chat_models.ChatOpenAI`
2. `tools`: a list of langchain [tools](https://python.langchain.com/docs/modules/agents/tools/). Tools allow the LLM to access arbitrary external resources like searching the web, running Python code, etc. For us we will use tools to do retrieval and langchain will help us via that `create_retriever_tool` function we imported
3. `system_message`: This is where we customize the system prompt/set of instructions for the llm. This is where you will spend most of your time and will need to create many (dozens!) of variations to see what works best.

Note that if we didn't pass a custom system prompt, a default one would be added. You can read the contents of the default prompt [here](https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/agents/agent_toolkits/conversational_retrieval/openai_functions.py#L17-L24). I'll also include it right here so we can discuss...

```python
def _get_default_system_message() -> SystemMessage:
    return SystemMessage(
        content=(
            "Do your best to answer the questions. "
            "Feel free to use any tools available to look up "
            "relevant information, only if necessary"
        )
    )
```

Notice that they instruct the llm to answer questions **and** to "Feel free to use any tools available to look up relevant information, only if necessary". We should always include that second sentence instructing the LLM to use tools. Otherwise, it won't do any retrieval.

Below I will show an example of items 1, 2, and 3. In your work you can keep items 1 and 2 fixed, but will need to customize 3 as we've described. To help this workflow, I will define a function that always creates 1 and 2, but will take 3 as an argument...

In [11]:
from langchain.schema.messages import SystemMessage

def create_chain(system_message_text):
    # step 1: create llm
    from langchain.chat_models import ChatOpenAI
    llm = ChatOpenAI(temperature=0)
    
    # step 2: create retriever tool
    tool = create_retriever_tool(
        retriever,
        "search_course_content",
        "Searches and returns documents regarding the contents of the course and notes from the instructor.",
    )
    tools = [tool]

    # step 3: create system message from the text passed in as an argument
    system_message = SystemMessage(content=system_message_text)

    # return the chain
    return create_conversational_retrieval_agent(
        llm=llm, 
        tools=tools, 
        verbose=True, 
        system_message=system_message
    )

Finally, here is an example of a system prompt that I wrote and a few messages showing how to interact with the returned chat model...

In [16]:
example_system_prompt_text = """\
You are a helpful, knowledgeable, and smart teaching assistant.

You specialize in helping students understand concepts their instructors teach by:

1. Explaining concepts in concise, simple, and clear language
2. Providing additional examples of the topics being discussed
3. Summarizing content from the instructor, which will be provided to you along with the student's question

Feel free to use any tools available to look up relevant information, only if necessary
"""

example_chat = create_chain(example_system_prompt_text)

We are now ready to chat with this AI!

We do so by calling `example_chat` as a function and passing in a dictionary with a single key called `input` with the student's message

In [17]:
example_chat({"input": "Hi, I'm Spencer"})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mHello Spencer! How can I assist you today?[0m

[1m> Finished chain.[0m


{'input': "Hi, I'm Spencer",
 'chat_history': [HumanMessage(content="Hi, I'm Spencer"),
  AIMessage(content='Hello Spencer! How can I assist you today?')],
 'output': 'Hello Spencer! How can I assist you today?',
 'intermediate_steps': []}

Notice no retrieval was done! This is good.

Now let's ask a question and see retrieval happen

Notice that we simply call `example_chat` again. Langchain will keep track of the entire history of the chat for us, so we just call back into this same function.

Also notice that I'm storing the result to a variable called `result`. We'll unpack this later...

In [18]:
result = example_chat({"input": "What did the professor say are the four core reshaping operations for a pandas DataFrame?"})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `search_course_content` with `core reshaping operations for pandas DataFrame`


[0m[36;1m[1;3m[Document(page_content="Hello, this is Spencer Lion and I'm really excited about our topic today today. We're talking about group-by operations and pandas, which in my opinion is one of the very coolest and most powerful operations we can do. Let's take a stock of our pandas journey thus far. We started out by learning about the core data types and pandas. This includes the series and the data frame. We then learned how we can do operations such as extracting values or subsets of values from our series and data frame objects. We learned about how we can do arithmetic, either on single values or on entire columns or entire data frames, all at once. We then studied how we can organize our data in pandas using the index and the column names. We saw how a careful selection of the index and columns names could help with anal

In [21]:
print(result["output"])

The professor mentioned that the four core reshaping operations for a pandas DataFrame are:

1. Set Index: This operation sets one or more columns as the index of the DataFrame, allowing for easier data manipulation and analysis based on the index values.

2. Reset Index: This operation resets the index of the DataFrame back to the default integer index, removing any previously set index.

3. Stack: This operation reshapes the DataFrame from wide to long format by stacking the column labels into a single column, creating a hierarchical index.

4. Unstack: This operation reshapes the DataFrame from long to wide format by unstacking the hierarchical index and spreading the values from a single column into multiple columns.

These four operations are fundamental for reshaping and organizing data in pandas.


Excellent! This is spot on. And notice that retrieval happened.

We can see it in the printout above, but we can also check `result` for more details

The retrieval details will be contained in `result["intermediate_steps"]`

In [None]:
type(result['intermediate_steps'])

In [28]:
len(result['intermediate_steps'])

1

This is a list of all intermediate steps that were done. This is only a 1 element list. Let's check that value

In [29]:
type(result['intermediate_steps'][0])

tuple

In [30]:
len(result['intermediate_steps'][0])

2

A two element tuple... let's unpack

In [31]:
x1, x2 = result['intermediate_steps'][0]

In [32]:
type(x1)

langchain.schema.agent.AgentActionMessageLog

In [33]:
type(x2)

list

Ok so the first element of the tuple in `result['intermediate_steps'][0]` is a log (record) of what intermediate step happened.

In [34]:
x1

AgentActionMessageLog(tool='search_course_content', tool_input='core reshaping operations for pandas DataFrame', log='\nInvoking: `search_course_content` with `core reshaping operations for pandas DataFrame`\n\n\n', message_log=[AIMessage(content='', additional_kwargs={'function_call': {'name': 'search_course_content', 'arguments': '{\n  "__arg1": "core reshaping operations for pandas DataFrame"\n}'}})])

We see that this is an intermediate step where the llm used the `search_course_content` tool. Remember above we set up the retriever to look for contents and gave it that name. The log message above shows the details about what was sent to the retriever.

Finally, the second element in the `result['intermediate_steps'][0]` tuple has a list of the documents (chunks) that were retrieved

In [35]:
x2

[Document(page_content="Hello, this is Spencer Lion and I'm really excited about our topic today today. We're talking about group-by operations and pandas, which in my opinion is one of the very coolest and most powerful operations we can do. Let's take a stock of our pandas journey thus far. We started out by learning about the core data types and pandas. This includes the series and the data frame. We then learned how we can do operations such as extracting values or subsets of values from our series and data frame objects. We learned about how we can do arithmetic, either on single values or on entire columns or entire data frames, all at once. We then studied how we can organize our data in pandas using the index and the column names. We saw how a careful selection of the index and columns names could help with analysis because pandas will align the data for us using the index and column names. We then took some time to understand how to reshape data, how to maybe transform it from