<a href="https://colab.research.google.com/github/rastringer/promptcraft/blob/main/agents_vectorstores.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Agents and vectorstores

In this notebook, we will explore one of the most fun features of LangChain: agents and their toolkits.

Agents have access to tools such as JSON, Wikipedia, Web Search, GitHub or Pandas Dataframes, and can access their capabilities depending on user input.

See [here](https://python.langchain.com/docs/integrations/toolkits/) for a full list of agent toolkits.

We will use the following technologies:

* Vertex AI Generative Studio

* Langchain, a framework for building applications with large language models

* The open-source Chroma vector store database

## Data Retrieval with LLMs and Embeddings

Matching customer queries to products via embeddings and Retrieval Augmentated Generation.

### Overview

This notebook demonstrates one method of using large language models to interact with data. Using the Wayfair [WANDS](https://www.aboutwayfair.com/careers/tech-blog/wayfair-releases-wands-the-largest-and-richest-publicly-available-dataset-for-e-commerce-product-search-relevance) dataset of more than 42,000 products, we will go through the following steps:

* Download the data into a pandas dataframe and take a smaller 1,000-row sample set

* Merge then generate embeddings for the product titles and descriptions

* Prompt an LLM to retrieve details and relevant documents related to queries.


<img src="https://assets.wfcdn.com/im/01139917/resize-h800-w800%5Ecompr-r85/2315/231567967/Capricornus+3+Seater+Sofa.jpg" width="300"/> <img src="https://assets.wfcdn.com/im/07725066/resize-h800-w800%5Ecompr-r85/1584/158440119/Vancasso+BOMOOTIUR+Stoneware+Dinnerware+-+Set+of+18.jpg" width="300"/>


In [None]:
# Install the packages
! pip install --upgrade google-cloud-aiplatform
! pip install shapely<2.0.0
! pip install sentence_transformers
! pip install langchain
! pip install pypdf
! pip install pydantic==1.10.8
! pip install chromadb==0.3.26
! pip install langchain[docarray]
! pip install typing-inspect==0.8.0 typing_extensions==4.5.0

In [None]:
# Automatically restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

In [None]:
from google.colab import auth
auth.authenticate_user()

### SDK and Project Initialization

In [None]:
#Fill in your GCP project_id and region
PROJECT_ID = "<..>"
REGION = "<..>"

import vertexai

vertexai.init(project=PROJECT_ID, location=REGION)

### Import Langchain tools

In [None]:
# Utils
import time
from typing import List

# Langchain
import langchain
from pydantic import BaseModel

print(f"LangChain version: {langchain.__version__}")

# Vertex AI
from google.cloud import aiplatform
from langchain.chat_models import ChatVertexAI
from langchain.embeddings import VertexAIEmbeddings
from langchain.llms import VertexAI
from langchain.schema import HumanMessage, SystemMessage

print(f"Vertex AI SDK version: {aiplatform.__version__}")

# Import data

In [None]:
!wget -q https://raw.githubusercontent.com/wayfair/WANDS/main/dataset/product.csv

In [None]:
import pandas as pd
product_df = pd.read_csv("product.csv", sep='\t')

In [None]:
product_df = product_df[:1000].dropna()

In [None]:
len(product_df)

In [None]:
# Reduce the df to columns of interest
product_df = product_df.filter(["product_id", "product_name", "product_description", "average_rating"], axis=1)

In [None]:
product_df.head()

### Import and initialize pandas dataframe agent

In [None]:
from langchain.agents import create_pandas_dataframe_agent
from langchain.agents.agent_types import AgentType

In [None]:
agent = create_pandas_dataframe_agent(VertexAI(temperature=0), product_df, verbose=True)

In [None]:
agent.run("how many rows are there?")

In [None]:
agent.run("How many beds are there with a rating of > 4?")

In [None]:
agent = create_csv_agent(
    VertexAI(temperature=0),
    "data.csv",
    verbose=True,
    agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
)

In [None]:
agent.run("How many rows are there?")

In [None]:
agent.run("Do any products descriptions mention polypropylene pile? Output them as JSON please")

In [None]:
agent.run("What is the square root of all ratings for product names featuring sofas")

## Vector stores

We will explore embeddings vectors and vector stores in more detail in the subsequent notebooks. Let's see what's possible by concatenating our `product_title` and `product_description` columns and creating a text file from the result. We can then create embeddings and perform various retrieval and Q&A tasks.

We will use the open source [Chroma](https://docs.trychroma.com/) vector store.

In [None]:
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader

In [None]:
product_df['text_data'] = product_df['product_name'] + " " + product_df['product_description']

In [None]:
# Save the "text_data" column to a text file
text_file_path = "combined_text_data.txt"
product_df['text_data'].to_csv(text_file_path, sep='\t', index=False, header=False)


In [None]:
# load the document and split it into chunks
loader = TextLoader("combined_text_data.txt")
documents = loader.load()

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

docs = text_splitter.split_documents(documents)

In [None]:
len(docs)

In [None]:
from langchain.vectorstores import Chroma

# Clear any previous vector store
!rm -rf ./docs/chroma

In [None]:
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
db = Chroma.from_documents(docs, embedding_function)

In [None]:
query = "Is there a slow cooker?"
docs = db.similarity_search(query, n_results=2)

In [None]:
docs[0]

In [None]:
query = "Recommend a durable door mat"
docs = db.similarity_search(query, n_results=2)

In [None]:
docs

### Retrieval

In [None]:
from langchain.chains import RetrievalQA

llm = VertexAI(
    model_name="text-bison@001",
    max_output_tokens=1024,
    temperature=0.1,
    top_p=0.8,
    top_k=40,
    verbose=True,
)

qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=db.as_retriever()
)

### Prompt

In [None]:
from langchain.prompts import PromptTemplate

# Build prompt
template = """Use the following pieces of context to answer the question at the end. \
If you don't know the answer, just say that you don't know, \
don't try to make up an answer. Use three sentences maximum. \
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate(input_variables=["context", "question"],template=template,)


In [None]:
# Run chain
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=db.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

In [None]:
question = "Can you recommend comfortable bed sheets?"
result = qa_chain({"query": question})
result["result"]