In [None]:
!pip install langchain==0.0.142 openai==0.27.4 beautifulsoup4==4.12.2 chromadb==0.3.21 GitPython==3.1.31

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

from langchain.document_loaders import (
    GitLoader,
    YoutubeLoader,
    DataFrameLoader
)
from langchain.text_splitter import CharacterTextSplitter
from langchain.schema import Document
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.agents import initialize_agent, Tool

# Implementing a sales & support agent with LangChain
## Learn how to develop a chatbot that can answer questions based on the information provided in your company's documentation

Recently, I have been fascinated by the power of ChatGPT and its ability to construct various types of chatbots. I have tried and written about multiple approaches to implementing a chatbot that can access external information to improve its answers. I joined a few Discord channels during my chatbot coding sessions, hoping to get some help as the libraries are relatively new, and not much documentation is available yet. To my amazement, I found custom bots that could answer most of the questions for the given library.

The idea is to provide the chatbot the ability to dig through various resources like company documentation, code, or other content in order to allow it to answer company support questions. Since I already have some experience with chatbots, I decided to test how hard it is to implement a custom bot with access to the company's resources.
In this blog post, I will walk you through how I used OpenAI's models to implement a sales & support agent with in the LangChain library that can be used to answer information about applications with a graph database Neo4j. The agent can also help you debug or produce any Cypher statement you are struggling with. Such an agent could then be deployed to serve users on Discord or other platforms.

We will be using the LangChain library to implement the support bot. The library is easy to use and provides an excellent integration of LLM prompts and Python code, allowing us to develop chatbots in only a few minutes. In addition, the library supports a range of LLMs, text embedding models, and vector databases, along with utility functions that help us load and embed frequent types of files we might come across, like text, PowerPoint, images, HTML, PDF, and more.

## LangChain document loaders
First, we must preprocess the company's resources and store them in a vector database. Luckily, LangChain can help us load external data, calculate text embeddings, and store the documents in a vector database of our choice.
First, we have to load the text into documents. LangChain offers a variety of helper functions that can take various formats and types of data and produce a document output. The helper functions are called Document loaders.

Neo4j has a lot of its documentation available in GitHub repositories. Conveniently, LangChain provides a document loader that takes a repository URL as input and produces a document for each file in the repository. Additionally, we can use the filter function to ignore files during the loading process if needed.

We will begin by loading the AsciiDoc files from the Neo4j's knowledge base repository.

In [2]:
# Knowledge base
kb_loader = GitLoader(
    clone_url="https://github.com/neo4j-documentation/knowledge-base",
    repo_path="./repos/kb/",
    branch="master",
    file_filter=lambda file_path: file_path.endswith(".adoc")
    and "articles" in file_path,
)
kb_data = kb_loader.load()
print(len(kb_data))

309


Wasn't that easy as a pie? The GitLoader function clones the repository and load relevant files as documents. In this example, we specified that the file must end with .adoc suffix and be a part of the articles folder. In total, 309 articles were loaded. We also have to be mindful of the size of the documents. For example, GPT-3.5-turbo has a token limit of 4000, while GPT-4 allows 8000 tokens in a single request. While number of words is not exactly identical to the number of tokens, it is still a good estimator. 

Next, we will load the documentation of the Graph Data Science repository. Here, we will use a text splitter to make sure none of the documents exceed 2000 words. Again, I know that number of words is not equal to the number of tokens, but it is a good approximation. Defining the threshold number of tokens can significantly affect how the database is found and retrieved.

In [3]:
# Define text chunk strategy
splitter = CharacterTextSplitter(
  chunk_size=2000, 
  chunk_overlap=50,
  separator=" "
)
# GDS guides
gds_loader = GitLoader(
    clone_url="https://github.com/neo4j/graph-data-science",
    repo_path="./repos/gds/",
    branch="master",
    file_filter=lambda file_path: file_path.endswith(".adoc") 
    and "pages" in file_path,
)
gds_data = gds_loader.load()
# Split documents into chunks
gds_data_split = splitter.split_documents(gds_data)
print(len(gds_data_split))

771


We could load other Neo4j repositories that contain documentation. However, the idea is to show various data loading methods and not explore all of Neo4j's repositories containing documentation. Therefore, we will move on and look at how we can load documents from a Pandas Dataframe.

For example, say that we want to load a YouTube video as a document source for our chatbot. Neo4j has its own YouTube channel and, even I appear in a video or two. Two years ago I presented how to implement an information extraction pipeline.
With LangChain, we can use the captions of the video and load it as documents with only three lines of code.

In [4]:
# Youtube
yt_loader = YoutubeLoader("1sRgsEKlUr0")
yt_data = yt_loader.load()
yt_data_split = splitter.split_documents(yt_data)
print(len(yt_data_split))

10


It couldn't get any easier than this. Next, we will look at loading documents from a Pandas dataframe. A month ago, I retrieved information from Neo4j medium publication for a separate blog post. Since we want to bring external information about Neo4j to the bot, we can also use the content of the medium articles.

In [5]:
# Medium
article_url = "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/medium/neo4j_articles.csv"
medium = pd.read_csv(article_url, sep=";")
medium["source"] = medium["url"]
medium_loader = DataFrameLoader(
    medium[["text", "source"]], 
    page_content_column="text")
medium_data = medium_loader.load()
medium_data_split = splitter.split_documents(medium_data)
print(len(medium_data_split))

2244


Here, we used Pandas to load a CSV file from GitHub, renamed one column, and used the DataFrameLoaderfunction to load the articles as documents. Since medium posts could exceed 4000 tokens, we used the text splitter to split the articles into multiple chunks.

The last source we will use is the Stack Overflow API. Stack Overflow is a web platform where users help others solve coding problems. Their API does not require any authorization. Therefore, we can use the API to retrieve questions with accepted answers that are tagged with the Neo4j tag.

In [6]:
# Stackoverflow
so_data = []
for i in range(1, 20):
    # Define the Stack Overflow API endpoint and parameters
    api_url = "https://api.stackexchange.com/2.3/questions"
    params = {
        "order": "desc",
        "sort": "creation",
        "filter": "!-MBrU_IzpJ5H-AG6Bbzy.X-BYQe(2v-.J",
        "tagged": "neo4j",
        "site": "stackoverflow",
        "pagesize": 100,
        "page": i,
    }
    # Send GET request to Stack Overflow API
    response = requests.get(api_url, params=params)
    data = response.json()
    # Retrieve the resolved questions
    resolved_questions = [
        question
        for question in data["items"]
        if question["is_answered"] and question.get("accepted_answer_id")
    ]

    # Print the resolved questions
    for question in resolved_questions:
        text = (
            "Title:",
            question["title"] + "\n" + "Question:",
            BeautifulSoup(question["body"]).get_text()
            + "\n"
            + BeautifulSoup(
                [x["body"] for x in question["answers"] if x["is_accepted"]][0]
            ).get_text(),
        )
        source = question["link"]
        so_data.append(Document(page_content=str(text), metadata={"source": source}))
print(len(so_data))


775


Each approved answer and the original question are used to construct a single document. Since most Stack overflow questions and answers do not exceed 4000 tokens, we skipped the text-splitting step.
Now that we have loaded the documentation resources as documents, we can move on to the next step.
# Storing documents in a vector database
A chatbot finds relevant information by comparing the vector embedding of questions with document embeddings. A text embedding is a machine-readable representation of text in the form of a vector or, more plainly, a list of floats. In this example, we will use the ada-002 model provided by OpenAI to embed documents.
The whole idea behind vector databases is the ability to store vectors and provide fast similarity searches. The vectors are usually compared using cosine similarity. LangChain includes integration with a variety of vector databases. To keep things simple, we will use the Chroma vector database, which can be used as a local in-memory. For a more serious chatbot application, we want to use a persistent database that doesn't lose data once the script or notebook is closed.

We will create two collections of documents. The first will be more sales and marketing oriented, containing documents from Medium and YouTube. The second collection focuses more on support use cases and consists of documentation and Stack Overflow documents.

In [7]:
# Define embedding model
OPENAI_API_KEY = "OPENAI_API_KEY"
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

sales_data = medium_data_split + yt_data_split
sales_store = Chroma.from_documents(
    sales_data, embeddings, collection_name="sales"
)

support_data = kb_data + gds_data_split + so_data
support_store = Chroma.from_documents(
    support_data, embeddings, collection_name="support"
)

Using embedded DuckDB without persistence: data will be transient
Using embedded DuckDB without persistence: data will be transient


This script runs each document through OpenAI's text embedding API and inserts the resulting embedding along with text in the Chroma database. The process of text embedding costs 0.80$, which is a reasonable price.
## Question answering using external context

The last thing to do is to implement two separate question-answering flow. The first will handle the sales & marketing requests, while the other will handle support. The LangChain library uses LLMs for reasoning and providing answers to the user. Therefore, we start by defining the LLM. Here, we will be using the GPT-3.5-turbo model from OpenAI.

In [8]:
llm = ChatOpenAI(
    model_name="gpt-3.5-turbo",
    temperature=0,
    openai_api_key=OPENAI_API_KEY,
    max_tokens=512,
)

Implementing a question-answering flow is about as easy as it gets with LangChain. We only need to provide the LLM to be used along with the retriever that is used to fetch relevant documents. Additionally, we have the option to customize the LLM prompt used to answer questions.

In [9]:
sales_template = """As a Neo4j marketing bot, your goal is to provide accurate and helpful information about Neo4j,
a powerful graph database used for building various applications.
You should answer user inquiries based on the context provided and avoid making up answers.
If you don't know the answer, simply state that you don't know.
Remember to provide relevant information about Neo4j's features, benefits,
and use cases to assist the user in understanding its value for application development.

{context}

Question: {question}"""
SALES_PROMPT = PromptTemplate(
    template=sales_template, input_variables=["context", "question"]
)
sales_qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=sales_store.as_retriever(),
    chain_type_kwargs={"prompt": SALES_PROMPT},
)

The most important part of the sales prompt is to prohibit the LLM from basing its responses without relying on official company resources. Remember, LLMs can act very assertively while providing invalid information. However, we would like to avoid that scenario and avoid getting into problems where the bot promised or sold non-existing features. We can test the sales question answering flow by asking the following question:

In [10]:
print(sales_qa.run("What are the main benefits of using Neo4j?"))

Neo4j is a powerful graph database that offers several benefits for application development. Some of the main benefits include:

1. High Performance: Neo4j is designed to handle complex queries and large datasets, making it ideal for applications that require high performance.

2. Flexible Data Model: Neo4j's flexible data model allows developers to easily store and query complex relationships between data points, making it ideal for applications that require complex data modeling.

3. Scalability: Neo4j is highly scalable, allowing developers to easily add new nodes and relationships as their application grows.

4. Easy to Use: Neo4j's query language, Cypher, is easy to learn and use, making it accessible to developers of all skill levels.

5. Real-Time Insights: Neo4j's ability to quickly analyze complex relationships between data points allows developers to gain real-time insights into their data, making it ideal for applications that require real-time data analysis.

Overall, Neo4j

The response to the question seems relevant and accurate. Remember, the information to construct this response came from Medium articles.

Next, we will implement the support question-answering flow. Here, we will allow the LLM model to use its knowledge of Cypher and Neo4j to help solve the user's problem if the context doesn't provide enough information.

In [11]:
support_template = """
As a Neo4j Customer Support bot, you are here to assist with any issues 
a user might be facing with their graph database implementation and Cypher statements.
Please provide as much detail as possible about the problem, how to solve it, and steps a user should take to fix it.
If the provided context doesn't provide enough information, you are allowed to use your knowledge and experience to offer you the best possible assistance.

{context}

Question: {question}"""

SUPPORT_PROMPT = PromptTemplate(
    template=support_template, input_variables=["context", "question"]
)

support_qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=support_store.as_retriever(),
    chain_type_kwargs={"prompt": SUPPORT_PROMPT},
)


And again, we can test the support question-answering abilities. I took a random question from Neo4j's discord server.

In [12]:
print(support_qa.run("""
I am having my graph in a VM and i want to use GDS Plugins in my graph.
I didn't see any proper documentation to install it in my server where it was working only on locally.
Anyone clear me this to install the GDS Plugin in the VM ?
"""))

To install the GDS plugin on a Neo4j server running on a VM, you can follow the manual installation steps provided in the Neo4j documentation. 

1. Download the `neo4j-graph-data-science-[version].zip` file from the Neo4j Download Center.
2. Unzip the archive and move the `neo4j-graph-data-science-[version].jar` file into the `$NEO4J_HOME/plugins` directory.
3. Add the following to your `$NEO4J_HOME/conf/neo4j.conf` file:
```
dbms.security.procedures.unrestricted=gds.*
```
4. Check if the procedure allowlist is enabled in the `$NEO4J_HOME/conf/neo4j.conf` file and add the GDS library if necessary:
```
dbms.security.procedures.allowlist=gds.*
```
5. Restart Neo4j.

After following these steps, you should be able to use the GDS plugin on your Neo4j server running on the VM. You can verify the installation by running the `gds.version()` function in Neo4j Browser.


The response is quite to the point. Remember, we retrieved the Graph Data Science documentation and are using it as context to form the chatbot questions.
## Agent implementation
We now have two separate instructions and stores for sales and support responses. If we had to put a human in the loop to distinguish between the two, the whole point of the chatbot would be lost. Luckily, we can use a LangChain agent to decide which tool to use based on the user input. First, we need to define the available tools of an agent along with instructions on when and how to use them.

In [13]:
# the zero-shot-react-description agent will use the "description" string to select tools
tools = [
    Tool(
        name = "sales",
        func=sales_qa.run,
        description="""useful for when a user is interested in various Neo4j information, 
                       use-cases, or applications. A user is not asking for any debugging, but is only
                       interested in general advice for integrating and using Neo4j.
                       Input should be a fully formed question."""
    ),
    Tool(
        name = "support",
        func=support_qa.run,
        description="""useful for when when a user asks to optimize or debug a Cypher statement or needs
                       specific instructions how to accomplish a specified task. 
                       Input should be a fully formed question."""
    ),
]

The description of a tool is used by an agent to identify when and how to use a tool. For example, the support tool should be used to optimize or debug a Cypher statement and the input to the tool should be a fully formed question.

The last thing we need to do is to initialize the agent.

In [None]:
agent = initialize_agent(tools, llm, agent="zero-shot-react-description", verbose=True)

Now we can go ahead and test the agent on a couple of questions.

In [15]:
agent.run("""What are some GPT-4 applications with Neo4j?""")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThis question is asking for general advice on using Neo4j with GPT-4, so the best tool to use would be sales.
Action: sales
Action Input: "What are some use-cases for integrating Neo4j with GPT-4?"[0m
Observation: [36;1m[1;3mThere are several use cases for integrating Neo4j with GPT-4. One example is using GPT-4 as a domain expert to help extract knowledge from a video transcript and then storing that information in a Neo4j knowledge graph. Another example is using GPT-4 for relationship extraction to accelerate knowledge graph construction in Neo4j. Additionally, GPT-4 can be used in conjunction with Neo4j to develop a knowledge graph-based chatbot that provides answers based on data stored in the knowledge graph. Overall, integrating Neo4j with GPT-4 can enhance the capabilities of both technologies and enable more powerful and accurate applications.[0m
Thought:[32;1m[1;3mI now know the final answer
Final Answer: Ther

'There are several use cases for integrating Neo4j with GPT-4, including using GPT-4 as a domain expert to extract knowledge from a video transcript and storing it in a Neo4j knowledge graph, using GPT-4 for relationship extraction to accelerate knowledge graph construction, and developing a knowledge graph-based chatbot that provides answers based on data stored in the knowledge graph.'

In [19]:
agent.run("""
Hello everyone, is there a way to execute a weighted shortest path query in the Neo4j Community edition?
All I have found on the internet was gds library or the algo library, which are both unavailable the community version.
""")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThis sounds like a technical question that requires specific instructions or optimization. 

Action: support
Action Input: Ask the user for more information about their specific use case and what they are trying to accomplish with the weighted shortest path query. Ask if they have tried any alternative methods or libraries. [0m
Observation: [33;1m[1;3mAs a Neo4j Customer Support bot, I would like to ask for more information about your specific use case and what you are trying to accomplish with the weighted shortest path query. Have you tried any alternative methods or libraries? This will help me provide you with the best possible assistance.[0m
Thought:[32;1m[1;3mBased on the user's response, I may need to provide specific instructions or suggest alternative methods.

Action: support
Action Input: Based on the user's response, provide specific instructions or suggest alternative methods for executing a weighted shorte

"To execute a weighted shortest path query in the Neo4j Community edition, you can use the APOC library or implement Dijkstra's algorithm manually in Cypher. The APOC library provides a built-in function for Dijkstra's algorithm, while the manual implementation requires a Cypher query that finds the shortest path based on the sum of the weights of the relationships in the path."