<a href="https://colab.research.google.com/github/quantranvr/all-in-one/blob/main/LangChain_QA_w_RAG_part_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Build chatbot that responses based on given websites and chat history

**Problem:**

A singer found [[this article]](https://www.music-tomorrow.com/blog/how-spotify-recommendation-system-works-a-complete-guide-2022)  about Spotify recommender interesting. However, he doesn't know much about the AI and Data world so he struggles to comprehend the article's content.

Build a chatbot that could leverage the information appeared in the two articles to help him better understand what he could do to influence recommendation algorithms. **Remember to cite evidences from the articles that support chatbot's responses**.

Keep in mind that he would want to ask a lot of questions, the later would likely to related to the previous one. Hence, t**he chatbot should be able to take into account the chat history**!

1. Build chatbot without citation and chat history consideration (**done**)
2. Build chatbot wit citation but no chat history consideration (**done**)
3. Build chatbot with both citation and chat history consideration (**not yet**)

# Setup

In [None]:
# wrap output
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

In [None]:
# install
!pip install --upgrade --quiet langchain langchain_community langchainhub langchain_openai chromadb bs4

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/803.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.9/803.6 kB[0m [31m3.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m803.6/803.6 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m47.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m509.0/509.0 kB[0m [31m37.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m229.5/229.5 kB[0m [31m26.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.3/49.3 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m223.4/223.4 kB[0m [31m24.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m

In [None]:
# openai api key
import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass()

··········


# Import

In [None]:
# load
from langchain_community.document_loaders import WebBaseLoader
import bs4
# split
from langchain.text_splitter import RecursiveCharacterTextSplitter
# store & index
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
# retrieve and generate
from langchain import hub
from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain_core.output_parsers import StrOutputParser
from langchain.prompts import ChatPromptTemplate, PromptTemplate

# Load documents

In [None]:
# load

web_paths = (
    "https://www.music-tomorrow.com/blog/how-spotify-recommendation-system-works-a-complete-guide-2022",
)

bs_strainer = bs4.SoupStrainer(class_=("wrapper page-title", "article-rich-text w-richtext"))

loader = WebBaseLoader(
    web_paths = web_paths,
    bs_kwargs = {"parse_only": bs_strainer}
)

docs = loader.load()

print(f"Number of docs = {len(docs)}")

total_chars = 0
for i in range(len(docs)):
    doc_chars = len(docs[i].page_content)
    print(f"Doc #{i+1} has {doc_chars} characters")
    total_chars += doc_chars

print(f"Total characters = {total_chars}")

Number of docs = 1
Doc #1 has 23706 characters
Total characters = 23706


# Split into chunks

In [None]:
# split
splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 200,
    add_start_index = True,
)

splits = splitter.split_documents(docs)

print(f"Number of chunks = {len(splits)}")

Number of chunks = 30


# Store and index chunks

In [None]:
# store and index
vectorstore = Chroma.from_documents(
    documents = splits,
    embedding = OpenAIEmbeddings(),
)

In [None]:
# retrieve and generate
retriever = vectorstore.as_retriever()
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

rag_prompt = hub.pull("rlm/rag-prompt")
trained_sentences = 2
retrieved_sentences = 3

instruction = """\
You are an assistant for question-answering tasks. \
If user's input is not a question, behave like a kind and ready-to-help chatbot. \
Else, firstly use your trained knowledge to answer the question in at most two sentences, starting with "From my knowledge, ". \
If you don't know then don't answer! \
Next, use the following pieces of retrieved context to answer the question in at most three sentences, starting with "From the articles, ". \
If you don't know then say "there is no information about that!" \
Keep the answer concise. \
"""
def format_rag_prompt(ragp, instruction):
    contents = rag_prompt.messages[0].prompt.template.split("\n")
    contents[0] = instruction
    rag_prompt.messages[0].prompt.template = "\n".join(contents)
    return ragp

rag_prompt = format_rag_prompt(rag_prompt, instruction)


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Chatbot 0 - without citation and chat history consideration

In [None]:
# chatbot_0: without citation and chat history consideration
chatbot_0 = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)

In [None]:
# test chatbot_0
chatting = False
while chatting:
    print("You: ", end="\n")
    question = input()
    if "thank you and see you again" == question.lower():
        print("\nChatbot:")
        print("It is my pleasure to chat with you today!")
        break
    else:
        print("\nChatbot:")
        chatbot_0_msg = chatbot_0.invoke(question)
        print(chatbot_0_msg)

    print("\n--------------------\n")

# Chatbot 1 - With citation but no chat history consideration

In [None]:
def verify_source(answer_from_docs, chunk):
    prompt = PromptTemplate.from_template(
        """
        Is it true that \
        the following statement:
        {statement}
        is deduced from the following text:
        {text}

        Only answer with True or False!
        """
    )
    if "true" in llm.invoke(prompt.format(text=chunk.page_content, statement=answer_from_docs)).content.lower():
        return True
    else:
        return False

In [None]:
def get_proof(answer_from_docs, chunk):
    prompt = PromptTemplate.from_template(
        """
        the following statement:
        {statement}
        is deduced from the following text:
        {text}

        The proof of the statement can be found in which sentences in the text?
        Use bullet list to list the sentences!
        """
    )
    return llm.invoke(prompt.format(text=chunk.page_content, statement=answer_from_docs)).content

In [None]:
def output_parser(output: dict):
    full_ans = output["answer"].content

    rag_ans = [u for u in full_ans.split("\n\n") if 'From the articles, ' in u]
    if len(rag_ans) == 0:
        return full_ans
    else:
        answer_from_docs = rag_ans[0].replace('From the articles, ', '')
        chunks = output["context"]
        citations = []
        for chunk in chunks:
            if verify_source(answer_from_docs, chunk):
                proof = get_proof(answer_from_docs, chunk)
                proof = proof.replace("\n", "\n\n")
                link = chunk.metadata['source']
                citations.append(proof + "\n\n" + link)
    ref = ["References:"] + citations
    ref_str = "\n\n".join(ref)
    return full_ans + "\n\n" + ref_str

In [None]:
# chatbot_1: with citation but no chat history consideration

chatbot_1 = RunnableParallel(
    {"context": retriever, "question": RunnablePassthrough()}
).assign(
    answer = (
        RunnablePassthrough.assign(context = lambda x: format_docs(x["context"]))
        | rag_prompt
        | llm
    )
) | output_parser

In [None]:
# test chatbot_1
chatting = True
while chatting:
    print("You: ", end="\n")
    question = input()
    if "thank you and see you again" == question.lower():
        print("\nChatbot:")
        print("It is my pleasure to chat with you today!")
        break
    else:
        print("\nChatbot:")
        chatbot_1_msg = chatbot_1.invoke(question)
        print(chatbot_1_msg)

    print("\n--------------------\n")

You: 
Hello

Chatbot:
Hello! How can I assist you today?

--------------------

You: 
How Spotify's recommender use collaborative filtering?

Chatbot:
From my knowledge, Spotify's recommender uses collaborative filtering by comparing users' listening history and recommending songs that similar users have enjoyed.

From the articles, Spotify's collaborative filtering model is trained on a sample of user-generated playlists, chosen based on the passion, care, love, and time users put into creating them. The algorithm compares users' listening history and recommends songs that similar users have enjoyed. However, Spotify has also moved towards focusing on the track's organizational similarity by studying playlist and listening session co-occurrence.

References:

- "Today, the Spotify collaborative filtering model is trained on a sample of ~700 million user-generated playlists selected out of the much broader set of all user-generated playlists on the platform."

- "The main principle for