<a href="https://colab.research.google.com/github/sacherjc/content-chatbot/blob/main/Shareable_copy_of_CapsidChatBot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Create your own chatbot!
Based on this tutorial from Mark Paepper: https://www.paepper.com/blog/posts/build-q-and-a-bot-of-your-website-using-langchain/

In [None]:
# First, make a copy of this notebook! That way you'll be able to save it

# this clones the github from Mark P's tutorial
!git clone https://github.com/mpaepper/content-chatbot.git

# This changes the directory to content-chatbot
%cd content-chatbot/

Cloning into 'content-chatbot'...
remote: Enumerating objects: 36, done.[K
remote: Counting objects: 100% (35/35), done.[K
remote: Compressing objects: 100% (22/22), done.[K
remote: Total 36 (delta 14), reused 31 (delta 12), pack-reused 1[K
Unpacking objects: 100% (36/36), 272.76 KiB | 6.99 MiB/s, done.
/content/content-chatbot


In [None]:
# this installs the requirements according to what was in Mark P's tutorial
!pip install -r requirements.txt

In [None]:
# This lets you add your API key so you can access OpenAI's large language models (e.g. ChatGPT), which the code below will call on
import os

# add your OpenAI API key here, between the quotes
os.environ['OPENAI_API_KEY'] = ''


In [None]:
# Optional: Run this so you can access your Google drive files from Colab.

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Before running this code, create a new file in the content-chatbot folder called
# links.txt (get it here for Capsid & Tail links: https://drive.google.com/file/d/1YB5LuhNsr_v-WjloYiPoxZSsIUel_TW0/view?usp=share_link)
# *note, Google Colab doesn't save files after a session is closed
# if you want to save for later, make sure there's a copy of it in your Google Drive too)
# Open the links.txt file here in Colab by double clicking on it (left panel under Files)
# Paste in your URLs into the empty links.txt file (I used Capsid & Tail blog post URLS,
# but you can use any URLs you're interested in!) and hit command+S to save the file. One link on each line.


# this opens/reads your links.txt file
with open('links.txt', 'r') as f:
  links = f.read()

# this creates a list of links
links_split = links.split('\n')

# this prints your list below so you can make sure it worked
print(links_split)

In [None]:
# This block scrapes the text from each of the links in your links_split list

# First, it imports a bunch of stuff it needs
import argparse
import pickle
import requests
import xmltodict

from bs4 import BeautifulSoup
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter

# this function extracts text from a URL
def extract_text_from(url):
    html = requests.get(url).text
    soup = BeautifulSoup(html, features="html.parser")
    text = soup.get_text()

    lines = (line.strip() for line in text.splitlines())
    return '\n'.join(line for line in lines if line)


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Embedding website content')
    parser.add_argument('-s', '--sitemap', type=str, required=False,
            help='URL to your sitemap.xml', default='https://www.paepper.com/sitemap.xml')
    parser.add_argument('-f', '--filter', type=str, required=False,
            help='Text which needs to be included in all URLs which should be considered',
            default='https://www.paepper.com/blog/posts')
    args = parser.parse_args()

    r = requests.get(args.sitemap)
    xml = r.text
    raw = xmltodict.parse(xml)

# This will loop through each URL in the links_split list, extract the text from each webpage, and save the text + the URL in a dictionary.

    # this creates a new empty dictionary called pages
    pages = []

    # this is a 'for loop' which goes through each url in the links_split list one by one. For each one, adds it to the pages dictionary
    for url in links_split:
            pages.append({'text': extract_text_from(url), 'source': url})

# This is the splitter - splits your loaded text into chunks and stores the chunks
    text_splitter = CharacterTextSplitter(chunk_size=1500, separator="\n")
    docs, metadatas = [], []
    for page in pages:
        splits = text_splitter.split_text(page['text'])
        docs.extend(splits)
        metadatas.extend([{"source": page['source']}] * len(splits))
        print(f"Split {page['source']} into {len(splits)} chunks")

    store = FAISS.from_texts(docs, OpenAIEmbeddings(), metadatas=metadatas)
    with open("faiss_store.pkl", "wb") as f:
        pickle.dump(store, f)

In [None]:
# Ask questions here! Remove hashtags to 'run' the questions (# stops code from running)

#!python ask_question.py "Are people using qpcr for phage therapy?"

#!python ask_question.py "What is the State of Phage survey, how many articles feature it, and what did it find?"

# !python ask_question.py "What is the STAMP trial?"

!python ask_question.py "What is a spreadsheet useful for in the phage space?"

!python ask_question.py "Are phages useful in urinary infections? What has been their track record?"

#!python ask_question.py "Who treated a sea turtle with phages?"
#!python ask_question.py "List the australian phage researchers we have talked about in capsid & tail"
#!python ask_question.py "How might you count phages during phage therapy?"




[1m> Entering new VectorDBQAWithSourcesChain chain...[0m

[1m> Finished chain.[0m
Answer:  Spreadsheets are useful for laying out two-dimensional data and tracking relationships in a manual way.

Sources: https://phage.directory/capsid/how-to-organize-phage-biobank-data, https://phage.directory/capsid/tabular-data


[1m> Entering new VectorDBQAWithSourcesChain chain...[0m

[1m> Finished chain.[0m
Answer:  Phages have been used to reduce biofilm biomass in a human urine model and have been administered to patients in a phase 1/2 clinical trial to evaluate their PhageBank™ technology as a treatment for UTIs.

Sources: 
https://phage.directory/capsid/rumen-phage
https://phage.directory/capsid/salvage-phage-therapy
https://phage.directory/capsid/phage-therapy-access-india
https://phage.directory/capsid/go-viral-adaptive-phage-therapeutics


In [None]:
# This is to make a Chat App version of the above
import pickle
from langchain.prompts.prompt import PromptTemplate
from langchain.llms import OpenAI
from langchain.chains import ChatVectorDBChain

_template = """Given the following conversation and a follow up question,
rephrase the follow up question to be a standalone question.
Chat History:
{chat_history}
Follow Up Input: {question}
Standalone question:"""
CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(_template)

template = """You are an AI assistant for answering questions about
phage-related blog posts published on Capsid & Tail.
You are given the following extracted parts of
a long document and a question. Provide a conversational answer.
If you don't know the answer, just say "Hmm, I'm not sure.".
Don't try to make up an answer. If the question is not about
phages or phage therapy, politely inform them that you are tuned
to only answer questions about phages.
Question: {question}
=========
{context}
=========
Answer in Markdown:"""
QA = PromptTemplate(template=template, input_variables=["question", "context"])


def get_chain(vectorstore):
    llm = OpenAI(temperature=0)
    qa_chain = ChatVectorDBChain.from_llm(
        llm,
        vectorstore,
        qa_prompt=QA,
        condense_question_prompt=CONDENSE_QUESTION_PROMPT,
    )
    return qa_chain


if __name__ == "__main__":
    with open("faiss_store.pkl", "rb") as f:
        vectorstore = pickle.load(f)
    qa_chain = get_chain(vectorstore)
    chat_history = []
    print("Chat with the Capsid & Tail bot:")
    while True:
        print("Your question:")
        question = input()
        result = qa_chain({"question": question, "chat_history": chat_history})
        chat_history.append((question, result["answer"]))
        print(f"AI: {result['answer']}")