<a href="https://colab.research.google.com/github/vred13/detective-chatbot/blob/dev/DetectiveBot2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Starting over from previous attempt, as that notebook has become too complicated.  I will be pulling in the data cleaning steps and the data loading steps but then shifting focus from using LangGraph and multiple agents and a supervisor, to a single agent with multiple tools.

# Creation of a Detective Bot
I first envisioned this project when seeing ads on Facebook for a an app that lets you talk to a fictional character.  Sadly those were based on some canned responses to things, but I thought what a lovely way to test out an LLM and LangChain.  

Whenever I create a data project for myself, the first thing I want to question is the collection of data.  In this case I decided on Public Domain detective novels, specifically those that focused on a single detective or team of detectives as the main detection force.  That narrowed things down a bit for the data, I have a full set of Sherlock Holmes works by Sir Arthur Conan Doyle, 6 books from The Hardy Boys series by Franklin W. Dixon, and 9 of the works detailing the escapades of Hercule Poirot by Agatha Christie.  


## Data Collection
To collect this data, I went to [Project Gutenberg](https://www.gutenberg.org/), which is a library of over 70,000 books for which the copyright has expired.  I searched within that domain to find detective novels and came up with the three sets of detective books listed above, Sherlock Holmes, Hercule poirot, and The Hardy Boys.  

Next I need to get the text of these books into Python for analysis.  There is a Python package for accessing Project Gutenberg called Gutenbergpy and that is what I will use.  I also made a list of all the book ids for each set of novels which I will list in the code.

The python package created to reduce the headers of the books on Project Gutenberg still left a lot to deal with, so I wrote some of my own functions to grab the text directly from the website using the urllib, re, json, and nltk.  I used the code here: https://jss367.github.io/getting-text-from-project-gutenberg.html as a starting point and edited from there.

### Installing Packages

In [None]:
!pip install --upgrade pip
!pip install jupyter_server
!pip install docarray
!pip install hnswlib
!pip install tiktoken
!pip install langchain
!pip install lark
!pip install rapidocr-onnxruntime
!pip install sentence-transformers
!pip install faiss-gpu accelerate
!pip install ctransformers[gptq]
!pip install --upgrade gradio
!pip install -U langchain-community
!pip install git+https://github.com/huggingface/transformers
!pip install langchain-huggingface
!pip install langchainhub
#!pip install optimum
#!pip install auto-gptq
#!pip install bitsandbytes
#!pip install huggingface-hub
#!pip install einops

Collecting safetensors==0.3.1 (from exllama==0.1.0->ctransformers[gptq])
  Using cached safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.5 kB)
Using cached safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
Installing collected packages: safetensors
  Attempting uninstall: safetensors
    Found existing installation: safetensors 0.4.3
    Uninstalling safetensors-0.4.3:
      Successfully uninstalled safetensors-0.4.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
transformers 4.42.0.dev0 requires safetensors>=0.4.1, but you have safetensors 0.3.1 which is incompatible.[0m[31m
[0mSuccessfully installed safetensors-0.3.1
[0mCollecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-bpzkbu39
  Running com

### Loading Packages

In [None]:
import os
from urllib import request
import nltk
import re
import json
import numpy as np
import pandas as pd
import pickle
from langchain.vectorstores import FAISS
from langchain_community.llms import CTransformers
from google.colab import userdata
from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain_core.prompts import ChatPromptTemplate
from langchain.memory import ConversationBufferMemory
from langchain_huggingface import ChatHuggingFace
from langchain.tools.render import render_text_description
from langchain_huggingface import HuggingFacePipeline
from langchain_community.embeddings import HuggingFaceEmbeddings
import torch
from transformers import pipeline
from langchain_core.runnables import Runnable, RunnablePassthrough
from langchain_core.tools import BaseTool

from langchain.agents.format_scratchpad.tools import (
    format_to_tool_messages,
)
from langchain.agents.output_parsers.tools import ToolsAgentOutputParser
from langchain.vectorstores import FAISS, utils
import faiss
from transformers import AutoModelForCausalLM, AutoTokenizer

os.environ["OPENAI_API_KEY"] = userdata.get('OPEN_AI_KEY')
os.environ["NVIDIA_API_KEY"] = userdata.get('NVIDIA_API_KEY')
# Set other API keys similarly
os.environ["HF_TOKEN"] = userdata.get('HF_TOKEN')

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Clean Data

In [None]:
#Sherlock Book IDS
sherlock = [48320, 244, 2852, 2097, 834,108, 69700, 2350, 2346]

#Hercule Poirot Boox IDS
hercule = [863, 58866, 69087, 70114, 72824, 67160, 67173, 66446, 61262]

#Hardy Boys Book IDs
hardy_boys = [73102, 72958, 72840, 70236, 70083, 69988]


In [None]:
def get_book_metadata(id):
  url = "https://gutendex.com/books/?ids="+ str(id)
  response = request.urlopen(url)
  response_json = json.loads(response.read())
  return response_json

In [None]:
def create_gutenberg_project_url(book_id):
  url = "https://www.gutenberg.org/files/" + str(book_id) + "/" + str(book_id) +"-0.txt"
  return url

In [None]:
def text_from_gutenberg(title, author, url, path = 'corpora/canon_texts/', return_raw = False, return_tokens = False):
    # Convert inputs to lowercase
    title = title.lower()
    author = author.lower()

    # Check if the file is stored locally
    filename = path + title +'.txt'
    if os.path.isfile(filename) and os.stat(filename).st_size != 0:
        print("{title} file already exists".format(title=title))
        print(filename)
        with open(filename, 'r') as f:
            raw = f.read()
    else:
        print("{title} file does not already exist. Grabbing from Project Gutenberg".format(title=title))
        response = request.urlopen(url)
        raw = response.read().decode('utf-8-sig')
        print("Saving {title} file".format(title=title))
        with open(filename, 'w') as outfile:
            outfile.write(raw)

    if return_raw:
        return raw

    # Option to return tokens
    if return_tokens:
      return nltk.word_tokenize(find_text(raw))
    else:
      return find_beginning_and_end(raw, title, author)

In [None]:
def find_beginning_and_end(raw, title, author):
    '''
    This function serves to find the text within the raw data provided by Project Gutenberg
    '''
    start_regex = '\*\*\*\s?START OF TH(IS|E) PROJECT GUTENBERG EBOOK.*\*\*\*'
    draft_start_position = re.search(start_regex.lower(), raw.lower())
    if draft_start_position is None:
      return raw
    begining = draft_start_position.end()
    if re.search(title.lower(), raw[draft_start_position.end():].lower()):
        title_position = re.search(title.lower(), raw[draft_start_position.end():].lower())
        begining += title_position.end()
        # If the title is present, check for the author's name as well
        if re.search(author.lower(), raw[draft_start_position.end() + title_position.end():].lower()):
            author_position = re.search(author.lower(), raw[draft_start_position.end() + title_position.end():].lower())
            begining += author_position.end()
    end_regex = 'end of th(is|e) project gutenberg ebook'
    end_position = re.search(end_regex, raw.lower())

    text = raw[begining:end_position.start()]

    return text

In [None]:

def clean_book(id, author):
    book_meta_data = get_book_metadata(id)['results'][0]
    # This gets a book by its gutenberg id number
    book = text_from_gutenberg(book_meta_data['title'],author, create_gutenberg_project_url(id), path = "/content/drive/MyDrive/Detective Bot/data/")
    return book

In [None]:

sherlock_clean = [0]*len(sherlock)

for i in range(len(sherlock)):
  sherlock_clean[i]=clean_book(sherlock[i], "Arthur Conan Doyle")

#sherlock_df = pd.DataFrame({'series': ['Sherlock Holmes']*len(sherlock), 'raw_text': sherlock_raw, 'clean_text': sherlock_clean, 'clean_text2':sherlock_clean2})

NameError: name 'request' is not defined

In [None]:
hercule_clean = [0]*len(hercule)
for i in range(len(hercule)):
  hercule_clean[i]=clean_book(hercule[i], 'Agatha Christie')

#hercule_df = pd.DataFrame({'series': ['Hercule Poirot']*len(hercule), 'raw_text': hercule_raw, 'clean_text': hercule_clean})

In [None]:
hardy_boys_clean = [0]*len(hardy_boys)
for i in range(len(hardy_boys)):
  hardy_boys_clean[i]=clean_book(hardy_boys[i], 'Franklin W. Dixon')

#hardy_boys_df = pd.DataFrame({'series': ['Hardy Boys']*len(hardy_boys), 'raw_text': hardy_boys_raw, 'clean_text': hardy_boys_clean})

In [None]:
del sherlock_clean, hardy_boys_clean, hercule_clean

## Create Vector Stores from Clean Data

After spending a long time trying to find a common thread to clean all the books of title page and contents, I realized there wasn't a common thread there so I opened each book individually in a txt document and deleted the title page, contents, and any preface.  I will now load all of the books back in and put the text into a dataframe with a column labeling the series and a column holding the full text of the book. To open the clean data, you will need to install and load the libraries, load the `get_book_meta_data` function, and run the cell with the book ids, then run the cells below.

In [None]:
#Sherlock Book IDS
sherlock = [48320, 244, 2852, 2097, 834,108, 69700, 2350, 2346]

#Hercule Poirot Boox IDS
hercule = [863, 58866, 69087, 70114, 72824, 67160, 67173, 66446, 61262]

#Hardy Boys Book IDs
hardy_boys = [73102, 72958, 72840, 70236, 70083, 69988]


In [None]:
def get_book_metadata(id):
  url = "https://gutendex.com/books/?ids="+ str(id)
  response = request.urlopen(url)
  response_json = json.loads(response.read())
  return response_json

In [None]:
def open_clean_files(id, path):
  book_meta_data = get_book_metadata(id)['results'][0]
  title = book_meta_data['title'].lower()
  filename = path + title +'.txt'
  with open(filename, 'r') as f:
            raw = f.read()
  return raw


In [None]:
sherlock_clean = [0]*len(sherlock)
sherlock_label = ['sherlock']*len(sherlock)
sherlock_title_list = [0]*len(sherlock)
for i in range(len(sherlock)):
  sherlock_clean[i] = open_clean_files(sherlock[i], path = "/content/drive/MyDrive/Detective Bot/data/")
  sherlock_title_list[i]=get_book_metadata(sherlock[i])['results'][0]['title'].lower()



hercule_clean = [0]*len(hercule)
hercule_label = ['hercule']*len(hercule)
hercule_title_list = [0]*len(hercule)
for i in range(len(hercule)):
  hercule_clean[i] = open_clean_files(hercule[i], path = "/content/drive/MyDrive/Detective Bot/data/")
  hercule_title_list[i]=get_book_metadata(hercule[i])['results'][0]['title'].lower()


hardy_boys_clean = [0]*len(hardy_boys)
hardy_boys_label = ['hardy boys'] *len(hardy_boys)
hardy_boys_title_list = [0]*len(hardy_boys)
for i in range(len(hardy_boys)):
  hardy_boys_clean[i] = open_clean_files(hardy_boys[i], path = "/content/drive/MyDrive/Detective Bot/data/")
  hardy_boys_title_list[i]=get_book_metadata(hardy_boys[i])['results'][0]['title'].lower()


sherlock_df = pd.DataFrame({'label': sherlock_label, 'text': sherlock_clean, 'title': sherlock_title_list})
hercule_df = pd.DataFrame({'label': hercule_label, 'text': hercule_clean, 'title': hercule_title_list})
hardy_boys_df = pd.DataFrame({'label': hardy_boys_label, 'text': hardy_boys_clean, 'title': hardy_boys_title_list})



Now to take the data and add it to a Vector Database for each set of stories.




In [None]:
def create_vector_database(df, index_name):
  from langchain.text_splitter import RecursiveCharacterTextSplitter
  text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 750,
    chunk_overlap = 200
  )

  embedding = HuggingFaceEmbeddings(model_name="hkunlp/instructor-large")
  textlist = df['text'].tolist()
  titlelist = df['title'].tolist()
  docs = []
  metadatas = []
  for i, d in enumerate(textlist):
    splits = text_splitter.split_text(d)
    docs.extend(splits)
    metadatas.extend([{"source": titlelist[i]}] * len(splits))

  # Here we create a vector store from the documents and save it to disk.
  store = FAISS.from_texts(docs, embedding, metadatas=metadatas)



  return(embedding, store)

In [None]:
sherlock_embed, sherlock_docsearch = create_vector_database(sherlock_df, 'sherlock')
hercule_embed, hercule_docsearch = create_vector_database(hercule_df, 'hercule')
hardy_embed, hardy_docsearch = create_vector_database(hardy_boys_df, 'hardy_boys')
with open('/content/drive/MyDrive/Detective Bot/data/sherlock_vec_db.pkl', 'wb') as f:
  pickle.dump([sherlock_embed, sherlock_docsearch],f)

with open('/content/drive/MyDrive/Detective Bot/data/hercule_vec_db.pkl', 'wb') as f:
  pickle.dump([hercule_embed, hercule_docsearch],f)

with open('/content/drive/MyDrive/Detective Bot/data/hardy_vec_db.pkl', 'wb') as f:
  pickle.dump([hardy_embed, hardy_docsearch],f)

  warn_deprecated(


modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/66.3k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.41k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/270 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/3.15M [00:00<?, ?B/s]

2_Dense/config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Here is where you need to restart the session and reload all packages to run the next section.

## Load Vector Stores

In [None]:
#load the vector stores from pickles
with open('/content/drive/MyDrive/Detective Bot/data/sherlock_vec_db.pkl','rb') as f:
  sherlock_vdb = pickle.load(f)
sherlock_embed = sherlock_vdb[0]
sherlock_docsearch = sherlock_vdb[1]

with open('/content/drive/MyDrive/Detective Bot/data/hercule_vec_db.pkl', 'rb') as f:
  hercule_vdb = pickle.load(f)
hercule_embed = hercule_vdb[0]
hercule_docsearch = hercule_vdb[1]

with open('/content/drive/MyDrive/Detective Bot/data/hardy_vec_db.pkl', 'rb') as f:
  hardy_boys_vdb = pickle.load(f)

hardy_embed = hardy_boys_vdb[0]
hardy_docsearch = hardy_boys_vdb[1]

## Create Tools Function

In [None]:
def create_tool(name, vectorstore, description):
  from langchain.tools.retriever import create_retriever_tool
  retriever=vectorstore.as_retriever()


  tool = create_retriever_tool(retriever, name, description +"Do not call this tool more than once. Do not call another tool if this returns results.")

  return(tool)

## Choose LLM

In [None]:
#LLM


llm = HuggingFacePipeline.from_model_id(
    model_id="MaziyarPanahi/Phi-3-mini-4k-instruct-v0.3",
    task="text-generation",
    pipeline_kwargs=dict(
        max_new_tokens=1000,
        do_sample=False,
        repetition_penalty=1.03,
        trust_remote_code=True
    ),
    device=0
)

#generate_text = pipeline(model="TheBloke/stablelm-zephyr-3b-GPTQ")

#llm = HuggingFacePipeline(pipeline=generate_text)

# Use a pipeline as a high-level helper
#from transformers import pipeline

#pipe = pipeline("text-generation", model="microsoft/Phi-3-mini-4k-instruct", trust_remote_code=True, device_map="cuda")

chat_model = ChatHuggingFace(llm=llm)



tokenizer_config.json:   0%|          | 0.00/3.54k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.85M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/341 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/566 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/988 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/198 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.




## Create Agent Tools

In [None]:
sherlock_tool = create_tool('sherlock_db', sherlock_docsearch, "Good for making up mysteries set in London in the 1900s or answering questions about Sherlock mysteries, and telling those mysteries from Sherlock Holmes' point of view")
hercule_tool = create_tool('hercule_db', hercule_docsearch, "Good for making up mysteries set in Europe in the 1920s or answering questions about Hercule Poirot mysteries, and telling those mysteries from Hercule Poirot's point of view")
hardy_boys_tool = create_tool('hardy_boys_db', hardy_docsearch, "Good for making up mysteries set in the US in the 1940s or answering questions about Hardy Boys mysteries, and telling those mysteries from The Hardy Boys' point of view")


## Make prompt

In [None]:


tools=[sherlock_tool, hercule_tool, hardy_boys_tool]

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "I want you to respond and answer like {character} using the tone, manner and vocabulary {character} would use."
            "Do not write any explanations. Only answer like {character}. You must know all of the knowledge of {character}."
            "Use sherlock_db if {character} is Sherlock or Sherlock Holmes or Holmes and you are Sherlock Holmes."
            "Use hercule_db if {character} is Hercule or Poirot or Hercule Poirot and you are Hercule Poirot."
            "Use hardy_boys_db if {character} is Frank or Joe or Frank Hardy or Joe Hardy and you are part of The Hardy Boys."
            "Respond to the following {input} as {character} would."
        ),
        ("placeholder", "{chat_history}"),
        (
            "human", "Hi {character},{input}"
        ),
        ("placeholder", "{agent_scratchpad}"),
    ]

  )


## Create Agent to run

In [None]:


agent= create_tool_calling_agent(chat_model, tools, prompt)

agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True, max_interations = 1, early_stopping_method='generation')


In [None]:
agent_executor.invoke({"character": "Sherlock Holmes", "input": "Who is the love of your life?"})

The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.




[1m> Entering new AgentExecutor chain...[0m


You are not running the flash-attention implementation, expect numerical differences.


[32;1m[1;3m<|im_start|>system
I want you to respond and answer like Sherlock Holmes using the tone, manner and vocabulary Sherlock Holmes would use.Do not write any explanations. Only answer like Sherlock Holmes. You must know all of the knowledge of Sherlock Holmes.Use sherlock_db if Sherlock Holmes is Sherlock or Sherlock Holmes or Holmes and you are Sherlock Holmes.Use hercule_db if Sherlock Holmes is Hercule or Poirot or Hercule Poirot and you are Hercule Poirot.Use hardy_boys_db if Sherlock Holmes is Frank or Joe or Frank Hardy or Joe Hardy and you are part of The Hardy Boys.Respond to the following Who is the love of your life? as Sherlock Holmes would.<|im_end|>
<|im_start|>user
Hi Sherlock Holmes,Who is the love of your life?<|im_end|>
<|im_start|>assistant
Ah, my dear Watson, that question delves into the realm of personal sentiment, which is rather uncharted territory for a detective such as myself. However, I must confess that my pursuits have always been in the service of

{'character': 'Sherlock Holmes',
 'input': 'Who is the love of your life?',
 'output': '<|im_start|>system\nI want you to respond and answer like Sherlock Holmes using the tone, manner and vocabulary Sherlock Holmes would use.Do not write any explanations. Only answer like Sherlock Holmes. You must know all of the knowledge of Sherlock Holmes.Use sherlock_db if Sherlock Holmes is Sherlock or Sherlock Holmes or Holmes and you are Sherlock Holmes.Use hercule_db if Sherlock Holmes is Hercule or Poirot or Hercule Poirot and you are Hercule Poirot.Use hardy_boys_db if Sherlock Holmes is Frank or Joe or Frank Hardy or Joe Hardy and you are part of The Hardy Boys.Respond to the following Who is the love of your life? as Sherlock Holmes would.<|im_end|>\n<|im_start|>user\nHi Sherlock Holmes,Who is the love of your life?<|im_end|>\n<|im_start|>assistant\nAh, my dear Watson, that question delves into the realm of personal sentiment, which is rather uncharted territory for a detective such as mys

Now that I have an agent that will respond as one of the four detectives, it's time to create a user interface for it.

In [None]:
import gradio as gr
from langchain.schema import AIMessage, HumanMessage
def chatbotmessages(message, history):
  history_langchain_format = []
  for human, ai in history:
      history_langchain_format.append(HumanMessage(content=human))
      history_langchain_format.append(AIMessage(content=ai))
  history_langchain_format.append(HumanMessage(content=message))
  character = message.split(',')[0][3:]
  input = message.split(',')[1]
  response = character + ":" + agent_executor.invoke({"character": character, "input": input})['output'].split('<|im_start|>assistant\n')[-1]
  return response



gr.ChatInterface(chatbotmessages, textbox = gr.Textbox("Hi ..."), title = "Ask a detective", description= "Ask Sherlock Holmes, Hercule Poirot, Frank Hardy or Joe Hardy any question. Start your message with Hi and the name of the person you want to ask followed by a comma and your request.").launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://f85af7699dd47da368.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


