<a href="https://colab.research.google.com/github/vred13/detective-chatbot/blob/main/DetectiveBot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creation of a Detective Bot
I first envisioned this project when seeing ads on Facebook for a an app that lets you talk to a fictional character.  Sadly those were based on some canned responses to things, but I thought what a lovely way to test out an LLM and LangChain.  

Whenever I create a data project for myself, the first thing I want to question is the collection of data.  In this case I decided on Public Domain detective novels, specifically those that focused on a single detective or team of detectives as the main detection force.  That narrowed things down a bit for the data, I have a full set of Sherlock Holmes works by Sir Arthur Conan Doyle, 6 books from The Hardy Boys series by Franklin W. Dixon, and 9 of the works detailing the escapades of Hercule Poirot by Agatha Christie.  


## Data Collection
To collect this data, I went to [Project Gutenberg](https://www.gutenberg.org/), which is a library of over 70,000 books for which the copyright has expired.  I searched within that domain to find detective novels and came up with the three sets of detective books listed above, Sherlock Holmes, Hercule poirot, and The Hardy Boys.  

Next I need to get the text of these books into Python for analysis.  There is a Python package for accessing Project Gutenberg called Gutenbergpy and that is what I will use.  I also made a list of all the book ids for each set of novels which I will list in the code.

The python package created to reduce the headers of the books on Project Gutenberg still left a lot to deal with, so I wrote some of my own functions to grab the text directly from the website using the urllib, re, json, and nltk.  I used the code here: https://jss367.github.io/getting-text-from-project-gutenberg.html as a starting point and edited from there.

In [None]:
import os
from urllib import request
import nltk
import re
import json

In [None]:
def get_book_metadata(id):
  url = "https://gutendex.com/books/?ids="+ str(id)
  response = request.urlopen(url)
  response_json = json.loads(response.read())
  return response_json

In [None]:
def create_gutenberg_project_url(book_id):
  url = "https://www.gutenberg.org/files/" + str(book_id) + "/" + str(book_id) +"-0.txt"
  return url

In [80]:
def text_from_gutenberg(title, author, url, path = 'corpora/canon_texts/', return_raw = False, return_tokens = False):
    # Convert inputs to lowercase
    title = title.lower()
    author = author.lower()

    # Check if the file is stored locally
    filename = path + title +'.txt'
    if os.path.isfile(filename) and os.stat(filename).st_size != 0:
        print("{title} file already exists".format(title=title))
        print(filename)
        with open(filename, 'r') as f:
            raw = f.read()
    else:
        print("{title} file does not already exist. Grabbing from Project Gutenberg".format(title=title))
        response = request.urlopen(url)
        raw = response.read().decode('utf-8-sig')
        print("Saving {title} file".format(title=title))
        with open(filename, 'w') as outfile:
            outfile.write(raw)

    if return_raw:
        return raw

    # Option to return tokens
    if return_tokens:
      return nltk.word_tokenize(find_text(raw))
    else:
      return find_beginning_and_end(raw, title, author)

In [81]:
def find_beginning_and_end(raw, title, author):
    '''
    This function serves to find the text within the raw data provided by Project Gutenberg
    '''
    start_regex = '\*\*\*\s?START OF TH(IS|E) PROJECT GUTENBERG EBOOK.*\*\*\*'
    draft_start_position = re.search(start_regex.lower(), raw.lower())
    if draft_start_position is None:
      return raw
    begining = draft_start_position.end()
    if re.search(title.lower(), raw[draft_start_position.end():].lower()):
        title_position = re.search(title.lower(), raw[draft_start_position.end():].lower())
        begining += title_position.end()
        # If the title is present, check for the author's name as well
        if re.search(author.lower(), raw[draft_start_position.end() + title_position.end():].lower()):
            author_position = re.search(author.lower(), raw[draft_start_position.end() + title_position.end():].lower())
            begining += author_position.end()
    end_regex = 'end of th(is|e) project gutenberg ebook'
    end_position = re.search(end_regex, raw.lower())

    text = raw[begining:end_position.start()]

    return text

In [82]:
import pandas as pd
import numpy as np


In [83]:
#Sherlock Book IDS
sherlock = [48320, 244, 2852, 2097, 834,108, 69700, 2350, 2346]

#Hercule Poirot Boox IDS
hercule = [863, 58866, 69087, 70114, 72824, 67160, 67173, 66446, 61262]

#Hardy Boys Book IDs
hardy_boys = [73102, 72958, 72840, 70236, 70083, 69988]


In [84]:

def clean_book(id, author):
    book_meta_data = get_book_metadata(id)['results'][0]
    # This gets a book by its gutenberg id number
    book = text_from_gutenberg(book_meta_data['title'],author, create_gutenberg_project_url(id), path = "/content/drive/MyDrive/Detective Bot/data/")
    return book

In [85]:

sherlock_clean = [0]*len(sherlock)

for i in range(len(sherlock)):
  sherlock_clean[i]=clean_book(sherlock[i], "Arthur Conan Doyle")

#sherlock_df = pd.DataFrame({'series': ['Sherlock Holmes']*len(sherlock), 'raw_text': sherlock_raw, 'clean_text': sherlock_clean, 'clean_text2':sherlock_clean2})

adventures of sherlock holmes: illustrated file does not already exist. Grabbing from Project Gutenberg
Saving adventures of sherlock holmes: illustrated file
a study in scarlet file does not already exist. Grabbing from Project Gutenberg
Saving a study in scarlet file
the hound of the baskervilles file does not already exist. Grabbing from Project Gutenberg
Saving the hound of the baskervilles file
the sign of the four file does not already exist. Grabbing from Project Gutenberg
Saving the sign of the four file
the memoirs of sherlock holmes file does not already exist. Grabbing from Project Gutenberg
Saving the memoirs of sherlock holmes file
the return of sherlock holmes file does not already exist. Grabbing from Project Gutenberg
Saving the return of sherlock holmes file
the case-book of sherlock holmes file does not already exist. Grabbing from Project Gutenberg
Saving the case-book of sherlock holmes file
his last bow: an epilogue of sherlock holmes file does not already exist. G

In [87]:
hercule_clean = [0]*len(hercule)
for i in range(len(hercule)):
  hercule_clean[i]=clean_book(hercule[i], 'Agatha Christie')

#hercule_df = pd.DataFrame({'series': ['Hercule Poirot']*len(hercule), 'raw_text': hercule_raw, 'clean_text': hercule_clean})

the mysterious affair at styles file already exists
/content/drive/MyDrive/Detective Bot/data/the mysterious affair at styles.txt
the murder on the links file does not already exist. Grabbing from Project Gutenberg
Saving the murder on the links file
the murder of roger ackroyd file does not already exist. Grabbing from Project Gutenberg
Saving the murder of roger ackroyd file
the big four file does not already exist. Grabbing from Project Gutenberg
Saving the big four file
the mystery of the blue train file does not already exist. Grabbing from Project Gutenberg
Saving the mystery of the blue train file
the hunter's lodge case file does not already exist. Grabbing from Project Gutenberg
Saving the hunter's lodge case file
the missing will file does not already exist. Grabbing from Project Gutenberg
Saving the missing will file
the plymouth express affair file does not already exist. Grabbing from Project Gutenberg
Saving the plymouth express affair file
poirot investigates file does n

In [88]:
hardy_boys_clean = [0]*len(hardy_boys)
for i in range(len(hardy_boys)):
  hardy_boys_clean[i]=clean_book(hardy_boys[i], 'Franklin W. Dixon')

#hardy_boys_df = pd.DataFrame({'series': ['Hardy Boys']*len(hardy_boys), 'raw_text': hardy_boys_raw, 'clean_text': hardy_boys_clean})

the shore road mystery file does not already exist. Grabbing from Project Gutenberg
Saving the shore road mystery file
hunting for hidden gold file does not already exist. Grabbing from Project Gutenberg
Saving hunting for hidden gold file
the missing chums file does not already exist. Grabbing from Project Gutenberg
Saving the missing chums file
the secret of the old mill file does not already exist. Grabbing from Project Gutenberg
Saving the secret of the old mill file
the tower treasure file does not already exist. Grabbing from Project Gutenberg
Saving the tower treasure file
the house on the cliff file does not already exist. Grabbing from Project Gutenberg
Saving the house on the cliff file


In [89]:
del sherlock_clean, hardy_boys_clean, hercule_clean

After spending a long time trying to find a common thread to clean all the books of title page and contents, I realized there wasn't a common thread there so I opened each book individually in a txt document and deleted the title page, contents, and any preface.  I will now load all of the books back in and put the text into a dataframe with a column labeling the series and a column holding the full text of the book.

In [91]:
def open_clean_files(id, path):
  book_meta_data = get_book_metadata(id)['results'][0]
  title = book_meta_data['title'].lower()
  filename = path + title +'.txt'
  with open(filename, 'r') as f:
            raw = f.read()
  return raw


In [94]:
sherlock_clean = [0]*len(sherlock)
sherlock_label = ['sherlock']*len(sherlock)
for i in range(len(sherlock)):
  sherlock_clean[i] = open_clean_files(sherlock[i], path = "/content/drive/MyDrive/Detective Bot/data/")


hercule_clean = [0]*len(hercule)
hercule_label = ['hercule']*len(hercule)
for i in range(len(hercule)):
  hercule_clean[i] = open_clean_files(hercule[i], path = "/content/drive/MyDrive/Detective Bot/data/")


hardy_boys_clean = [0]*len(hardy_boys)
hardy_boys_label = ['hardy boys'] *len(hardy_boys)
for i in range(len(hardy_boys)):
  hardy_boys_clean[i] = open_clean_files(hardy_boys[i], path = "/content/drive/MyDrive/Detective Bot/data/")


sherlock_df = pd.DataFrame({'label': sherlock_label, 'text': sherlock_clean})
hercule_df = pd.DataFrame({'label': hercule_label, 'text': hercule_clean})
hardy_boys_df = pd.DataFrame({'label': hardy_boys_label, 'text': hardy_boys_clean})

full_df = pd.concat([sherlock_df, hercule_df, hardy_boys_df], ignore_index=True)

In [95]:
full_df.head()

Unnamed: 0,label,text
0,sherlock,\nAdventure I\n\nA SCANDAL IN BOHEMIA\n\n\nI\n...
1,sherlock,\nPART I.\n\n\n(_Being a reprint from the Remi...
2,sherlock,\nChapter 1.\nMr. Sherlock Holmes\n\n\n M...
3,sherlock,\n\nChapter I\nThe Science of Deduction\n\n\nS...
4,sherlock,"\nI. Silver Blaze\n\n\n I am afraid, Wats..."


The full_df is a dataframe that contains the full text for each book along with a label for each of the book series. I am saving that in my google drive along with the cleaned individual book files.

In [96]:
full_df.to_csv("/content/drive/MyDrive/Detective Bot/data/full_dataset.csv", sep="|")