<a href="https://colab.research.google.com/github/vred13/detective-chatbot/blob/dev/DetectiveBot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creation of a Detective Bot
I first envisioned this project when seeing ads on Facebook for a an app that lets you talk to a fictional character.  Sadly those were based on some canned responses to things, but I thought what a lovely way to test out an LLM and LangChain.  

Whenever I create a data project for myself, the first thing I want to question is the collection of data.  In this case I decided on Public Domain detective novels, specifically those that focused on a single detective or team of detectives as the main detection force.  That narrowed things down a bit for the data, I have a full set of Sherlock Holmes works by Sir Arthur Conan Doyle, 6 books from The Hardy Boys series by Franklin W. Dixon, and 9 of the works detailing the escapades of Hercule Poirot by Agatha Christie.  


## Data Collection
To collect this data, I went to [Project Gutenberg](https://www.gutenberg.org/), which is a library of over 70,000 books for which the copyright has expired.  I searched within that domain to find detective novels and came up with the three sets of detective books listed above, Sherlock Holmes, Hercule poirot, and The Hardy Boys.  

Next I need to get the text of these books into Python for analysis.  There is a Python package for accessing Project Gutenberg called Gutenbergpy and that is what I will use.  I also made a list of all the book ids for each set of novels which I will list in the code.

In [None]:
!pip install gutenbergpy



In [None]:
import os
from urllib import request
import nltk
import re
import json

In [None]:
def get_book_metadata(id):
  url = "https://gutendex.com/books/?ids="+ str(id)
  response = request.urlopen(url)
  response_json = json.loads(response.read())
  return response_json

In [None]:
def create_gutenberg_project_url(book_id):
  url = "https://www.gutenberg.org/files/" + str(book_id) + "/" + str(book_id) +"-0.txt"
  return url

In [None]:
def text_from_gutenberg(title, author, url, path = 'corpora/canon_texts/', return_raw = False, return_tokens = False):
    # Convert inputs to lowercase
    title = title.lower()
    author = author.lower()

    # Check if the file is stored locally
    filename = path + title
    if os.path.isfile(filename) and os.stat(filename).st_size != 0:
        print("{title} file already exists".format(title=title))
        print(filename)
        with open(filename, 'r') as f:
            raw = f.read()
    else:
        print("{title} file does not already exist. Grabbing from Project Gutenberg".format(title=title))
        response = request.urlopen(url)
        raw = response.read().decode('utf-8-sig')
        print("Saving {title} file".format(title=title))
        with open(filename, 'w') as outfile:
            outfile.write(raw)

    if return_raw:
        return raw

    # Option to return tokens
    if return_tokens:
      return nltk.word_tokenize(find_text(raw))
    else:
      return find_beginning_and_end(raw, title, author)

In [None]:
def find_beginning_and_end(raw, title, author):
    '''
    This function serves to find the text within the raw data provided by Project Gutenberg
    '''

    start_regex = '\*\*\*\s?START OF TH(IS|E) PROJECT GUTENBERG EBOOK.*\*\*\*'
    draft_start_position = re.search(start_regex, raw)
    if draft_start_position is None:
      start_regex = '\*\*\*\s?START OF TH(IS|E) PROJECT GUTENBERG EBOOK.*\*\*\*'
      draft_start_position = re.search(start_regex, raw, flags = re.S)
      print(draft_start_position)

    begining = draft_start_position.end()
    print(begining)

    if re.search(title.lower(), raw[draft_start_position.end():].lower()):
        title_position = re.search(title.lower(), raw[draft_start_position.end():].lower())
        begining += title_position.end()
        # If the title is present, check for the author's name as well
        if re.search(author.lower(), raw[draft_start_position.end() + title_position.end():].lower()):
            author_position = re.search(author.lower(), raw[draft_start_position.end() + title_position.end():].lower())
            begining += author_position.end()

    end_regex = 'end of th(is|e) project gutenberg ebook'
    end_position = re.search(end_regex, raw.lower())
    print(end_position)

    text = raw[begining:end_position.start()]

    return text

In [None]:
import pandas as pd
import numpy as np


In [None]:
#Sherlock Book IDS
sherlock = [48320, 244, 2852, 2097, 834,108, 69700, 2350, 2346]

#Hercule Poirot Boox IDS
hercule = [863, 58866, 69087, 70114, 72824, 67160, 67173, 66446, 61262]

#Hardy Boys Book IDs
hardy_boys = [73102, 72958, 72840, 70236, 70083, 69988]


In [None]:
import gutenbergpy.textget
from gutenberg_cleaner import simple_cleaner, super_cleaner
def clean_book(id, author):
    book_meta_data = get_book_metadata(id)['results'][0]
    # This gets a book by its gutenberg id number
    book = text_from_gutenberg(book_meta_data['title'],author, create_gutenberg_project_url(id), path = "/content/drive/MyDrive/Detective Bot/data/")
    return book

In [None]:
sherlock_raw=[0]*len(sherlock)
sherlock_clean = [0]*len(sherlock)
sherlock_clean2 = [0]*len(sherlock)
sherlock_extra_header = [4544, 1853, 984, 1551, 1429, 1570, 5403, 1400, 943]
for i in range(len(sherlock)):
  sherlock_clean[i]=clean_book(sherlock[i], "A. Conan Doyle")

#sherlock_df = pd.DataFrame({'series': ['Sherlock Holmes']*len(sherlock), 'raw_text': sherlock_raw, 'clean_text': sherlock_clean, 'clean_text2':sherlock_clean2})

adventures of sherlock holmes: illustrated file already exists
/content/drive/MyDrive/Detective Bot/data/adventures of sherlock holmes: illustrated
583
<re.Match object; span=(576527, 576562), match='end of this project gutenberg ebook'>
a study in scarlet file already exists
/content/drive/MyDrive/Detective Bot/data/a study in scarlet
780
<re.Match object; span=(240220, 240254), match='end of the project gutenberg ebook'>
the hound of the baskervilles file already exists
/content/drive/MyDrive/Detective Bot/data/the hound of the baskervilles
846
<re.Match object; span=(354988, 355022), match='end of the project gutenberg ebook'>
the sign of the four file already exists
/content/drive/MyDrive/Detective Bot/data/the sign of the four
775
<re.Match object; span=(232917, 232951), match='end of the project gutenberg ebook'>
the memoirs of sherlock holmes file already exists
/content/drive/MyDrive/Detective Bot/data/the memoirs of sherlock holmes
803
<re.Match object; span=(570370, 570404), 

In [None]:
hercule_raw=[0]*len(hercule)
hercule_clean = [0]*len(hercule)
for i in range(len(hercule)):
  hercule_raw[i], hercule_clean[i]=raw_and_clean_book(hercule[i])

hercule_df = pd.DataFrame({'series': ['Hercule Poirot']*len(hercule), 'raw_text': hercule_raw, 'clean_text': hercule_clean})

In [None]:
hardy_boys_raw=[0]*len(hardy_boys)
hardy_boys_clean = [0]*len(hardy_boys)
for i in range(len(hardy_boys)):
  hardy_boys_raw[i], hardy_boys_clean[i]=raw_and_clean_book(hardy_boys[i])

hardy_boys_df = pd.DataFrame({'series': ['Hardy Boys']*len(hardy_boys), 'raw_text': hardy_boys_raw, 'clean_text': hardy_boys_clean})

In [None]:
#Full Dataframe of all novels with a column labeling the series
full_df = pd.concat([sherlock_df, hercule_df, hardy_boys_df], ignore_index= True)

In [None]:
full_df.head(20)

Unnamed: 0,series,raw_text,clean_text
0,Sherlock Holmes,b'generously made available by The Internet Ar...,b'Project Gutenberg\'s Adventures of Sherlock ...
1,Sherlock Holmes,b'\n\n\n\nA STUDY IN SCARLET\n\nBy A. Conan Do...,"b""The Project Gutenberg eBook of A Study In Sc..."
2,Sherlock Holmes,b'\ncover \n\n\n\nTHE HOUND OF THE BASKERVILLE...,b'The Project Gutenberg eBook of The Hound of ...
3,Sherlock Holmes,b'\n\n\n\ncover\n\n\n\n\nThe Sign of the Four\...,b'The Project Gutenberg eBook of The Sign of t...
4,Sherlock Holmes,b'\n\n\n\ncover \n\n\n\n\nTHE MEMOIRS OF SHERL...,"b""The Project Gutenberg eBook of The Memoirs o..."
5,Sherlock Holmes,"b""HOLMES ***\n\n\n\n\nThe Return of Sherlock H...","b""The Project Gutenberg eBook of The Return of..."
6,Sherlock Holmes,b'HOLMES ***\n\n\n\n\n\n\n\n\n THE CASE-BOOK ...,b'The Project Gutenberg eBook of The case-book...
7,Sherlock Holmes,b'\xef\xbb\xbf*** START OF THE PROJECT GUTENBE...,b'\xef\xbb\xbf*** START OF THE PROJECT GUTENBE...
8,Sherlock Holmes,b'\n\n\n\n\n\n\n\n\nThe Adventure of the Bruce...,b'The Project Gutenberg EBook of The Adventure...
9,Hercule Poirot,b'\n\n\n\nThe Mysterious Affair at Styles\n\nb...,b'The Project Gutenberg eBook of The Mysteriou...


After seeing the output for Gutenbergpy with respect to the header free version of the text, there is still a lot of cleaning for me to do to the data, especially at the beginning.  

In [None]:
print(sherlock_clean[0][:200])







Produced by The Online Distributed Proofreading Team at
http://www.pgdp.net (This file was produced from images
generously made available by The Internet Archive/American
Libraries.)









ADV


I've got a few options with respect to cleaning, there is a library for Gutenberg files that does cleaning, Nemo Curator could also be used, or I could create a specific cleaner for my data.  What I want is for the headers to disappear and for the "/r/n" characters to be gone.  

In [None]:
create_gutenberg_project_url(108)

'https://www.gutenberg.org/files/108/108-0.txt'