<a href="https://colab.research.google.com/github/zalcandil/authentic-connections/blob/master/Ebook_Embeddings_Search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Notes on usage:

- Upload an epub file representing the ebook you want to search (tip: ever heard of [libgen](https://libgen.is/)?).
- Re-run the last cell using different queries to keep searching the same book. 
  - The cost of embedding the query is trivial compared to the cost of embedding the whole book, so this is the cheap part.
  - Query search results are also appended to 'results.txt' in Files.
- You'll need an OpenAI API key to run this notebook, which you can get [here](https://beta.openai.com/account/api-keys) (signing up for an OpenAI account gives you \$18 of credits). 
- Choose a model based on price/performance considerations (more info on [pricing](https://openai.com/api/pricing/)). 
  - Heads up: transcribing long books, or using the bigger models (Curie and Davince) can get really expensive.
  - Make sure to set Usage Limits for your OpenAI account.
- Embeddings for the book you upload will be saved in Files (in the left menu bar) under the title 'embeddings-{first chapter}-{last chapter}-{model name}-{epub filename}.json'. 
  - Download this file and upload it (instead of an epub) on your next runtime session in order to avoid calling the OpenAI API again.
- Run 'process_file' with 'preview_mode' set to True at first to check which range of chapters you want to index. This helps you avoid needlessly creating embeddings for chapters like 'Notes' and 'Works Cited"


In [None]:
!pip install -q openai ebooklib
import openai
import json
import ebooklib
from ebooklib import epub
from bs4 import BeautifulSoup
from os.path import exists
from IPython.display import HTML, display
import numpy as np
import math

[?25l[K     |███████▍                        | 10 kB 20.5 MB/s eta 0:00:01[K     |██████████████▊                 | 20 kB 26.2 MB/s eta 0:00:01[K     |██████████████████████          | 30 kB 30.6 MB/s eta 0:00:01[K     |█████████████████████████████▍  | 40 kB 28.5 MB/s eta 0:00:01[K     |████████████████████████████████| 44 kB 2.4 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
[K     |████████████████████████████████| 111 kB 46.2 MB/s 
[K     |████████████████████████████████| 163 kB 37.8 MB/s 
[?25h  Building wheel for openai (PEP 517) ... [?25l[?25hdone
  Building wheel for ebooklib (setup.py) ... [?25l[?25hdone


In [None]:
# upload epub (or json of book embeddings generated by this program)
from google.colab import files
uploaded = files.upload()
path = next(iter(uploaded))

In [None]:
openai.api_key = "sk-" #@param {type:"string"}
model = 'babbage' #@param ['ada', 'babbage', 'curie', 'davinci']

In [None]:
def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

In [None]:
def part_to_chapter(part):
    soup = BeautifulSoup(part.get_body_content(), 'html.parser')
    paragraphs = [para.get_text().strip() for para in soup.find_all('p')]
    paragraphs = [para for para in paragraphs if len(para) > 0]
    if len(paragraphs) == 0:
        return None
    title = ' '.join([heading.get_text() for heading in soup.find_all('h1')])
    return {'title': title, 'paras': paragraphs}

min_words_per_para = 150
max_words_per_para = 500

def format_paras(chapters):
    for i in range(len(chapters)):
        for j in range(len(chapters[i]['paras'])):
            k = j
            while len(chapters[i]['paras'][j].split()) < min_words_per_para and k < len(chapters[i]['paras']) - 1:
                chapters[i]['paras'][j] += '\n' + chapters[i]['paras'][k + 1]
                chapters[i]['paras'][k + 1] = ''
                k += 1
            split_para = chapters[i]['paras'][j].split()
            if len(split_para) > max_words_per_para:
                chapters[i]['paras'].insert(j + 1, ' '.join(split_para[max_words_per_para:]))
                chapters[i]['paras'][j] = ' '.join(split_para[:max_words_per_para])

        chapters[i]['paras'] = [para.strip() for para in chapters[i]['paras'] if len(para.strip()) > 0]
        if len(chapters[i]['title']) == 0:
            chapters[i]['title'] = '(Unnamed) Chapter {no}'.format(no=i + 1)

def print_previews(chapters):
    for (i, chapter) in enumerate(chapters):
        title = chapter['title']
        wc = len(' '.join(chapter['paras']).split(' '))
        paras = len(chapter['paras'])
        initial = chapter['paras'][0][:20]
        preview = '{}: {} | wc: {} | paras: {}\n"{}..."\n'.format(i, title, wc, paras, initial)
        print(preview)

def get_chapters(book_path, print_chapter_previews, first_chapter, last_chapter):
    book = epub.read_epub(book_path)
    parts = list(book.get_items_of_type(ebooklib.ITEM_DOCUMENT))
    chapters = [part_to_chapter(part) for part in parts if part_to_chapter(part) is not None]
    last_chapter = min(last_chapter, len(chapters) - 1)
    chapters = chapters[first_chapter:last_chapter + 1]
    format_paras(chapters)
    if print_chapter_previews:
        print_previews(chapters)
    return chapters

In [None]:
doc_model = 'text-search-{model}-doc-001'.format(model=model)
query_model = 'text-search-{model}-query-001'.format(model=model)

def get_embedding(text, doc=True):
    text = text.replace("\n", " ")
    model = doc_model if doc else query_model
    response = openai.Embedding.create(input=[text], model=model)
    return response['data'][0]['embedding']

def get_embeddings(chapters):
    embeddings = []
    for chapter in chapters:
        for para in chapter['paras']:
            embeddings.append(get_embedding(para))
    return embeddings

In [None]:
def read_json(json_path):
    print('Loading embeddings from "{}"'.format(json_path))
    with open(json_path, 'r') as f:
        values = json.load(f)
    return (values['chapters'], np.array(values['embeddings']))

def read_epub(book_path, json_path, preview_mode, first_chapter, last_chapter):
    chapters = get_chapters(book_path, preview_mode, first_chapter, last_chapter)
    if preview_mode:
        return (chapters, None)
    print('Generating {} embeddings for chapters {}-{} in "{}"\n'.format(model, first_chapter, last_chapter, book_path))
    embeddings = get_embeddings(chapters)
    with open(json_path, 'w') as f:
        json.dump({'chapters': chapters, 'embeddings': embeddings}, f)
    return (chapters, np.array(embeddings))

In [None]:
def process_file(path, preview_mode=False, first_chapter=0, last_chapter=math.inf):
    values = None
    if path[-4:] == 'json':
        values = read_json(path)
    elif path[-4:] == 'epub':
        json_path = 'embeddings-{}-{}-{}-{}.json'.format(first_chapter, last_chapter, model, path)
        if exists(json_path):
            values = read_json(json_path)
        else:
            values = read_epub(path, json_path, preview_mode, first_chapter, last_chapter) 
    else:
        print('Invalid file format. Either upload an epub or a json of book embeddings.')        
    return values

Loading embeddings from "/content/embeddings-8-18-babbage-Charles C. Mann - The Wizard and the Prophet_ Two Remarkable Scientists and Their Dueling Visions to Shape Tomorrow’s World-Knopf Publishing Group (2018).epub.json"


In [None]:
# Comments below only relevant if you want to save yourself some API calls.

# Run this with 'preview_mode' on if you want to figure out which chapters to include.
# For example, after you run, 'process_file(path, preview_mode=True)',
# you might notice that chapters 1-7 and 19-27 are useless endnotes/intro stuff.
# So then you can run, 'process_file(path, first_chapter=8, last_chapter=18)'

chapters, embeddings = process_file(path)

In [None]:
def print_and_write(text, f):
    print(text)
    f.write(text + '\n')

def cos_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def para_index_to_info(index, chapters):
    for chapter in chapters:
        paras_len = len(chapter['paras'])
        if index < paras_len:
            return chapter['paras'][index], chapter['title'], index
        index -= paras_len
    return None

def search(query, embeddings, n=3):
    query_embedding = np.array(get_embedding(query, doc=False))
    results = sorted([i for i in range(len(embeddings))], key=lambda i: cos_similarity(embeddings[i], query_embedding), reverse=True)[:n]

    f = open('result.text', 'a')
    header_msg ='Results for query "{}" in "The Wizard and the Prophet.epub"'.format(query)
    print_and_write(header_msg, f)
    for result in results:
        para, title, para_no = para_index_to_info(result, chapters)
        result_msg = '\nChapter: "{}", Passage number: {}, Score: {:.2f}\n"{}"'.format(title, para_no, cos_similarity(embeddings[result], query_embedding), para)
        print_and_write(result_msg, f)
    print_and_write('\n', f)

In [None]:
query = 'scene of vought and huxley meeting' #@param {type:"string"}
search(query, embeddings)

Results for query "scene of vought and huxley meeting" in "The Wizard and the Prophet.epub"

Chapter: "[ EIGHT ] The Prophet", Passage number: 4, Score: 0.39
"Julian Huxley, 1964 Credit 77
After the meeting Huxley and Vogt talked. Surely it was an exciting moment for Vogt. Speaking to Huxley, with his first-class Oxford degree, his links to scientists around the world, his string of best-selling books, was about as far from the Chincha Islands as it was possible to get. And Huxley had sought out Vogt, had questions for him, possible plans. No record exists of their conversation, though presumably Vogt talked about his forthcoming book, Road to Survival. Whatever the course of discussion, it is clear that Vogt satisfied Huxley. The two men kept in touch, sometimes by letter, sometimes through their mutual acquaintance, Vogt’s friendly rival Fairfield Osborn.
During the next year Huxley watched Road become an explosive best seller, making Vogt—and Osborn, who had published a competing bo