# What to do when RAG system hallucinates?

This script is designed to extract text content from a PDF file and present it as a list of strings, where each string represents the text content of a single page within the PDF. It is written in Python and makes use of two external libraries: `openai` and `PyMuPDF`.

## Dependencies Installation

First, the script installs necessary Python libraries using `pip`, Python's package installer:

- `openai`: Although imported, this library is not directly utilized in the given code snippet. It's a library intended for accessing OpenAI's APIs, which suggests that other parts of the project may involve AI-based operations.
- `PyMuPDF` (imported as `fitz`): This is the main library used for interacting with PDF files in the script. It provides functionalities to open, read, and manipulate PDF documents.


In [1]:
! pip install openai

Collecting openai
  Downloading openai-1.12.0-py3-none-any.whl (226 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/226.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m204.8/226.7 kB[0m [31m5.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m226.7/226.7 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.26.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.9/75.9 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.2-py3-none-any.whl (76 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.9/76.9 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-

In [2]:
import openai

In [3]:
! pip install PyMuPDF

Collecting PyMuPDF
  Downloading PyMuPDF-1.23.21-cp310-none-manylinux2014_x86_64.whl (4.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.4/4.4 MB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting PyMuPDFb==1.23.9 (from PyMuPDF)
  Downloading PyMuPDFb-1.23.9-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (30.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.6/30.6 MB[0m [31m25.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDFb, PyMuPDF
Successfully installed PyMuPDF-1.23.21 PyMuPDFb-1.23.9


## Function: `read_pdf_content`

### Purpose

The core of the script is defined in the `read_pdf_content` function. This function is designed to open a PDF file, iterate through its pages, and extract the text content from each page.

### Parameters

- `pdf_path (str)`: A string parameter that takes the file path to the PDF document to be read.

### Process

1. **Initialization**: A list named `content_list` is initialized to store the text content of each page.
2. **Opening the PDF**: The function uses `fitz.open(pdf_path)` to open the PDF file specified by the `pdf_path` argument. The `with` statement ensures that the PDF file is properly closed after its contents are read, which is a good practice for resource management.
3. **Reading Pages**: The function iterates over each page in the document (`for page in doc:`) and uses the `get_text()` method to extract the text content of the current page.
4. **Storing Text**: The extracted text for each page is appended to the `content_list`.

### Returns

- The function returns `content_list`, a list of strings, where each string contains the text content of a respective PDF page.


In [4]:
import fitz  # PyMuPDF

def read_pdf_content(pdf_path):
    """
    Reads a PDF and returns its content as a list of strings.

    Args:
    pdf_path (str): The file path to the PDF.

    Returns:
    list of str: A list where each element is the text content of a PDF page.
    """
    content_list = []
    with fitz.open(pdf_path) as doc:
        for page in doc:
            content_list.append(page.get_text())

    return content_list

### Execution and Timing

The script includes a magic command `%%time` (specific to Jupyter Notebooks or IPython environments) at its end, which measures the execution time of the operation. Following this, it calls the `read_pdf_content` function with a specified PDF file path (`"/content/stories.pdf"`), and the text content extracted is stored in the `scraped_content` variable.

In [6]:
%%time

scraped_content = read_pdf_content("/content/stories.pdf")

CPU times: user 27.9 ms, sys: 1.88 ms, total: 29.7 ms
Wall time: 103 ms


In [7]:
%%time

scraped_content = [scraped_content[i].replace('\n', ' ') for i in range(len(scraped_content))]

CPU times: user 11 µs, sys: 1e+03 ns, total: 12 µs
Wall time: 15.3 µs


In [9]:
len(scraped_content)

2

In [10]:
scraped_content[0]

"In the submerged world of New York City, 2080, John Storyteller navigates the aquaƟc avenues in his  one-man submarine, a lone courier among the corals and skyscrapers. Once a bustling metropolis, now a  silent underwater realm, the city had succumbed to the rising Ɵdes, but life, as always, found a way to  persevere. John, a mailman of the new era, took pride in his unique role, connecƟng the submerged  city's inhabitants with the world above the waves.    Each morning, John would seal himself within his vessel, the SS Narrator, a sleek submarine painted with  the vibrant colors of forgoten sunsets. His route took him past landmarks that had taken on new lives  beneath the sea; Times Square teemed with schools of ﬁsh, and the Statue of Liberty stood watch over  the depths. John delivered leters, packages, and memories, weaving through the waterlogged streets  with a skill born of years behind the helm.    But John's job was more than just a profession; it was a calling. In a world wh

In [11]:
scraped_content[1]

'In 200 A.D., ancient China was a realm of wonder and dragons, where the skies were streaked with the  majesƟc creatures as commonly as roads were tread by horses in other lands. Into this world, Thomas  Wingless, a man of the future with no dragon to call his own, found himself inexplicably transported.  With his pale complexion, foreign features, and inability to speak the local dialect, Thomas stood out like  a moon in a star-ﬁlled sky.    Despite his iniƟal bewilderment, Thomas was a man of considerable resolve. He quickly learned that in  this era, dragons were not just beasts of burden but symbols of power and presƟge. Without a dragon,  one was considered lesser, almost incomplete. Thomas, dubbed "Wingless" by those he encountered,  sought to carve a place for himself in this fantasƟcal society.    His journey was one of both hardship and enlightenment. Thomas devoted himself to understanding the  culture that surrounded him, from the art of calligraphy to the intricate rituals 

## Set up `GPT`

The given code snippet demonstrates how to interact with the OpenAI API using Python to generate text responses for queries and to create question-answer pairs based on provided content. Here's a high-level overview of its components:


### OpenAI API Key and Client Initialization

- **OPENAI_API_KEY = "sk-xxx"**: This line sets up the API key required to authenticate requests to the OpenAI API. Replace `"sk-xxx"` with your actual API key.
- **openai_client = openai.OpenAI(api_key=OPENAI_API_KEY)**: Initializes the OpenAI client with the given API key. This client is used to make requests to the API.


### Function: `call_chatgpt`

This function generates a response to a user's query using a specified language model from OpenAI's suite.

- **Parameters**:
  - `query (str)`: The user's question or query.
  - `model (str)`: The ID of the OpenAI language model to use. Defaults to `"gpt-3.5-turbo"`, a variant known for its speed and efficiency.
- **Process**:
  - Constructs a conversation context with system instructions and the user's query.
  - Calls the OpenAI API to generate a response based on this context and the chosen model.
  - Extracts and returns the generated response text.
- **Returns**: A string containing the generated response to the query.



In [12]:
OPENAI_API_KEY = "sk-xxx"
openai_client = openai.OpenAI(api_key=OPENAI_API_KEY)


def call_chatgpt(query: str, model: str = "gpt-3.5-turbo") -> str:
    """
    Generates a response to a query using the specified language model.
    Args:
        query (str): The user's query that needs to be processed.
        model (str, optional): The language model to be used. Defaults to "gpt-3.5-turbo".
    Returns:
        str: The generated response to the query.
    """

    # Prepare the conversation context with system and user messages.
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": f"Question: {query}."},
    ]

    # Use the OpenAI client to generate a response based on the model and the conversation context.
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )

    # Extract the content of the response from the first choice.
    content: str = response.choices[0].message.content

    # Return the generated content.
    return content

### Function: `prompt_engineered_api`

Aims to generate question-answer pairs based on the provided text.

- **Process**:
  - Takes a string of text as input.
  - Constructs a prompt that instructs the AI to generate question-answer content in a specific format, with questions labeled `### Human:` and answers labeled `### Assistant:`.
  - Calls `call_chatgpt` with this prompt to generate the content.
- **Returns**: The generated question-answer pairs.


In [13]:
def prompt_engineered_api(text: str):

    prompt = f"""
        I have the following content: {text}

        I want to create some question-answer content that has the following format:

        ### Human:
        ### Assistant:

        Make sure to write question and answer based on the content I provided.

        The ### Human means it's a question, and the ### Assistant means it's an answer.

        Make sure to write five sets of ### Human: ### Assistant per line.
    """

    resp = call_chatgpt(prompt)

    return resp

In [14]:
from tqdm import tqdm

### Loop for Processing Content

- The loop iterates over `scraped_content`, a list of texts.
- For each item in the list, it uses `prompt_engineered_api` to generate question-answer pairs.
- The generated content is collected in `raw_content_for_train`, presumably for training or analysis purposes.
- Utilizes `tqdm` to show a progress bar, providing visual feedback on the process's advancement.
- The `%%time` magic command at the beginning measures and displays the total execution time of the cell.



In [15]:
%%time

raw_content_for_train = []
for i in tqdm(range(len(scraped_content))):
    resp = prompt_engineered_api(scraped_content[i])
    raw_content_for_train.append(resp)

100%|██████████| 2/2 [00:10<00:00,  5.49s/it]

CPU times: user 110 ms, sys: 12.4 ms, total: 122 ms
Wall time: 11 s





In [16]:
len(raw_content_for_train)

2

In [17]:
raw_content_for_train[0]

"### Human: Who is John Storyteller and what is his role in the submerged world of New York City?\n### Assistant: John Storyteller is a lone courier who navigates the aquatic avenues of New York City in his one-man submarine. He is responsible for connecting the submerged city's inhabitants with the world above the waves by delivering letters, packages, and memories.\n\n### Human: How has New York City changed in the year 2080?\n### Assistant: In the year 2080, New York City has become a silent underwater realm due to the rising tides. It has transformed into a submerged world where corals and skyscrapers coexist, and the once bustling metropolis now breathes beneath the sea.\n\n### Human: What is special about John's approach to his job as a mailman in the submerged city?\n### Assistant: John takes pride in his unique role as a mailman. Despite living in a world dominated by digital communication, he maintains the personal touch of handwritten notes and physical parcels. He believes i

In [18]:
raw_content_for_train[1]

"### Human: What was Thomas Wingless' initial experience like when he arrived in ancient China?\n### Assistant: Thomas Wingless initially stood out due to his foreign features, pale complexion, and inability to speak the local dialect. He found himself bewildered in this unfamiliar world.\n\n### Human: How did Thomas seek to fit into this fantastical society?\n### Assistant: Thomas sought to carve a place for himself in ancient China by immersing himself in the culture, learning the local customs, and working alongside peasants in the fields.\n\n### Human: What unique skill did Thomas possess that earned him respect and fascination among the community?\n### Assistant: Thomas shared tales and knowledge from his own time, captivating the imaginations of the locals with stories of distant futures and worlds beyond their own. This unique skill earned him respect and fascination.\n\n### Human: Who played an important role in Thomas' journey in ancient China?\n### Assistant: A dragonkeeper, 

## Save `str` to `.txt`

The given Python script includes a function designed to save a list of strings to separate text files within a specified directory. Here's a breakdown of its components and functionality:


### Function Definition
- **Function Name**: `save_strings_as_files`
- **Parameters**:
  - `string_list`: A list of strings that the user wants to save as text files.
  - `directory`: An optional parameter that specifies the directory where the text files should be saved. It defaults to "output_files" if not provided.

### Core Functionality
1. **Directory Creation**: Initially, the function checks if the specified directory exists. If not, it creates the directory. This step ensures that there is a place to save the text files without any errors.
   
2. **Iterating Over Strings**: The function iterates through each string in the provided list. For each string, it performs the following actions:
   - Generates a unique filename using the format `file_{index}.txt`, where `{index}` is the current position of the string in the list. This approach helps avoid filename conflicts.
   - Saves the filename in a list `list_of_names` for later use or reference.
   - Constructs the full path to where the file will be saved by combining the directory path and the filename.
   - Opens a new text file at the constructed path and writes the string to it.

3. **Feedback to User**: After processing all strings in the list, the function prints a message indicating how many files have been saved and in which directory.

4. **Return Value**: The function returns `list_of_names`, which contains all the filenames that were generated and used for saving the strings. This list can be useful for further processing or verification.



In [19]:
import os

def save_strings_as_files(string_list, directory="output_files"):
    """
    Saves each string in the list as a separate .txt file.

    Args:
    - string_list: List of strings to be saved as .txt files.
    - directory: The directory where the .txt files will be saved.
    """
    # Ensure the output directory exists
    os.makedirs(directory, exist_ok=True)
    list_of_names = []

    # Iterate over the list of strings
    for index, string in enumerate(string_list):
        # Define the filename with an index to avoid name conflicts
        filename = f"file_{index}.txt"
        list_of_names.append(filename)
        # Create a full path for the file
        filepath = os.path.join(directory, filename)

        # Open the file and write the string
        with open(filepath, 'w') as file:
            file.write(string)

    print(f"Saved {len(string_list)} files in '{directory}'.")
    return list_of_names

### Execution and Timing
- After defining the function, the script proceeds to create a list of strings named `string_list`, intended for demonstration or further processing.
- It then calls the `save_strings_as_files` function with `string_list` as the argument, saving the strings as text files and measuring the time taken for this operation using the `%%time` magic command in a Jupyter Notebook context.

This script is particularly useful for applications that need to persist textual data to the filesystem, enabling batch processing of strings and efficient file management.





In [20]:
%%time

string_list = raw_content_for_train
list_of_names = save_strings_as_files(string_list)

Saved 2 files in 'output_files'.
CPU times: user 0 ns, sys: 1.44 ms, total: 1.44 ms
Wall time: 1.45 ms


In [21]:
list_of_names

['file_0.txt', 'file_1.txt']

## Understanding the Document Embedding Code


### Library Installation

The code begins by installing necessary Python libraries:

- `chromadb`: A library for efficient vector storage and similarity search.
- `langchain`: A toolkit for building language applications, providing functionalities for document loading, text splitting, and embeddings.
- `sentence-transformers`: A library for sentence and document embeddings using transformer models.


In [22]:
# library
! pip install chromadb
! pip install langchain
! pip install sentence-transformers

Collecting chromadb
  Downloading chromadb-0.4.22-py3-none-any.whl (509 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/509.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m245.8/509.0 kB[0m [31m7.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m509.0/509.0 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
Collecting chroma-hnswlib==0.7.3 (from chromadb)
  Downloading chroma_hnswlib-0.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.109.2-py3-none-any.whl (92 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.1/92.1 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting uvicorn[standard]>=0.18.3 (from ch

### Imports

Key components from the `langchain` library are imported, including:

- `TextLoader` for loading text data from files.
- `SentenceTransformerEmbeddings` for generating embeddings of text data.
- `CharacterTextSplitter` for splitting text into smaller chunks.
- `Chroma` from `langchain.vectorstores` for storing and managing the embeddings.


In [23]:
# import
from langchain.document_loaders import TextLoader
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

In [24]:
persist_dir = "/content/db"

In [25]:
os.chdir("/content/output_files")

### Document Loading and Preparation

- An empty list, `all_documents`, is initialized to store the loaded documents.
- Documents are loaded from the file names listed in `file_names` using `TextLoader` and appended to `all_documents`.
- The documents are then split into smaller chunks using `CharacterTextSplitter`, based on a character count (`chunk_size`) with no overlap (`chunk_overlap`).


### Embedding and Persistence

- An embedding function is created using `SentenceTransformerEmbeddings` with a specified model (`all-MiniLM-L6-v2`), which is designed for generating dense vector representations of the text data.
- The `Chroma` database is initialized with the chunked documents and the embedding function. The `persist_directory` argument specifies where to store the vector data.
- Finally, the `.persist()` method is called on the `Chroma` database to ensure that the embeddings are saved to disk.


### Performance Measurement

- The `%%time` magic command at the beginning of the code block is used to measure the execution time of the code that follows it.


In [26]:
%%time

# Initialize an empty list to hold all documents
file_names = list_of_names  # Add your file names here
all_documents = [] # this is just a copy, you don't have to use this

# Iterate over each file and load its contents
for file_name in file_names:
    loader = TextLoader(file_name)
    documents = loader.load()
    all_documents.extend(documents)

# Split the loaded documents into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(all_documents)

# Create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# Load the documents into Chroma
db = Chroma.from_documents(docs, embedding_function, persist_directory=persist_dir)

# Call .persist to ensure the vectors are written
db.persist()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

CPU times: user 8.76 s, sys: 1.49 s, total: 10.3 s
Wall time: 21.1 s


This document provides a high-level explanation of a Python code snippet that demonstrates the use of a database's similarity search functionality. The code is structured into two parts, each preceded by the `%%time` magic command used in Jupyter notebooks to measure the execution time of the cell.

### Part 1: Basic Similarity Search

```python
%%time

# query it
query = "Who is john storytelleer?"
docs = db.similarity_search(query)

# print results
print(docs[0].page_content)
```

### Explanation

- **Timing the Execution:** The `%%time` command at the beginning is a special command used in Jupyter Notebooks to measure the execution time of the code cell.
  
- **Defining the Query:** The code defines a string variable `query` with the value "Who is john storytelleer?". This query is intended to search for information related to a person (possibly misspelled as "storytelleer") named John within a database.

- **Performing the Search:** The `similarity_search` method of the `db` object is called with the `query` as its argument. This method searches the database for documents that are similar to the query, suggesting that the database has a built-in or integrated functionality for semantic or text-based similarity searches.

- **Displaying Results:** The first document's content (`page_content`) from the search result (`docs`) is printed out. This implies that the result of the `similarity_search` method is a list of documents, and `page_content` is a property of a document in this list that contains the main text or content of the document.

In [48]:
%%time

# query it
query = "Who is john storytelleer?"
docs = db.similarity_search(query)

# print results
print(docs[0].page_content)

### Human: What does John Storyteller symbolize in the submerged world?
### Assistant: John Storyteller, in his submarine, symbolizes a lifeline in the submerged world of New York City. He is a reminder that even in the most changed of worlds, some things, like the need to connect and keep humanity connected, remain ever the same.
CPU times: user 23.7 ms, sys: 0 ns, total: 23.7 ms
Wall time: 25.4 ms


### Part 2: Similarity Search with Scores

```python
%%time

# query it
query = "Who is john storytelleer?"
docs = db.similarity_search_with_score(query)

# print results
print(docs)
```

- **Timing the Execution:** Similar to the first part, the `%%time` command is used to measure how long the code cell takes to execute.

- **Defining the Query:** The query remains the same as in the first part, aiming to find information about "john storytelleer".

- **Performing the Search with Scores:** The method `similarity_search_with_score` is used this time, indicating that this search not only finds documents similar to the query but also provides a score for each document. This score likely represents how relevant or similar each document is to the provided query.

- **Displaying Results:** The entire `docs` object is printed, suggesting that the output of `similarity_search_with_score` includes both the documents and their respective similarity scores, offering a detailed insight into the search results' relevance to the query.

In [49]:
%%time

# query it
query = "Who is john storytelleer?"
docs = db.similarity_search_with_score(query)

# print results
print(docs)

[(Document(page_content='### Human: What does John Storyteller symbolize in the submerged world?\n### Assistant: John Storyteller, in his submarine, symbolizes a lifeline in the submerged world of New York City. He is a reminder that even in the most changed of worlds, some things, like the need to connect and keep humanity connected, remain ever the same.', metadata={'source': 'file_0.txt'}), 0.9765369703563733), (Document(page_content="### Human: What is special about John's approach to his job as a mailman in the submerged city?\n### Assistant: John takes pride in his unique role as a mailman. Despite living in a world dominated by digital communication, he maintains the personal touch of handwritten notes and physical parcels. He believes in the magic of a letter and the power of a package to bridge distances.\n\n### Human: How does John feel connected to the city's pulse?\n### Assistant: Hovering beside green-tinged windows or the open doors of underwater habitats, John feels a pr

In [50]:
len(docs)

4

In [42]:
type(docs[0]), len(docs[0])

(tuple, 2)

In [43]:
import pandas as pd

In [51]:
i = 0

[docs[i][0].page_content, docs[i][0].metadata['source'], docs[i][1]]

['### Human: What does John Storyteller symbolize in the submerged world?\n### Assistant: John Storyteller, in his submarine, symbolizes a lifeline in the submerged world of New York City. He is a reminder that even in the most changed of worlds, some things, like the need to connect and keep humanity connected, remain ever the same.',
 'file_0.txt',
 0.9765369703563733]

## Investigation of Similarity Results

This Python script performs a series of steps to query a database for documents similar to a given query and then export those results to a CSV file. The process can be broken down into the following steps:

### Explanation of the following code

```python
%%time

# query it
query = "WHATEVER_CONTENT_YOU_DESIRE"
docs = db.similarity_search_with_score(query)

# print results
tmp_search_result_in_df = pd.DataFrame([[docs[i][0].page_content, docs[i][0].metadata['source'], docs[i][1]] for i in range(len(docs))])
tmp_search_result_in_df.columns = ['content', 'source', 'score']
tmp_search_result_in_df.to_csv("/content/tmp_result.csv")
tmp_search_result_in_df
```

### 1. Query Definition

The script starts by defining a query in a string format. This query is a simulated dialogue, asking about the symbolism of a character named John Storyteller in a fictional scenario. The query is meant to retrieve documents that are relevant to this specific context.
### 2. Similarity Search

Using a method called `similarity_search_with_score` from an object `db` (which is not defined within the provided snippet but is assumed to be a database connection or interface object), the script searches for documents that are similar to the provided query. This method likely uses some form of natural language processing or text similarity algorithm to find matches and assigns a score to each based on how closely they match the query.

### 3. Results Formatting

The search results are then formatted into a pandas DataFrame. This step involves creating a list of lists, where each inner list contains the content of a document, its source metadata, and the similarity score assigned by the search method. The DataFrame is structured with columns named 'content', 'source', and 'score'.


### 4. Exporting Results

The script exports the DataFrame to a CSV file named `tmp_result.csv` located in the `/content` directory. This allows for easy sharing or further analysis of the results outside the Python environment.


### 5. Displaying the DataFrame

Finally, the DataFrame containing the search results and their respective scores is displayed. This step is useful for immediate inspection of the results within a notebook or script output.


### Execution Timing

The script is wrapped in a cell magic command (`%%time`), which is specific to Jupyter notebooks. This command measures and displays the execution time of the cell, providing insight into the performance of the query and the subsequent processing steps.

Overall, the script demonstrates a practical application of text similarity search in a database, followed by data manipulation with pandas, and concludes with both data persistence (via CSV export) and data presentation (displaying the DataFrame).


In [53]:
%%time

# query it
query = "### Human: What does John Storyteller symbolize in the submerged world? ### Assistant: John Storyteller, in his submarine, symbolizes a lifeline in the submerged world of New York City. He is a reminder that even in the most changed of worlds, some things,"
docs = db.similarity_search_with_score(query)

# print results
tmp_search_result_in_df = pd.DataFrame([[docs[i][0].page_content, docs[i][0].metadata['source'], docs[i][1]] for i in range(len(docs))])
tmp_search_result_in_df.columns = ['content', 'source', 'score']
tmp_search_result_in_df.to_csv("/content/tmp_result.csv")
tmp_search_result_in_df

CPU times: user 45.9 ms, sys: 0 ns, total: 45.9 ms
Wall time: 45.8 ms


Unnamed: 0,content,source,score
0,### Human: What does John Storyteller symboliz...,file_0.txt,0.021491
1,### Human: Who is John Storyteller and what is...,file_0.txt,0.409301
2,### Human: What is special about John's approa...,file_0.txt,0.693118
3,### Human: What was Thomas Wingless' initial e...,file_1.txt,1.352468
