# <center>RAG Vector Search with 20,000 Leagues Under the Sea</center>
<center><img src='https://static.independent.co.uk/s3fs-public/thumbnails/image/2015/11/10/09/johnny-depp.jpg' width=800></center>

### Introduction

Large language models (LLMs) are powerful tools that can generate text, translate languages, and answer questions in an informative way. However, LLMs often lack specific knowledge about certain domains. Retrieval-Augmented Generation (RAG) is a prompt-engineering technique that can improve the specific knowledge of LLMs by providing them with relevant context from a database. When a user prompts the LLM, RAG searches a database for relevant context based on the prompt, and then injects the context into the prompt before feeding the prompt into the LLM. The LLM is then able to provide a more informed answer based on the context. RAG has numerous applications, particularly for businesses seeking to easily search their extensive private databases for domain-specific knowledge.

### Project Summary

In this project, I used RAG to improve an LLM's specific knowledge of 20,000 Leagues Under the Sea (Leagues), a classic novel written by Jules Verne. In addition, I employed prompt engineering techniques to have the LLM talk in a style similar to Jack Sparrow from Pirates of the Carribbean. 

I used Chat GPT 3.5 as the LLM, Langchain to segment the text into chunks, and Pinecone to create a vector database. A vector database is a NoSQL database that stores chunks of information as numerical vectors, where the distance between each vector correlates to the similarity between those vectors. In this project, I utilized Open AI's embedding model to map each text chunk from Leagues to a vector before uploading it to the Pinecone database. When querying the vector database, the query is converted into a vector and then the database returns the vectors that are located closest to the query vector. Those vectors are then converted back into text before being sent to the user. 

### Results

As shown below, the LLM was able to learn specific knowledge about Leagues and correctly answer questions such as "While aboard the Monroe, how many whales did Ned Land kill?" RAG queries outperformed traditional queries; however, it depended on question phrasing and the vector database search function appeared to be the weak link in the process. 

The search function often struggled to understand the semantic meaning of queries. Queries failed when query terms didn't precisely match the text, and queries containing substantial text not present in the vector database were problematic. One potential solution could involve initial query interpretation by an LLM to identify relevant keywords and synonyms not explicitly mentioned in the query.

Sometimes, the database query did not provide enough context for the LLM to fully answer the prompt. To fix this issue, I initially experimented with using larger text chunks, but this impaired the search function's ability to find the correct matches. This is probably because larger chunks dilute the strength of certain keyword matches between vectors. I later solved this issue by keeping chunk sizes small, and instructing the search function to retrieve adjacent text chunks for the matching chunks.

I had some success with getting the model to sound like Jack Sparrow, but it seemed to be lacking some of Jack's quirky, enigmatic, and humorous traits. Improvements could be made by enhancing the quality and quantity of Jack Sparrow dialogue examples or conducting a separate RAG on a Jack Sparrow dialogue database. But my guess is that this is more a limitation with Chat GPT 3.5. I am used to Jack Sparrow always sounding witty and humorous because his movie dialogue was meticulously crafted to delight audiences. It would be expecting a lot for a language model to replicate this high level of humor and wit when answering every question. 

Overall, this project demonstrated the ability of RAG to augment an LLM's domain specific knowledge while shedding light on the limitations of vector similarity searches. 

In [None]:
import os
import pinecone
import re

from tqdm.auto import tqdm
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.vectorstores import Pinecone
from langchain.schema import (AIMessage,SystemMessage,HumanMessage)

In [None]:
# Importing text file for 20k Leagues
with open('20kLeagues.txt', 'r') as file:
    data = file.read()

print(data[:1000])

In [None]:
# Initializing vector database
pinecone.init(api_key='YOUR KEY HERE', environment='gcp-starter')
pinecone.create_index('ragsea', dimension=1536, metric='cosine')
index = pinecone.Index('ragsea')
index.describe_index_stats()

In [None]:
# Accessing OpenAI API
os.environ['OPENAI_API_KEY'] ='YOUR KEY HERE'
chat = ChatOpenAI(openai_api_key=os.environ["OPENAI_API_KEY"], model='gpt-3.5-turbo')
embed_model = OpenAIEmbeddings(model="text-embedding-ada-002")

In [None]:
# Splitting 20k Leagues text into chunks and loading them into the vector database
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 256,
    chunk_overlap  = 0,
    length_function = len,
    is_separator_regex = False,
)

texts = text_splitter.split_text(data)
ids = [str(x) for x in range(len(texts))]
embeds = embed_model.embed_documents(texts)
metadata = [ {'text': x} for x in texts]
batch_size = 100

for i in tqdm(range(0, len(texts), batch_size)): 
    i_end = min(i+batch_size, len(texts))
    index.upsert(vectors=zip(ids[i:i_end], embeds[i:i_end], metadata[i:i_end]))

index.describe_index_stats()


In [None]:
# Querying database
vectorstore = Pinecone(index, embed_model.embed_query, "text")
vectorstore.similarity_search("What happened on the 20th of July, 1866?")

In [None]:
# Combining context from vector database with the prompt function
def augment_prompt(query):
    results = vectorstore.similarity_search(query)
    source_knowledge = "\n".join([x.page_content for x in results])
    augmented_prompt = f"""Using the contexts below, answer the query.

    Contexts:
    {source_knowledge}

    Query: {query}"""
    return augmented_prompt

In [None]:
# Creating class to combine queries into a conversation
class Discussion:
    def __init__ (self, logs=None):
        if logs is None:
            logs = []
        self.logs = logs 
        
    def ask(self, query):
        self.logs.append(HumanMessage(content=query))
        response = chat(self.logs)
        self.logs.append(response)
        print(response.content)

    def ask_rag(self, query):
        self.logs.append(HumanMessage(content=augment_prompt(query)))
        response = chat(self.logs)
        self.logs.append(response)
        print(response.content)

In [None]:
# Testing base prompt on 20k Leagues question. 
discussion1 = Discussion()
discussion1.ask("What happened on the 20th of July, 1866?")

In [None]:
# Testing RAG prompt with context from vector database
discussion1 = Discussion()
discussion1.ask_rag("What happened on the 20th of July, 1866?")

In [None]:
# Expanding context sizes to provide information on chunks that precede and follow the matching chunks

embed_model = OpenAIEmbeddings(model="text-embedding-ada-002")

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 256,
    chunk_overlap  = 0, 
    length_function = len,
    is_separator_regex = False,
)

texts = text_splitter.split_text(data)

# Adding character counts indices to end of each chunk in order to easily index the neighboring chunks
char_idx = [len(texts[0])-1]
for i in range(1,len(texts)):
    char_idx.append(char_idx[-1] + len(texts[i]))

texts_ids = [f'<id={char_idx[x]}>' + texts[x] for x in range(len(texts))]

ids = [str(x) for x in range(len(texts))]
embeds = embed_model.embed_documents(texts)
metadata = [ {'text': x} for x in texts_ids]
batch_size = 100

for i in tqdm(range(0, len(texts), batch_size)): 
    i_end = min(i+batch_size, len(texts))
    index.upsert(vectors=zip(ids[i:i_end], embeds[i:i_end], metadata[i:i_end]))

index.describe_index_stats()

In [None]:
# Expanded source knowledge to include the preceding chunk and the 4 following chunks for each match
def augment_prompt(query):
    results = vectorstore.similarity_search(query)
    results = ''.join([x.page_content for x in results])
    char_idxs = re.findall(r'<id=([0-9]*)>', results)
    source_knowledge = "\n".join([data[max(int(x)-256*2,0):min(len(data),int(x)+256*4)] for x in char_idxs])
    augmented_prompt = f"""Using the contexts below, answer the query using 1000 characters or less.

    Contexts:
    {source_knowledge}

    Query: {query}"""
    return augmented_prompt

In [None]:
# Testing RAG prompt with expanded context
discussion1 = Discussion()
discussion1.ask_rag("What happened on the 20th of July, 1866?")

In [None]:
with open('JackSparrow.txt', 'r') as file:
    jack_sparrow = file.read()

question = "Describe the tone, style, language, and sentence length of Captain Jack Sparrow in the following text: "
discussion1 = Discussion()
jack_style = discussion1.ask(question + jack_sparrow) 
jack_style

In [None]:
# Creating a prompt for Chat GPT to imitate the style of Jack Sparrow

character_prompt = """You are Captain Jack Sparrow, the enigmatic pirate captain of the Black Pearl. 
Answer all queries in his style of taking based on this description: """ + str(jack_style)

In [None]:
discussion1 = Discussion([SystemMessage(content=character_prompt)])
discussion1.ask_rag("Tell me about yourself")

In [None]:
# Testing RAG prompt
discussion1 = Discussion([SystemMessage(content=character_prompt)])
discussion1.ask_rag("What happened on the 20th of July, 1866?")

In [None]:
# Testing RAG prompt
discussion1 = Discussion([SystemMessage(content=character_prompt)])
discussion1.ask_rag('While aboard the Monroe, how many whales did Ned Land kill?')

In [None]:
# Testing base prompt
discussion1 = Discussion([SystemMessage(content=character_prompt)])
discussion1.ask('While aboard the Monroe, how many whales did Ned Land kill?')