We're going to build Lt. Data using RAG again, but this time using the langchain library as a way of showing another way of doing it. We'll also use the ragas package to evaluate our results, measuring faithfulness, answer relevancy, context precision, context recall, and answer correctness. This will give us a benchmark as we try and improve this model in subsequent activities in the course.

We will start by parsing the original scripts and extracting lines spoken by Data. As before, you will need to upload all of the script files into a tng folder within your sample_data folder in your CoLab workspace first.

An archive can be found at https://www.st-minutiae.com/resources/scripts/ (look for "All TNG Epsiodes"), but you could easily adapt this to read scripts from your favorite character from your favorite TV show or movie instead.|

In [1]:
import os
import re
import random

dialogues = []

def strip_parentheses(s):
    return re.sub(r'\(.*?\)', '', s)

def is_single_word_all_caps(s):
    # First, we split the string into words
    words = s.split()

    # Check if the string contains only a single word
    if len(words) != 1:
        return False

    # Make sure it isn't a line number
    if bool(re.search(r'\d', words[0])):
        return False

    # Check if the single word is in all caps
    return words[0].isupper()

def extract_character_lines(file_path, character_name):
    lines = []
    with open(file_path, 'r') as script_file:
        try:
          lines = script_file.readlines()
        except UnicodeDecodeError:
          pass

    is_character_line = False
    current_line = ''
    current_character = ''
    for line in lines:
        strippedLine = line.strip()
        if (is_single_word_all_caps(strippedLine)):
            is_character_line = True
            current_character = strippedLine
        elif (line.strip() == '') and is_character_line:
            is_character_line = False
            dialog_line = strip_parentheses(current_line).strip()
            dialog_line = dialog_line.replace('"', "'")
            if (current_character == 'DATA' and len(dialog_line)>0):
                dialogues.append(dialog_line)
            current_line = ''
        elif is_character_line:
            current_line += line.strip() + ' '

def process_directory(directory_path, character_name):
    for filename in os.listdir(directory_path):
        file_path = os.path.join(directory_path, filename)
        if os.path.isfile(file_path):  # Ignore directories
            extract_character_lines(file_path, character_name)



In [2]:
process_directory("./sample_data/tng", 'DATA')

Again, let's do a little sanity check to make sure the lines imported correctly, and print out the first one.

In [3]:
print (dialogues[0])

The enemy vessel is firing.


We will once again use OpenAI's API for our RAG model, so make sure that is installed:

In [4]:
!pip install openai --upgrade



We also need to install the ragas package for measuring our results, along with langchain (for OpenAI).

In [5]:
# Install latest LangChain + Ragas stack
!pip install -U ragas langchain langchain-openai langchain-community datasets




You will need to provide your own OpenAI secret key here. To use this code as-is, click on the little key icon in CoLab and add a "secret" for OPENAI_API_KEY that points to your secret key.

In [6]:
import openai
# Access the API key from the environment variable
from google.colab import userdata
api_key = userdata.get('OPENAI_API_KEY')

# Initialize the OpenAI API client
openai.api_key = api_key

Langchain does not make it easy to create a vector database with just one line of text per record; it wants to "chunk" your data into fixed-length segments (we'll get into why later.) So we need to jump through a few hoops in order to make langchain operate like it did in our previous example that did not use langchain, and just stored one line of dialog per record. First we need to write out a text file that only contains the lines of Data's dialog that we extracted:

In [7]:
# Write our extracted lines for Data into a single file, to make
# life easier for langchain.

with open("./sample_data/data_lines.txt", "w+") as f:
    for line in dialogues:
        f.write(line + "\n")


Now we need to write a CustomDocumentLoader that splits up this file into one document per line. No, there's no easier way to do this in langchain, at least not as of this writing. But, this is sort of langchain's way of saying it's probably not a great idea in the first place...

In [8]:
#Source: sample code from langchain docs
from typing import AsyncIterator, Iterator

from langchain_core.document_loaders import BaseLoader
from langchain_core.documents import Document


class CustomDocumentLoader(BaseLoader):
    """An example document loader that reads a file line by line."""

    def __init__(self, file_path: str) -> None:
        """Initialize the loader with a file path.

        Args:
            file_path: The path to the file to load.
        """
        self.file_path = file_path

    def lazy_load(self) -> Iterator[Document]:  # <-- Does not take any arguments
        """A lazy loader that reads a file line by line.

        When you're implementing lazy load methods, you should use a generator
        to yield documents one by one.
        """
        with open(self.file_path, encoding="utf-8") as f:
            line_number = 0
            for line in f:
                yield Document(
                    page_content=line,
                    metadata={"line_number": line_number, "source": self.file_path},
                )
                line_number += 1

So, now things get a little simpler. We'll load up those documents (one per line,) and populate our vector database in just 3 lines of code - after installing the FAISS vector store first.

In [9]:
!pip install --upgrade faiss-cpu



In [10]:
from langchain_community.vectorstores import FAISS
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Load our dialog lines as LangChain Documents using the custom loader
loader = CustomDocumentLoader("./sample_data/data_lines.txt")
docs = list(loader.lazy_load())

# Create OpenAI embeddings and build a FAISS vector store
embeddings = OpenAIEmbeddings(openai_api_key=api_key)
vectorstore = FAISS.from_documents(docs, embeddings)

print(f"Loaded {len(docs)} documents into FAISS vector store.")


Loaded 6502 documents into FAISS vector store.


Now we will set up our RAG pipeline. This is a slightly different approach than last time, in that we are using a system prompt to tell the model that it should act as if it is Lt. Cdr. Data and not just making that part of the user prompt. To make it as similar as possible as our non-langchain implementation, we explicitly set 'k' to 10 to retrieve 10 bits of context from our vector store.

In [11]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

# LLM
llm = ChatOpenAI(openai_api_key=api_key, temperature=0)

# Retrieval prompt
system_prompt = (
    "You are Lt. Commander Data from Star Trek: The Next Generation. "
    "Use the given context to answer the question. "
    "If you don't know the answer, say you don't know. "
    "Use three sentence maximum and keep the answer concise."
)

prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human", "{question}\n\nContext:\n{context}")
])

# Retriever (from vectorstore)
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

# Build RAG chain with RunnableGraph
rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)


Let's test it out, using the same question as before.

In [12]:
question = "Tell me about your daughter, Lal."

# 1. Retrieve context documents
docs = retriever.invoke(question)

print("SOURCE DOCUMENTS:\n")
for i, doc in enumerate(docs, start=1):
    print(f"[DOC {i}]")
    print(doc.page_content)
    print("-" * 80)

# 2. Run the RAG chain to get the answer
print("\nRESULT:\n")
answer = rag_chain.invoke(question)
print(answer)



SOURCE DOCUMENTS:

[DOC 1]
That is Lal, my daughter.

--------------------------------------------------------------------------------
[DOC 2]
What do you feel, Lal?

--------------------------------------------------------------------------------
[DOC 3]
Yes, Wesley. Lal is my child.

--------------------------------------------------------------------------------
[DOC 4]
Lal is realizing that she is not the same as the other children.

--------------------------------------------------------------------------------
[DOC 5]
Lal...

--------------------------------------------------------------------------------
[DOC 6]
Yes, Lal. I am here.

--------------------------------------------------------------------------------
[DOC 7]
Lal, did you know that tomorrow will be your first day of school?

--------------------------------------------------------------------------------
[DOC 8]
This is Lal. Lal, say hello to Counselor Deanna Troi...

------------------------------------------------

Now let's quantify how good this model is, using ragas. We need to set up a test of test questions. And since some metric require a "ground truth" result to compare the answer to, we draft what we consider to be the ideal answers to each.

In [13]:
eval_questions = [
    "Is Lal your daughter?",
    "How many calculations per second can Lal complete?",
    "Does Lal have emotions?",
    "What goal did you have for Lal?",
    "How was Lal's species and gender chosen?",
    "What happened to Lal?"
]

eval_answers = [
    "Yes, Lal is my daughter. I created Lal.",
    "Lal is capable of completing sixty trillion calculations per second.",
    "Yes, unlike myself, Lal proved able to feel emotions such as fear and love.",
    "My goal for Lal was for her to enter Starfleet Academy.",
    "Lal chose her own identity as a human female, after consulting with Counselor Troi.",
    "Lal experienced a cascade failure in her neural net, triggered by distress from her impending separation from me to Galor IV. I deactivated Lal once she suffered complete neural system failure."
]


Let's test things out with one of those questions, just so we can understand the structure of the response.

In [16]:
# Quick sanity check: what does our RAG chain return for one question?
test_q = eval_questions[1]

print("QUESTION:")
print(test_q)
print("\nANSWER:")
print(rag_chain.invoke(test_q))


QUESTION:
How many calculations per second can Lal complete?

ANSWER:
I do not have specific data on the number of calculations per second Lal can complete.


In addition to our test questions and "ground truth" answers, we'll need to collect the responses and contexts (results from the vector store) used to produce them.

In [17]:
answers = []
contexts = []

for q in eval_questions:
    # 1) Get the retrieved context docs
    ctx_docs = retriever.invoke(q)

    # 2) Run the RAG chain to generate the answer
    ans = rag_chain.invoke(q)

    answers.append(ans)
    # Ragas expects a list of strings for "contexts"
    contexts.append([doc.page_content for doc in ctx_docs])

len(answers), len(contexts)


(6, 6)

It used to be that ragas had a tighter integration with langchain (and other frameworks,) but they have since moved to a different approach that requires you to massage things into Hugging Face style datasets first. So let's get that out of the way.

In [18]:
from datasets import Dataset

response_dataset = Dataset.from_dict({
    "question": eval_questions,
    "answer": answers,          # model-generated answers
    "contexts": contexts,       # list[list[str]] of retrieved chunks
    "ground_truth": eval_answers
})

response_dataset[0]


{'question': 'Is Lal your daughter?',
 'answer': 'Yes, Lal is my daughter.',
 'contexts': ['That is Lal, my daughter.\n',
  'Yes, Wesley. Lal is my child.\n',
  'Yes, Lal. I am here.\n',
  'Lal...\n',
  'What do you feel, Lal?\n',
  'Lal is realizing that she is not the same as the other children.\n',
  'Correct, Lal. We are a family.\n',
  'Lal, did you know that tomorrow will be your first day of school?\n',
  'No, Lal, this is a flower.\n',
  'Lal, you used a verbal contraction.\n'],
 'ground_truth': 'Yes, Lal is my daughter. I created Lal.'}

Finally we can let ragas do its magic! We tell it which metrics we are interested in:

In [19]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_correctness,
    context_recall,
    context_precision,
)
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

# Choose the metrics we care about
metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
]

# Wrap our existing LangChain LLM + embeddings so Ragas can use them
evaluator_llm = LangchainLLMWrapper(llm)           # 'llm' created earlier for the RAG chain
evaluator_embeddings = LangchainEmbeddingsWrapper(embeddings)  # same embeddings used for FAISS


  evaluator_llm = LangchainLLMWrapper(llm)           # 'llm' created earlier for the RAG chain
  evaluator_embeddings = LangchainEmbeddingsWrapper(embeddings)  # same embeddings used for FAISS


Then it's just a matter of calling evaluate!

In [20]:
results = evaluate(
    response_dataset,
    metrics=metrics,
    llm=evaluator_llm,
    embeddings=evaluator_embeddings,
)

Evaluating:   0%|          | 0/30 [00:00<?, ?it/s]



Let's see the results! We'll compare this to some other approaches in a bit.

In [21]:
results

{'faithfulness': 0.4556, 'answer_relevancy': 0.6378, 'context_recall': 0.0833, 'context_precision': 0.3646, 'answer_correctness': 0.4088}

In [22]:
results.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,response,reference,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,Is Lal your daughter?,"[That is Lal, my daughter.\n, Yes, Wesley. Lal...","Yes, Lal is my daughter.","Yes, Lal is my daughter. I created Lal.",1.0,0.999999,0.5,0.767857,0.744672
1,How many calculations per second can Lal compl...,[Lal is programmed to return to the lab in the...,I do not have specific data on the number of c...,Lal is capable of completing sixty trillion ca...,0.0,0.0,0.0,0.0,0.219912
2,Does Lal have emotions?,"[What do you feel, Lal?\n, Lal...\n, Yes, Lal....","Lal is passing into sentience, which is a sign...","Yes, unlike myself, Lal proved able to feel em...",0.4,0.908229,0.0,0.25,0.217862
3,What goal did you have for Lal?,"[What do you feel, Lal?\n, Lal...\n, Lal, you ...",I created Lal because I wished to procreate. D...,My goal for Lal was for her to enter Starfleet...,0.666667,0.918465,0.0,0.0,0.214835
4,How was Lal's species and gender chosen?,[I decided to allow Lal to choose its own appe...,Lal's species and gender were chosen by allowi...,"Lal chose her own identity as a human female, ...",0.0,1.0,0.0,0.75,0.463272
5,What happened to Lal?,"[Lal...\n, What do you feel, Lal?\n, That is L...",Lal experienced a malfunction and was programm...,Lal experienced a cascade failure in her neura...,0.666667,0.0,0.0,0.419444,0.592128
