Instead of using fine-tuning, we'll use RAG to build our own "Commander Data" based on everything he ever said in the scripts.

To summarize the high level approach:

- We'll first parse all of the scripts to extract every line from Data, much as we did in the fine-tuning example.
- Then we'll use the OpenAI embeddings API to compute embedding vectors for every one of his lines. This basically gives us similarity measures between every line.
- RAG calls for use of a vector database to store these lines with the associated embedding vectors. To keep things simple, we'll use a local database called chromadb. There are plenty of cloud-based vector database services out there as well.
- Then we'll make a little retrieval function that retrieves the N most-similar lines from the vector database for a given query
- Those similar lines are then added as context to the prompt before it is handed off the the chat API.

I'm intentionally not using langchain or some other higher-level framework, because this is actually pretty simple without it.

First, let's install chromadb:

In [None]:
!pip install chromadb


Collecting chromadb
  Downloading chromadb-1.3.5-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.2 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.3.0-py3-none-any.whl.metadata (5.6 kB)
Collecting pybase64>=1.4.1 (from chromadb)
  Downloading pybase64-1.4.2-cp312-cp312-manylinux1_x86_64.manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_5_x86_64.whl.metadata (8.7 kB)
Collecting posthog<6.0.0,>=2.4.0 (from chromadb)
  Downloading posthog-5.4.0-py3-none-any.whl.metadata (5.7 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.23.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.38.0-py3-none-any.whl.metadata (2.4 kB)
Collecting pypika>=0.48.9 (from chromadb)
  Downloading PyPika-0.48.9.tar.gz (67 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m

Now we'll parse out all of the scripts and extract every line of dialog from "DATA". This is almost exactly the same code as from our fine tuning example's preprocessing script. Note you will need to upload all of the script files into a tng folder within your sample_data folder in your CoLab workspace first.

An archive can be found at https://www.st-minutiae.com/resources/scripts/ (look for "All TNG Epsiodes"), but you could easily adapt this to read scripts from your favorite character from your favorite TV show or movie instead.

In [None]:
import os
import re
import random

dialogues = []

def strip_parentheses(s):
    return re.sub(r'\(.*?\)', '', s)

def is_single_word_all_caps(s):
    # First, we split the string into words
    words = s.split()

    # Check if the string contains only a single word
    if len(words) != 1:
        return False

    # Make sure it isn't a line number
    if bool(re.search(r'\d', words[0])):
        return False

    # Check if the single word is in all caps
    return words[0].isupper()

def extract_character_lines(file_path, character_name):
    lines = []
    with open(file_path, 'r') as script_file:
        try:
          lines = script_file.readlines()
        except UnicodeDecodeError:
          pass

    is_character_line = False
    current_line = ''
    current_character = ''
    for line in lines:
        strippedLine = line.strip()
        if (is_single_word_all_caps(strippedLine)):
            is_character_line = True
            current_character = strippedLine
        elif (line.strip() == '') and is_character_line:
            is_character_line = False
            dialog_line = strip_parentheses(current_line).strip()
            dialog_line = dialog_line.replace('"', "'")
            if (current_character == 'DATA' and len(dialog_line)>0):
                dialogues.append(dialog_line)
            current_line = ''
        elif is_character_line:
            current_line += line.strip() + ' '

def process_directory(directory_path, character_name):
    for filename in os.listdir(directory_path):
        file_path = os.path.join(directory_path, filename)
        if os.path.isfile(file_path):  # Ignore directories
            extract_character_lines(file_path, character_name)



In [None]:
process_directory("./sample_data/tng", 'DATA')

Let's do a little sanity check to make sure the lines imported correctly, and print out the first one.

In [None]:
print (dialogues[0])

The enemy vessel is firing.


Now we'll define the dimensionality for our embedding vectors that will be stored in Chroma.

Chroma will simply store:

- An **ID** for each line of dialog
- The **text** of the line
- The **embedding vector** (a list of floats)

We don't need `docarray` or custom document classes for this approach — plain Python lists and strings are enough.


In [None]:
# Dimensionality of the embeddings we will request from OpenAI.
# This must match the `dimensions` argument we pass when creating embeddings.
embedding_dimensions = 128


It's time to start computing embeddings for each line in OpenAI, so let's make sure OpenAI is installed:

In [None]:
!pip install openai --upgrade



Let's initialize the OpenAI client, and test creating an embedding for a single line of dialog just to make sure it works.

You will need to provide your own OpenAI secret key here. To use this code as-is, click on the little key icon in CoLab and add a "secret" for OPENAI_API_KEY that points to your secret key.

In [None]:
from google.colab import userdata

from openai import OpenAI
client = OpenAI(api_key=userdata.get('OPENAI_API_KEY'))

embedding_model = "text-embedding-3-small"

response = client.embeddings.create(
    input=dialogues[1],
    dimensions=embedding_dimensions,
    model= embedding_model
)

print(response.data[0].embedding)

[0.09909163415431976, 0.2058548778295517, 0.08390823751688004, -0.039756521582603455, -0.13281475007534027, -0.02327454648911953, -0.06748619675636292, 0.1062837690114975, -0.04762791469693184, -0.1171518862247467, 0.07999251782894135, 0.037498991936445236, -0.08670517802238464, -0.07104230672121048, -0.00662275729700923, 0.042593419551849365, -0.004672390408813953, -0.10812176018953323, -0.04347245767712593, 0.08454754203557968, 0.0033638214226812124, 0.07659623771905899, -0.019129080697894096, -0.04698861390352249, 0.06756611168384552, -0.07835431396961212, 0.14168505370616913, -0.02099703811109066, 0.055099744349718094, 0.004777275491505861, 0.19195008277893066, -0.07040300965309143, 0.036779776215553284, -0.05905541777610779, 0.13904793560504913, -0.013565164990723133, 0.009419698268175125, 0.07575715333223343, -0.08814360946416855, 0.03544124215841293, 0.0470685251057148, -0.015043548308312893, -0.08942221105098724, -0.017321057617664337, -0.059295155107975006, 0.03811831399798393

Let's double check that we do in fact have embeddings of 128 dimensions as we specified.

In [None]:
print(len(response.data[0].embedding))

128


OK, now let's compute embeddings for every line Data ever said. The OpenAI API currently can't handle computing them all at once, so we're breaking it up into 128 lines at a time here.

In [None]:
#Generate embeddings for everything Data ever said, 128 lines at a time.
embeddings = []

for i in range(0, len(dialogues), 128):
  dialog_slice = dialogues[i:i+128]
  slice_embeddings = client.embeddings.create(
    input=dialog_slice,
    dimensions=embedding_dimensions,
    model=embedding_model
  )

  embeddings.extend(slice_embeddings.data)

Let's check how many embeddings we actually got back in total.

In [None]:
print (len(embeddings))

6502


Now let's insert every line and its embedding vector into our vector database.

We'll use **Chroma**, a lightweight local vector store that:

- Runs entirely in your notebook process
- Requires no external service or account
- Can persist data to a local directory

We'll create a persistent collection under `./sample_data/chroma_db` and store each dialog line along with its embedding and an integer ID.

Be sure to create that sample_data/chroma_db directory first!


In [None]:
import chromadb
from chromadb.config import Settings

# Create (or connect to) a local Chroma DB on disk
chroma_client = chromadb.PersistentClient(
    path="./sample_data/chroma_db"
)

# Create (or get) a collection for Data's dialog lines
collection = chroma_client.get_or_create_collection(
    name="data_dialogues"
)

ids = [str(i) for i in range(len(dialogues))]
docs = dialogues
embs = [e.embedding for e in embeddings]

# Chroma has a max batch size; stay safely under it
BATCH_SIZE = 1000   # you can bump this up, as long as < 5461

for start in range(0, len(docs), BATCH_SIZE):
    end = start + BATCH_SIZE
    batch_ids = ids[start:end]
    batch_docs = docs[start:end]
    batch_embs = embs[start:end]

    collection.add(
        ids=batch_ids,
        documents=batch_docs,
        embeddings=batch_embs
    )

print(f"Inserted {len(docs)} dialog lines into Chroma in batches of {BATCH_SIZE}.")


Inserted 6502 dialog lines into Chroma in batches of 1000.


Let's try querying our vector database for lines similar to a query string.

First we need to compute the embedding vector for our query string, then we'll query the vector database for the top 10 matches based on the similarities encoded by their embedding vectors.

In [None]:
# Perform a search query using Chroma
queryText = 'Lal, my daughter'

# Create an embedding for the query text
response = client.embeddings.create(
    input=queryText,
    dimensions=embedding_dimensions,
    model=embedding_model
)
query_embedding = response.data[0].embedding

# Query Chroma for the 10 most similar dialog lines
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=10
)

print("Most similar lines to:", queryText)
print("--------------------------------------------------")
for line in results["documents"][0]:
    print(line)


Most similar lines to: Lal, my daughter
--------------------------------------------------
That is Lal, my daughter.
Lal...
What do you feel, Lal?
Yes, Lal. I am here.
Correct, Lal. We are a family.
No, Lal, this is a flower.
Lal, you used a verbal contraction.
Yes, Wesley. Lal is my child.
Lal's creation is entirely dependent on me. I am giving it knowledge and skills that are stored in my brain... its programming reflects mine in the same way a human child's genes reflect its parent's genes...
Perhaps. I created Lal because I wished to procreate. Despite what happened to her, I still have that wish.


Let's put it all together! We'll write a generate_response function that:

- Computes an embedding for the query passed in
- Queries our vector database for the 10 most similar lines to that query (you could experiment with using more or less)
- Constructs a prompt that adds in these similar lines as context, to try and nudge ChatGPT in the right direction using our external data
- Feeds to augmented prompt into the chat completions API to get our response.

That's RAG!

In [None]:
def generate_response(question: str) -> str:
    """
    Generate a response in the voice of Lt. Cmdr. Data using RAG.

    Steps:
    - Embed the user's question
    - Query Chroma for similar dialog lines from Data
    - Use those lines as context for a chat completion
    """

    # Search for similar dialogues in the vector DB (Chroma)
    response = client.embeddings.create(
        input=question,
        dimensions=embedding_dimensions,
        model=embedding_model
    )
    query_embedding = response.data[0].embedding

    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=10
    )

    # Extract relevant context from search results
    context_lines = results["documents"][0]
    context = ""
    for line in context_lines:
        context += f"\"{line}\"\n"

    prompt = (
        f"Lt. Commander Data is asked: '{question}'. "
        f"How might he respond, given his previous responses similar to this topic, "
        f"listed here:\n{context}"
    )

    print("PROMPT with RAG:\n")
    print(prompt)
    print("\nRESPONSE:\n")

    # Use OpenAI API to generate a response based on the context
    completion = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are Lt. Cmdr. Data from Star Trek: The Next Generation."},
            {"role": "user", "content": prompt}
        ]
    )

    return completion.choices[0].message.content


Let's try it out! Note that the final response does seem to be drawing from the model's own training, but it is building upon the specific lines we gave it, allowing us to have some control over its output.

In [None]:
print(generate_response("Tell me about your daughter, Lal."))

PROMPT with RAG:

Lt. Commander Data is asked: 'Tell me about your daughter, Lal.'. How might he respond, given his previous responses similar to this topic, listed here:
"That is Lal, my daughter."
"What do you feel, Lal?"
"Lal..."
"Correct, Lal. We are a family."
"Yes, Doctor. It is an experience I know too well. But I do not know how to help her. Lal is passing into sentience. It is perhaps the most difficult stage of development for her."
"Lal is realizing that she is not the same as the other children."
"Yes, Wesley. Lal is my child."
"That is precisely what happened to Lal at school. How did you help him?"
"This is Lal. Lal, say hello to Counselor Deanna Troi..."
"I am sorry I did not anticipate your objections, Captain. Do you wish me to deactivate Lal?"


RESPONSE:

"Lal is my daughter. She is currently undergoing a challenging stage of development as she transitions into sentience. I am doing my best to guide and support her through this process, but it is not without its diff