# Philosophy with Vector Embeddings, OpenAI and Astra DB

In this quickstart you will learn how to build a "philosophy quote finder & generator" using OpenAI's vector embeddings and DataStax [Astra DB](https://docs.datastax.com/en/astra/home/astra.html) as the vector store for data persistence.

The basic workflow of this notebook is outlined below. You will evaluate and store the vector embeddings for a number of quotes by famous philosophers, use them to build a powerful search engine and, after that, even a generator of new quotes!

The notebook exemplifies some of the standard usage patterns of vector search -- while showing how easy is it to get started with [Astra DB](https://docs.datastax.com/en/astra/home/astra.html).


For documentation on AstraPy (the Python client to use the Data API), [click here](https://docs.datastax.com/en/astra-db-serverless/api-reference/python-client.html).

Table of contents:
- Setup
- Create vector collection
- Connect to OpenAI
- Load quotes into the Vector Store
- Use case 1: **quote search engine**
- Use case 2: **quote generator**
- Cleanup

### How it works

**Indexing**

Each quote is made into an embedding vector with OpenAI's `Embedding`. These are saved in the Vector Store for later use in searching. Some metadata, including the author's name and a few other pre-computed tags, are stored alongside, to allow for search customization.

![1_vector_indexing](https://raw.githubusercontent.com/datastaxdevs/mini-demo-astradb-astrapy-philosophy/main/images/philo1.png)

**Search**

To find a quote similar to the provided search quote, the latter is made into an embedding vector on the fly, and this vector is used to query the store for similar vectors ... i.e. similar quotes that were previously indexed. The search can optionally be constrained by additional metadata ("find me quotes by Spinoza similar to this one ...").

![2_vector_search](https://raw.githubusercontent.com/datastaxdevs/mini-demo-astradb-astrapy-philosophy/main/images/philo2.png)

The key point here is that "quotes similar in content" translates, in vector space, to vectors that are metrically close to each other: thus, vector similarity search effectively implements semantic similarity. _This is the key reason vector embeddings are so powerful._

The sketch below tries to convey this idea. Each quote, once it's made into a vector, is a point in space. Well, in this case it's on a sphere, since OpenAI's embedding vectors, as most others, are normalized to _unit length_. Oh, and the sphere is actually not three-dimensional, rather 1536-dimensional!

So, in essence, a similarity search in vector space returns the vectors that are closest to the query vector:

![3_vector_space](https://raw.githubusercontent.com/datastaxdevs/mini-demo-astradb-astrapy-philosophy/main/images/philo3.png)

**Generation**

Given a suggestion (a topic or a tentative quote), the search step is performed, and the first returned results (quotes) are fed into an LLM prompt which asks the generative model to invent a new text along the lines of the passed examples _and_ the initial suggestion.

![4_quote_generation](https://raw.githubusercontent.com/datastaxdevs/mini-demo-astradb-astrapy-philosophy/main/images/philo4.png)

## Setup

Install and import the necessary dependencies:

In [1]:
!pip install --quiet \
    "astrapy>=2.0,<3.0" \
    "openai>=1.73,<2.0" \
    "datasets>=3.5,<4.0"

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/333.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m333.5/333.5 kB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m948.6/948.6 kB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m23.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m39.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m331.1/331.1 kB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
from getpass import getpass
from collections import Counter

from astrapy import DataAPIClient
from astrapy.info import CollectionDefinition

import openai
from datasets import load_dataset

### Connection parameters

Please retrieve your database credentials on your Astra dashboard ([info](https://docs.datastax.com/en/astra-db-serverless/get-started/quickstart.html#create-a-database-and-store-your-credentials)): you will supply them momentarily.

Example values:

- API Endpoint: `https://01234567-89ab-cdef-0123-456789abcdef-us-east1.apps.astra.datastax.com`
- Token: `AstraCS:6gBhNmsk135...`

In [4]:
ASTRA_DB_API_ENDPOINT = input("Please enter your API Endpoint:")
ASTRA_DB_APPLICATION_TOKEN = getpass("Please enter your Token")

_keyspace = input("Please enter your Astra DB keyspace (leave empty for default):")
ASTRA_DB_KEYSPACE = _keyspace if _keyspace else None  # None will signal 'use defaults' to astrapy

Please enter your API Endpoint:https://d6fe45eb-a768-4baa-bdac-59e100a42588-us-east-2.apps.astra.datastax.com
Please enter your Token··········
Please enter your Astra DB keyspace (leave empty for default):


### Instantiate an Astra DB client and database handle

In [5]:
astra_db_client = DataAPIClient()
astra_db = astra_db_client.get_database(
    ASTRA_DB_API_ENDPOINT,
    token=ASTRA_DB_APPLICATION_TOKEN,
    keyspace=ASTRA_DB_KEYSPACE,
)

## Create vector collection

The only parameter to specify when constructing the `CollectionDefinition` is the dimension of the vectors you'll store. Other parameters, such as the similarity metric to use for searches, are optional.

In [6]:
coll_name = "philosophers_astra_db"
collection = astra_db.create_collection(
    coll_name,
    definition=CollectionDefinition.builder().set_vector_dimension(1536).build(),
)

## Connect to OpenAI

### Set up your secret key

In [7]:
OPENAI_API_KEY = getpass("Please enter your OpenAI API Key: ")

Please enter your OpenAI API Key: ··········


### A test call for embeddings

Quickly check how one can get the embedding vectors for a list of input texts:

In [8]:
client = openai.OpenAI(api_key=OPENAI_API_KEY)
embedding_model_name = "text-embedding-3-small"

result = client.embeddings.create(
    input=[
        "This is a sentence",
        "A second sentence"
    ],
    model=embedding_model_name,
)

_Note: the above is the syntax for OpenAI v1.0+. If using previous versions, the code to get the embeddings will look different._

In [9]:
print(f"len(result.data)              = {len(result.data)}")
print(f"result.data[1].embedding      = {str(result.data[1].embedding)[:55]}...")
print(f"len(result.data[1].embedding) = {len(result.data[1].embedding)}")

len(result.data)              = 2
result.data[1].embedding      = [0.009456743486225605, 0.0014919530367478728, -0.036199...
len(result.data[1].embedding) = 1536


## Load quotes into the Vector Store

Get a dataset with the quotes. _(We adapted and augmented the data from [this Kaggle dataset](https://www.kaggle.com/datasets/mertbozkurt5/quotes-by-philosophers), ready to use in this demo.)_

In [10]:
philo_dataset = load_dataset("datastax/philosopher-quotes")["train"]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/574 [00:00<?, ?B/s]

philosopher-quotes.csv: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/450 [00:00<?, ? examples/s]

A quick inspection:

In [11]:
print("An example entry:")
print(philo_dataset[16])

An example entry:
{'author': 'aristotle', 'quote': 'Love well, be loved and do something of value.', 'tags': 'love;ethics'}


Check the dataset size:

In [12]:
author_count = Counter(entry["author"] for entry in philo_dataset)
print(f"Total: {len(philo_dataset)} quotes. By author:")
for author, count in author_count.most_common():
    print(f"    {author:<20}: {count} quotes")

Total: 450 quotes. By author:
    aristotle           : 50 quotes
    schopenhauer        : 50 quotes
    spinoza             : 50 quotes
    hegel               : 50 quotes
    freud               : 50 quotes
    nietzsche           : 50 quotes
    sartre              : 50 quotes
    plato               : 50 quotes
    kant                : 50 quotes


### Write to the vector collection

Now compute the embeddings for the quotes and save them into the Vector Store, along with the text itself and the metadata you'll use later.

To optimize speed and reduce the calls, you'll perform batched calls to the embedding OpenAI service.

To store the quote objects, you will use the `insert_many` method with the full list of documents to insert.

You can name the documents' fields as you prefer -- except for the embedding vectors, which need to be stored in the special `"$vector"` field.

In [13]:
OPENAI_BATCH_SIZE = 80

num_batches = ((len(philo_dataset) + OPENAI_BATCH_SIZE - 1) // OPENAI_BATCH_SIZE)

quotes_list = philo_dataset["quote"]
authors_list = philo_dataset["author"]
tags_list = philo_dataset["tags"]

full_documents_to_insert = []

print("Computing embeddings: ", end="")
for batch_i in range(num_batches):
    b_start = batch_i * OPENAI_BATCH_SIZE
    b_end = (batch_i + 1) * OPENAI_BATCH_SIZE
    # compute the embedding vectors for this batch:
    b_emb_results = client.embeddings.create(
        input=quotes_list[b_start : b_end],
        model=embedding_model_name,
    )
    # prepare documents for insertion (attach the embedding to the other quote info):
    b_docs = []
    for entry_idx, emb_result in zip(range(b_start, b_end), b_emb_results.data):
        if tags_list[entry_idx]:
            tags = {
                tag: True
                for tag in tags_list[entry_idx].split(";")
            }
        else:
            tags = {}
        b_docs.append({
            "quote": quotes_list[entry_idx],
            "$vector": emb_result.embedding,
            "author": authors_list[entry_idx],
            "tags": tags,
        })
    # append to the full document list:
    full_documents_to_insert += b_docs
    print(f"[{len(b_docs)}]", end="")

print(f"\nInserting {len(full_documents_to_insert)} documents to Astra DB ...")
insertion_result = collection.insert_many(full_documents_to_insert)

print(f"Finished storing {len(insertion_result.inserted_ids)} entries.")

Computing embeddings: [80][80][80][80][80][50]
Inserting 450 documents to Astra DB ...
Finished storing 450 entries.


## Use case 1: **quote search engine**

For the quote-search functionality, you need first to make the input quote into a vector, and then use it to query the store (besides handling the optional metadata into the search call, that is).

Encapsulate the search-engine functionality into a function for ease of re-use. At its core is the `find` method of the collection:

In [18]:
def find_quote_and_author(query_quote, n, author=None, tags=None):
    query_vector = client.embeddings.create(
        input=[query_quote],
        model=embedding_model_name,
    ).data[0].embedding
    filter_clause = {}
    if author:
        filter_clause["author"] = author
    if tags:
        filter_clause["tags"] = {}
        for tag in tags:
            filter_clause["tags"][tag] = True
    #
    results = collection.find(
        filter_clause,
        sort={"$vector": query_vector},
        limit=n,
        projection={"quote": True, "author": True, "_id": False},
    )
    return results.to_list()

### Putting search to test

Passing just a quote:

In [15]:
find_quote_and_author("We struggle all our life for nothing", 3)

[{'quote': 'Life to the great majority is only a constant struggle for mere existence, with the certainty of losing it at last.',
  'author': 'schopenhauer'},
 {'quote': 'To endure life remains, when all is said, the first duty of all living being Illusion can have no value if it makes this more difficult for us.',
  'author': 'freud'},
 {'quote': 'To live is to suffer, to survive is to find some meaning in the suffering.',
  'author': 'nietzsche'}]

Search restricted to an author:

In [16]:
find_quote_and_author("We struggle all our life for nothing", 2, author="nietzsche")

[{'quote': 'To live is to suffer, to survive is to find some meaning in the suffering.',
  'author': 'nietzsche'},
 {'quote': 'What makes us heroic?--Confronting simultaneously our supreme suffering and our supreme hope.',
  'author': 'nietzsche'}]

Search constrained to a tag (out of those saved earlier with the quotes):

In [17]:
find_quote_and_author("We struggle all our life for nothing", 2, tags=["politics"])

[{'quote': 'He who seeks equality between unequals seeks an absurdity.',
  'author': 'spinoza'},
 {'quote': 'One... gets an impression that civilization is something which was imposed on a resisting majority by a minority which understood how to obtain possession of the means to power and coercion. It is, of course, natural to assume that these difficulties are not inherent in the nature of civilization itself but are determined by the imperfections of the cultural forms which have so far been developed.',
  'author': 'freud'}]

### Cutting out irrelevant results

The vector similarity search generally returns the vectors that are closest to the query, even if that means results that might be somewhat irrelevant if there's nothing better.

To keep this issue under control, you can get the actual "similarity" between the query and each result, and then implement a cutoff on it, effectively discarding results that are beyond that threshold.
Tuning this threshold correctly is not an easy problem: here, we'll just show you the way.

To get a feeling on how this works, try the following query and play with the choice of quote and threshold to compare the results. Note that the similarity is returned as the special `$similarity` field in each result document - and it will be returned if you pass `include_similarity=True` to the search method.

_Note (for the mathematically inclined): this value is **a rescaling between zero and one** of the cosine difference between the vectors, i.e. of the scalar product divided by the product of the norms of the two vectors. In other words, this is 0 for opposite-facing vectors and +1 for parallel vectors. For other measures of similarity (cosine is the default), check the `metric` parameter in `AstraDB.create_collection` and the [documentation on allowed values](https://docs.datastax.com/en/astra-serverless/docs/develop/dev-with-json.html#metric-types)._

In [22]:
quote = "Animals are our equals."
# quote = "Be good."
# quote = "This teapot is strange."

metric_threshold = 0.52

quote_vector = client.embeddings.create(
    input=[quote],
    model=embedding_model_name,
).data[0].embedding

results_full = collection.find(
    sort={"$vector": quote_vector},
    limit=8,
    projection={"quote": True, "_id": False},
    include_similarity=True,
)
results = [res for res in results_full if res["$similarity"] >= metric_threshold]

print(f"{len(results)} quotes within the threshold:")
for idx, result in enumerate(results):
    print(f"    {idx}. [similarity={result['$similarity']:.3f}] \"{result['quote'][:70]}...\"")

8 quotes within the threshold:
    0. [similarity=0.746] "The assumption that animals are without rights, and the illusion that ..."
    1. [similarity=0.728] "Man is the only animal that must be encouraged to live...."
    2. [similarity=0.727] "Animals are in possession of themselves; their soul is in possession o..."
    3. [similarity=0.725] "At his best, man is the noblest of all animals; separated from law and..."
    4. [similarity=0.718] "Because Christian morality leaves animals out of account, they are at ..."
    5. [similarity=0.715] ".... we are a part of nature as a whole, whose order we follow...."
    6. [similarity=0.694] "Better to have beasts that let themselves be killed than men who run a..."
    7. [similarity=0.682] "Blessed are the weak who think that they are good because they have no..."


## Use case 2: **quote generator**

For this task you need another component from OpenAI, namely an LLM to generate the quote for us (based on input obtained by querying the Vector Store).

You also need a template for the prompt that will be filled for the generate-quote LLM completion task.

In [24]:
completion_model_name = "gpt-4.1-mini"

generation_prompt_template = """"Generate a single short philosophical quote on the given topic,
similar in spirit and form to the provided actual example quotes.
Do not exceed 20-30 words in your quote.

REFERENCE TOPIC: "{topic}"

ACTUAL EXAMPLES:
{examples}
"""

Like for search, this functionality is best wrapped into a handy function (which internally uses search):

In [25]:
def generate_quote(topic, n=2, author=None, tags=None):
    hits = find_quote_and_author(query_quote=topic, n=n, author=author, tags=tags)
    if hits:
        prompt = generation_prompt_template.format(
            topic=topic,
            examples="\n".join(f"  - {document['quote']}" for document in hits),
        )
        # a little logging:
        print("** quotes found:")
        for document in hits:
            print(f"**    - {document['quote']} ({document['author']})")
        print("** end of logging")
        #
        response = client.chat.completions.create(
            model=completion_model_name,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7,
            max_tokens=320,
        )
        return response.choices[0].message.content.replace('"', '').strip()
    else:
        print("** no quotes found.")
        return None

_Note: similar to the case of the embedding computation, the code for the Chat Completion API would be slightly different for OpenAI prior to v1.0._

#### Putting quote generation to test

Just passing a text (a "quote", but one can actually just suggest a topic since its vector embedding will still end up at the right place in the vector space):

In [26]:
q_topic = generate_quote("politics and virtue")
print("\nA new generated quote:")
print(q_topic)

** quotes found:
**    - Our moral virtues benefit mainly other people; intellectual virtues, on the other hand, benefit primarily ourselves; therefore the former make us universally popular, the latter unpopular. (schopenhauer)
**    - Happiness is the reward of virtue. (aristotle)
** end of logging

A new generated quote:
- True politics should cultivate virtue, not just power.


Use inspiration from just a single philosopher:

In [27]:
q_topic = generate_quote("animals", author="schopenhauer")
print("\nA new generated quote:")
print(q_topic)

** quotes found:
**    - The assumption that animals are without rights, and the illusion that our treatment of them has no moral significance, is a positively outrageous example of Western crudity and barbarity. Universal compassion is the only guarantee of morality. (schopenhauer)
**    - Because Christian morality leaves animals out of account, they are at once outlawed in philosophical morals; they are mere 'things,' mere means to any ends whatsoever. They can therefore be used for vivisection, hunting, coursing, bullfights, and horse racing, and can be whipped to death as they struggle along with heavy carts of stone. Shame on such a morality that is worthy of pariahs, and that fails to recognize the eternal essence that exists in every living thing, and shines forth with inscrutable significance from all eyes that see the sun! (schopenhauer)
** end of logging

A new generated quote:
The way we treat animals reflects our morality; compassion towards all beings is the true measure 

## Cleanup

If you want to remove all resources used for this demo, run this cell (_warning: this will irreversibly delete the collection and its data!_):

In [None]:
collection.drop()