# Oveview

This tutorial shows you how to use [DataStax Astra DB](https://docs.datastax.com/en/astra-serverless/docs/index.html), a fast and highly scalable vector database, to build a food recommendation system based on the similarity of historical food review comments.

## DataSet

The food review dataset used in this tutorial is a subset of the Amazon Fine Food Review dataset that has been preprocessed by OpenAI. The subset can be found at this [OpenAI GitHub link](https://github.com/openai/openai-cookbook/blob/main/examples/data/fine_food_reviews_1k.csv) and the full dataset can be found at this [Kaggle link](https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews).

## Text Embedding Model

The text embedding model used in this tuorial is Google's [Vertex AI Gecko model](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings), ***textembedding-gecko***, which has a output dimension size of ***768***.

With a little bit of tweaking, you can easily make this tutorial compatible with other text embedding modes, such as OpenAI's ***text-embedding-ada-002*** model ([link](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings) that has a dimension size of ***1536*** or HugginFace's [MiniLM-L6-V2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) that has a output dimension size of ***384***.

## Environment Setup

To run this tutorial,  you first need to meet the following prerequisites:
1) Create a free-tier Astra account and create an Astra Vector Database. (Link: https://astra.datastax.com)
2) Create a GCP service account with the necessary permissions, including access to Vertex AI.
3) Optionally (but highly recommende), install `gcloud CLI`[link](https://cloud.google.com/sdk/docs/install) and `Astra CLI`[link](https://awesome-astra.github.io/docs/pages/astra/astra-cli/#1-installation) to make it easier to configure the connection to the remote Astra Vector database and the Google Vertex AI service.

The tutorial procedure in the notebook below has been tested on a locally installed Jupyter notebook. If you want to run it on a cloud-based Jupyter notebook service, you may need to make some minor changes to the code, such as adding code to upload the required files to the cloud Jupyter notebook instance. The main workflow of the code remains the same, such as getting the embedding values for the food reviews, writing them to the database, and making recommendations by executing similarity-based vector searches.


### Configure Astra Vector database connection

Assuming you have `Astra CLI` installed and an Astra Vector database (as well as a keyspace) has already been precreated from the Astra Console, 

1) Run the following command to set up the Astra CLI enviornment. The following command is required to set up the Astra CLI environment. Once you have run this command, you will be able to run any other Astra CLI commands.
```
$ astra setup --token <AstraCS_token>
```
*NOTE*: The specified token value must start by *AstraCS:....*. You can get an AstraCS token by following the instructions in this [doc](https://docs.datastax.com/en/astra-serverless/docs/manage/org/manage-tokens.html#_create_application_token). Make sure that you have the *Organization Administrator* role to avoid any permission limitations later on.

2) Run the following command to create a dotenv environment for the database. This environment is stored in a hidden file called .env. and it contains all the system environment variables needed to connect to the database.

```
$ astra db create-dotenv <database_name> -k <keyspace_name>
```

### Verify connection to the Google Vertex AI service

Assuming you have `gcloud CLI` installed and configured (see this [Google doc](https://cloud.google.com/sdk/docs/initializing)), run the following script to verify the connection to the Google Vertex AI service. If the connection is good, you should get a JSON output that includes the text embedding value for the input text.
```
MODEL_ID="textembedding-gecko"
PROJECT_ID=<your_project>
REGION_NAME=<your_region>

curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://us-central1-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${REGION_NAME}/publishers/google/models/${MODEL_ID}:predict -d \
$'{
  "instances": [
    { "content": "What is life?"}
  ],
}'

```

### Local Juypter instance

Put the food review dataset in a subfolder `data` in the current folder, as below:

```
$ tree .
.
├── data
│   └── fine_food_reviews_1k.csv
└── food_review_vector_with_astra_db.ipyn
```

Start the local Jupyter instance from the current folder with the following command:
```
jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10
```


In [None]:
!pip install python-dotenv tiktoken google-cloud-aiplatform cassandra-driver

In [None]:
import os
import math
import datetime
import pandas as pd
import numpy as np
import time

from dotenv import load_dotenv
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
from cassandra.concurrent import execute_concurrent_with_args

from vertexai.language_models import TextEmbeddingModel

##
# DO NOT change these values !
#
# - Concurrent batch prediction jobs (Quota limit is 5) (see: https://cloud.google.com/vertex-ai/docs/quotas)
MAX_ALLOWED_CONCURRENT_PREDICTION_JOB_NUM = 5
# - Google Gecko text embedding dimension is 768
VERTEX_AI_EMBEDDING_DIMENSION=768


##
# Access the environment variables from the .env file for the corresponding Astra DB
# NOTE: the ".env" file is prepared using the following Astra CLI command
#       astra db create-dotenv <astradb_name> -k <keyspace_name>
#
load_dotenv()

##
# Helper function to connect to Astra DB CQL session
#
def getAdbCqlSession(keyspace):
    sec_bundle_file = os.environ.get('ASTRA_DB_SECURE_BUNDLE_PATH')
    access_token = os.environ.get('ASTRA_DB_APPLICATION_TOKEN')
    
    cluster = Cluster(
        cloud={
            "secure_connect_bundle": sec_bundle_file,
        },
        auth_provider=PlainTextAuthProvider(
            "token",
            access_token,
        )
    )
    
    cqlSession = cluster.connect()
    if len(keyspace) > 0:
        cqlSession.set_keyspace(keyspace)
    
    return cqlSession
    
def getCQLKeyspace():
    return os.environ.get('ASTRA_DB_KEYSPACE')

In [None]:
##
# Load the food review data
#
food_review_df = pd.read_csv('./data/fine_food_reviews_1k.csv')

##
# Add an empty column to store the embedding value for the combination of 'Summary' and 'Text' columns
#
food_review_df['Embedding'] = None

##
# Convert the input value (UTC in second) to the DateTime type 
# Otherwise, the input statement later will fail with incompatible type
#
food_review_df['Time'] = food_review_df['Time'].apply(lambda x: datetime.datetime.fromtimestamp(x))

print(food_review_df.info())
food_review_df.head(n=5)

total_row = food_review_df.shape[0]
total_batch = math.ceil(total_row/MAX_ALLOWED_CONCURRENT_PREDICTION_JOB_NUM)
print(f"total_row={total_row}, total_batch={total_batch}")

##
# Help function to get the starting row index of a batch
#
def get_batch_start_rowidx(bathidx):
    assert batchidx >=0 and batchidx <= total_batch
    return batchidx * MAX_ALLOWED_CONCURRENT_PREDICTION_JOB_NUM

##
# Help function to get the ending row index of a batch
#
def get_batch_end_rowidx(batchidx):
    assert batchidx >=0 and batchidx <= total_batch
    return min((batchidx+1) * MAX_ALLOWED_CONCURRENT_PREDICTION_JOB_NUM - 1, total_row - 1)

In [None]:
##
# Helper function to get the text embedding vector using Google Vertex AI API
#    "textembedding-gecko@001" embedding dimension is fixed at 768
#
def vertex_ai_text_embeddings(text_arr) -> list:
    start_time = time.time()
    
    embedding_mode_name = "textembedding-gecko@001"
    model = TextEmbeddingModel.from_pretrained(embedding_mode_name)
    embeddings = model.get_embeddings(text_arr)
    
    end_time = time.time()
    
    return (end_time - start_time), embeddings

##
# Get the food review emeddings in batches using the Vertex AI API and update the dataframe accordingly
#
review_text_arr_by_batch = []
fetch_embedding_batch_duration = []

for batchidx in range(0, total_batch):    
    start_row_idx = get_batch_start_rowidx(batchidx)
    end_row_idx = get_batch_end_rowidx(batchidx)
    
    print(f"Get embeddings of the food reviews for batch {batchidx+1} [{start_row_idx}, {end_row_idx}] ...")

    review_text_arr_by_batch.clear()
    for rowidx in range(start_row_idx, end_row_idx+1):
        review_text_arr_by_batch.append(
            food_review_df.iloc[rowidx]['Summary'] + " : " + food_review_df.iloc[rowidx]['Text'])
    
    # Call the Vertex embedding API in batch
    batch_duration,embedding_list = vertex_ai_text_embeddings(review_text_arr_by_batch)
    fetch_embedding_batch_duration.append(batch_duration)

    # For each batch, update the embedding cell for each row in the data frame;
    #   and isnert the record in the 'food_review' table
    for rowidx in range(start_row_idx, end_row_idx+1):
        food_review_df.at[rowidx, 'Embedding'] = embedding_list[rowidx-start_row_idx].values

print(f"""\nVertex Emedding API call duration statistics (per batch): 
   min={np.min(fetch_embedding_batch_duration)}s, 
   max={np.max(fetch_embedding_batch_duration)}s, 
   avg={np.mean(fetch_embedding_batch_duration)}s
""")
food_review_df.head()

In [None]:
adb_keyspace = getCQLKeyspace()
adb_cql_session = getAdbCqlSession(adb_keyspace)

## 
# CQL statement to create the 'food_veiw' C* table
#
cql_schema_stmt=f"""CREATE TABLE IF NOT EXISTS {adb_keyspace}.food_review (
  id int PRIMARY KEY,
  time TIMESTAMP,
  product_id TEXT,
  user_id TEXT,
  score INT,
  summary TEXT,
  text TEXT,
  embedding VECTOR<FLOAT, {VERTEX_AI_EMBEDDING_DIMENSION}>
);"""
print(f"cql_schema_stmt={cql_schema_stmt}")

adb_cql_session.execute(cql_schema_stmt)

In [None]:
##
# CQL prepared statement for inserting one record into the 'food_review' table
#
review_insert_stmt = cql_session.prepare(f"""
    INSERT INTO {adb_keyspace}.food_review(id, time, product_id, user_id, score, summary, text, embedding) 
    VALUES(?,?,?,?,?,?,?,?)
    """
)

##
# Helper function to insert the food reviews with embedding values for a batch
# - batchidx: the batch index
# - concurrent: whether to use cassandra.concurrent library
# 
def insert_with_embedding_batch(batchidx, concurrent=False):
    assert batchidx >=0 and batchidx <= total_batch
    
    start_time = time.time()
    
    start_row_idx = get_batch_start_rowidx(batchidx)
    end_row_idx = get_batch_end_rowidx(batchidx)
    
    if concurrent == False:
        for rowidx in range(start_row_idx, end_row_idx+1):
            cql_session.execute(review_insert_stmt, food_review_df.iloc[rowidx])
    else:
        parameters = []
        for rowidx in range(start_row_idx, end_row_idx+1):
            parameters.append(food_review_df.iloc[rowidx])
        execute_concurrent_with_args(cql_session,
                                     review_insert_stmt, 
                                     parameters,
                                     concurrency=MAX_ALLOWED_CONCURRENT_PREDICTION_JOB_NUM)
        
    end_time = time.time()
    
    return (end_time - start_time)

##
# Insert the food reviews with embeddings in batch
#
adb_insert_duration = []

# Concurrent insert is much faster 
concurrent_insert = True

for batchidx in range(0, total_batch):
    start_row_idx = get_batch_start_rowidx(batchidx)
    end_row_idx = get_batch_end_rowidx(batchidx)
    
    print(f"Insert food review with the embedding value for batch {batchidx+1} [{start_row_idx}, {end_row_idx}] ...")
    batch_duration = insert_with_embedding_batch(batchidx, concurrent_insert)    
    adb_insert_duration.append(batch_duration)
    
print(f"""\nAstra DB C* table batch insert duration statistics with concurrent insert ({concurrent_insert}): 
   min={np.min(adb_insert_duration)}s, 
   max={np.max(adb_insert_duration)}s, 
   avg={np.mean(adb_insert_duration)}s
""") 

In [None]:
##
# Helper function to query the food reviews using vector search and/or other regular searchs
# - stmt: the CQL statement to execute
# - top: Top N record to show (default is is not to show any queried results)
#
def food_review_query(stmt, top=None):
    start_time = time.time()
    
    results = cql_session.execute(stmt)

    cnt=0
    for result in results:
        cnt += 1
        if top and cnt<top:
            print(result)
            
    end_time = time.time()
        
    return (end_time - start_time), cnt

##
# Define an SASI index on a regular column 'product_id' 
#
product_index_creation_stmt=f"""CREATE CUSTOM INDEX IF NOT EXISTS food_review_product_index
    ON {adb_keyspace}.food_review(product_id) USING 'StorageAttachedIndex';
"""
cql_session.execute(product_index_creation_stmt)

##
# Define an SASI index on the vector column 'embedding' 
# - default similarity comparison mode:
#   * COSINE
#   * DOT_PRODUCT
#   * EUCLIDEAN
ANN_INDEX_MODE = "'COSINE'"
ann_index_creation_stmt=f"""CREATE CUSTOM INDEX IF NOT EXISTS food_review_ann_index
    ON {adb_keyspace}.food_review(embedding) USING 'StorageAttachedIndex'
    WITH OPTIONS = {{ 'similarity_function': {ANN_INDEX_MODE} }};
"""
cql_session.execute(ann_index_creation_stmt)

In [None]:
duration,query_embeddings = vertex_ai_text_embeddings(["hamburger is good"])

In [None]:
vector_query_limit = 100

# Pure vector search 
query_duration,cnt = food_review_query(f"""
    SELECT 
        id, 
        time, 
        product_id, 
        user_id, score, 
        summary, 
        text
    FROM {adb_keyspace}.food_review 
    ORDER BY embedding ANN OF {query_embeddings[0].values}
    LIMIT {vector_query_limit};
    """,
top=2)

print(f"""\nPure vector search with maximum {vector_query_limit} records: 
   result_cnt={cnt}, 
   query_duration={query_duration}s
""") 

In [None]:
# Both vector search and regular search
query_duration,cnt = food_review_query(f"""
    SELECT 
        id, 
        time, 
        product_id, 
        user_id, score, 
        summary, 
        text
    FROM {adb_keyspace}.food_review 
    WHERE product_id in ('B0006UFY46', 'B0077HIJYS')
    ORDER BY embedding ANN OF {query_embeddings[0].values}
    LIMIT {vector_query_limit};
    """,
top=2)

print(f"""\nPure vector search with maximum {vector_query_limit} records: 
   result_cnt={cnt}, 
   query_duration={query_duration}s
""") 