## Text Embeddings with Pgvector
##### using Sagemaker Jumpstart Foundation Models and Amazon RDS w/ Pgvector extension
---


In this lab we'll setup a Pgvector database table in an RDS Postgres instance, vectorize an SEC filing document, store the resulting embeddings in the database and then experiment with querying them

---
##### Lab Agenda:
1. Setup dependencies
2. Setup the database for Pgvector
3. Setup the Text Embeddings model from a Sagemaker Endpoint
4. Create Text Embeddings
5. Vectorize an SEC filing document and store in Pgvector
6. Test the database with some querying

---
### 1. Setup dependencies
check python version and import envrionment settings


In [None]:
# check the python version. should be 3.10.x
!python -V

In [None]:
# uncomment if you're running this on a new instance or kernel from lab 1 
# !pip install --upgrade pip --quiet
# !pip install --upgrade psycopg2 --quiet
# !pip install --upgrade pgvector --quiet

!pip install --upgrade tiktoken --quiet
!pip install --upgrade langchain --quiet
!pip install --upgrade sagemaker --quiet
!pip install --upgrade beautifulsoup4 --quiet

Imports and Settings

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import sys, json
sys.path.append("libs")

from sagemaker_embeddings_model import SagemakerEmbeddingsModel
from sagemaker.session import Session
from sagemaker.jumpstart.notebook_utils import list_jumpstart_models
from bs4 import BeautifulSoup

from pgclient import PgClient
from doc2vex import DocToVex
import sagemaker_utils


# TIKTOKEN_ENCODING = 'p50k_base'
TEXT_EMBEDDING_ENDPOINT = "### SET TEXT EMBEDDING ENDPOINT HERE ###"
EMBEDDING_VECTOR_SIZE = 384 # this depends on the model used to do the embeddings
REGION = "us-east-1"
DB_SETTINGS_FILE = "dbsettings.json"
TABLE_NAME = "embeddings"

---

### 2. Setup the database for Pgvector

Fetch the dbsettings and instance the DB client

In [None]:

with open( DB_SETTINGS_FILE, 'r', encoding='utf-8') as f:
    content = f.read()
    dbsettings = json.loads(content)

# Setup the database client instance and connect
db = PgClient(dbsettings)
db.connect() # hardcoded to 'postgres' database for this demo

Execute the setup statements to install the vector extension and create a new embeddings table

In [None]:
print("creating pgvector extension...")
db.execute("CREATE EXTENSION IF NOT EXISTS vector;")
db.register_vector()

print(f"creating {TABLE_NAME} table")
db.execute(f"DROP TABLE IF EXISTS {TABLE_NAME};")
create_statement = f"""CREATE TABLE IF NOT EXISTS {TABLE_NAME} (
                        id bigserial primary key, 
                        content text, 
                        source text, 
                        descriptions_embeddings vector({EMBEDDING_VECTOR_SIZE}));"""

db.execute(create_statement)
print(f"{TABLE_NAME} table created")

Check the table with a simple query

In [None]:
rez = db.query(f"select id from {TABLE_NAME}")
print("records: ", len(rez))

---
#### 3. Setup the Text Embeddings model from a Sagemaker Endpoint

Take a look at the Jumpstart FM text embedding options

In [None]:
sagemaker_utils.list_jumpstart_models()


We've already deployed the text embedding model to an endpoint. let's veryify by listing the currently deployed sagemaker endpoints

In [None]:
sagemaker_utils.list_endpoints()


In [None]:
# Set the TEXT_EMBEDDING_ENDPOINT var to the correct endpoing name from the list above
TEXT_EMBEDDING_ENDPOINT = "### SET TEXT EMBEDDING ENDPOINT HERE ###"

---
#### 4. Create Text Embeddings

Now let's use the text embedding endpoint to create our first embedding

In [None]:
# First - create a Sagemaker session
sm_session = Session()
# Instantiate our endpoint model class
embedder = SagemakerEmbeddingsModel(TEXT_EMBEDDING_ENDPOINT,sm_session)

txt = "LETS EMBED THIS TEXT!"

vec = embedder.query_endpoint(txt)
print("")
print(vec)

---
#### 5. Vectorize an SEC filing document and store in Pgvector

Integreate Langchain text splitter and tiktoken to create a document vectorizer 

In [None]:
# instance the vectorizor helper 
vectorizor = DocToVex(sm_session, TEXT_EMBEDDING_ENDPOINT)


source_docs = [
    "data/financial/0000003153-20-000004.html",
]

def read_doc_text(filename):
    if ".html" in filename:
        print('parsing html')
        with open(filename, 'r') as f:
            soup = BeautifulSoup(f, 'html.parser')
        # parse text from html file
        text = soup.get_text()
        text = text.replace(u'\xa0', u' ')
        text = text.replace(u'\xa01', u' ')
        text = text.replace(u'\xa03', u' ')
        text = text.replace(u'\xa04', u' ')
        return text

    else:
        print('parsing text')
        with open(filename, "r") as f:
            text = f.read()
        return text

for doc in source_docs:
    # open doc and read all text
    print(f'vectorizing {doc}')
    doctext = read_doc_text(doc)

    chunks, vex = vectorizor.get_document_vectors(doctext)
    
    # print(f'converted {len(vex)} chunks from source doc')
    for i,chunk in enumerate(chunks):
        docid = "{}_{}".format(doc, i)
        vec = vex[i]
        insert_query = "insert into embeddings (content, source, descriptions_embeddings) values (%s, %s, %s)"
        insert_params = (chunk, docid, vec)
        db.execute(insert_query, insert_params)

print("")

---
Check the record count 

In [None]:
rez = db.query(f"select id, source from {TABLE_NAME}")
print(len(rez))

---
#### 6. Test the database with some querying

Create a text query, get a vector representation of it from our endpoint and then find similarities in the vector store

In [None]:
text_question = "Who is the Chief Executive Officer?"
# text_query = "How many members are on the Board of Directors?"

# bring back our embedder from before to create a vector out of the query text
question_embedding = embedder.query_endpoint(text_question) 

In [None]:
%%time
# now we query Pgvector with the embedded question to look for similar text chunks

# <-> l2 distance
# <=> cosine distance
# <#> inner product
# Note: <#> returns the negative inner product since Postgres only supports ASC order index scans on operators
query = """SELECT id, source, content,
            descriptions_embeddings <-> '{}' as distance
            FROM embeddings 
            ORDER BY descriptions_embeddings <-> '{}' limit 1;""".format(question_embedding,question_embedding)

rez = db.query(query)

print(rez[0][0], "==",rez[0][1])
print("distance: ", rez[0][3])
print("======================")
print(rez[0][2])
print("")

---
#### This concludes the Text Embeddings section