**Disclaimer**
This Jupyter notebook is derived from and builds upon the following notebook. Credit and appreciation are extended to the original author(s) for their foundational work, which has been adapted and expanded for the current purpose.
https://github.com/build-on-aws/langchain-embeddings/blob/main/notebooks/03_build_pgvector_db.ipynb

# Supercharging Vector Similarity Search with Amazon Aurora and pgvector
In this Jupyter Notebook, you'll explore how to store vector embeddings in a vector database using [Amazon Aurora](https://aws.amazon.com/es/rds/aurora/) and the pgvector extension. This approach is particularly useful for applications that require efficient similarity searches on high-dimensional data, such as natural language processing, image recognition, and recommendation systems.

[Amazon Aurora](https://aws.amazon.com/es/rds/aurora/) is a fully managed relational database service provided by Amazon Web Services (AWS). It is compatible with PostgreSQL and supports the [pgvector](https://github.com/pgvector/pgvector) extension, which introduces a 'vector' data type and specialized query operators for vector similarity searches. The pgvector extension utilizes the ivfflat indexing mechanism to expedite these searches, allowing you to store and index up to 16,000 dimensions, while optimizing search performance for up to 2,000 dimensions.

For developers and data engineers with experience in relational databases and PostgreSQL, Amazon Aurora with pgvector offers a powerful and familiar solution for managing vector datastores, especially when dealing with structured datasets. Alternatively, Amazon Relational Database Service (RDS) for PostgreSQL is also a suitable option, particularly if you require specific PostgreSQL versions.

Both Amazon Aurora and Amazon RDS for PostgreSQL offer horizontal scaling capabilities for read queries, with a maximum of 15 replicas. Additionally, Amazon Aurora PostgreSQL provides a Serverless v2 option, which automatically scales compute and memory resources based on your application's demand, simplifying operations and capacity planning.

In [None]:
!pip install -q psycopg[binary] langchain_postgres langchain_community langchain_aws langchain_experimental datasets

In [None]:
import json
import boto3
import pandas as pd
from datasets import load_dataset

from langchain_community.document_loaders import DataFrameLoader
from langchain.docstore.document import Document
from langchain_core.runnables import RunnableLambda
from langchain_core.prompts import ChatPromptTemplate
from langchain_postgres import PGVector
from langchain_aws import ChatBedrock, BedrockEmbeddings
from langchain.chains import RetrievalQA
from langchain.callbacks import StdOutCallbackHandler

**Set up database connection:** Ensure that you have an Amazon Aurora instance configured and running. 

In [None]:
session = boto3.session.Session()
client = session.client(
    service_name='secretsmanager',
)

response = client.get_secret_value(SecretId="phoenix-demo-db-credential")
secret = json.loads(response['SecretString'])
print(secret)

In [None]:
import psycopg

connection = f"postgresql://{secret['username']}:{secret['password']}@{secret['host']}:{secret['port']}/{secret['dbname']}"

# Establish the connection to the database
conn = psycopg.connect(
    conninfo = connection
)
# Create a cursor to run queries
cur = conn.cursor()

## Load HuggingFace Dataset to PG Vector Store

In [None]:
bedrock_client = boto3.client("bedrock-runtime", region_name="us-east-1") 
bedrock_embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v2:0", client=bedrock_client)
llm = ChatBedrock(model_id="anthropic.claude-3-sonnet-20240229-v1:0", client=bedrock_client)

In [None]:
# Load aws_whitepapers dataset from huggingface
ds = load_dataset("si3mshady/aws_whitepapers")

# Convert dataset to dataframe
df = pd.DataFrame(ds["train"])

# Check data shape
df.head(5)

In [None]:
def load_and_split_semantic(embeddings):
    loader = DataFrameLoader(df, page_content_column="Content")
    docs = loader.load_and_split()
    print(f"docs:{len(docs)}")
    return docs

In [None]:
# function to create vector store
def create_vectorstore(embeddings, collection_name, conn):
    vectorstore = PGVector(
        embeddings=embeddings,
        collection_name=collection_name,
        connection=conn,
        use_jsonb=True,
    )
    return vectorstore

In [None]:
docs = load_and_split_semantic(bedrock_embeddings)

In [None]:
collection_name_text = "aws_whitepapers"
vectorstore = create_vectorstore(bedrock_embeddings, collection_name_text, connection)

In [None]:
# Add documents to the vectorstore
# this will take roughly 10-15 minutes.
vectorstore.add_documents(docs)

## Verify successful loading of dataset

In [None]:
vectorstore.similarity_search("what are the pillars in AWS well architected framework?", k=5)

In [None]:
vectorstore.similarity_search_with_relevance_scores("what is the durability of s3?", k=5)

### Retrieve information using Amazon Bedrock

In [None]:
template = """
You are an AI assistant tasked with answering questions based on provided context. Your goal is to provide accurate and relevant answers using only the information given.

Here is the context you should use to answer the question:

<context>
{context}
</context>

Now, here is the question you need to answer:

<question>
{query}
</question>

Instructions:
1. Carefully read and analyze the provided context.
2. Identify key information in the context that is relevant to the question.
3. Formulate an answer to the question using only the information from the given context.
4. If the context does not contain enough information to fully answer the question, state this clearly in your response.
5. Do not use any external knowledge or information not present in the provided context.
6. Keep your answer concise and to the point, while ensuring it fully addresses the question.

Format your response as follows:
1. Begin with a brief answer to the question.
2. Follow with a more detailed explanation, if necessary.
3. If you're quoting directly from the context, use quotation marks and indicate the quote's location in the context.

Remember, it's important to rely solely on the given context and not to introduce any external information or assumptions in your answer.
"""

In [None]:
query = "what is the durability of s3?"
prompt = ChatPromptTemplate.from_template(template)


def parse_docs(docs):
    return {
        'query': query,
        'context': docs
    }


llm = ChatBedrock(
    model_id="anthropic.claude-3-sonnet-20240229-v1:0",
    region_name="us-east-1",
)

chain = vectorstore.as_retriever() | parse_docs | prompt | llm

print(chain.invoke(query))


Learn more: 
- [Leverage pgvector and Amazon Aurora PostgreSQL for Natural Language Processing, Chatbots and Sentiment Analysis](https://aws.amazon.com/es/blogs/database/leverage-pgvector-and-amazon-aurora-postgresql-for-natural-language-processing-chatbots-and-sentiment-analysis/)

## Delete vectorDB

In [None]:
vectorstore.drop_tables()