<a href="https://colab.research.google.com/github/withpi/cookbook-withpi/blob/main/colabs/Sagemaker_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://withpi.ai"><img src="https://withpi.ai/logo/logoFullBlack.svg" width="240px"></a>

<a href="https://code.withpi.ai"><font size="4">Documentation</font></a>

<a href="https://withpi.ai"><font size="4">Copilot</font></a>

# Embeddings

Pi has published its Pi Embedding model for deployment on AWS Sagemaker.

It takes as input a list of items to embed and returns a list of embeddings.

Deploy to Sagemaker for inference in your own AWS account.  This notebook shows how to perform inference with it.

You will need appropriate secrets in your notebook to access your account, such as `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY` and `AWS_SESSION_TOKEN`.  When running locally authenticate to AWS in the normal manner.

Start by installing packages and adding environment variables.

In [1]:
%pip install boto3 tqdm


import os
from google.colab import userdata

os.environ["AWS_ACCESS_KEY_ID"] = userdata.get('AWS_ACCESS_KEY_ID')
os.environ["AWS_SECRET_ACCESS_KEY"] = userdata.get("AWS_SECRET_ACCESS_KEY")
os.environ["AWS_SESSION_TOKEN"] = userdata.get("AWS_SESSION_TOKEN")

Collecting boto3
  Downloading boto3-1.40.9-py3-none-any.whl.metadata (6.7 kB)
Collecting botocore<1.41.0,>=1.40.9 (from boto3)
  Downloading botocore-1.40.9-py3-none-any.whl.metadata (5.7 kB)
Collecting jmespath<2.0.0,>=0.7.1 (from boto3)
  Downloading jmespath-1.0.1-py3-none-any.whl.metadata (7.6 kB)
Collecting s3transfer<0.14.0,>=0.13.0 (from boto3)
  Downloading s3transfer-0.13.1-py3-none-any.whl.metadata (1.7 kB)
Downloading boto3-1.40.9-py3-none-any.whl (140 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m140.1/140.1 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading botocore-1.40.9-py3-none-any.whl (14.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.0/14.0 MB[0m [31m98.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading jmespath-1.0.1-py3-none-any.whl (20 kB)
Downloading s3transfer-0.13.1-py3-none-any.whl (85 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.3/85.3 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00

## Sample inference

Run the below cell to test if everything is working.

You will need to plug in the name of your Sagemaker endpoint and the region it is located in below.

In [5]:
import boto3
import json
import time

# Initialize the SageMaker runtime client
# Update the region if needed
sagemaker_runtime = boto3.client('sagemaker-runtime', region_name='us-east-1')

# Your endpoint configuration
endpoint_name = 'MarketplaceEndpoint'

latencies = []
for _ in range(10):
  start = time.perf_counter()
  response = sagemaker_runtime.invoke_endpoint(
      EndpointName=endpoint_name,
      ContentType='application/json',
      Body=json.dumps({"query": ["A document to embed"], "batch": False})
  )
  stop = time.perf_counter()
  latencies.append(f"{stop-start:.3f}")

print(f"Latencies: {latencies}")
results = json.loads(response['Body'].read().decode())
display("Retrieved embeddings")
display(f"Sample dimensions: {results[0][:5]}")

Latencies: ['0.149', '0.130', '0.137', '0.165', '0.157', '0.158', '0.170', '0.168', '0.153', '0.162']


'Retrieved embeddings'

'Sample dimensions: [-0.11114501953125, 0.0201416015625, -0.031494140625, -0.038177490234375, 0.016082763671875]'

## Load test

The cell below will hit the endpoint with a lot of batch traffic to demonstrate throughput.  Use this to compute how many instances you need for anticipated traffic.

In [8]:
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm
import time

num_terms = 1024
batch_size = 16
max_concurrency = 16
num_batches=1000

large_payload = {
    "query": [
        " ".join(["term"]*num_terms)
    ]*batch_size,
    "batch": True
}

def make_call():
  response = sagemaker_runtime.invoke_endpoint(
      EndpointName=endpoint_name,
      ContentType='application/json',
      Body=json.dumps(large_payload)
  )
  return response


start = time.perf_counter()
with ThreadPoolExecutor(max_workers=max_concurrency) as executor:
  futures = [executor.submit(make_call) for _ in range(num_batches)]
  for future in tqdm(as_completed(futures), total=len(futures)):
    # Will throw on error to abort the test
    result = future.result()
stop = time.perf_counter()

elapsed = stop-start
total_embedded = num_terms*batch_size*num_batches

display(f"Total terms embedded: {total_embedded}")
display(f"Elapsed time: {elapsed:.2f} seconds")
display(f"Throughput: {total_embedded / elapsed:.2f} tokens/second")

100%|██████████| 1000/1000 [02:16<00:00,  7.35it/s]


'Total terms embedded: 16384000'

'Elapsed time: 137.20 seconds'

'Throughput: 119417.21 tokens/second'