# Scale Embeddings with Snowflake Notebooks on Container Runtime

[Snowflake Notebooks on Container Runtime](https://docs.snowflake.com/en/user-guide/ui-snowsight/notebooks-on-spcs) are a powerful IDE option for building ML workloads at scale. Container Runtime (Public Preview) gives you a flexible container infrastructure that supports building and operationalizing a wide variety of resource-intensive ML workflows entirely within Snowflake.

### What you'll be building
Now, imagine you're a Data Scientist looking to experiment with an open source embedding model and evaluate a dataset with it before deciding to deploy it for a large batch embeddings generation (inference) job.

- In the first part of this Notebook, you will first load an embedding model and generate embeddings using a GPU on a sample dataset (68K records). 

- In the second part, you will evaluate a sampled RAG dataset (100K records) that has various questions and associated context chunks ("labels"). After evaluation, you will deploy the embedding model and perform inference on the full RAG dataset (10M context chunks).

First, we will import some basic libraries and get the Snowflake session object. We will also install some libraries that we'll need.

In [None]:
import pandas as pd
import numpy as np
session = get_active_session()

# Add a query tag to the session. This helps with debugging and performance monitoring.
session.query_tag = {"origin":"sf_sit-is", 
                     "name":"cr_notebooks_embeddings", 
                     "version":{"major":1, "minor":0},
                     "attributes":{"is_quickstart":1, "source":"notebook"}}

# Set session context 
session.use_role("DEEPSEEK_ROLE") 

# Print the current role, warehouse, and database/schema
print(f"role: {session.get_current_role()} | WH: {session.get_current_warehouse()} | DB.SCHEMA: {session.get_fully_qualified_current_schema()}")

In [None]:
! pip install sentence-transformers --quiet

## PART 1: Getting started with embeddings

Let's load an open source embedding model using `SentenceTransformer()` and show how we can generate embeddings on a sample sentence dataset and store those embeddings as a `VectorType()` in a Snowflake table.

In [None]:
from sentence_transformers import SentenceTransformer

# Take an example sentence transformer from HF
embed_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2',
                                  trust_remote_code=True,
                                  device='cuda')

Let's load a sample sentence dataset called `SST2`.

In [None]:
from datasets import load_dataset

# Load SST2 dataset and rename columns
sst2_data = load_dataset('sst2')
df_pd = pd.DataFrame(sst2_data['train'])
df_pd = df_pd.rename(columns={'idx': 'IDX', 'sentence': 'SENTENCE', 'label': 'LABEL'})

df_pd.shape

Now, we're ready to generate embeddings on this dataset.

In [None]:
# Generate embeddings 
embeddings = embed_model.encode(df_pd["SENTENCE"].to_list(), 
                          show_progress_bar=True)
df_pd['EMBEDDING'] = embeddings.tolist()
df_pd.head()

We will now create a Snowpark DataFrame to store the results.

In [None]:
df_sdf = session.create_dataframe(df_pd)
df_sdf.limit(5)

Let's cast the embeddings into Snowflake's `VectorType()`.

In [None]:
from snowflake.snowpark.types import VectorType

df_sdf = df_sdf.with_column('EMBEDDING', df_sdf['EMBEDDING'].cast(VectorType(float, 384)))
df_sdf.limit(5)

Now, we're ready to write the results into a Snowflake table. 

In [None]:
df_sdf[['SENTENCE',
        'EMBEDDING']].write.save_as_table('SST2_EMBEDDINGS', mode="overwrite")

Let's take a look at what this table looks like.

In [None]:
SELECT * FROM SST2_EMBEDDINGS 
LIMIT 5

## PART 2: Evaluate a RAG dataset and perform large scale batch inference

Let's now experiement with generating embeddings for a RAG solution. We will calculate embeddings for the `CONTEXT` chunks we have and a sample set of `QUESTIONS`. Basically, we want to evaluate whether the correct chunk is being pulled for each question.

If we're happy with the accuracy, we will go ahead and deploy the model to generate embeddings at scale on a larger dataset.

Let's load an open source RAG dataset from HuggingFace and oversample to create a larger dataset (~10M records).

In [None]:
# For more info: https://huggingface.co/datasets/neural-bridge/rag-dataset-12000
ds = load_dataset("neural-bridge/rag-dataset-12000")

df_pd_rag = pd.DataFrame(ds['train'])
df_pd_rag = df_pd_rag.rename(columns={'context': 'CONTEXT'})

# Oversample to create a larger dataset
newdf = df_pd_rag.loc[np.repeat(df_pd_rag.index, 100)].reset_index(drop=True)

for _ in range(10):
    print(f'{_+1}0% complete')
    session.write_pandas(newdf, "RAG_DATASET_10M", 
                         auto_create_table=True, 
                         overwrite=False,
                         chunk_size=10000)

Since we're just experimenting and evaluating at this stage, let's sample this dataset to 100k records.

In [None]:
df_rag_sample = session.table('RAG_DATASET_10M').limit(100000).to_pandas()
df_rag_sample = df_rag_sample.rename(columns={'question': 'QUESTION'})

Now, we can go ahead and generate embeddings on the `CONTEXT` chunks in our dataset.

In [None]:
context_sample_list = df_rag_sample["CONTEXT"].to_list()

context_embeddings = embed_model.encode(context_sample_list,
                                        show_progress_bar=True)

df_rag_sample['CONTEXT_EMBEDDINGS'] = context_embeddings.tolist()
df_rag_sample.head()

Let's select a sample of 1000 questions to evaluate and generate embeddings for those as well.

In [None]:
df_rag_sample_q = pd.DataFrame(df_rag_sample[["QUESTION", "CONTEXT"]].sample(1000))
df_rag_sample_q = df_rag_sample_q.rename(columns={"CONTEXT": "LABELED_CONTEXT"})

question_sample_list = df_rag_sample_q["QUESTION"].to_list()

question_embeddings = embed_model.encode(question_sample_list, 
                                show_progress_bar=True)

df_rag_sample_q['QUESTION_EMBEDDINGS'] = question_embeddings.tolist()
df_rag_sample_q.head()

We will want to keep track of the correct `CONTEXT` per `QUESTION` as well.

In [None]:
question_labels_list = df_rag_sample_q["LABELED_CONTEXT"].to_list()

Finally, we can evaluate our embedding model on our chosen sample of questions to generate a relevance score.

We'll be using the `util.semantic_search()` function from `sentence_transformers` to select the top `CONTEXT` per `QUESTION` to see whether we pick the correct `CONTEXT` chunk.

In [None]:
from sentence_transformers import util

hits = util.semantic_search(question_embeddings, context_embeddings, top_k=1)

In [None]:
aDict = {}
for n, item in enumerate(hits):
    item[0]["QUESTION"] = question_sample_list[n]
    item[0]["LABELED_CONTEXT"] = question_labels_list[n]
    aDict[n] = item[0]

In [None]:
results_df = pd.DataFrame.from_dict(aDict, orient='index')
results_df.head()

Join `corpus_id` with the original dataset to get the CONTEXT field from the original dataset.

In [None]:
merged_df = pd.merge(df_rag_sample, results_df, 
                      left_index=True, 
                      right_on=['corpus_id'], how='inner')
merged_df.head()

In [None]:
correct_results = merged_df[merged_df['CONTEXT']==merged_df['LABELED_CONTEXT']].count()

We compute the accuracy value now to see how many times the correct `CONTEXT` chunk was pulled for each of our sample questions.

In [None]:
accuracy = correct_results.values[0]/merged_df.count()['LABELED_CONTEXT'] * 100
f'''Percent accuracy: {accuracy:.2f}%'''

The accuracy looks good for us to proceed and deploy the embedding model for perform a batch inference job on the full ~10M `CONTEXT` chunks now.

In order to deploy the model, we will be using [Snowflake Model Registry](https://docs.snowflake.com/developer-guide/snowflake-ml/model-registry/overview?utm_cta=snowpark-dg-hero-card).

The Snowflake Model Registry lets you securely manage models and their metadata in Snowflake, regardless of origin. The model registry stores machine learning models as first-class schema-level objects in Snowflake so they can easily be found and used by others in your organization. You can create registries and store models in them using Python classes in the Snowpark ML library. Models can have multiple versions, and you can designate a version as the default.

After you have stored a model, you can invoke its methods (equivalent to functions or stored procedures) to perform model operations, such as inference

First, let's create a `Registry` instance.

In [None]:
from snowflake.ml.registry import Registry

# Create Model Registry
reg = Registry(
    session=session, 
    database_name=session.get_current_database(), 
    schema_name=session.get_current_schema()
    )

reg

We need to specify sample input data in order to log this model.

In [None]:
sample_input_data = session.table('RAG_DATASET_10M').limit(10)
sample_input_data = sample_input_data[['CONTEXT']]

Now, we can log the model.

In [None]:
# Logging the sentence_transformers model, using pip requirements, deployment against the gpu
mv = reg.log_model(embed_model,
                   model_name="sentence_transformer_minilm",
                   version_name='v1',
                   pip_requirements=["sentence-transformers", "torch", "transformers"], 
                   conda_dependencies=["pyopenssl >= 22.0.0"],
                   sample_input_data = sample_input_data,
                   options = {"cuda_version": "11.8"},
                   comment = "Model artifact associated with deployment against GPU"
                  )

Because you're using pip requirements, this model will be deployed as a service on SPCS using the new Model Serving functionality. it will not run on the warehouse. if you logged it using conda requirments, it would run on the warehouse also.

Let's make sure the model got logged.

In [None]:
reg.show_models()

We can also get our reference to the model using `get_model()` and see the associate `functions()` we can call with our model.

In [None]:
mv = reg.get_model('sentence_transformer_minilm').version('V1')

In [None]:
mv.show_functions()

Now we need to create a service that will host our model on GPUs. Let's make sure our service can use as many GPUs as we have access to outside of the single GPU that our Notebook is using. During `setup.sql` we set 4 GPUs (nodes) to be the max capacity, so we can dedicate 3 to the inference service.

**Note:** This step takes some time and will print log statements below.

In [None]:
#Create the service and call it:
mv.create_service(service_name="minilm_gpu_service",
                  service_compute_pool="GPU_NV_S_COMPUTE_POOL",
                  image_repo=f"{session.get_current_database()}.{session.get_current_schema()}.MY_INFERENCE_IMAGES",
                  ingress_enabled=True,
                  build_external_access_integration="ALLOW_ALL_INTEGRATION", #allows access to pypi to build
                  gpu_requests = "1", #max number of GPUs needs to match GPU nodes in the compute pool Small --> 1 instance
                  max_instances = 3
                )

In [None]:
-- Run this to check whether status = RUNNING
SHOW SERVICES IN COMPUTE POOL GPU_NV_S_COMPUTE_POOL;

Once the inference service is ready, let's load our full 10M dataset.

In [None]:
full_rag_dataset = session.table('RAG_DATASET_10M')
full_rag_dataset

Now, let's run the inference job and save the embeddings to a Snowflake table. 

**Note:** This step will also take some time and will complete async, so you can monitor the underlying query under `Monitoring > Query History`.

Once, it's completed, you should be able to see the table within the `EMBEDDING_MODEL_QUICKSTART_DB` that was created to store the embeddings.

In [None]:
from snowflake.snowpark import functions as F
from snowflake.snowpark import types as T

output_10M = mv.run(full_rag_dataset[['CONTEXT']], function_name = 'encode', service_name = 'minilm_gpu_service')
output_10M = output_10M.with_column('"output_feature_0"', F.col('"output_feature_0"').cast(T.VectorType(float, 384)))\
                       .select('CONTEXT', '"output_feature_0"')
                       
output_10M = output_10M.rename(F.col('"output_feature_0"'), "CONTEXT_EMBEDDING")

# We can now run an async job 
output_10M.write.mode('overwrite').save_as_table('RAG_DATASET_10M_OUTPUT', block = False)

## Conclusion

Within this Notebook, you loaded an embedding model, generated embeddings using a GPU, evaluated a dataset for a RAG solution, deployed the embedding model, performed large scale batch inference, and saved results to a Snowflake table without a lot of complex infrastructure setup and management.