# Ingest massive amounts of data to a Vector DB (Amazon OpenSearch)
**_Use of Amazon OpenSearch as a vector database for storing embeddings_**

This notebook works well with the `conda_python3` kernel on a SageMaker Notebook `ml.t3.xlarge` instance.

---
---

## Contents

1. [Objective](#Objective)
1. [Background](#Background-(Problem-Description-and-Approach))
1. [Overall Workflow](#Overall-Workflow)
1. [Create scripts for ingesting data into OpenSearch](#Create-scripts-for-ingesting-data-into-OpenSearch)
1. [Download the data from the web and upload to S3](#Download-the-data-from-the-web-and-upload-to-S3)
1. [Load the data in a OpenSearch index (Local mode)](#Load-the-data-in-a-OpenSearch-index-(Local-mode))
1. [Load the data in a OpenSearch index via SageMaker Processing Job (Distributed mode)](#Load-the-data-in-a-OpenSearch-index-via-SageMaker-Processing-Job-(Distributed-mode))
1. [Conclusion](#Conclusion)

---

## Objective

This notebook illustrates how to use [`langchain`](https://python.langchain.com/en/latest/index.html) Amazon Sagemaker Endpoints and Amazon Sagemaker Processing Job to convert large amount of data into embeddings and ingest the text data along with its embeddings into an Amazon OpenSearch index.

We use the documents from [sagemaker.readthedocs.io/en/stable](sagemaker.readthedocs.io/en/stable) as the dataset to convert into embeddings. The [`gpt-j-6b`](https://huggingface.co/EleutherAI/gpt-j-6b) large language model (LLM) is to generate the embeddings. 

To understand the code, you might also find it useful to refer to:

- *[The langchain OpenSearch documentation](https://python.langchain.com/en/latest/ecosystem/opensearch.html)*
- *[Amazon OpenSearch service documentation](https://docs.aws.amazon.com/opensearch-service/index.html)*
- *[SageMaker Processing Job](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html)*
---

## Background (Problem Description and Approach)

- **Problem statement**: 

Using LLMs for information retrieval tasks (such as question-answering) requires converting the knowledge corpus as well as user questions into vector embeddings. We want to generate these vector embeddings using an LLM hosted as a Amazon Sagemaker Endpoint and store it in a vector database of choice such as Amazon OpenSearch. For converting large amounts of data (TBs or PBs) we need a scalable system which can accomplish both converting the documents into embeddings, storing them in a vector database and provide low latency similarity search

- **Our approach**: 

1. Host the LLM use to generate the embeddings as a SageMaker Endpoint with `instance_count` set to > 1 (the exact number depends upon time taken to generate the embeddings for the amount of data we have and the dollar amount we want to spend on it; more instances would mean greater cost but also lesser time taken).

1. Place the data to be corpus of data in S3 (each document is a file stored as an object in S3).

1. Use a Python script that uses [langchain](https://python.langchain.com/en/latest/index.html) and [Opensearch-py](https://pypi.org/project/opensearch-py/) to ingest the data into OpenSearch. Run the script locally on this notebook for testing.

1. Create a Sagemaker Processing job with `instance_count` set to > 1 (usually matching the `instance_count` for the Sagemaker Endpoint). 

    Each instance of the SageMaker Processing Job runs a script that does the following:
    - Processes a subset of files from S3.
    - Uses langchain to read the files from the local filesystem and convert it into chunks.
    - Creates a langchain `OpenSearchVectorSearch` object and provides it a `SagemakerJumpstartEmbeddings` object that enables it to talk to our Sagemaker Endpoint.
    - Uses the langchain `OpenSearchVectorSearch` to create or get an existing Opensearch index and then ingests documents into the index which contain the original `text`, `embeddings` and `metadata`.
    - Does this using Pytohn multiprocessing to achieve parallelization even within a single processing job instance and ensure maximum use of the Sagemaker Endpoint instance's GPU.
    > **The advantage to using langchain as a wrapper for interfacing with a vector database is that it provides a generic pattern that can be used with any LLM and any vector store. Langchain automatically uses the OpenSearch bulk ingestion API endpoint for ingesting data rather than ingesting data one record at a time. Furthermore, langchain also provides an opinionated JSON structure that includes text and metadata alongwith the embeddings itself for storing embeddings in an OpenSearch index specifically for information retrieval use-cases**.

- **Our tools**: [Amazon SageMaker SDK](https://sagemaker.readthedocs.io/en/stable/), [langchain](https://python.langchain.com/en/latest/index.html) and [Opensearch-py](https://pypi.org/project/opensearch-py/).


---

## Overall Workflow

**Prerequisite**

The following are prerequisites that needs to be accomplised by running [this cloud formation template](./template.yaml) before running this notebook.
- A Sagemaker Endpoint for generating embeddings.
- An Amazon OpenSearch cluster for storing embeddings.
    - Opensearch cluster's access credentials (username and password) stored in AWS Secrets Mananger by following steps described [here](https://docs.aws.amazon.com/secretsmanager/latest/userguide/managing-secrets.html).

The overall workflow for this notebook is as follows:
1. Install the required Python packages and store session information in local variables.
1. Download data from source and upload to S3.
1. Run the Python script locally to ingest a subset of data into an OpenSearch index for testing.
1. Run Sagemaker Processing Job which reads all data from S3 and runs the same Python script as above to ingest data into OpenSearch.
    - As part of this step we also create a custom container to package langchain and opensearch Python packages.
1. Do a similarity search with embeddings stored in the OpenSearch index for an input query.

---

## Step 1: Setup
Install the required packages.

In [1]:
print("I am in")

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
awscli 1.27.150 requires PyYAML<5.5,>=3.10, but you have pyyaml 6.0 which is incompatible.
sparkmagic 0.20.5 requires nest-asyncio==1.5.5, but you have nest-asyncio 1.5.6 which is incompatible.
sparkmagic 0.20.5 requires pandas<2.0.0,>=0.17.1, but you have pandas 2.0.1 which is incompatible.[0m[31m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
autovizwidget 0.20.5 requires pandas<2.0.0,>=0.20.1, but you have pandas 2.0.1 which is incompatible.
hdijupyterutils 0.20.5 requires pandas<2.0.0,>=0.17.1, but you have pandas 2.0.1 which is incompatible.
sparkmagic 0.20.5 requires nest-asyncio==1.5.5, but you have nest-asyncio 1.5.6 which is incompatible.
sparkmagic 0.20.5 require