# Ingestion into Amazon Opensearch Serverless 

### Setup environment

In [107]:
%pip install llama-index-llms-bedrock
%pip install llama-index-retrievers-bedrock
%pip install llama-index-vector-stores-opensearch
%pip install llama-index-embeddings-bedrock
%pip install requests-aws4auth

Collecting llama-index-llms-bedrock
  Using cached llama_index_llms_bedrock-0.1.6-py3-none-any.whl (8.2 kB)
Collecting llama-index-llms-anthropic<0.2.0,>=0.1.7
  Using cached llama_index_llms_anthropic-0.1.10-py3-none-any.whl (6.1 kB)
Installing collected packages: llama-index-llms-anthropic, llama-index-llms-bedrock
Successfully installed llama-index-llms-anthropic-0.1.10 llama-index-llms-bedrock-0.1.6

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
Collecting llama-index-retrievers-bedrock
  Using cached llama_index_retrievers_bedrock-0.1.0-py3-none-any.whl (3.6 kB)
Installing collected packages: llama-index-retrievers-bedrock
Successfully installed llama-index-retrievers-bedrock-0.1.0

[1m[[0m[34;49mnotice[0m[1;39;49

In [1]:
import nest_asyncio
nest_asyncio.apply()

### Configure OpenSearch and create collection (via console)

1. Go to Amazon OpenSearch Service
2. Select serverless, get started
3. create collection (using "rag-bedrock" as collection name)  
wait for collection to create, takes ~5 mins

### Connect to OpenSearch for indexing

In [2]:
from opensearchpy import OpenSearch, AsyncOpenSearch, AsyncHttpConnection, AWSV4SignerAsyncAuth

In [3]:
import boto3
from requests_aws4auth import AWS4Auth

service = 'aoss'
region = 'us-east-1'
session = boto3.Session(region_name=region)
credentials = session.get_credentials()
auth = AWSV4SignerAsyncAuth(credentials, region, service)

In [19]:
from llama_index.vector_stores.opensearch import (
    OpensearchVectorStore,
    OpensearchVectorClient,
)

host = 'https://abhpc33ml2wxb5lfm7mj.us-east-1.aoss.amazonaws.com'
index_name = 'cohere-index'
text_field = "content"
embedding_field = "embedding"

client = OpensearchVectorClient(
    host,
    index_name, 
    1024, # embedding dimension for cohere.embed-english-v3
    embedding_field=embedding_field, 
    text_field=text_field, 
    use_ssl=True,
    verify_certs=True,
    http_auth=auth, 
    connection_class=AsyncHttpConnection,
)

In [41]:
from llama_index.core import VectorStoreIndex

vector_store = OpensearchVectorStore(client)
index = VectorStoreIndex.from_vector_store(vector_store, verbose=True)

In [60]:
from llama_index.core import Settings
from llama_index.embeddings.bedrock import BedrockEmbedding

Settings.embed_model = BedrockEmbedding(model="cohere.embed-english-v3")
Settings.chunk_size = 256

### Parsing & Ingestion

Todo: create OpenSearch cluster, is this the same as the open search serverless?
* looks like it's part of the open search service.

Todo: use LlamaParse and ingest data into OpenSearch  
Todo: figure out how to ingest into OpenSearch

Reference: https://docs.llamaindex.ai/en/stable/examples/vector_stores/OpensearchDemo/?h=opensearch

In [56]:
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

--2024-04-17 22:21:52--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay.txt’


2024-04-17 22:21:53 (14.0 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]



In [57]:
from llama_index.core import SimpleDirectoryReader

docs = SimpleDirectoryReader("./data/paul_graham/").load_data()

In [58]:
for doc in docs:
    index.insert(doc, verbose=True)