## AOSS에 데이터 인덱싱

- AOSS에 데이터를 인덱싱해 봅니다.
- Langchain 등의 라이브러리를 활용할 수도 있지만, 여기서는 간단한 데이터 전처리에만 활용하였습니다.
- AOSS에 데이터를 인덱싱하고 검색하는 부분은 boto3 와 opensearch-py를 활용하였습니다.

In [None]:
!pip install -q pypdf

In [33]:
!pip list | grep 'pypdf\|langchain'

langchain                 0.2.5
langchain-aws             0.1.7
langchain-community       0.2.5
langchain-core            0.2.9
langchain-text-splitters  0.2.1
pypdf                     4.2.0


In [None]:
import os

# data_path = os.path.join("sample-data", "mortgage_kr_guide.pdf")
# data_path = os.path.join("sample-data", "school_edu_guide.pdf")
data_path = os.path.join("sample-data", "cs50_with_ai.pdf")

In [None]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader(data_path)

In [None]:
pages = loader.load_and_split()

In [None]:
len(pages)
# print(pages[20].page_content)

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    is_separator_regex=False,
)

In [None]:
documents = text_splitter.split_documents(pages)

In [None]:
print(f"Number of splitted data: {len(documents)}")
print(f"Text sample: {documents[10].page_content}")

In [None]:
print(documents[10].metadata["source"])
print(documents[10].metadata["page"])

In [None]:
import json
import boto3

bedrock = boto3.client("bedrock-runtime")
embedding_model_id = "amazon.titan-embed-text-v2:0"
embedding_dimension = 1024

def get_embedding_output(query):
    
    try:
        body = {
            "inputText": query,
            "dimensions": embedding_dimension,
            "normalize": True
        }

        response = bedrock.invoke_model(
            body=json.dumps(body), 
            modelId=embedding_model_id,
            accept='application/json',
            contentType='application/json')

        response_body = json.loads(response.get("body").read())
        embedding = response_body.get("embedding")
        return embedding
    except Exception as e:
        print(f"Error: {e}")
        return False

In [None]:
data_list = []

for doc in documents:
    content = doc.page_content
    meta = doc.metadata
    embedding = get_embedding_output(content)
    
    if embedding and len(embedding) == embedding_dimension:
        data_list.append({
            "content": content,
            "content_embeddings": embedding,
            "metadata": meta,
        })
        print("Success to get index")
    else:
        print(f"Error: {content}")

In [None]:
print(f"Raw doc size: {len(documents)}")
print(f"Data to index size: {len(data_list)}")

In [None]:
%store -r

In [None]:
try:
    print(collection_name)
    print(vector_index_name)
    print(aoss_endpoint)
except:
    collection_name = "rag-hol-aoss-collection"
    vector_index_name = "rag-hol-index-vector"
    aoss_endpoint = "1zo3f6fuhn7vowcv1ld7.us-west-2.aoss.amazonaws.com"
    

In [None]:
from opensearchpy import OpenSearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth
import boto3
import botocore
import time

import sagemaker

sess = sagemaker.Session()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

service = 'aoss'
credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(credentials.access_key, credentials.secret_key,
                   region, service, session_token=credentials.token)


In [None]:
def get_aoss_client(host):
    client = OpenSearch(
        hosts=[{'host': host, 'port': 443}],
        http_auth=awsauth,
        use_ssl=True,
        verify_certs=True,
        connection_class=RequestsHttpConnection,
        timeout=6000
    )
    return client

In [None]:
aoss_client = get_aoss_client(aoss_endpoint)


In [None]:
for data in data_list:
    try:
        response = aoss_client.index(index=vector_index_name, body=data)
        print(response)
    except Exception as e:
        print(f"Error: {e}")

### 데이터 검색

- 위의 과정을 거치면 데이터가 인덱싱되게 됩니다. 여기서는 item을 하나씩 넣었지만, 실제 환경에서는 필요한 경우 [bulk](https://github.com/opensearch-project/opensearch-py/blob/main/guides/bulk.md) 로 넣는 것도 고려할 필요가 있습니다.
- 인덱싱이 정상적으로 되었다면, 검색을 해 볼 수 있습니다.
- 아래에서 semantic search와 lexical search의 가장 기본적인 방식을 테스트해 볼 수 있습니다.


In [None]:
# It could takes more than 1 min to indexing
sample_out = aoss_client.get(index=vector_index_name, id="1%3A0%3AlaHCR5ABWp_sIC9zthBC")
print(sample_out)

In [None]:
vector = get_embedding_output("교육에서 챗봇을 어떻게 활용해야 하나요")

In [None]:
vector_query = {
  "query": {
    "knn": {
      "content_embeddings": {
        "vector": vector,
        "k": 5
      }
    }
  }
}

In [None]:
response = aoss_client.search(index=vector_index_name, body=vector_query, size=3)

In [None]:
vector_search_result = [result["_source"]["content"] for result in response["hits"]["hits"]]

In [None]:
vector_search_result

In [None]:
# query_text = "교육에서 챗봇을 어떻게 활용해야 하나요"
query_text = "How to use chatbot for education?"
keyword_query = {"query": {"match": {"content": query_text}}}

In [None]:
response = aoss_client.search(index=vector_index_name, body=keyword_query, size=3)

In [None]:
keyword_search_results = [result["_source"]["content"] for result in response["hits"]["hits"]]

In [None]:
keyword_search_results

In [None]:
%store aoss_client