# Similarity search with Langchain and Open AI

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/integrations/langchain/langchain-vector-store.ipynb)


이 통합 문서에서는 검색 쿼리에 대한 유사성 검색의 예를 보여주고 메타데이터 필터링을 보여줍니다. 먼저 `langchain`을 사용하여 문서를 청크로 분할한 다음 [`ElasticsearchStore.from_documents`](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html#langchain.vectorstores.elasticsearch.ElasticsearchStore.from_documents)를 통해 elasticsearch로 색인을 생성합니다.

## Install packages and import modules


In [5]:
# install packages
%pip install langchain openai elasticsearch tiktoken

Note: you may need to restart the kernel to use updated packages.


In [6]:
# import modules
from getpass import getpass
from langchain.vectorstores import ElasticsearchStore
from langchain.embeddings.openai import OpenAIEmbeddings
from urllib.request import urlopen
from langchain.text_splitter import CharacterTextSplitter
import json

# Connect to Elasticsearch


ElasticsearchStore를 사용하여 Elasticsearch에 연결하겠습니다. 이렇게 하면 데이터를 쉽게 생성하고 색인화하는 데 도움이 됩니다. ElasticsearchStore 인스턴스에서는 텍스트를 포함하기 위해 OpenAIEmbeddings에 포함을 설정하고 이 예제에서 사용할 elasticsearch 인덱스 이름도 설정합니다.

In [7]:
# ES_URL = input('Elasticsearch URL: ')
ES_URL = input('Elasticsearch URL(ex:https://127.0.0.1:9200)')
ES_USER = "elastic" 
ES_USER_PASSWORD = getpass('elastic user PW: ')
CERT_PATH = input('Elasticsearch pem 파일 경로: ')
# pem 생성 방법: https://cdax.ch/2022/02/20/elasticsearch-python-workshop-1-the-basics/

# set OpenAI API key
OPENAI_API_KEY = getpass("OpenAI API key")


In [8]:
from elasticsearch import Elasticsearch

client = Elasticsearch(
    ES_URL,
    basic_auth=(ES_USER, ES_USER_PASSWORD),
    ca_certs=CERT_PATH
)

if client.indices.exists(index="workplace_index"):
    client.indices.delete(index="workplace_index")

In [9]:
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

vector_store = ElasticsearchStore(
    embedding=embeddings,
    es_connection=client,
    index_name= "workplace_index"
)


## Download the dataset

샘플 데이터 세트를 다운로드하고 문서를 역직렬화해 보겠습니다.

In [10]:
url = "https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/example-apps/chatbot-rag-app/data/data.json"

response = urlopen(url)

workplace_docs = json.loads(response.read())

## 문서를 구절로 분할

우리는 간단한 스플리터를 사용하여 이러한 문서를 0개의 토큰이 겹치는 800개의 토큰 구절로 청크할 것입니다.


In [11]:
metadata = []
content = []

for doc in workplace_docs:
  content.append(doc["content"])
  metadata.append({
      "name": doc["name"],
      "summary": doc["summary"],
      "rolePermissions":doc["rolePermissions"],
  })

text_splitter = CharacterTextSplitter(chunk_size=800, chunk_overlap=0)
docs = text_splitter.create_documents(content, metadatas=metadata)

Created a chunk of size 866, which is longer than the specified 800
Created a chunk of size 1120, which is longer than the specified 800


## Elasticsearch에 데이터 인덱싱

다음으로 [ElasticsearchStore.from_documents](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html#langchain.vectorstores.elasticsearch.ElasticsearchStore.from_documents)를 사용하여 Elasticsearch에 데이터를 인덱싱하겠습니다.


In [12]:
documents = vector_store.from_documents(
    docs,
    embeddings,
    es_connection=client,
    index_name="workplace_index"
)

# 결과 함수

다음으로 쿼리 결과를 사람이 읽을 수 있는 출력으로 표시하는 작은 함수를 만듭니다. 이 함수는 예제에서 결과를 표시하는 데 사용됩니다.

In [13]:
def showResults(output):
  print("Total results: ", len(output))
  for index in range(len(output)):
    print(output[index])

## similarity_search을 사용하여 dataset에 쿼리

이제 샘플 데이터를 Elasticsearch에 인덱싱했으므로 `How does the compensation work?(보상은 어떻게 이루어지나요?)`라는 쿼리에 대해 유사성 검색을 수행하겠습니다. 기본적으로 상위 `4`개 문서를 반환합니다.

In [14]:
query = "How does the compensation work?"
results = documents.similarity_search(query)

showResults(results)

Total results:  4
page_content='Compensation Bands:\nBased on the job levels, the following compensation bands have been established:\na. Entry-Level Band: This band encompasses salary ranges for employees in entry-level positions. It aims to provide competitive compensation for individuals starting their careers within the company.\n\nb. Intermediate-Level Band: This band covers salary ranges for employees who have gained moderate experience and expertise in their respective roles. It rewards employees for their growing skill set and contributions.\n\nc. Senior-Level Band: The senior-level band includes salary ranges for experienced employees who have attained advanced skills and have a proven track record of delivering results. It reflects the increased responsibilities and expectations placed upon these individuals.' metadata={'name': 'Compensation Framework For It Teams', 'summary': 'This document outlines a compensation framework for IT teams. It includes job levels, compensation 

## 상위 10개 문서가 표시되도록 dataset에 쿼리

이제 `k=10`을 설정하고 동일한 쿼리를 시도하여 상위 `10`개 문서를 봅니다.



In [15]:
query = "How does the compensation work?"
results = documents.similarity_search(query, k=10)

showResults(results)

Total results:  10
page_content='Compensation Bands:\nBased on the job levels, the following compensation bands have been established:\na. Entry-Level Band: This band encompasses salary ranges for employees in entry-level positions. It aims to provide competitive compensation for individuals starting their careers within the company.\n\nb. Intermediate-Level Band: This band covers salary ranges for employees who have gained moderate experience and expertise in their respective roles. It rewards employees for their growing skill set and contributions.\n\nc. Senior-Level Band: The senior-level band includes salary ranges for experienced employees who have attained advanced skills and have a proven track record of delivering results. It reflects the increased responsibilities and expectations placed upon these individuals.' metadata={'name': 'Compensation Framework For It Teams', 'summary': 'This document outlines a compensation framework for IT teams. It includes job levels, compensation

## 메타데이터 필터링으로 데이터세트 쿼리

이제 쿼리 시 키워드별로 메타데이터 필터링을 추가하여 `rolePermissions`를 `manager`로 일치시킵니다.

In [16]:
query = "How does the compensation work?"
results = documents.similarity_search(query, filter={ 'match': { "metadata.rolePermissions": "manager" }})

showResults(results)


Total results:  4
page_content='Compensation Bands:\nBased on the job levels, the following compensation bands have been established:\na. Entry-Level Band: This band encompasses salary ranges for employees in entry-level positions. It aims to provide competitive compensation for individuals starting their careers within the company.\n\nb. Intermediate-Level Band: This band covers salary ranges for employees who have gained moderate experience and expertise in their respective roles. It rewards employees for their growing skill set and contributions.\n\nc. Senior-Level Band: The senior-level band includes salary ranges for experienced employees who have attained advanced skills and have a proven track record of delivering results. It reflects the increased responsibilities and expectations placed upon these individuals.' metadata={'name': 'Compensation Framework For It Teams', 'summary': 'This document outlines a compensation framework for IT teams. It includes job levels, compensation 