Feedback to shins777@gmail.com

이 예제는 구글의 BigQuery를 Vector database로 활용하는 예제입니다.
현재 상용화되거나, 오픈소스의 다양한 Vector DB가 있지만, BigQuery의 다양한 기능과, VectorDB 의 특화된 기능이 검색 성능을 높히고 효율적인 개발환경을 구성할 수 있습니다.
여기서 사용하는 임베딩 모델은 구글의 Gecko embedding 모듈을 사용합니다.

이 예제는 Langchain API 기준으로 설명합니다.

# 라이브러리 설치

In [None]:
!pip install --upgrade --quiet langchain langchain-google-vertexai google-cloud-aiplatform google-cloud-bigquery

# GCP 인증 및 환경설정

In [None]:
from google.colab import auth
auth.authenticate_user()

In [None]:
PROJECT_ID="ai-hangsik"
REGION="asia-northeast3"
MODEL = "gemini-pro"

#set and show gcp project
!gcloud config set project {PROJECT_ID}
!gcloud config get-value project

# Embedding 설정 / Dataset, Table 구성

*   https://api.python.langchain.com/en/v0.0.339/embeddings/langchain.embeddings.vertexai.VertexAIEmbeddings.html#



In [None]:
from langchain_google_vertexai import VertexAIEmbeddings

EBEDDING_MODEL = "textembedding-gecko-multilingual@latest"

embedding = VertexAIEmbeddings(
    model_name=EBEDDING_MODEL, project=PROJECT_ID
)

In [None]:
from google.cloud import bigquery

DATASET = "vector_db"
TABLE = "vector_table"

client = bigquery.Client(project=PROJECT_ID, location=REGION)
client.create_dataset(dataset=DATASET, exists_ok=True)



*   https://python.langchain.com/docs/integrations/vectorstores/bigquery_vector_search
*   https://api.python.langchain.com/en/stable/vectorstores/langchain_community.vectorstores.bigquery_vector_search.BigQueryVectorSearch.html#langchain_community.vectorstores.bigquery_vector_search.BigQueryVectorSearch



In [None]:
from langchain.vectorstores.utils import DistanceStrategy
from langchain_community.vectorstores import BigQueryVectorSearch

table = BigQueryVectorSearch(
    project_id=PROJECT_ID,
    dataset_name=DATASET,
    table_name=TABLE,
    location=REGION,
    embedding=embedding,

    #https://api.python.langchain.com/en/stable/vectorstores/langchain_community.vectorstores.utils.DistanceStrategy.html#langchain_community.vectorstores.utils.DistanceStrategy
    distance_strategy=DistanceStrategy.COSINE

)

# Embedding 대상 Text 파일

이 텍스트 파일로 Embedding을 위한 정보를 넣어주세요.

In [None]:
import pandas as pd

terms = pd.read_csv('./term1.csv',sep="|", encoding='utf-8-sig')
terms

Context 부분이 임베딩을 해야 하는 Text 또는 Paragraph 입니다.

In [None]:
import json

all_texts = terms['context'].to_list()
#metadatas = [ {'context_title': row['context_title'] } for idx, row in terms.iterrows()]
#table.add_texts(all_texts, metadatas=metadatas)

table.add_texts(all_texts)

# Sentence similarity 데이터 조회


*   https://api.python.langchain.com/en/stable/vectorstores/langchain_community.vectorstores.bigquery_vector_search.BigQueryVectorSearch.html#langchain_community.vectorstores.bigquery_vector_search.BigQueryVectorSearch.similarity_search



In [None]:
import time
s = time.time()

query = "질문 내용을 넣어주세요."

docs = table.similarity_search(query, k=5, brute_force=True)

for doc in docs:
  print(doc.page_content)

e = time.time() - s
print(e)

# Vector 형태의 쿼리 조회

In [None]:
query_vector = embedding.embed_query(query)
docs = table.similarity_search_by_vector(query_vector, k=5)
for doc in docs:
  print(doc.page_content)

# 쿼리 유사성 다양성 최적화

*   https://api.python.langchain.com/en/stable/vectorstores/langchain_community.vectorstores.bigquery_vector_search.BigQueryVectorSearch.html#langchain_community.vectorstores.bigquery_vector_search.BigQueryVectorSearch.max_marginal_relevance_search



In [None]:
docs = table.max_marginal_relevance_search(query= query,
                                           k=5,
                                           fetch_k = 30,
                                           lambda_mult = 0.5,
                                           brute_force = True
                                           )
for doc in docs:
  print(doc.page_content)

# Similarity with Score

*   https://api.python.langchain.com/en/stable/vectorstores/langchain_community.vectorstores.bigquery_vector_search.BigQueryVectorSearch.html#langchain_community.vectorstores.bigquery_vector_search.BigQueryVectorSearch.similarity_search_with_relevance_scores




In [None]:
tuples = table.similarity_search_with_relevance_scores(query, k=5)

context ={}

for tp in tuples:
    context[tp[1]] = tp[0].page_content
    # print(f"==[{tp[1]}]==")
    # print(tp[0].page_content)

context

# Gemini Pro 실행 - BigQuery as a Grounding Service


Responsible AI setting
*   HarmCategory : https://cloud.google.com/vertex-ai/docs/reference/rest/v1/HarmCategory
*   HarmBlockThreshold : https://cloud.google.com/php/docs/reference/cloud-ai-platform/0.31.0/V1.SafetySetting.HarmBlockThreshold

In [None]:
from langchain_google_vertexai import HarmBlockThreshold, HarmCategory

safety_settings = {
                    HarmCategory.HARM_CATEGORY_UNSPECIFIED: HarmBlockThreshold.BLOCK_NONE,
                    HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
                    HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_ONLY_HIGH,
                    HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,
                    HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE
}

*   VertexAI API : https://api.python.langchain.com/en/stable/llms/langchain_google_vertexai.llms.VertexAI.html#langchain_google_vertexai.llms.VertexAI

In [None]:
from langchain_google_vertexai.llms import VertexAI

gemini_pro = VertexAI( model_name = MODEL,
                  project=PROJECT_ID,
                  location=REGION,
                  verbose=True,
                  streaming=False,
                  safety_settings = safety_settings,
                  temperature = 0.2,
                  top_p = 1,
                  top_k = 40
                 )

In [None]:
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

query = "질문 내용을 넣어주세요."

prompt = PromptTemplate.from_template("""

  당신은 법률을 상담하는 AI 어시스턴트입니다.
  아래 Question 에 대해서 반드시 Context에 있는 개별 내용을 기반으로 단계적으로 추론해서 근거를 설명하고 답변해주세요.
  Context : {context}
  Question : {question}

  """)

prompt = prompt.format(context=context,
                       question=query)

print(f"Prompt : {prompt}")
print(f"답변 : {gemini_pro.invoke(prompt)}")
