### Pipeline 구성
1. Outlook API → JSON 데이터 수집
2. HTML 제거 후 txt 변환 → Azure Data Lake Storage Gen2 적재
3. Databricks에서 txt 로드 및 RecursiveCharacterTextSplitter로 split
4. OpenAI Embedding 생성
5. Azure AI Search에 VectorDB 구축
6. RAG 파이프라인 구성 (Retrieval → GPT-4o 기반 응답)
7. Streamlit 챗봇 UI 배포

1. Outlook API → JSON 데이터 수집

2. HTML 제거 후 txt 변환 → Azure Data Lake Storage Gen2 적재

In [None]:
service_credential = dbutils.secrets.get(scope="",key="")
print(service_credential)
 
spark.conf.set("fs.azure.account.auth.type.<storagename>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storagename>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storagename>.dfs.core.windows.net", "d6297dc2-2fc5-478a-9584-2a9e8cae2cf1")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storagename>.dfs.core.windows.net", service_credential)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storagename>.dfs.core.windows.net", "https://login.microsoftonline.com/785087ba-1e72-4e7d-b1d1-4a9639137a66/oauth2/token")

print("Data Lake House Gen2 Connection Success")

In [None]:
configs = {"fs.azure.account.auth.type": "OAuth",
          "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
          "fs.azure.account.oauth2.client.id": "d6297dc2-2fc5-478a-9584-2a9e8cae2cf1",
          "fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="key-vault-secret3",key="secretkv22"),
          "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/785087ba-1e72-4e7d-b1d1-4a9639137a66/oauth2/token"}

# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
  source = "abfss://datalake-container@storagezb.dfs.core.windows.net/",
  mount_point = "/mnt/my-mount",
  extra_configs = configs)

In [None]:
mount_point = "/mnt/my-mount"

# 마운트된 디렉터리 내의 파일과 디렉터리 목록을 가져옵니다.
files = dbutils.fs.ls(mount_point)
files

3. Databricks에서 txt 로드 및 RecursiveCharacterTextSplitter로 split

In [None]:

%pip install langchain_community

In [None]:
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,
    chunk_overlap=200,
    length_function=len
)
file_path =  ''
loader = TextLoader(file_path=file_path)
document_list = loader.load_and_split(text_splitter=text_splitter)
document_list

4. OpenAI Embedding 생성

In [None]:
from dotenv import load_dotenv
from langchain_openai import OpenAIEmbeddings

load_dotenv()

embedding = OpenAIEmbeddings(model='text-embedding-3-large', openai_api_key='')
embedding

In [None]:
%pip install -U azure-search-documents

%pip install azure.identity

%pip install --upgrade --quiet  azure-search-documents
%pip install --upgrade --quiet  azure-identity

5. Azure AI Search에 VectorDB 구축

In [None]:
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import SearchIndex, SimpleField, SearchFieldDataType

# Initialize the search index client
service_endpoint = ''
key = ''
index_name = ''

index_client = SearchIndexClient(service_endpoint, AzureKeyCredential(key))

# Define the index schema
fields = [
    SimpleField(name="id", type=SearchFieldDataType.String, key=True, filterable=True),
    SimpleField(name="source", type=SearchFieldDataType.String, filterable=True, searchable=True),
    SimpleField(name="page_content", type=SearchFieldDataType.String, searchable=True)
]

index = SearchIndex(name=index_name, fields=fields)

# Create the index
index_client.create_index(index)

In [None]:

from langchain_community.vectorstores.azuresearch import AzureSearch

vector_store = AzureSearch(azure_search_endpoint='',
                            azure_search_key='',
                            index_name='',
                            embedding_function=embedding_model.embed_query)

vector_store.add_documents(documents=document_list)

6. RAG 파이프라인 구성 (Retrieval → GPT-4o 기반 응답)

In [None]:
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain import hub

query = '김민주한테 온 메일 알려줘'

# Perform a similarity search
docs = vector_store.similarity_search(
    query=query,
    search_type="similarity",
)


# 기존에는 llm.invoke만 사용했지만, 이제는 문서도 같이 주어야한다. 그렇기에 prompt를 추가해주자.
llm = ChatOpenAI(model='gpt-4o')
prompt = hub.pull('rlm/rag-prompt')
prompt

qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever = vector_store.as_retriever(),
    chain_type_kwargs={"prompt": prompt}
)

ai_message = qa_chain({'query':query})
ai_message

7. Streamlit 챗봇 UI 배포

In [None]:
streamlit run chat.py --server.runOnSave true