<a href="https://colab.research.google.com/github/sugarforever/WTFAcademyChatBot/blob/main/WTFAcademyChatBotChroma.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

该Python notebook演示了如何利用langchain的QA chain，结合Chroma来实现博客的语义化搜索。
使用时，在本地创建`.env`，并如`.env.example`所示，设置有效的OpenAI API Key即可。

In [None]:
%pip install openai
%pip install chromadb
%pip install langchain
%pip install unstructured

# 文档预处理

In [None]:
from langchain.document_loaders import DirectoryLoader

In [None]:
def load_all_courses(dir):
  loader = DirectoryLoader(dir, glob = "**/*.md")
  docs = loader.load()

  return docs

docs = load_all_courses("/Users/wushangcheng/Desktop/Archive")

In [None]:
print (f'You have {len(docs)} document(s) in your data')

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
split_docs = text_splitter.split_documents(docs)

In [None]:
print (f'Now you have {len(split_docs)} documents')

# Embedding - Openai

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
import os

OPENAI_API_KEY = "123"

In [None]:
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

In [None]:
persist_directory = 'blog_vector_storage'

这步操作属于初始化操作，将文档转换为向量并存储，后续可以直接使用。（只需要执行一次）

In [None]:
# vectorstore = Chroma.from_documents(split_docs, embeddings, persist_directory=persist_directory)
# vectorstore.persist()

In [None]:
# Load the vectorstore from disk
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embeddings)

query = "docker 如何创建一个mysql容器？"
docs = vectordb.similarity_search(query)

通过Embedding加载关联上下文完毕后，开始进行QA。

In [None]:
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:
llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)
chain = load_qa_chain(llm, chain_type="map_reduce")

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
split_docs = text_splitter.split_documents(docs)

chain.run(input_documents=split_docs, question=query)