<a href="https://colab.research.google.com/github/ychoi-kr/chatgpt-langchain/blob/main/chapter5/5_1_Data_connection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import os

os.environ["OPENAI_API_KEY"] = "your-openai-api-key"

In [None]:
# 코랩 보안 비밀에 OpenAI API 키를 등록한 경우
from google.colab import userdata
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

In [None]:
!pip install pydantic==2.6.4



In [None]:
!pip install langchain==0.1.14 openai==1.16.2 langchain-openai==0.1.1



## 5-1 Data connection

### Document loaders

In [None]:
!pip install GitPython==3.1.36



In [None]:
from langchain.document_loaders import GitLoader

def file_filter(file_path):
    return file_path.endswith(".mdx")

loader = GitLoader(
    clone_url="https://github.com/langchain-ai/langchain",
    repo_path="./langchain",
    branch="master",
    file_filter=file_filter,
)

raw_docs = loader.load()
print(len(raw_docs))

306


### Document transformers

In [None]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

docs = text_splitter.split_documents(raw_docs)
print(len(docs))



985


## Text embedding models

In [None]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

In [None]:
!pip install tiktoken==0.6.0



In [None]:
query = "AWS의 S3에서 데이터를 읽기 위한 DocumentLoader가 있나요?"

vector = embeddings.embed_query(query)
print(len(vector))
print(vector)

1536
[-0.015598894545180425, -0.01281194535456013, 0.02173419229436915, -0.01617366025430552, -0.024086724220867564, 0.045232782602289565, -0.019675725461699215, -0.00775266368148702, -0.004063465041003099, 0.022148558184080493, 0.010539613337768589, 0.015465227408120603, 0.0009882975994966953, -0.018860360023453496, -0.007438547213247999, -0.002778594761723421, 0.01765735951520527, -0.009623996615405455, -0.00923636340804804, -0.036651386675718586, 0.012905511605443969, -0.015692460796064265, -0.005654097825286196, 0.001779436893605124, -0.002656624104515987, 0.009002446849515894, 0.007766030488325257, -0.04012671733811326, 0.007224680632142573, 0.002614853124184792, 0.052637914428256424, -0.013085961867945712, -0.006790264299343238, -0.022736692097027642, -0.010045046607027826, -0.014663228311051844, 0.00836753074244937, -0.000708015380853774, -0.006108564601173522, -0.021466858020249502, 0.002041757509214909, -0.0064828310016926995, -0.0005818675091979021, -0.04344165190638437, 0.00

### Vector stores

In [None]:
!pip install chromadb==0.4.24



In [None]:
from langchain.vectorstores import Chroma

db = Chroma.from_documents(docs, embeddings)

### Retriever

In [None]:
retriever = db.as_retriever()

In [None]:
query = "AWS S3에서 데이터를 불러올 수 있는 DocumentLoader가 있나요?"

context_docs = retriever.get_relevant_documents(query)
print(f"len = {len(context_docs)}")

first_doc = context_docs[0]
print(f"metadata = {first_doc.metadata}")
print(first_doc.page_content)

len = 4
metadata = {'file_name': 'aws.mdx', 'file_path': 'docs/docs/integrations/platforms/aws.mdx', 'file_type': '.mdx', 'source': 'docs/docs/integrations/platforms/aws.mdx'}
See a [usage example](/docs/integrations/text_embedding/sagemaker-endpoint).
```python
from langchain_community.embeddings import SagemakerEndpointEmbeddings
from langchain_community.llms.sagemaker_endpoint import ContentHandlerBase
```

## Document loaders

### AWS S3 Directory and File

>[Amazon Simple Storage Service (Amazon S3)](https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-folders.html)
> is an object storage service.
>[AWS S3 Directory](https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-folders.html)
>[AWS S3 Buckets](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingBucket.html)

See a [usage example for S3DirectoryLoader](/docs/integrations/document_loaders/aws_s3_directory).

See a [usage example for S3FileLoader](/docs/integrations/document_loaders/aws_s3_file).

```pytho

### RetrievalQA（Chain）

In [None]:
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
qa_chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)

qa_chain.invoke(query)

[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "AWS S3에서 데이터를 불러올 수 있는 DocumentLoader가 있나요?"
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 3:chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 3:chain:StuffDocumentsChain > 4:chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "AWS S3에서 데이터를 불러올 수 있는 DocumentLoader가 있나요?",
  "context": "See a [usage example](/docs/integrations/text_embedding/sagemaker-endpoint).\n```python\nfrom langchain_community.embeddings import SagemakerEndpointEmbeddings\nfrom langchain_community.llms.sagemaker_endpoint import ContentHandlerBase\n```\n\n## Document loaders\n\n### AWS S3 Directory and File\n\n>[Amazon Simple Storage Service (Amazon S3)](https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-folders.html)\n> is an object storage service.\n>[AWS S3 Directory](https://docs.aws.amazon.c

{'query': 'AWS S3에서 데이터를 불러올 수 있는 DocumentLoader가 있나요?',
 'result': '네, AWS S3에서 데이터를 불러오는 두 가지 Document Loader가 있습니다: `S3DirectoryLoader`와 `S3FileLoader`. 이 두 Document Loader를 사용하여 AWS S3에서 데이터를 로드할 수 있습니다.'}