# Loading and processing data with Lang Chain

Create data loader using Git. This will clone whole Azure Docs repository.

In [1]:
from langchain.document_loaders import GitLoader

# Data loader - documentation on GitHub, filter only markdown files
loader = GitLoader(
    clone_url="https://github.com/MicrosoftDocs/azure-docs",
    repo_path="./data/",
    file_filter=lambda file_path: file_path.endswith(".md"),
)


data = loader.load()

Let's see how many files are downloaded and what they are.

In [2]:
print(f"Downloaded {len(data)} files")
print("----------------------------")
for file in data:
    print(file.metadata["file_path"])

Downloaded 10 files
----------------------------
README.md
demo_kube\README.md
docs\compute_isolation.md
docs\deployment.md
docs\network_isolation.md
docs\policies.md
docs\storage_isolation.md
docs\terraform.md
modules\aks-apps-rbac\docs\terraform.md
modules\aks-system\docs\terraform.md


Now we will configure text splitter to split large md files into smaller chunks.

Note: Lang Chain supports context aware splitting of Markdown, which would be great for very big files. Issue is I do not want to make chunks too small (this might destroy context) and Azure Docs are having single file per page. Some might be small enough so it is not worth splitting them by chapters. Big ones might, but then even chapter needs to be split into smaller one (or 3rd grade chapter like ### will need to be used rendering too many too small chunks overall).

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 25)
all_splits = text_splitter.split_documents(data)

How many chunks we have got?

In [4]:
len(all_splits)

113

Let's see some chunk

In [5]:
all_splits[2]

Document(page_content='- For single source of truth provide YAML manifest with all cluster parameters (shared components, namespaces, applications) that will be consumed by both Azure resources deployment (Terraform) and Kubernetes resources deployment (ArgoCD).\r\n- Each application will use its own namespace that is network isolated from others and only communication allowed is via Ingress or Azure API Management self-hosted gateway.', metadata={'source': 'README.md', 'file_path': 'README.md', 'file_name': 'README.md', 'file_type': '.md'})

Configure URL and API key for Azure OpenAI service

In [None]:
%env BASE_URL = https://tom-canada-openai.openai.azure.com
%env API_KEY = mykey

Get embeddings for chunks and store it in FAISS.

In [26]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
import os

embedding = OpenAIEmbeddings(
    openai_api_base=os.environ["BASE_URL"],
    openai_api_key=os.environ["API_KEY"],
    openai_api_type="azure",
    openai_api_version="2023-05-15",
    deployment="text-embedding-ada-002",
    model="text-embedding-ada-002",
    chunk_size=16)   # Note chunk_size here is misleading - it is more like batch size, how many should be send to API at once

vectorstore = FAISS.from_documents(documents=all_splits, embedding=embedding)

Try some question and similarity search

In [29]:
question = "How is network configured?"
docs = vectorstore.similarity_search(question)
docs

[Document(page_content='## Network architecture and isolation\r\n\r\n\r\n```mermaid\r\nflowchart TD\r\n    subgraph api_subnet\r\n        api_server\r\n    end;\r\n\r\n    api_subnet --> main_subnet\r\n    api_subnet --> confidential_app3_subnet\r\n    api_subnet --> confidential_app4_subnet\r\n\r\n    subgraph main_subnet\r\n        subgraph standard_app1_namespace\r\n            subgraph standard_app1_component1\r\n                standard_app1_service1 --> standard_app1_pod1\r\n                standard_app1_service1 --> standard_app1_pod2', metadata={'source': 'docs\\network_isolation.md', 'file_path': 'docs\\network_isolation.md', 'file_name': 'network_isolation.md', 'file_type': '.md'}),
 Document(page_content='```\r\n\r\nThis will create example hub and spoke topology with Azure Firewall, Azure VPN and jump server. T. Since solution is using private endpoints your deployment server needs to be in VNET - in demo solution you can use either jump server or connect via P2S VPN.\r\n\r

This is how we will save FAISS to file.

In [30]:
vectorstore.save_local(folder_path=".", index_name="azuredocs")

Try loading FAISS from file and do some similarity search

In [34]:
vectorstore2 = FAISS.load_local(folder_path=".", index_name="azuredocs", embeddings=embedding)
question = "How is network configured?"
docs = vectorstore2.similarity_search(question)
docs

[Document(page_content='## Network architecture and isolation\r\n\r\n\r\n```mermaid\r\nflowchart TD\r\n    subgraph api_subnet\r\n        api_server\r\n    end;\r\n\r\n    api_subnet --> main_subnet\r\n    api_subnet --> confidential_app3_subnet\r\n    api_subnet --> confidential_app4_subnet\r\n\r\n    subgraph main_subnet\r\n        subgraph standard_app1_namespace\r\n            subgraph standard_app1_component1\r\n                standard_app1_service1 --> standard_app1_pod1\r\n                standard_app1_service1 --> standard_app1_pod2', metadata={'source': 'docs\\network_isolation.md', 'file_path': 'docs\\network_isolation.md', 'file_name': 'network_isolation.md', 'file_type': '.md'}),
 Document(page_content='```\r\n\r\nThis will create example hub and spoke topology with Azure Firewall, Azure VPN and jump server. T. Since solution is using private endpoints your deployment server needs to be in VNET - in demo solution you can use either jump server or connect via P2S VPN.\r\n\r