# Loading and processing data with Lang Chain

Create data loader using Git. This will clone whole Azure Docs repository.

In [5]:
from langchain.document_loaders import GitLoader

# Data loader - documentation on GitHub, filter only markdown files
loader = GitLoader(
    clone_url="https://github.com/MicrosoftDocs/azure-docs",
    repo_path="./data/",
    file_filter=lambda file_path: file_path.endswith(".md"),
)


data = loader.load()

Let's see how many files are downloaded and what they are.

In [6]:
print(f"Downloaded {len(data)} files")
print("----------------------------")
for file in data:
    print(file.metadata["file_path"])

Downloaded 28040 files
----------------------------
CODE_OF_CONDUCT.md
CONTRIBUTING.md
README.md
SECURITY.md
ThirdPartyNotices.md
articles/azure-glossary-cloud-terminology.md
articles/cloud-services-php-create-web-role.md
articles/nodejs-use-node-modules-azure-apps.md
articles/third-party-notices.md
articles/vs-azure-tools-storage-explorer-accessibility.md
articles/vs-azure-tools-storage-explorer-blobs.md
articles/vs-azure-tools-storage-explorer-files.md
articles/vs-azure-tools-storage-manage-with-storage-explorer.md
includes/DDoS-Protection-region-requirement.md
includes/DDoS-Protection-virtual-network-relocate-note.md
includes/active-directory-authentication-configure-certificate-authorities.md
includes/active-directory-authentication-configure-revocation.md
includes/active-directory-authentication-connect-azuread.md
includes/active-directory-authentication-get-trusted-azuread.md
includes/active-directory-authentication-new-trusted-azuread.md
includes/active-directory-authentication-

Now we will configure text splitter to split large md files into smaller chunks.

Note: Lang Chain supports context aware splitting of Markdown, which would be great for very big files. Issue is I do not want to make chunks too small (this might destroy context) and Azure Docs are having single file per page. Some might be small enough so it is not worth splitting them by chapters. Big ones might, but then even chapter needs to be split into smaller one (or 3rd grade chapter like ### will need to be used rendering too many too small chunks overall).

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 2000, chunk_overlap = 100)
all_splits = text_splitter.split_documents(data)

How many chunks we have got?

In [8]:
len(all_splits)

188945

Let's see some chunk

In [9]:
all_splits[2]

Document(page_content="# Microsoft Azure Documentation\n\nWelcome to the open-source [documentation](https://learn.microsoft.com/azure/?product=popular) of [Microsoft Azure](https://azure.microsoft.com). Please review this README file to understand how you can assist in contributing to the Microsoft Azure documentation. \n\n## Getting Started\n\nContributing to open source is more than just providing updates. It's also about letting us know when there is an issue. Read our [Contributing guidance](CONTRIBUTING.md) to find out more.\n\n### Prerequisites\n\nYou've decided to contribute. That's great! To contribute to the documentation, you need a few tools.\n\n#### GitHub\n\nContributing to the documentation requires a GitHub account. If you don't have an account, follow the instructions for [GitHub account setup](https://learn.microsoft.com/contribute/get-started-setup-github) from our contributor guide.\n\n#### Tools\n\nTo install the necessary tools, follow the instructions for [Instal

Configure URL and API key for Azure OpenAI service

In [None]:
%env BASE_URL = https://tom-canada-openai.openai.azure.com
%env API_KEY = mykey

Get embeddings for chunks and store it in FAISS.

In [14]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
import os

embedding = OpenAIEmbeddings(
    openai_api_base=os.environ["BASE_URL"],
    openai_api_key=os.environ["API_KEY"],
    openai_api_type="azure",
    openai_api_version="2023-05-15",
    deployment="text-embedding-ada-002",
    model="text-embedding-ada-002",
    chunk_size=16)   # Note chunk_size here is misleading - it is more like batch size, how many should be send to API at once

vectorstore = FAISS.from_documents(documents=all_splits, embedding=embedding)

Try some question and similarity search

In [15]:
question = "How to configure routing on virtual network?"
docs = vectorstore.similarity_search(question)
docs

[Document(page_content='1. In the search box at the top of the portal, enter **Virtual machine**. Select **Virtual machines** in the search results.\n\n1. Select **vm-nva**.\n\n1. In the **Overview** select **Stop** if the virtual machine is running.\n\n1. Select **Networking** in **Settings**.\n\n1. In **Networking** select the network interface name next to **Network Interface:**. The interface name is the virtual machine name and random numbers and letters. In this example, the interface name is **vm-nva271**. \n\n1. In the network interface properties, select **IP configurations** in **Settings**.\n\n1. Select the box next to **Enable IP forwarding**.\n\n1. Select **Apply**.\n\n1. When the apply action completes, select **ipconfig1**.\n\n1. In **Assignment** in **ipconfig1** select **Static**.\n\n1. In **Private IP address** enter **10.0.253.10**.\n\n1. Select **Save**.\n\n1. When the save action completes, return to the networking configuration for **vm-nva**.\n\n1. In **Networkin

This is how we will save FAISS to file.

In [16]:
vectorstore.save_local(folder_path=".", index_name="azuredocs")

Try loading FAISS from file and do some similarity search

In [17]:
vectorstore2 = FAISS.load_local(folder_path=".", index_name="azuredocs", embeddings=embedding)
question = "How is network configured?"
docs = vectorstore2.similarity_search(question)
docs

[Document(page_content='### Networking', metadata={'source': 'articles/app-service/environment/version-comparison.md', 'file_path': 'articles/app-service/environment/version-comparison.md', 'file_name': 'version-comparison.md', 'file_type': '.md'}),
 Document(page_content='## Networking', metadata={'source': 'articles/storage/blobs/security-recommendations.md', 'file_path': 'articles/storage/blobs/security-recommendations.md', 'file_name': 'security-recommendations.md', 'file_type': '.md'}),
 Document(page_content='## Networking', metadata={'source': 'articles/storage/queues/security-recommendations.md', 'file_path': 'articles/storage/queues/security-recommendations.md', 'file_name': 'security-recommendations.md', 'file_type': '.md'}),
 Document(page_content="### Configure network on first node\n\nFollow these steps to configure the network for your device.\n\n1. In the local web UI of your device, go to the **Get started** page. \n\n2. On the **Network** tile, select **Configure**.  \