In [1]:
from langchain_community.document_loaders import WebBaseLoader 

USER_AGENT environment variable not set, consider setting it to identify your requests.


# RAG Flow

Load Data -> Docs -> Divide docs into chunks -> Convert chunks into vector (Vector Embedding) -> Store vectors into VectorStore

1. Data Loader - Loading the data
2. Data Tranformation -> Divide the loaded data into chunks
3. Data Embedding -> Convert chunks data into vectors 
4. Vector Store -> Store the vectors into DB

## 1. Data Loading

In [2]:
loader = WebBaseLoader("https://python.langchain.com/docs/integrations/document_loaders/web_base/")
loader

<langchain_community.document_loaders.web_base.WebBaseLoader at 0x11435b830>

In [3]:
docs = loader.load()
docs

[Document(metadata={'source': 'https://python.langchain.com/docs/integrations/document_loaders/web_base/', 'title': 'WebBaseLoader | 🦜️🔗 LangChain', 'description': 'This covers how to use WebBaseLoader to load all text from HTML webpages into a document format that we can use downstream. For more custom logic for loading webpages look at some child class examples such as IMSDbLoader, AZLyricsLoader, and CollegeConfidentialLoader.', 'language': 'en'}, page_content='\n\n\n\n\nWebBaseLoader | 🦜️🔗 LangChain\n\n\n\n\n\n\nSkip to main contentJoin us at  Interrupt: The Agent AI Conference by LangChain on May 13 & 14 in San Francisco!IntegrationsAPI ReferenceMoreContributingPeopleError referenceLangSmithLangGraphLangChain HubLangChain JS/TSv0.3v0.3v0.2v0.1💬SearchProvidersAnthropicAWSGoogleHugging FaceMicrosoftOpenAIMoreProvidersAcreomActiveloop Deep LakeAerospikeAI21 LabsAimAINetworkAirbyteAirtableAlchemyAleph AlphaAlibaba CloudAnalyticDBAnnoyAnthropicAnyscaleApache Software FoundationApache D

## 2. Data Transformation
Text -> Chunks

In [4]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
text_chunks = text_splitter.split_documents(docs)
text_chunks

[Document(metadata={'source': 'https://python.langchain.com/docs/integrations/document_loaders/web_base/', 'title': 'WebBaseLoader | 🦜️🔗 LangChain', 'description': 'This covers how to use WebBaseLoader to load all text from HTML webpages into a document format that we can use downstream. For more custom logic for loading webpages look at some child class examples such as IMSDbLoader, AZLyricsLoader, and CollegeConfidentialLoader.', 'language': 'en'}, page_content='WebBaseLoader | 🦜️🔗 LangChain'),
 Document(metadata={'source': 'https://python.langchain.com/docs/integrations/document_loaders/web_base/', 'title': 'WebBaseLoader | 🦜️🔗 LangChain', 'description': 'This covers how to use WebBaseLoader to load all text from HTML webpages into a document format that we can use downstream. For more custom logic for loading webpages look at some child class examples such as IMSDbLoader, AZLyricsLoader, and CollegeConfidentialLoader.', 'language': 'en'}, page_content='Skip to main contentJoin us a

## 3. Data Embedding
Chunks -> Vectors 

In [5]:
from langchain_community.embeddings import OllamaEmbeddings
embeddings = OllamaEmbeddings(model="gemma2:2b") # Default -> Llama2
embeddings

  embeddings = OllamaEmbeddings(model="gemma2:2b") # Default -> Llama2


OllamaEmbeddings(base_url='http://localhost:11434', model='gemma2:2b', embed_instruction='passage: ', query_instruction='query: ', mirostat=None, mirostat_eta=None, mirostat_tau=None, num_ctx=None, num_gpu=None, num_thread=None, repeat_last_n=None, repeat_penalty=None, temperature=None, stop=None, tfs_z=None, top_k=None, top_p=None, show_progress=False, headers=None, model_kwargs=None)

## 4. VectorStore

In [6]:
from langchain_community.vectorstores import FAISS
faiss_db = FAISS.from_documents(documents=text_chunks, embedding=embeddings)
faiss_db

<langchain_community.vectorstores.faiss.FAISS at 0x11644f410>

In [7]:
query = "There are reasonable limits to concurrent request"
result = faiss_db.similarity_search(query)
result[0].page_content

"[Document(metadata={'source': 'https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml'}, page_content='\\n\\n10\\nEnergy\\n3\\n2018-01-01\\n2018-01-01\\nfalse\\nUniform test method for the measurement of energy efficiency of commercial packaged boilers.\\nÂ§ 431.86\\nSection Â§ 431.86\\n\\nEnergy\\nDEPARTMENT OF ENERGY\\nENERGY CONSERVATION\\nENERGY EFFICIENCY PROGRAM FOR CERTAIN COMMERCIAL AND INDUSTRIAL EQUIPMENT\\nCommercial Packaged Boilers\\nTest Procedures\\n\\n\\n\\n\\n§\\u2009431.86\\nUniform test method for the measurement of energy efficiency of commercial packaged boilers.\\n(a) Scope. This section provides test procedures, pursuant to the Energy Policy and Conservation Act (EPCA), as amended, which must be followed for measuring the combustion efficiency and/or thermal efficiency of a gas- or oil-fired commercial packaged boiler.\\n(b) Testing and Calculations. Determine the thermal efficiency or combustion efficiency of commercia