# Data Ingestion

### Text Loader

In [4]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("speech.txt")
text_document = loader.load()
text_document

[Document(metadata={'source': 'speech.txt'}, page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.\n\nJust because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.\n\nI have said nothing of the governments allied with the Imperial government of Germany because they have not made war

In [5]:
import os
from dotenv import load_dotenv
load_dotenv(dotenv_path='.env')

os.environ["LANGCHAIN_API_KEY"]=os.getenv("LANGCHAIN_API_KEY")

TypeError: str expected, not NoneType

### Web Based loader

In [6]:
from langchain_community.document_loaders import WebBaseLoader
import bs4

## Load, chunk and index the content of the html page 
loader = WebBaseLoader(web_paths=('https://www.myvi.in/new-connection/buy-prepaid-sim-connection-online',),
                       bs_kwargs=dict(parse_only=bs4.SoupStrainer(
                           class_=('faq-section container','slick-track')
                       )))
text_document = loader.load()

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [7]:
text_document

[Document(metadata={'source': 'https://www.myvi.in/new-connection/buy-prepaid-sim-connection-online'}, page_content="\nPrepaid SIM Connection FAQs\n\n\n\n\n\nHow do I get free Vi prepaid SIM?\n\n\n\n\nVisit Vi Prepaid Connection Page and select a\nprepaid plan\nEnter your pin code and contact details\nChoose your preferred mobile number\nComplete the checkout and get free prepaid SIM card at your doorstep\n\n\n\n\n\n\n\n\nWhat are the new SIM offers on Vi Prepaid connection?\n\n\n\nVi has many exciting deals and offers with new prepaid SIM cards. Purchase a new Vi Prepaid Connection and enjoy different offers with select packs like:\n\nUnlimited calls with 100 SMS/day\n1-year access to JioHotstar Mobile\nSubscription to Vi Movies & TV\nWeekend data rollover\nFree data from 12AM to 6AM\n\n\n\n\n\n\n\n\nWhat is the price of a new Vi SIM?\n\n\n\nThe new Vi SIM price is zero. The new SIM is provided for free.\n\n\n\n\n\n\n\n\nHow do I get 130GB free data with Vi Guarantee?\n\n\n\nIf you ha

### PDF Loader

In [8]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("Agreement.pdf")
docs = loader.load()

In [9]:
docs

[Document(metadata={'producer': 'iTextSharp 5.2.1 (c) 1T3XT BVBA', 'creator': 'PyPDF', 'creationdate': '2025-09-04T12:56:54+05:30', 'moddate': '2025-09-08T12:31:26+05:30', 'source': 'Agreement.pdf', 'total_pages': 5, 'page': 0, 'page_label': '1'}, page_content='Particulars Amount Paid GRN/Transaction Id  Date\nStamp Duty Rs. 2125.00/- MH007995706202526P 02/09/2025\nDHC Rs. 300/- 0925028615545 02/09/2025\nRegistration Fee Rs. 1000.00/- MH007995706202526P 02/09/2025\nLEAVE AND LICENSE AGREEMENT\n This agreement is made and executed on 03/09/2025 at Thane\nBetween,\n1) Name: Mrs Nadar Christa Willington , Age : About 39 Years, PAN : BEVPN9224H Residing at:\nFlat No:506, Floor No:5, Building Name:Christia B, Block Sector:Lodha Dombivl East ,\nRoad:Kalyan Shill Road,Casa Bella Gold , Dombivli, Thane, Maharashtra, 421204\nHEREINAFTER called ‘the Licensor (which expression shall mean and include the Licensor above\nnamed and also his/her/their respective heirs, successors, assigns, executors 

# Data Transformation

In [10]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
documents = text_splitter.split_documents(docs)

documents[:5]

[Document(metadata={'producer': 'iTextSharp 5.2.1 (c) 1T3XT BVBA', 'creator': 'PyPDF', 'creationdate': '2025-09-04T12:56:54+05:30', 'moddate': '2025-09-08T12:31:26+05:30', 'source': 'Agreement.pdf', 'total_pages': 5, 'page': 0, 'page_label': '1'}, page_content='Particulars Amount Paid GRN/Transaction Id  Date\nStamp Duty Rs. 2125.00/- MH007995706202526P 02/09/2025\nDHC Rs. 300/- 0925028615545 02/09/2025\nRegistration Fee Rs. 1000.00/- MH007995706202526P 02/09/2025\nLEAVE AND LICENSE AGREEMENT\n This agreement is made and executed on 03/09/2025 at Thane\nBetween,\n1) Name: Mrs Nadar Christa Willington , Age : About 39 Years, PAN : BEVPN9224H Residing at:\nFlat No:506, Floor No:5, Building Name:Christia B, Block Sector:Lodha Dombivl East ,\nRoad:Kalyan Shill Road,Casa Bella Gold , Dombivli, Thane, Maharashtra, 421204\nHEREINAFTER called ‘the Licensor (which expression shall mean and include the Licensor above\nnamed and also his/her/their respective heirs, successors, assigns, executors 

# Vector Embeddings & Store

In [11]:
from langchain_ollama.embeddings import OllamaEmbeddings
embedding_model = OllamaEmbeddings(model="nomic-embed-text")

### ChromaDB

In [12]:
from langchain_community.vectorstores import Chroma
chroma_db = Chroma.from_documents(documents, embedding_model)

In [13]:
query = ("prepaid")
result = chroma_db.similarity_search(query)
result[0].page_content


'Type of Party,\nName & UID\n Date & Time of\nAdmission\nDate ,Time of\nVerification\nwith UIDAI\nInformation received from\nUIDAI(Name,Gender,Aadhaar/Ref No,Photo)\nLicensor\nMrs Nadar\nChrista\nWillington\n03/09/2025\n11:14:46 AM\n03/09/2025\n11:15:09 AM\nChrista Willington Nadar,\nFemale, 1249332383419289600\n \nLicensee\n. Rishabh K\nSharma\n02/09/2025\n07:26:14 PM\n02/09/2025\n07:26:53 PM\nRishabh K. Sharma, Male,\n1210851368124370944\n \nidentifier for all\nexecutants\n- Nikhil Dey\n03/09/2025\n02:53:12 PM\n03/09/2025\n02:53:35 PM\nNikhil Jitendra Dey, Male,\n1167491594943225856\n \nidentifier for all\nexecutants\n- Prerna Jadhav\n03/09/2025\n03:31:45 PM\n03/09/2025\n03:32:14 PM\nPrerna Rahul Jadhav, Female,\n1212349220277276672\n \nLEAVE AND LICENSE AGREEMENT\nPage 5 of 5Registered as Document No.17404/2025 at the Joint S.R. Thane 11 on 08/09/2025\nThumb Impression of SRO'

### FAISS DB

In [94]:
from langchain_community.vectorstores import FAISS
FAISSdb = FAISS.from_documents(documents, embedding_model)


In [96]:
query = ("rent per month")
result = FAISSdb.similarity_search(query)
result[0].page_content


'1) Period: That the Licensor hereby grants to the Licensee herein a revocable leave and license,\nto occupy the Licensed Premises, described in Schedule I hereunder written without creating any\ntenancy rights or any other rights, title and interest in favour of the Licensee for a period of 22\nMonths commencing from 01/09/2025 and ending on 30/06/2027\n2) License Fee & Deposit:That the Licensee shall pay to the Licensor the following amount per\nmonth towards the compensation for the use of the said Licensed premises.\na) Rs. 36000/-(Thirty-Six Thousand Only) per month for the first 11 months,\nb) Rs. 39600/-(Thirty-Nine Thousand Six Hundred Only) per month for the next 11 months.\n The amount of monthly compensation License fee shall be payable within first five days of the\nconcerned month of Leave and License. Licensees shall also pay to the Licensor Rs. 100000\ninterest free refundable deposit, for the use of the said Licensed premises.'

### LanceDB

In [97]:
from langchain_community.vectorstores import LanceDB
Lancedb = LanceDB.from_documents(documents, embedding_model)

In [108]:
query = ("Rishabh k sharma address")
result = Lancedb.similarity_search(query)
result[0].page_content


'Name & Address  Photo Thumb Verified Digitally\nsigned\nLicensor\nMrs Nadar Christa Willington\nAddress:Flat No:506, Floor No:5, Building\nName:Christia B, Block Sector:Lodha Dombivl\nEast , Road:Kalyan Shill Road,Casa Bella Gold ,\nDombivli, Thane, Maharashtra, 421204\n \n \nNot Available\nLicensee\nMr.. Rishabh K Sharma\nAddress:Flat No:803, Floor No:8, Building\nName:Lodha Amara Wing 31, Block Sector:Near\nAir Force Station , Road:Kolshet Road, Thane\nWest , Thane, Maharashtra, 400607\n \n \nNot Available\nWitness of execution of all executants\n- Nikhil Dey\nAddress: Block Sector:-, Road:-, ., Thane,\nMaharashtra, 400602\n \n \nNot Required\nWitness of execution of all executants\n- Prerna Jadhav\nAddress: Block Sector:-, Road:-, ., Thane,\nMaharashtra, 400602\n \n \nNot Required\n\xa0 _______________________________________________________________________\nAdmission Of Execution / Identification\nThe following parties have admitted that they have executed the Agreement of Leave a

# Chain and Prompt Retrival

In [14]:
from langchain_ollama import OllamaLLM

llm = OllamaLLM(model="Gemma3:4b")

llm

OllamaLLM(model='Gemma3:4b')

### Designing chat prompt template

In [15]:
from langchain_core.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_template(
    """
    Answer the following question based only on the provided context. Think step by step before providing a detailed answer. 
    I will tip you 1000$ if the user finds the answer helpful.
    <contect>
    {context}
    </content>
    Question: {input}
    """
)

### Chain Introduction
#### Create stuff Document chain

In [16]:
from langchain_classic.chains.combine_documents import create_stuff_documents_chain
document_chain = create_stuff_documents_chain(llm, prompt)

Retrievers: A retriever is an interface that returns documents given an unstructured query. It is more general than a vector store.
A retriever does not need to be able to store documents, only to return (or retrieve) them. Vector stores can be used as the backbone of a retriever, but there are other types of retrievers as well.
https://python.langchain.com/docs/modules/data_connection/retrievers/

In [17]:
retriver = chroma_db.as_retriever()
retriver


VectorStoreRetriever(tags=['Chroma', 'OllamaEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x11d86db80>, search_kwargs={})

Retrieval chain: This chain takes in a user inquiry, which is then passed to the retriever to fetch relevant documents. Those documents (and original inputs) are then passed to an LLM to generate a response
https://python.langchain.com/docs/modules/chains/
" 1111

In [18]:
from langchain_classic.chains import create_retrieval_chain

retrival_chain = create_retrieval_chain(retriver, document_chain)

In [22]:
response = retrival_chain.invoke({'input':"rishabh ka address kya hai "})
response['answer']

'Mr.. Rishabh K Sharma\nAddress:Flat No:803, Floor No:8, Building\nName:Lodha Amara Wing 31, Block Sector:Near\nAir Force Station , Road:Kolshet Road, Thane\nWest , Thane, Maharashtra, 400607'