<a href="https://colab.research.google.com/github/singhraj00/langchain-tutorial/blob/main/RAG-Document-Loaders.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![](https://images.prismic.io/turing/6564bfb2531ac2845a2562f5_RAG_Process_38a4465948.jpg?auto=format,compress)

## RAG is a technique that combines information retrieval with language generation, where a model retrieves relevant documents from a knowledge base and then uses them as context to generate accurate and grounded responses.


## Benefits of Using RAG
- Use of up-to-date information
- Better Privacy
- No limit of document size

## RAG Components
- Document Loaders
- Text Splitters
- Vector Databases
- Retrievers

## Document Loaders

## **Document Loaders** are components in LangChain used to load data from various sources into a standarized format (usually as Document objects), which can then be used for chunking, embedding, retrieval, and generation.

## For Example:

``Document(page_content="this actual text content",
  metadata={"source":"filename.pdf",....})``

# TextLoader

## **TextLoader** is a simple and commonly used document loader in LangChain that reads plain text (.txt) files and convert them into LangChain Document objects.

## Use Case
- Ideal for loading chat logs, scraped text, transcripts, code snippets, or any plain text data into a LangChain pipeline.

## Limitation
- Works only with .txt files

## Code Example

In [2]:
!pip install langchain langchain_core langchain-community

Collecting langchain-community
  Downloading langchain_community-0.3.20-py3-none-any.whl.metadata (2.4 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.8.1-py3-none-any.whl.metadata (3.5 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain-community)
  Downloading python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB

In [3]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("/content/ipl.txt",encoding='utf-8')

document = loader.load()

print(document)

[Document(metadata={'source': '/content/ipl.txt'}, page_content="**Indian Premier League (IPL) - History, Achievements, and Records**\n\n### **Introduction to IPL**\nThe Indian Premier League (IPL) is a professional Twenty20 cricket league established by the Board of Control for Cricket in India (BCCI) in 2008. It is one of the most popular and lucrative cricket leagues globally, attracting top players from around the world. The IPL follows a franchise-based model, where teams represent different cities in India.\n\n### **History and Evolution**\nThe first season of the IPL was played in 2008, with Rajasthan Royals emerging as the inaugural champions. Over the years, the league has grown exponentially in terms of viewership, sponsorship, and global appeal. The IPL introduced the concept of player auctions, franchise ownership, and a mix of international and domestic talent, revolutionizing cricket.\n\nSome of the key milestones in IPL history include:\n- Introduction of the Decision Re

In [4]:
print(type(document))

<class 'list'>


In [5]:
print(len(document))

1


In [6]:
print(document[0])

page_content='**Indian Premier League (IPL) - History, Achievements, and Records**

### **Introduction to IPL**
The Indian Premier League (IPL) is a professional Twenty20 cricket league established by the Board of Control for Cricket in India (BCCI) in 2008. It is one of the most popular and lucrative cricket leagues globally, attracting top players from around the world. The IPL follows a franchise-based model, where teams represent different cities in India.

### **History and Evolution**
The first season of the IPL was played in 2008, with Rajasthan Royals emerging as the inaugural champions. Over the years, the league has grown exponentially in terms of viewership, sponsorship, and global appeal. The IPL introduced the concept of player auctions, franchise ownership, and a mix of international and domestic talent, revolutionizing cricket.

Some of the key milestones in IPL history include:
- Introduction of the Decision Review System (DRS) in 2018.
- Expansion of the league with ne

In [11]:
print(type(document[0]))

<class 'langchain_core.documents.base.Document'>


In [9]:
print(document[0].page_content)

**Indian Premier League (IPL) - History, Achievements, and Records**

### **Introduction to IPL**
The Indian Premier League (IPL) is a professional Twenty20 cricket league established by the Board of Control for Cricket in India (BCCI) in 2008. It is one of the most popular and lucrative cricket leagues globally, attracting top players from around the world. The IPL follows a franchise-based model, where teams represent different cities in India.

### **History and Evolution**
The first season of the IPL was played in 2008, with Rajasthan Royals emerging as the inaugural champions. Over the years, the league has grown exponentially in terms of viewership, sponsorship, and global appeal. The IPL introduced the concept of player auctions, franchise ownership, and a mix of international and domestic talent, revolutionizing cricket.

Some of the key milestones in IPL history include:
- Introduction of the Decision Review System (DRS) in 2018.
- Expansion of the league with new teams in dif

In [10]:
print(document[0].metadata)

{'source': '/content/ipl.txt'}


## Full Code With LLM Integration

In [12]:
!pip install langchain_groq

Collecting langchain_groq
  Downloading langchain_groq-0.3.2-py3-none-any.whl.metadata (2.6 kB)
Collecting groq<1,>=0.4.1 (from langchain_groq)
  Downloading groq-0.22.0-py3-none-any.whl.metadata (15 kB)
Downloading langchain_groq-0.3.2-py3-none-any.whl (15 kB)
Downloading groq-0.22.0-py3-none-any.whl (126 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.7/126.7 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: groq, langchain_groq
Successfully installed groq-0.22.0 langchain_groq-0.3.2


In [16]:
from langchain_community.document_loaders import TextLoader
from langchain_groq import ChatGroq
from langchain_core.prompt import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from google.colab import userdata

API_KEY = userdata.get('GROQ_API_KEY')

## load data

loader = TextLoader("/content/ipl.txt",encoding='utf-8')

document = loader.load()

llm = ChatGroq(model_name="llama-3.3-70b-versatile",api_key=API_KEY,temperature=0)

prompt = PromptTemplate(template="summarize this documents first \n {document} and after tell me what's most successfull team in ipl inside document?",
                        input_variables=["document"]
                        )

parser = StrOutputParser()

chain = prompt | llm | parser

result = chain.invoke({"document":document[0].page_content})

print(result)

**Summary of the Document:**
The document discusses the Indian Premier League (IPL), a professional Twenty20 cricket league established in 2008. It covers the history, evolution, achievements, and records of the IPL, as well as its impact on cricket. The league has grown exponentially in terms of viewership, sponsorship, and global appeal, and has introduced innovative concepts such as player auctions and franchise ownership. The document also highlights notable moments, records, and statistics in IPL history, including the most successful teams, highest individual scores, and fastest centuries.

**Most Successful Team in IPL:**
According to the document, the most successful teams in the IPL are:
1. **Mumbai Indians (MI)**
2. **Chennai Super Kings (CSK)**
Both teams have won multiple IPL titles, making them the most successful franchises in the league.


# PyPDFLoader

## **PyPDFLOader** is a document loader in LangChain used to load content from PDF file and convert each page into a document object.

## For Example

``[
  Document(page_content="text from page 1", metadata={"page":1,"source":"file.pdf",....}),
  Document(page_content="text from page 2", metadata={"page":2,"source":"file.pdf",....}),
  .........
]``

## Limitation
- It uses the PyPDF library under the hood - not great wiuth scanned pdf or complex layouts.

## Code Example

In [17]:
!pip install PyPDF

Collecting PyPDF
  Downloading pypdf-5.4.0-py3-none-any.whl.metadata (7.3 kB)
Downloading pypdf-5.4.0-py3-none-any.whl (302 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/302.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.3/302.3 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF
Successfully installed PyPDF-5.4.0


In [18]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("/content/DSMP.pdf")

document = loader.load()

print(document)

[Document(metadata={'producer': 'Skia/PDF m136 Google Docs Renderer', 'creator': 'PyPDF', 'creationdate': '', 'title': 'DSMP 2.0 FAQs', 'source': '/content/DSMP.pdf', 'total_pages': 6, 'page': 0, 'page_label': '1'}, page_content='About  Data  Science  Mentorship  Program   ●  What  is  the  course  fee  for   Data  Science  Mentorship  Program  (DSMP  \n2.0)\n The  total  course  fee  for  the  DSMP  course  is  Rs  10,600.  This  includes  the  \nfee\n \nfor\n \nboth\n \nDSMP\n \n1.0\n \nand\n \nDSMP\n \n2.0\n  ●  What  is  the  total  duration  of  the  course?  The  total  duration  of  the  course  is  about  6-8  months.    ●  Are  Deep  Learning  and  NLP  a  part  of  the  DSMP  2.0   course?  No,  NLP  and  Deep  Learning  both  are  not  a  part  of  this  program’s  \ncurriculum.\n  ●  What  if  I  miss  a  live  session?  Will  I  get  a  recording  of  the  session?  Yes,  all  our  sessions  are  recorded,  so  even  if  you  miss  a  session  you  can  \nlogin\n \ninto\n 

In [19]:
print(len(document))

6


## Extract Page 1

In [20]:
print(document[0].page_content)
print(document[0].metadata)

About  Data  Science  Mentorship  Program   ●  What  is  the  course  fee  for   Data  Science  Mentorship  Program  (DSMP  
2.0)
 The  total  course  fee  for  the  DSMP  course  is  Rs  10,600.  This  includes  the  
fee
 
for
 
both
 
DSMP
 
1.0
 
and
 
DSMP
 
2.0
  ●  What  is  the  total  duration  of  the  course?  The  total  duration  of  the  course  is  about  6-8  months.    ●  Are  Deep  Learning  and  NLP  a  part  of  the  DSMP  2.0   course?  No,  NLP  and  Deep  Learning  both  are  not  a  part  of  this  program’s  
curriculum.
  ●  What  if  I  miss  a  live  session?  Will  I  get  a  recording  of  the  session?  Yes,  all  our  sessions  are  recorded,  so  even  if  you  miss  a  session  you  can  
login
 
into
 
our
 
portal
 
and
 
watch
 
the
 
recording
 
as
 
per
 
your
 
convenience.
  ●  Where  can  I  find  the  class  schedule?  You  will  find  the  class  schedule  in  your  course  dashboard  once  you  
enroll
 
for
 
the
 
course.
 
 ●  What  is  t

## Extract Page 2

In [22]:
print(document[1].page_content)
print(document[1].metadata)

●  What  is  the  language  spoken  by  the  instructor  during  the  sessions?  Hinglish.   ●  How  will  I  be  informed  about  the  upcoming  class?  You  will  get  mail  from  our  side  before  every  session  once  you  enroll  in  
the
 
course.
 
  ●  Can  I  do  this  course  if  I  am  from  a  non-tech  background?  Yes,  absolutely.  However,  you’ll  need  to  be  consistent.   ●  I  am  late,  can  I  join  the  program  in  the  middle?  Absolutely,  you  can  join  the  program  any  time.   ●  If  I  join/pay  in  the  middle,  will  I  be  able  to  see  all  the  past  lectures?  Yes,  once  you’re  enrolled  in  the  course,  you  will  be  able  to  see  all  the  
past
 
content
 
in
 
your
 
dashboard.
  ●  Where  do  I  have  to  submit  the  task?  You  don’t  have  to  submit  the  task.  We  will  provide  you  with  the  solutions,  
you
 
have
 
to
 
self
 
evaluate
 
the
 
task
 
yourself.
  ●  Will  we  do  case  studies  in  the  program?  Yes,  there 

## More PDFLoader In LangChain

- PDF with tables/columns: **PDFPlumberLoader**
- Scanned/ merge PDFs: **UnstructuredPDFLoader, AmazonTextractPDFLoader**

- Need Layout and image data: **PyMUPDFLoader**

- Want Best structure extraction: **UnstructuredPDFLoader**

# DirectoryLoader

## **DirectoryLoader** is a document loader that lets you load multiple documents from a directory (folder) of files.



In [26]:
!pip install  unstructured[pdf]

Collecting onnx>=1.17.0 (from unstructured[pdf])
  Downloading onnx-1.17.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (16 kB)
Collecting onnxruntime>=1.19.0 (from unstructured[pdf])
  Downloading onnxruntime-1.21.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting pdf2image (from unstructured[pdf])
  Downloading pdf2image-1.17.0-py3-none-any.whl.metadata (6.2 kB)
Collecting pdfminer.six (from unstructured[pdf])
  Downloading pdfminer_six-20250327-py3-none-any.whl.metadata (4.1 kB)
Collecting pikepdf (from unstructured[pdf])
  Downloading pikepdf-9.5.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Collecting pi-heif (from unstructured[pdf])
  Downloading pi_heif-0.22.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.5 kB)
Collecting google-cloud-vision (from unstructured[pdf])
  Downloading google_cloud_vision-3.10.1-py3-none-any.whl.metadata (9.5 kB)
Collecting effdet (

In [29]:
from langchain_community.document_loaders import DirectoryLoader,PyPDFLoader

loader = DirectoryLoader("/content/Data/",glob="*.pdf",loader_cls=PyPDFLoader)

documents = loader.load()

print(documents)

[Document(metadata={'producer': 'PyFPDF 1.7.2 http://pyfpdf.googlecode.com/', 'creator': 'PyPDF', 'creationdate': 'D:20250403132430', 'source': '/content/Data/LangChain_Core_Components.pdf', 'total_pages': 2, 'page': 0, 'page_label': '1'}, page_content='LangChain Core Components\nLangChain Core Components\nLangChain provides essential building blocks that make it easier to develop applications using large\nlanguage models. These components help structure and manage interactions effectively.\n1. **LLMs (Large Language Models)**\n   - Interface with various language models such as OpenAI\'s GPT, Hugging Face models, and\nopen-source alternatives.\n   - Example:\n     ```python\n     from langchain.llms import OpenAI\n     llm = OpenAI(model_name="gpt-4", openai_api_key="your_api_key")\n     print(llm.predict("Explain LangChain components."))\n     ```\n2. **Chains**\n   - Chains help link multiple components together, allowing structured workflows.\n   - Example:\n     ```python\n     fr

In [30]:
print(len(documents))

55


In [31]:
print(documents[0].page_content)
print(documents[0].metadata)

LangChain Core Components
LangChain Core Components
LangChain provides essential building blocks that make it easier to develop applications using large
language models. These components help structure and manage interactions effectively.
1. **LLMs (Large Language Models)**
   - Interface with various language models such as OpenAI's GPT, Hugging Face models, and
open-source alternatives.
   - Example:
     ```python
     from langchain.llms import OpenAI
     llm = OpenAI(model_name="gpt-4", openai_api_key="your_api_key")
     print(llm.predict("Explain LangChain components."))
     ```
2. **Chains**
   - Chains help link multiple components together, allowing structured workflows.
   - Example:
     ```python
     from langchain.chains import LLMChain
     from langchain.prompts import PromptTemplate
     template = PromptTemplate(template="Translate '{text}' to French.", input_variables=["text"])
     chain = LLMChain(llm=llm, prompt=template)
     print(chain.run("Hello, how are yo

In [32]:
print(documents[1].page_content)
print(documents[1].metadata)

```
4. **Agents**
   - Agents use LLMs to determine the next action dynamically, rather than following a fixed
sequence.
   - Example:
     ```python
     from langchain.agents import initialize_agent, AgentType
     from langchain.tools import Tool
     tools = [Tool(name="Search", func=lambda x: "Result for " + x, description="Search for
information")]
     agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
verbose=True)
     print(agent.run("Search for LangChain tutorials."))
     ```
5. **Tools**
   - Tools allow LangChain applications to interact with external APIs, databases, and more.
Conclusion:
LangChain's core components provide the foundation for building powerful AI-driven applications,
enabling structured workflows and advanced interactions.
{'producer': 'PyFPDF 1.7.2 http://pyfpdf.googlecode.com/', 'creator': 'PyPDF', 'creationdate': 'D:20250403132430', 'source': '/content/Data/LangChain_Core_Components.pdf', 'total_pages': 2, 'page': 1, 'pa

## Limitation
- Takes lots of time and memory

## Solution

### Concepts of load vs lazy_load

- load() :
  -  **Eager Loading**: load everythings at once
  - returns a **list of documents** objects
  - loads all documents **immediately** into memory.
  - Best when:
     - The number document is **small**
     - You want everything loaded upfront.
- lazy_load() :
  - **Lazy Loading:** loads on demands
  - returns a **generator of document** objects.
  - Documents are **not loaded at once**: they are fetched **at a time as needed.**
  - Best When:
    - you are dealing with **large documents or lots of files.**
    - you want to **strream** processing (eg . chunking, embedding) without using lots of memory.

## Code Example

## load() Function

In [37]:
from langchain_community.document_loaders import DirectoryLoader,PyPDFLoader
import time


loader = DirectoryLoader("/content/Data/",glob="*.pdf",loader_cls=PyPDFLoader)

start_time = time.time()

documents = loader.load()

end_time = time.time()

print(f"Time taken: {end_time - start_time} seconds")

print(documents)

Time taken: 31.36706042289734 seconds
[Document(metadata={'producer': 'PyFPDF 1.7.2 http://pyfpdf.googlecode.com/', 'creator': 'PyPDF', 'creationdate': 'D:20250403132430', 'source': '/content/Data/LangChain_Core_Components.pdf', 'total_pages': 2, 'page': 0, 'page_label': '1'}, page_content='LangChain Core Components\nLangChain Core Components\nLangChain provides essential building blocks that make it easier to develop applications using large\nlanguage models. These components help structure and manage interactions effectively.\n1. **LLMs (Large Language Models)**\n   - Interface with various language models such as OpenAI\'s GPT, Hugging Face models, and\nopen-source alternatives.\n   - Example:\n     ```python\n     from langchain.llms import OpenAI\n     llm = OpenAI(model_name="gpt-4", openai_api_key="your_api_key")\n     print(llm.predict("Explain LangChain components."))\n     ```\n2. **Chains**\n   - Chains help link multiple components together, allowing structured workflows.\n

## lazy_load() function

In [39]:
from langchain_community.document_loaders import DirectoryLoader,PyPDFLoader
import time


loader = DirectoryLoader("/content/Data/",glob="*.pdf",loader_cls=PyPDFLoader)

start_time = time.time()

documents = loader.lazy_load()

end_time = time.time()

print(f"Time taken: {end_time - start_time} seconds")

for doc in documents:
  print(doc.metadata)

Time taken: 0.00015115737915039062 seconds
{'producer': 'PyFPDF 1.7.2 http://pyfpdf.googlecode.com/', 'creator': 'PyPDF', 'creationdate': 'D:20250403132430', 'source': '/content/Data/LangChain_Core_Components.pdf', 'total_pages': 2, 'page': 0, 'page_label': '1'}
{'producer': 'PyFPDF 1.7.2 http://pyfpdf.googlecode.com/', 'creator': 'PyPDF', 'creationdate': 'D:20250403132430', 'source': '/content/Data/LangChain_Core_Components.pdf', 'total_pages': 2, 'page': 1, 'page_label': '2'}
{'producer': 'Microsoft: Print To PDF', 'creator': 'PyPDF', 'creationdate': '2025-03-28T17:31:11+05:30', 'author': '', 'moddate': '2025-03-28T17:31:11+05:30', 'title': 'Langchain-Chains.ipynb - Colab', 'source': '/content/Data/langchain-chains.pdf', 'total_pages': 9, 'page': 0, 'page_label': '1'}
{'producer': 'Microsoft: Print To PDF', 'creator': 'PyPDF', 'creationdate': '2025-03-28T17:31:11+05:30', 'author': '', 'moddate': '2025-03-28T17:31:11+05:30', 'title': 'Langchain-Chains.ipynb - Colab', 'source': '/conte

## WebBaseLoader

### **WebBaseLoader is a document loader in LangChain used to load and extract text content from web pages (URLs).**

### **It uses BeautifulSoup under the hood to parse HTML and extract visible text.**

### When to use:
- for blogs, news, articles, or public websites where the content is primatrily text=based and static.

### Limitation
- Does not handle Javascript-heavy pages well (use SeleniumLoader for that).
- Loads only static content ( what's in the HTML, not what loads after the page renders).

## Code Example

In [42]:
from langchain_community.document_loaders import WebBaseLoader




loader = WebBaseLoader("https://www.flipkart.com/apple-macbook-air-m2-8-gb-256-gb-ssd-mac-os-monterey-mly13hn-a/p/itmd65f6b189fc7b?pid=COMGFB2GBKDVYBDD&lid=LSTCOMGFB2GBKDVYBDDLKF8WC&marketplace=FLIPKART&store=6bo%2Fb5g&spotlightTagId=default_FkPickId_6bo%2Fb5g&srno=b_1_3&otracker=browse&fm=organic&iid=b4095891-a522-49c1-8781-555174c2a035.COMGFB2GBKDVYBDD.SEARCH&ppt=browse&ppn=browse&ssid=149hv7tpo00000001743688422982")

documents = loader.load()

print(documents)

[Document(metadata={'source': 'https://www.flipkart.com/apple-macbook-air-m2-8-gb-256-gb-ssd-mac-os-monterey-mly13hn-a/p/itmd65f6b189fc7b?pid=COMGFB2GBKDVYBDD&lid=LSTCOMGFB2GBKDVYBDDLKF8WC&marketplace=FLIPKART&store=6bo%2Fb5g&spotlightTagId=default_FkPickId_6bo%2Fb5g&srno=b_1_3&otracker=browse&fm=organic&iid=b4095891-a522-49c1-8781-555174c2a035.COMGFB2GBKDVYBDD.SEARCH&ppt=browse&ppn=browse&ssid=149hv7tpo00000001743688422982'}, page_content='Site is overloaded')]


In [46]:
from langchain_community.document_loaders import SeleniumURLLoader

loader = SeleniumURLLoader(urls=["https://www.flipkart.com/apple-macbook-air-m2-8-gb-256-gb-ssd-mac-os-monterey-mly13hn-a/p/itmd65f6b189fc7b"])
documents = loader.load()

print(documents)


[Document(metadata={'source': 'https://www.flipkart.com/apple-macbook-air-m2-8-gb-256-gb-ssd-mac-os-monterey-mly13hn-a/p/itmd65f6b189fc7b', 'title': 'Apple MacBook AIR Apple M2 - (8 GB/256 GB SSD/Mac OS Monterey) MLY13HN/A Rs.99900 Price in India - Buy Apple MacBook AIR Apple M2 - (8 GB/256 GB SSD/Mac OS Monterey) MLY13HN/A Starlight Online - Apple : Flipkart.com', 'description': 'No description found.', 'language': 'en'}, page_content="Flipkart\n\nExplore Plus\n\n\n\nLogin\n\nNew customer?\n\nSign Up\n\nBecome a Seller\n\nMore\n\nCart\n\nElectronicsTVs & AppliancesMenWomenBaby & KidsHome & FurnitureSports, Books & MoreFlightsOffer Zone\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nApple MacBook AIR Apple M2 - (8 GB/256 GB SSD/Mac OS Monterey) MLY13HN/A\n\nHome\n\nComputers\n\nLaptops\n\nApple Laptops\n\nApple MacBook AIR Apple M2 - (8 GB/256 GB SSD/Mac OS Monterey) MLY13HN/A (13.6 Inch, Starlight, 1.24 kg)\n\nShare\n\nApple MacBook AIR Apple M2 - (8 GB/256 GB SSD/Mac OS Monterey) MLY

In [47]:
print(len(documents))

1


In [48]:
print(documents[0].page_content)

Flipkart

Explore Plus



Login

New customer?

Sign Up

Become a Seller

More

Cart

ElectronicsTVs & AppliancesMenWomenBaby & KidsHome & FurnitureSports, Books & MoreFlightsOffer Zone























Apple MacBook AIR Apple M2 - (8 GB/256 GB SSD/Mac OS Monterey) MLY13HN/A

Home

Computers

Laptops

Apple Laptops

Apple MacBook AIR Apple M2 - (8 GB/256 GB SSD/Mac OS Monterey) MLY13HN/A (13.6 Inch, Starlight, 1.24 kg)

Share

Apple MacBook AIR Apple M2 - (8 GB/256 GB SSD/Mac OS Monterey) MLY13HN/A (13.6 Inch, Starlight, 1.24 kg)

4.7



17,536 Ratings & 1,087 Reviews



₹2,848/month

36 months EMI Plan with BOBCARD

Special price

₹80,990

₹99,900

18% off

+ ₹49 Protect Promise Fee Learn more

Secure delivery by 10 Apr, Thursday

Hurry, Only a few left!

Available offers



Bank Offer5% Unlimited Cashback on Flipkart Axis Bank Credit Card

T&C



Bank Offer10% off on BOBCARD EMI Transactions, up to ₹1,500 on orders of ₹5,000 and above

T&C



Special PriceGet extra 18% off (pri

In [49]:
print(documents[0].metadata)

{'source': 'https://www.flipkart.com/apple-macbook-air-m2-8-gb-256-gb-ssd-mac-os-monterey-mly13hn-a/p/itmd65f6b189fc7b', 'title': 'Apple MacBook AIR Apple M2 - (8 GB/256 GB SSD/Mac OS Monterey) MLY13HN/A Rs.99900 Price in India - Buy Apple MacBook AIR Apple M2 - (8 GB/256 GB SSD/Mac OS Monterey) MLY13HN/A Starlight Online - Apple : Flipkart.com', 'description': 'No description found.', 'language': 'en'}


## Extract Text Data From Multiiple Websites

In [53]:
from langchain_community.document_loaders import SeleniumURLLoader
from langchain_groq import ChatGroq
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from google.colab import userdata

API_KEY = userdata.get('GROQ_API_KEY')


loader = SeleniumURLLoader(urls=["https://www.flipkart.com/apple-macbook-air-m2-8-gb-256-gb-ssd-mac-os-monterey-mly13hn-a/p/itmd65f6b189fc7b",
                                 "https://www.amazon.com/Apple-2022-MacBook-Laptop-chip/dp/B0DLHT9VXW/ref=sr_1_1?crid=1O898322LAI1K&dib=eyJ2IjoiMSJ9.VJ5gfxCfTp_MGdHmaWtcu4f8V9Rc4MO8pfenGJSOdBwErNNGh5qZPioINRcPNSjc97KQYlt1mI3OcUW0K8Wk765x8_VMAVOtI3eLjdl5BonhUZlKAnCIWEDUbC6aIC-0wDHxTzxyW_7PDqxWDtqHNyHs8aRjddWWGna6uQb2tO6EbqiGRFjP2byuWYSjbdndv1lzx2ICHug3fyPzOFDk8DSvGaSIZz-42S_QnVs_9Xs.n1DiCL7LkyM1Z6yNiChXngB3i2cwcnoXsKPDYV3PErk&dib_tag=se&keywords=apple%2Bm2%2B15%22%2Bmacbook%2Bair%2B8gb%2Bram%2B256gb%2Bssd&qid=1743689194&sprefix=apple%2Bmacbook%2Bair%2Bm2%2B8gb%2Caps%2C395&sr=8-1&th=1"])
documents = loader.load()

print(len(documents))


llm = ChatGroq(model_name="llama-3.3-70b-versatile",api_key=API_KEY,temperature=0)

prompt = PromptTemplate(template="Answer the following question \n {question} from the following text - \n {text}",
                        input_variables=["question","text"]
                        )

parser = StrOutputParser()

chain = prompt | llm | parser

result = chain.invoke({'question':"what is the product that we are talking about?",'text':documents[0].page_content})

print(result)


2
The product being discussed is the Apple MacBook AIR Apple M2 - (8 GB/256 GB SSD/Mac OS Monterey) MLY13HN/A.


## CSVLoader

### **CSVLoader** is a document loader used to load CSV files into LangChain Document objects one per row, by default.

In [55]:
from langchain_community.document_loaders import CSVLoader

loader = CSVLoader("/content/sample_data/california_housing_train.csv")

data = loader.load()

print(data[0])

page_content='longitude: -114.310000
latitude: 34.190000
housing_median_age: 15.000000
total_rooms: 5612.000000
total_bedrooms: 1283.000000
population: 1015.000000
households: 472.000000
median_income: 1.493600
median_house_value: 66900.000000' metadata={'source': '/content/sample_data/california_housing_train.csv', 'row': 0}


In [56]:
print(len(data))

17000


## means 17K rows are available in thier dataset.

In [58]:
print(data[16999])

page_content='longitude: -124.350000
latitude: 40.540000
housing_median_age: 52.000000
total_rooms: 1820.000000
total_bedrooms: 300.000000
population: 806.000000
households: 270.000000
median_income: 3.014700
median_house_value: 94600.000000' metadata={'source': '/content/sample_data/california_housing_train.csv', 'row': 16999}


## How to create a custom Document Loader ?


## Implementation

In [None]:
from typing import AsyncIterator, Iterator

from langchain_core.document_loaders import BaseLoader
from langchain_core.documents import Document


class CustomDocumentLoader(BaseLoader):
    """An example document loader that reads a file line by line."""

    def __init__(self, file_path: str) -> None:
        """Initialize the loader with a file path.

        Args:
            file_path: The path to the file to load.
        """
        self.file_path = file_path

    def lazy_load(self) -> Iterator[Document]:  # <-- Does not take any arguments
        """A lazy loader that reads a file line by line.

        When you're implementing lazy load methods, you should use a generator
        to yield documents one by one.
        """
        with open(self.file_path, encoding="utf-8") as f:
            line_number = 0
            for line in f:
                yield Document(
                    page_content=line,
                    metadata={"line_number": line_number, "source": self.file_path},
                )
                line_number += 1

    # alazy_load is OPTIONAL.
    # If you leave out the implementation, a default implementation which delegates to lazy_load will be used!
    async def alazy_load(
        self,
    ) -> AsyncIterator[Document]:  # <-- Does not take any arguments
        """An async lazy loader that reads a file line by line."""
        # Requires aiofiles (install with pip)
        # https://github.com/Tinche/aiofiles
        import aiofiles

        async with aiofiles.open(self.file_path, encoding="utf-8") as f:
            line_number = 0
            async for line in f:
                yield Document(
                    page_content=line,
                    metadata={"line_number": line_number, "source": self.file_path},
                )
                line_number += 1

## Test

In [None]:
with open("./meow.txt", "w", encoding="utf-8") as f:
    quality_content = "meow meow🐱 \n meow meow🐱 \n meow😻😻"
    f.write(quality_content)

loader = CustomDocumentLoader("./meow.txt")

In [None]:
## Test out the lazy load interface
for doc in loader.lazy_load():
    print()
    print(type(doc))
    print(doc)

## Reference: https://python.langchain.com/docs/how_to/document_loader_custom/