### Step 1: Data Ingestion

In [31]:
### Document Structure

from langchain_core.documents import Document
print("✅ LangChain core loaded successfully!")

✅ LangChain core loaded successfully!


In [32]:
doc= Document(
  page_content="This is the main text content I am using to create a RAG.",
  metadata={
    "source": "example.pdf",
    "pages": 1,
    "author": "Sanjoy Kumar Das",
    "date_created": "2025-10-30"
  }  
)
doc

Document(metadata={'source': 'example.pdf', 'pages': 1, 'author': 'Sanjoy Kumar Das', 'date_created': '2025-10-30'}, page_content='This is the main text content I am using to create a RAG.')

In [33]:
### create a simple text file

import os
os.makedirs("../data/text_files", exist_ok=True)


In [34]:
sample_texts = {
  "../data/text_files/python_intro.txt": """Python is a high-level, interpreted, general-purpose programming language known for its readability and versatility. Created by Guido van Rossum and first released in 1991, it has become one of the most popular programming languages globally.
Key characteristics of Python:
High-level: Python abstracts away low-level details like memory management, allowing developers to focus on problem-solving.
Interpreted: Python code is executed line by line by an interpreter, rather than being compiled into machine code beforehand. This enables a fast edit-test-debug cycle.
Object-oriented: Python supports object-oriented programming (OOP), allowing for the creation of reusable code and modular program design. It also supports other paradigms like procedural and functional programming.
Readable syntax: Python's syntax is designed to be clear and concise, often using English-like keywords and fewer syntactical constructions compared to other languages. This enhances code readability and maintainability.
Versatile and general-purpose: Python is not specialized for any particular domain and can be used for a wide range of applications, including:
Web development (with frameworks like Django and Flask)
Data science and machine learning (with libraries like NumPy, Pandas, Scikit-learn, and TensorFlow)
Automation and scripting
Software development
Scientific computing
Game development
Extensive standard library and ecosystem: Python boasts a rich standard library with pre-built modules for various tasks, reducing the need to write code from scratch. It also has a vast ecosystem of third-party libraries and frameworks.
Cross-platform: Python applications can run on various operating systems, including Windows, macOS, and Linux, without significant modifications.
Large and active community: Python benefits from a large and supportive community, providing ample resources, documentation, and assistance for learners and developers.""",

"../data/text_files/machine_learning.txt": """ 
Machine learning is a branch of artificial intelligence where computers learn from data to find patterns and make decisions with minimal human intervention. It uses algorithms to analyze data and build models that can then be used to make predictions or classify new information without being explicitly programmed for every task. This technology powers applications like recommendation systems, fraud detection, and image recognition. 

How it works
Algorithms and data: Algorithms are fed large datasets to learn from. This training process involves identifying patterns and relationships within the data.
Model creation: The result of the training process is a model. An algorithm is a set of rules, while a model is the output that can be used to perform a task.
Predictions and refinement: Once trained, the model can make predictions or decisions on new, unseen data. As it receives more data, it can refine its performance and improve its accuracy over time, similar to how humans improve with practice. 
Key characteristics and benefits
Handles massive data: Machine learning can process vast amounts of data, finding patterns that humans might miss.
Adapts dynamically: Systems can evolve and adapt as new data becomes available.
Drives smarter decisions: It provides data-driven insights for tasks like predicting customer behavior or detecting fraud.
Personalizes experiences: It is used to tailor suggestions for users, such as in streaming or e-commerce services. 

Applications
Image and speech recognition: Enabling computers to understand and interpret visual or audio information.
Natural Language Processing (NLP): Allowing computers to understand and process human language.
Recommendation systems: Suggesting products, movies, or music based on user history.
Fraud detection: Identifying suspicious activities in financial transactions.
Healthcare: Analyzing medical images and patient data to assist with diagnosis."""
}

for filepath, content in sample_texts.items():
  with open(filepath, "w", encoding="utf-8") as f:
    f.write(content)


print("Files created...")

Files created...


In [35]:
### Text Loader
from langchain_community.document_loaders import TextLoader
loader = TextLoader("../data/text_files/python_intro.txt", encoding="utf-8")
document= loader.load()
print(document)

[Document(metadata={'source': '../data/text_files/python_intro.txt'}, page_content="Python is a high-level, interpreted, general-purpose programming language known for its readability and versatility. Created by Guido van Rossum and first released in 1991, it has become one of the most popular programming languages globally.\nKey characteristics of Python:\nHigh-level: Python abstracts away low-level details like memory management, allowing developers to focus on problem-solving.\nInterpreted: Python code is executed line by line by an interpreter, rather than being compiled into machine code beforehand. This enables a fast edit-test-debug cycle.\nObject-oriented: Python supports object-oriented programming (OOP), allowing for the creation of reusable code and modular program design. It also supports other paradigms like procedural and functional programming.\nReadable syntax: Python's syntax is designed to be clear and concise, often using English-like keywords and fewer syntactical c

In [36]:
### Directory Loader

from langchain_community.document_loaders import DirectoryLoader

## Load all the text files from the directory

dir_loader = DirectoryLoader(
  "../data/text_files",
  glob="**/*.txt", ## Pattern to match files
  loader_cls=TextLoader, ## loader class to use
  loader_kwargs={'encoding': 'utf-8'},
  show_progress=False
)

documents = dir_loader.load()
documents



[Document(metadata={'source': '..\\data\\text_files\\machine_learning.txt'}, page_content=' \nMachine learning is a branch of artificial intelligence where computers learn from data to find patterns and make decisions with minimal human intervention. It uses algorithms to analyze data and build models that can then be used to make predictions or classify new information without being explicitly programmed for every task. This technology powers applications like recommendation systems, fraud detection, and image recognition. \n\nHow it works\nAlgorithms and data: Algorithms are fed large datasets to learn from. This training process involves identifying patterns and relationships within the data.\nModel creation: The result of the training process is a model. An algorithm is a set of rules, while a model is the output that can be used to perform a task.\nPredictions and refinement: Once trained, the model can make predictions or decisions on new, unseen data. As it receives more data, i

In [37]:
from langchain_community.document_loaders import PyPDFLoader, PyMuPDFLoader
## Load all the pdf files from the directory

dir_pdf_loader = DirectoryLoader(
  "../data/pdf",
  glob="**/*.pdf", ## Pattern to match files
  loader_cls=PyMuPDFLoader, ## loader class to use
  show_progress=False
)

pdf_documents = dir_pdf_loader.load()
pdf_documents

[Document(metadata={'producer': 'Microsoft® Word 2013', 'creator': 'Microsoft® Word 2013', 'creationdate': '2025-10-30T21:23:34-04:00', 'source': '..\\data\\pdf\\Artificial intelligence.pdf', 'file_path': '..\\data\\pdf\\Artificial intelligence.pdf', 'total_pages': 1, 'format': 'PDF 1.5', 'title': '', 'author': 'Microsoft account', 'subject': '', 'keywords': '', 'moddate': '2025-10-30T21:23:34-04:00', 'trapped': '', 'modDate': "D:20251030212334-04'00'", 'creationDate': "D:20251030212334-04'00'", 'page': 0}, page_content='Artificial intelligence (AI) is the ability of machines to perform tasks that typically \nrequire human intelligence, such as learning, reasoning, problem-solving, and \ndecision-making. AI systems achieve this by processing vast amounts of data to \nidentify patterns and adapt their behavior, allowing them to understand and respond \nto human language, recognize objects, and make predictions.  \nHow AI works \n\uf0b7 \nLearning from data: Instead of being explicitly p

### Step-2: Chunking