### Data Ingestion

In [5]:
### document data structure

from langchain_core.documents import Document


In [7]:
doc = Document(
    page_content=' this is the main content I am using to create the RAG',
    metadata={
        "source":"example.txt",
        "pages":1,
        "author":"satyam",
        "date_created": "2025-01-01"
    }
)
doc

Document(metadata={'source': 'example.txt', 'pages': 1, 'author': 'satyam', 'date_created': '2025-01-01'}, page_content=' this is the main content I am using to create the RAG')

In [9]:
### create a simple txt file

import os
os.makedirs("../data/text_files", exist_ok=True)

In [11]:
sample_texts={
    "../data/text_files/python_intro.txt":"""Python Programming Introduction

- Python is a high-level, interpreted programming language created by Guido van Rossum in 1991.
- It is known for its simple syntax, readability, and beginner-friendly design.
- Python supports multiple programming paradigms including procedural, object-oriented, and functional programming.
- Python uses indentation to define code blocks instead of curly braces.

Core Features of Python

- Easy to learn and use due to clean and readable syntax.
- Interpreted language, meaning code executes line by line.
- Dynamically typed, so variable types are determined at runtime.
- Cross-platform support including Windows, macOS, and Linux.
- Extensive standard library for file handling, networking, OS operations, and more.
- Automatic memory management with garbage collection.
- Large ecosystem of third-party libraries available via pip.

Data Types in Python

- Numeric types: int, float, complex.
- Sequence types: list, tuple, range.
- Text type: str.
- Mapping type: dict.
- Set types: set, frozenset.
- Boolean type: bool.

Control Flow in Python

- Conditional statements: if, elif, else.
- Looping constructs: for loop and while loop.
- Loop control statements: break, continue, pass.
- Exception handling using try, except, finally.

Functions in Python

- Functions are defined using the def keyword.
- Functions can accept positional and keyword arguments.
- Supports default arguments.
- Supports variable-length arguments using *args and **kwargs.
- Lambda functions allow creation of anonymous functions.

Object-Oriented Programming in Python

- Classes are defined using the class keyword.
- Supports inheritance and multiple inheritance.
- Encapsulation is achieved using public, protected, and private conventions.
- Polymorphism allows methods to behave differently based on object context.
- Special methods like __init__, __str__, and __repr__ customize object behavior.

Modules and Packages

- A module is a Python file containing functions and classes.
- Modules can be imported using import keyword.
- Packages are collections of modules organized in directories.
- Virtual environments help isolate dependencies.

Popular Python Libraries

- NumPy for numerical computing.
- Pandas for data analysis.
- Matplotlib and Seaborn for data visualization.
- Flask and Django for web development.
- TensorFlow and PyTorch for machine learning.

Applications of Python

- Web development.
- Data science and analytics.
- Machine learning and artificial intelligence.
- Automation and scripting.
- Desktop application development.
- Game development.
- Cloud and DevOps scripting.

Advantages of Python

- Large and active community support.
- Rich documentation and tutorials.
- Strong integration capabilities with other languages.
- Rapid prototyping and development.
- Highly scalable for small to enterprise-level applications.""",
"../data/text_files/machine_learning.txt":"""Machine Learning Introduction

- Machine Learning is a subset of Artificial Intelligence that enables systems to learn from data without being explicitly programmed.
- It focuses on building models that improve performance as they are exposed to more data.
- Machine Learning algorithms identify patterns and relationships within structured or unstructured data.
- The goal of Machine Learning is to make predictions, classifications, or decisions based on input data.

Types of Machine Learning

- Supervised Learning involves training a model using labeled data.
- Common supervised tasks include regression and classification.
- Unsupervised Learning works with unlabeled data to find hidden patterns.
- Common unsupervised tasks include clustering and dimensionality reduction.
- Semi-supervised Learning uses a combination of labeled and unlabeled data.
- Reinforcement Learning trains agents using rewards and penalties based on actions.

Supervised Learning Algorithms

- Linear Regression is used for predicting continuous values.
- Logistic Regression is used for binary classification problems.
- Decision Trees split data based on feature values.
- Random Forest combines multiple decision trees to improve accuracy.
- Support Vector Machines find optimal hyperplanes for classification.
- K-Nearest Neighbors classifies data based on closest data points.

Unsupervised Learning Algorithms

- K-Means Clustering groups similar data points together.
- Hierarchical Clustering builds nested clusters.
- Principal Component Analysis reduces dimensionality.
- DBSCAN identifies clusters based on density.

Reinforcement Learning Concepts

- Agent interacts with an environment.
- State represents the current situation.
- Action is a decision taken by the agent.
- Reward is feedback from the environment.
- Policy defines the strategy of the agent.
- Q-Learning is a common reinforcement learning algorithm.

Machine Learning Workflow

- Data Collection from databases, APIs, or sensors.
- Data Cleaning to handle missing values and outliers.
- Feature Engineering to create meaningful input variables.
- Data Splitting into training and testing sets.
- Model Selection based on problem type.
- Model Training using training data.
- Model Evaluation using metrics.
- Model Deployment into production.

Evaluation Metrics

- Accuracy measures correct predictions.
- Precision measures positive prediction correctness.
- Recall measures sensitivity to actual positives.
- F1 Score balances precision and recall.
- Mean Squared Error evaluates regression performance.
- ROC-AUC measures classification performance.

Overfitting and Underfitting

- Overfitting occurs when a model learns noise instead of patterns.
- Underfitting occurs when a model fails to capture data patterns.
- Regularization techniques help reduce overfitting.
- Cross-validation improves model generalization.

Popular Machine Learning Libraries

- NumPy for numerical computation.
- Pandas for data manipulation.
- Scikit-learn for classical machine learning algorithms.
- TensorFlow for deep learning.
- PyTorch for deep learning research and production.
- XGBoost for gradient boosting.

Applications of Machine Learning

- Fraud detection in banking systems.
- Recommendation systems in e-commerce.
- Image and speech recognition.
- Medical diagnosis support systems.
- Autonomous vehicles.
- Stock market prediction models.

Advantages of Machine Learning

- Automates decision-making processes.
- Improves accuracy over time with more data.
- Handles large and complex datasets.
- Identifies hidden patterns not visible to humans."""
}

for filepath,content in sample_texts.items():
    with open(filepath, 'w', encoding='utf-8') as f:
        f.write(content)
        
print("sample text file created")

sample text file created


In [15]:
### read the text using text loader
from langchain_community.document_loaders import TextLoader
loader = TextLoader('../data/text_files/python_intro.txt', encoding='utf-8')
document= loader.load()
document


[Document(metadata={'source': '../data/text_files/python_intro.txt'}, page_content='Python Programming Introduction\n\n- Python is a high-level, interpreted programming language created by Guido van Rossum in 1991.\n- It is known for its simple syntax, readability, and beginner-friendly design.\n- Python supports multiple programming paradigms including procedural, object-oriented, and functional programming.\n- Python uses indentation to define code blocks instead of curly braces.\n\nCore Features of Python\n\n- Easy to learn and use due to clean and readable syntax.\n- Interpreted language, meaning code executes line by line.\n- Dynamically typed, so variable types are determined at runtime.\n- Cross-platform support including Windows, macOS, and Linux.\n- Extensive standard library for file handling, networking, OS operations, and more.\n- Automatic memory management with garbage collection.\n- Large ecosystem of third-party libraries available via pip.\n\nData Types in Python\n\n- 

In [None]:
###  Directory loader
from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader(
    "../data/text_files",
    glob="**/*.txt", # pattern to match the files
    loader_cls = TextLoader, # it can be array of class
    loader_kwargs= {'encoding':'utf-8'},
    show_progress = False
)

document = loader.load()
document

[Document(metadata={'source': '../data/text_files/python_intro.txt'}, page_content='Python Programming Introduction\n\n- Python is a high-level, interpreted programming language created by Guido van Rossum in 1991.\n- It is known for its simple syntax, readability, and beginner-friendly design.\n- Python supports multiple programming paradigms including procedural, object-oriented, and functional programming.\n- Python uses indentation to define code blocks instead of curly braces.\n\nCore Features of Python\n\n- Easy to learn and use due to clean and readable syntax.\n- Interpreted language, meaning code executes line by line.\n- Dynamically typed, so variable types are determined at runtime.\n- Cross-platform support including Windows, macOS, and Linux.\n- Extensive standard library for file handling, networking, OS operations, and more.\n- Automatic memory management with garbage collection.\n- Large ecosystem of third-party libraries available via pip.\n\nData Types in Python\n\n- 

In [22]:
from langchain_community.document_loaders import PyPDFLoader, PyMuPDFLoader
from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader(
    "../data/pdf",
    glob="**/*.pdf", # pattern to match the files
    loader_cls = PyMuPDFLoader, # it can be array of class
    # loader_kwargs= {'encoding':'utf-8'},
    show_progress = False
)

document = loader.load()
document


[Document(metadata={'producer': 'Skia/PDF m147 Google Docs Renderer', 'creator': '', 'creationdate': '', 'source': '../data/pdf/objectdetection.pdf', 'file_path': '../data/pdf/objectdetection.pdf', 'total_pages': 1, 'format': 'PDF 1.4', 'title': 'objectdetection', 'author': '', 'subject': '', 'keywords': '', 'moddate': '', 'trapped': '', 'modDate': '', 'creationDate': '', 'page': 0}, page_content='Title: A Comparative Study of CNN-Based Object Detection Models \n \nAbstract: \nThis research analyzes single-stage and two-stage object detection models using a benchmark \nimage dataset. \n \nIntroduction: \nObject detection identifies and localizes objects within images using bounding boxes. \n \nMethodology: \n \nEvaluated Faster R-CNN (two-stage model). \n \nEvaluated YOLO (single-stage model). \n \nCompared inference speed and detection accuracy. \n \nResults: \n \nYOLO achieved faster inference. \n \nFaster R-CNN produced higher accuracy. \n \nConclusion: \nModel selection depends on 

In [None]:
### simple rag pipeline with Groq

from langchain_groq import ChatGroq
import os
from dotenv import load_dotenv

load_dotenv()

## initialize the Groq LLM
groq_api_key= os.getenv('GROQ_API_KEY')

llm = ChatGroq(groq_api_key= groq_api_key, model_name= 'gemma2-9b-it', temperature= 0.1, max_tokens=1024)

### Simple rag function
