## Introduction to Data Ingestion


In [2]:
from typing import List, Dict, Any
import pandas as pd

In [3]:
from langchain_core.documents import Document
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
    TokenTextSplitter
)

## Understanding Document Datastructure in Lanchain

In [None]:
## Simple document creation
doc = Document(
    page_content = "This is the main content that will be embedded and searched.",
    metadata = {
        "source": "example.txt",
        "author": "satish kumar",
        "page": 1
    }
)

print("Document Structure")
print(f"Content: {doc.page_content}")
print(f"Metadata: {doc.metadata}")

Document Structure
Document Content: This is the main content that will be embedded and searched.
Document metadata: {'source': 'example.txt', 'author': 'satish kumar', 'page': 1}


## Text Files (.txt) - The Simplest Case {#2-text-files}

In [None]:
## Create a simple text file
import os
os.makedirs("data/text_files", exist_ok=True)

In [10]:
sample_txts = {
    "data/text_files/python_intro.txt" : """
### **Introduction to Python**

Python is a **high-level, interpreted, and general-purpose programming language** known for its simplicity, readability, and versatility. It was created by **Guido van Rossum** and released in **1991**.

---

### **Key Features of Python**

1. **Easy to Learn & Readable**

   * Syntax is similar to English, making it beginner-friendly.

2. **Interpreted Language**

   * No need to compile code; it runs line by line using the Python interpreter.

3. **Dynamically Typed**

   * You don’t need to declare variable types explicitly.

   ```python
   x = 10   # Integer
   x = "Hello"  # Now a string (no type declaration required)
   ```

4. **Cross-Platform**

   * Works on Windows, macOS, Linux, and more.

5. **Extensive Standard Library**

   * Provides built-in modules for file handling, math, networking, etc.

6. **Supports Multiple Paradigms**

   * Procedural, Object-Oriented, and Functional programming.

---

### **Basic Python Example**

```python
# Hello World Program
print("Hello, World!")

# Variables
name = "Alice"
age = 25
print(f"My name is {name} and I am {age} years old.")

# Simple Function
def greet(name):
    return f"Hello, {name}!"

print(greet("Bob"))
```

---

### **Where is Python Used?**

* Web Development (e.g., Django, Flask)
* Data Science & Machine Learning (e.g., NumPy, Pandas, Scikit-learn)
* Automation & Scripting
* Game Development
* IoT (Internet of Things)
* Cybersecurity & Networking
* Artificial Intelligence (AI)

---

Would you like me to create a **beginner-friendly Python learning roadmap** (syntax → data types → loops → OOP → projects) or a **crash course with examples**? Or both?

""",
"data/text_files/machinelearning_intro.txt": """
  ### **Introduction to Machine Learning (ML)**

**Machine Learning (ML)** is a subset of Artificial Intelligence (AI) that focuses on developing algorithms that enable computers to **learn from data** and make predictions or decisions without being explicitly programmed.

---

### **Key Concepts**

1. **Data**

   * ML models learn from historical data to identify patterns and make predictions.

2. **Model**

   * A mathematical representation that maps inputs to outputs based on training data.

3. **Training**

   * The process of teaching the model by feeding it data and adjusting parameters to minimize errors.

4. **Prediction / Inference**

   * Once trained, the model predicts outcomes for new, unseen data.

---

### **Types of Machine Learning**

1. **Supervised Learning**

   * Model is trained with labeled data (input + output).
   * Examples:

     * Predicting house prices (regression).
     * Classifying emails as spam or not (classification).

2. **Unsupervised Learning**

   * Model learns patterns from unlabeled data (no output provided).
   * Examples:

     * Customer segmentation (clustering).
     * Dimensionality reduction (PCA).

3. **Reinforcement Learning**

   * Model learns by interacting with an environment and receiving rewards or penalties.
   * Examples:

     * Training robots to walk.
     * Game-playing AI (e.g., AlphaGo).

---

### **Common Algorithms**

* **Linear Regression** – Predicts continuous values.
* **Logistic Regression** – Used for classification.
* **Decision Trees & Random Forests** – Tree-based decision models.
* **Support Vector Machines (SVM)** – Separates data using hyperplanes.
* **Neural Networks** – Basis of deep learning for complex data like images, speech.

---

### **Applications of Machine Learning**

* Recommendation systems (Netflix, Amazon)
* Fraud detection in banking
* Predictive maintenance in industries
* Autonomous vehicles
* Healthcare diagnostics

---

Would you like me to provide a **step-by-step roadmap to learn ML** (from Python basics → data handling → ML algorithms → projects), or a **mini crash course with Python code examples**? Or both?

"""
}



for filepath, content in sample_txts.items():
    with open(filepath, 'w', encoding="utf-8") as f:
        f.write(content)

print("Text files are successfully created!!!")

Text files are successfully created!!!


## TextLoader - Read Single File

In [17]:
from langchain.document_loaders import TextLoader

loader = TextLoader("data/text_files/python_intro.txt", encoding="utf-8")

document = loader.load()

print(document)

[Document(metadata={'source': 'data/text_files/python_intro.txt'}, page_content='\n### **Introduction to Python**\n\nPython is a **high-level, interpreted, and general-purpose programming language** known for its simplicity, readability, and versatility. It was created by **Guido van Rossum** and released in **1991**.\n\n---\n\n### **Key Features of Python**\n\n1. **Easy to Learn & Readable**\n\n   * Syntax is similar to English, making it beginner-friendly.\n\n2. **Interpreted Language**\n\n   * No need to compile code; it runs line by line using the Python interpreter.\n\n3. **Dynamically Typed**\n\n   * You don’t need to declare variable types explicitly.\n\n   ```python\n   x = 10   # Integer\n   x = "Hello"  # Now a string (no type declaration required)\n   ```\n\n4. **Cross-Platform**\n\n   * Works on Windows, macOS, Linux, and more.\n\n5. **Extensive Standard Library**\n\n   * Provides built-in modules for file handling, math, networking, etc.\n\n6. **Supports Multiple Paradigms

## DirectoryLoader - Multiple Text Files

In [20]:
from langchain.document_loaders import DirectoryLoader

## Load all files in a directory

dir_loader = DirectoryLoader(
    "data/text_files",
    glob="**/*.txt", # Pattern to match files
    loader_cls= TextLoader, ##loader class
    loader_kwargs={'encoding': 'utf-8'} ,
    show_progress=True                
)

documents = dir_loader.load()

print(f"Loaded {len(documents)} documents")

for i, doc in enumerate(documents):
    print(f"\n Document {i+1}")
    print(f" Source: {doc.metadata['source']}")
    print(f" Length: {len(doc.page_content)} characters")

100%|██████████| 2/2 [00:00<00:00, 1257.29it/s]

Loaded 2 documents

 Document 1
 Source: data/text_files/python_intro.txt
 Length: 1677 characters

 Document 2
 Source: data/text_files/machinelearning_intro.txt
 Length: 2152 characters





## Text Splitting Strategies


In [28]:
## Different document text splitters

from langchain.text_splitter import (
    CharacterTextSplitter,
    TokenTextSplitter,
    RecursiveCharacterTextSplitter
)

In [None]:
## Method 1: Character based splitting

text = documents[0].page_content

print(" CHARACTER TEXT SPLITTER ")

# Separator is applied once, not recursively.
# If a segment exceeds chunk_size, it splits mid-text without checking for smaller separators.
# More rigid than RecursiveTextSplitter, which better preserves logical boundaries.

char_splitter = CharacterTextSplitter(
    separator="\n", # Split on new lines
    chunk_size=200, # Max chunk size in character
    chunk_overlap=20, # Overlap between chunks
    length_function=len # How to measure chunk size
)

char_chunks = char_splitter.split_text(text)
# char_chunks = char_splitter.split_documents(text)

print(f"Created {len(char_chunks)} chunks")
print(f"First Chunk: {char_chunks[0][:100]} ...")

Created a chunk of size 202, which is longer than the specified 200


 CHARACTER TEXT SPLITTER 
Created 10 chunks
First Chunk: ### **Introduction to Python** ...


In [36]:
print(char_chunks[0])
print("---------------------")
print(char_chunks[1])

### **Introduction to Python**
---------------------
Python is a **high-level, interpreted, and general-purpose programming language** known for its simplicity, readability, and versatility. It was created by **Guido van Rossum** and released in **1991**.


In [None]:
## Method 2: Recursive Character Splitting

# In RecursiveTextSplitter, the separator hierarchy is considered first, not the chunk size.
# Order of Operations
# Start with the largest separator (e.g., \n\n for paragraphs).
# Split text into pieces using that separator.
# Check if each piece is smaller than or equal to chunk_size:
# If yes, keep it as a chunk.
# If no, recursively split the large piece using the next smaller separator (e.g., \n for lines).
# Repeat until:
# A piece fits within chunk_size, or
# No more separators remain → final fallback is splitting purely by character count.

recursive_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", " ",""],
    chunk_size=200,
    chunk_overlap=20,
    length_function=len
)

recursive_chunks = recursive_splitter.split_text(text)

print(f"Created {len(recursive_chunks)} chunks")
print(f"First Chunk: {recursive_chunks[0][:100]}")

Created 13 chunks
First Chunk: ### **Introduction to Python**


In [None]:
## Token based Text splitting

# A token is a unit of text used by a language model to process and generate responses. It is not the same as a word or character—it can be:
# A whole word (e.g., "cat")
# Part of a word (e.g., "comp", "uter" from "computer")
# Or even punctuation or spaces (e.g., ",", " ")

token_splitter = TokenTextSplitter(
    chunk_size= 50, # size in tokens (not characters)
    chunk_overlap=10
)

token_chunks = token_splitter.split_text(text)
print(f"Created {len(token_chunks)} chunks")
print(f"First Chunk: {token_chunks[0][:100]}")

Created 13 chunks
First Chunk: 
### **Introduction to Python**

Python is a **high-level, interpreted, and general-purpose programm
