### Intro Data Injestion

In [13]:
import os
from typing import List,Dict,Any
import pandas as pd

In [None]:
from langchain_core.documents import Document
from langchain_text_splitters import(RecursiveCharacterTextSplitter,CharacterTextSplitter,TokenTextSplitter)

### Understanding document structure

In [15]:
# creating a document

doc = Document(
    page_content="main content  that will be embedded and searched",
    metadata = {
        "source":"example.text",
        "page":1,
        "author": "shibu thomas",
        "create_date":"2026-01-01"
    }
)

### Text file

In [18]:
# creating a text file

import os
os.makedirs("data/textfiles", exist_ok=True)


sample_texts = {"data/textfiles/intro.txt":"""



What is Python?

Python is a high-level, easy-to-read programming language used for:

Web development üåê

Automation & scripting ü§ñ

Data analysis & AI üß†

Game development üéÆ

It‚Äôs popular because it looks almost like English and lets you build things fast.








""", "data/textfiles/mlbasics.txt": """

Machine Learning (ML) is a subset of artificial intelligence that focuses on building systems that can learn from data and improve their performance over time. Instead of relying on hard-coded rules, ML algorithms analyze patterns in historical data to make predictions, classifications, or decisions. This approach makes ML especially useful for problems where rules are complex or hard to define manually.

There are three main types of machine learning: supervised learning, unsupervised learning, and reinforcement learning. In supervised learning, models are trained using labeled data, such as predicting house prices or detecting spam emails. Unsupervised learning works with unlabeled data to find hidden patterns, like customer segmentation. Reinforcement learning involves learning through trial and error, commonly used in robotics and game-playing AI.

The machine learning process typically involves collecting data, cleaning and preparing it, choosing a suitable algorithm, training the model, and evaluating its performance. Good data quality is often more important than complex algorithms, as poor data can lead to inaccurate results regardless of the model used.







"""}


for filepath,content in sample_texts.items():
    with open(filepath, 'w', encoding='utf-8') as f:
        f.write(content)



### text loader - single file

In [21]:
from langchain_community.document_loaders import TextLoader


loader = TextLoader("data/textfiles/intro.txt", encoding="utf-8")

document = loader.load()

print(document)





[Document(metadata={'source': 'data/textfiles/intro.txt'}, page_content='\n\n\n\nWhat is Python?\n\nPython is a high-level, easy-to-read programming language used for:\n\nWeb development üåê\n\nAutomation & scripting ü§ñ\n\nData analysis & AI üß†\n\nGame development üéÆ\n\nIt‚Äôs popular because it looks almost like English and lets you build things fast.\n\n\n\n\n\n\n\n\n')]


### text loader - multiple

In [23]:
from langchain_community.document_loaders import DirectoryLoader


dir_loader = DirectoryLoader(
    "data/textfiles", glob="**/*.txt", 
    loader_cls= TextLoader, 
    show_progress= True,
    loader_kwargs= {"encoding":"utf-8"}
)



documents = dir_loader.load()










100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:00<00:00, 1268.69it/s]


### Text split strategy

In [50]:
from langchain_text_splitters import(RecursiveCharacterTextSplitter,
                                    CharacterTextSplitter,
                                    TokenTextSplitter)


#print(documents)

# method 1 - character text splitter


text = documents[0].page_content

char_splitter = CharacterTextSplitter(
    separator="\n", # where to cut
    chunk_size = 20, # pieces
    chunk_overlap = 5, # how many characters will overlap in next chunk
    length_function = len # length of chunks
)

#
char_chunks = char_splitter.split_text(text)

for i in char_chunks:
    print(i, end="\n")

# method 2 - 



Created a chunk of size 67, which is longer than the specified 20
Created a chunk of size 24, which is longer than the specified 20


What is Python?
Python is a high-level, easy-to-read programming language used for:
Web development üåê
Automation & scripting ü§ñ
Data analysis & AI üß†
Game development üéÆ
It‚Äôs popular because it looks almost like English and lets you build things fast.


### Recursive character text splitter

In [48]:
text = documents[0].page_content

char_splitter = RecursiveCharacterTextSplitter(
    separators= ["\n\n","\n","  "," ",""], # where to cut
    chunk_size = 20, # pieces
    chunk_overlap = 5, # how many characters will overlap in next chunk
    length_function = len # length of chunks
)

#
char_chunks = char_splitter.split_text(text)

for i in char_chunks:
    print(i, end="\n")

What is Python?
Python is a
is a high-level,
easy-to-read
programming
language used for:
Web development üåê
Automation &
& scripting ü§ñ
Data analysis & AI
& AI üß†
Game development üéÆ
It‚Äôs popular
because it looks
almost like English
and lets you build
things fast.


### token text splitter

In [52]:
text = documents[0].page_content

char_splitter = TokenTextSplitter(
   # separators= ["\n\n","\n","  "," ",""], # where to cut
    chunk_size = 20, # pieces
    chunk_overlap = 5, # how many characters will overlap in next chunk
    length_function = len # length of chunks
)

#
char_chunks = char_splitter.split_text(text)

for i in char_chunks:
    print(i, end="\n")





What is Python?

Python is a high-level, easy-to-
, easy-to-read programming language used for:

Web development üåê


 üåê

Automation & scripting ü§ñ

Data analysis & AI ÔøΩ
 analysis & AI üß†

Game development üéÆ

It‚Äôs popular
It‚Äôs popular because it looks almost like English and lets you build things fast.




 things fast.









