## Data Ingestion 


In [None]:
import os
from typing import List, Dict, Any
from langchain_core.documents import Document
from langchain_text_splitters import(
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
    TokenTextSplitter
)
from langchain_community.document_loaders import (
    TextLoader, 
    DirectoryLoader
    PyPDFLoader,
    PyMuPDFLoader,
    UnstructuredPDFLoader
)
print("Setup Completed!")

Setup Completed!


### TextLoader from langchain.text_loaders to load data from text files.

In [None]:


loader = TextLoader("data/text_files/python_intro.txt", encoding = "utf-8")
documents = loader.load()
print(type(documents))
print(documents)

<class 'list'>
[Document(metadata={'source': 'data/text_files/python_intro.txt'}, page_content='Python Programming Introduction\n\nPython is a high-level, interpreted programming language known for its simplicity and readability.\nCreated by Guido van Rossum and first released in 1991, Python has become one of the most popular\nprogramming languages in the world.\n\nPython has various levels, I learnt python using the book "Byte of Python" in 2012 on Python 2, though this was a fantastic book my use of python\nremained confined to writing short scripts in DevOps and I never delved deeper in modular programming. With use of Jupyter notebooks its so different now. \nI remember one of my old scripts where I would take decisions based on IP address structure, where the logic behind each octet was different. I used python extensively on\nDELL iDRAC project to write a lot of middleware in Python, that used other libraries written in C to communicate with hardware. \nWith ML the use of Python

#### Load multiple text files from a directory and create Document objects for each file.

In [7]:
dir_loader=DirectoryLoader(
    "data/text_files",
    glob="**/*.txt",
    loader_cls = TextLoader,
    loader_kwargs = {'encoding': 'utf=8'},
    show_progress = True
)
documents=dir_loader.load()

print(f"Loaded {len(documents)} documents")
for i, doc in enumerate(documents):
    print(f"\nDocument {i+1}: ")
    print(f" Source: {doc.metadata['source']}")
    print(f" Length: {len(doc.page_content)} characters")

100%|██████████| 2/2 [00:00<00:00, 2521.37it/s]

Loaded 2 documents

Document 1: 
 Source: data/text_files/python_intro.txt
 Length: 1223 characters

Document 2: 
 Source: data/text_files/machine_learning.txt
 Length: 715 characters





In [10]:
# Here I am splitting the document using newline and checking the output. 
text = documents[0].page_content
print("Character Text Splitter")
char_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 200,
    chunk_overlap = 20,
    length_function = len
)

char_chunks = char_splitter.split_text(text)
print(f"Created {len(char_chunks)} chunks")
for chunk in char_chunks:
    print(chunk)
    print("-----------")

Character Text Splitter
Created 8 chunks
Python Programming Introduction
Python is a high-level, interpreted programming language known for its simplicity and readability.
-----------
Created by Guido van Rossum and first released in 1991, Python has become one of the most popular
programming languages in the world.
-----------
Python has various levels, I learnt python using the book "Byte of Python" in 2012 on Python 2, though this was a fantastic book my use of python
-----------
remained confined to writing short scripts in DevOps and I never delved deeper in modular programming. With use of Jupyter notebooks its so different now.
-----------
I remember one of my old scripts where I would take decisions based on IP address structure, where the logic behind each octet was different. I used python extensively on
-----------
DELL iDRAC project to write a lot of middleware in Python, that used other libraries written in C to communicate with hardware.
-----------
With ML the use of Pyt

In [11]:
# Recursive character text splitter does this recursively using different separator in each iteration. 
print("Recursive Character Text Splitter")
recursive_splitter = RecursiveCharacterTextSplitter(
    separators = ["\n\n", "\n", " ", ""],
    chunk_size = 200,
    chunk_overlap = 20,
    length_function = len
)

recursive_chunks = recursive_splitter.split_text(text)
print(f"Created {len(recursive_chunks)} chunks")
[print(chunk+"\n---") for chunk in recursive_chunks]

Recursive Character Text Splitter
Created 10 chunks
Python Programming Introduction
---
Python is a high-level, interpreted programming language known for its simplicity and readability.
Created by Guido van Rossum and first released in 1991, Python has become one of the most popular
---
programming languages in the world.
---
Python has various levels, I learnt python using the book "Byte of Python" in 2012 on Python 2, though this was a fantastic book my use of python
---
remained confined to writing short scripts in DevOps and I never delved deeper in modular programming. With use of Jupyter notebooks its so different now.
---
I remember one of my old scripts where I would take decisions based on IP address structure, where the logic behind each octet was different. I used python extensively on
---
DELL iDRAC project to write a lot of middleware in Python, that used other libraries written in C to communicate with hardware.
---
With ML the use of Python has skyrocketted, Its like in

[None, None, None, None, None, None, None, None, None, None]