## **Text Splitter or Chunking from scratch**

### **Chunking Function**

In [1]:
def fixed_length_chunking(text, chunk_size):
    words = text.split() 
    chunks = [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]
    return chunks

In [2]:
# Example Usage
TEXT = "Machine learning is a field of AI that enables computers to learn from data. Deep learning is a subset of machine learning focused on neural networks. These techniques have revolutionized industries like healthcare, finance, and robotics."

**Chunking Function explained**

```
👉 words = text.split() 
```

In [3]:
words = TEXT.split()
print(words)

['Machine', 'learning', 'is', 'a', 'field', 'of', 'AI', 'that', 'enables', 'computers', 'to', 'learn', 'from', 'data.', 'Deep', 'learning', 'is', 'a', 'subset', 'of', 'machine', 'learning', 'focused', 'on', 'neural', 'networks.', 'These', 'techniques', 'have', 'revolutionized', 'industries', 'like', 'healthcare,', 'finance,', 'and', 'robotics.']


```python
👉 chunks = [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]
```

In [4]:
CHUNK_SIZE=10

for i in range(0, len(words), CHUNK_SIZE):
    print(i)

0
10
20
30


In [5]:
words[0:0 + CHUNK_SIZE]

['Machine',
 'learning',
 'is',
 'a',
 'field',
 'of',
 'AI',
 'that',
 'enables',
 'computers']

In [6]:
' '.join(words[i:i + CHUNK_SIZE])

'industries like healthcare, finance, and robotics.'

In [7]:
words[10:10 + CHUNK_SIZE]

['to', 'learn', 'from', 'data.', 'Deep', 'learning', 'is', 'a', 'subset', 'of']

In [8]:
for i in range(0, len(words), CHUNK_SIZE):
    print(i)

0
10
20
30


In [9]:
chunks = fixed_length_chunking(text=TEXT, chunk_size=CHUNK_SIZE)
chunks

['Machine learning is a field of AI that enables computers',
 'to learn from data. Deep learning is a subset of',
 'machine learning focused on neural networks. These techniques have revolutionized',
 'industries like healthcare, finance, and robotics.']

In [10]:
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk}\n")

Chunk 1: Machine learning is a field of AI that enables computers

Chunk 2: to learn from data. Deep learning is a subset of

Chunk 3: machine learning focused on neural networks. These techniques have revolutionized

Chunk 4: industries like healthcare, finance, and robotics.



## **Chunking with Charater text spiltter**

In [11]:
text = """
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
"""

In [12]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(separator=" ", chunk_size=50, chunk_overlap=0)


In [13]:
chunks = text_splitter.split_text(text=text)
chunks

['Lorem ipsum dolor sit amet, consectetur',
 'adipiscing elit. Sed do eiusmod tempor incididunt',
 'ut labore et dolore magna aliqua.\n\nUt enim ad',
 'minim veniam, quis nostrud exercitation ullamco',
 'laboris nisi ut aliquip ex ea commodo',
 'consequat.\n\nDuis aute irure dolor in reprehenderit',
 'in voluptate velit esse cillum dolore eu fugiat',
 'nulla pariatur.\n\nExcepteur sint occaecat cupidatat',
 'non proident, sunt in culpa qui officia deserunt',
 'mollit anim id est laborum.']

In [14]:
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk}\n")

Chunk 1: Lorem ipsum dolor sit amet, consectetur

Chunk 2: adipiscing elit. Sed do eiusmod tempor incididunt

Chunk 3: ut labore et dolore magna aliqua.

Ut enim ad

Chunk 4: minim veniam, quis nostrud exercitation ullamco

Chunk 5: laboris nisi ut aliquip ex ea commodo

Chunk 6: consequat.

Duis aute irure dolor in reprehenderit

Chunk 7: in voluptate velit esse cillum dolore eu fugiat

Chunk 8: nulla pariatur.

Excepteur sint occaecat cupidatat

Chunk 9: non proident, sunt in culpa qui officia deserunt

Chunk 10: mollit anim id est laborum.



In [28]:
from langchain.document_loaders import TextLoader

loader=TextLoader(r"data\note.txt")
docs=loader.load()

docs[0].page_content

'The untimely death of billionaire industrialist Sunjay Kapur has left a deep void in both the corporate boardrooms and celebrity circles he straddled with equal ease. The 57-year-old chairman of Sona Comstar, a leading global player in automotive components, died suddenly on June 12, 2025, while playing polo in London. Medical reports suggest a rare and fatal anaphylactic shock, reportedly triggered after he accidentally swallowed a bee during the match.\n\nWhile tributes continue to pour in for the visionary businessman who took his father’s company global, the spotlight has now shifted to the inevitable and complicated question: Who inherits his estimated Rs 10,300 crore ($1.2 billion) fortune?\n\nSunjay Kapur’s death has triggered a swirl of speculation regarding the succession of his business empire and personal wealth. At the time of his passing, Sona Comstar had a market capitalisation of approximately Rs 31,000 crore ($4 billion). Under his leadership, the company expanded rapi

In [29]:
len("Utkarsh Sinha")

13

In [30]:
len(docs[0].page_content)

4177

In [None]:
splitter=CharacterTextSplitter(separator="\n", chunk_size=3000,chunk_overlap=250)
split_docs=splitter.split_documents(docs)
split_docs