# Chunking

This notebook explores various methods for separating a text into chunks.

In [52]:
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from utils import get_works_from_years

In [32]:
df = get_works_from_years(start_year=2023, end_year=2023)
text = df.text.iloc[1]

In [33]:
# Get to the meat of the paper
start = text.index("1. Introduction")
# Examine only a part of the paper (i.e., the introduction)
end = text.index("2. Analytical Techniques in Metabolomics")

text = text[start:end]

In [38]:
text

'1. IntroductionTo understand the biological pathway underlying the phenotype of plants, a systemsbiology approach can be used [1–3]. In systems biology, the information and interac-tion of the functional physical structure and the genetic information are integrated toprovide a comprehensive model of the organism (Figure 1). Different high-throughputtechnologies are used to study the genetic program of the various -omics fields: genomics,transcriptomics, proteomics, and metabolomics.Metabolomics was the newest field added to the systems biology toolbox at the begin-ning of the 21st century. Metabolomics gives a quantitative and qualitative overview ofall the metabolites, small molecules with a molecular weight of 30–3000 Da, present in anorganism with various properties and functions [4]. There are approximately 1,000,000 dif-ferent metabolites available in the plant kingdom, which makes metabolomics a challengingfield [5]. Moreover, the metabolome changes quite quickly due to circadia

### Basic Method: Fixed-Length 

**Pros:**
* Straight-forward

**Cons:**
* Often results in sentences being split
* Requires specifying a `chunk_size` and `chunk_overlap`, and it's unclear how these should be chosen

In [56]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 200,
    length_function = len,
)
docs = text_splitter.create_documents([text])

In [57]:
print(f"Number of chunks: {len(docs)}.")

Number of chunks: 5.


In [58]:
for i, doc in enumerate(docs):
    print(f"Doc {i}:")
    print(doc.page_content)
    print()

Doc 0:
1. IntroductionTo understand the biological pathway underlying the phenotype of plants, a systemsbiology approach can be used [1–3]. In systems biology, the information and interac-tion of the functional physical structure and the genetic information are integrated toprovide a comprehensive model of the organism (Figure 1). Different high-throughputtechnologies are used to study the genetic program of the various -omics fields: genomics,transcriptomics, proteomics, and metabolomics.Metabolomics was the newest field added to the systems biology toolbox at the begin-ning of the 21st century. Metabolomics gives a quantitative and qualitative overview ofall the metabolites, small molecules with a molecular weight of 30–3000 Da, present in anorganism with various properties and functions [4]. There are approximately 1,000,000 dif-ferent metabolites available in the plant kingdom, which makes metabolomics a challengingfield [5]. Moreover, the metabolome changes quite quickly due to ci