In [22]:
def dp_text_split(text, target):
    # Split the text into lines
    lines = text.split('\n')
    line_lengths = [len(line) for line in lines]
    n = len(lines)

    # Initialize DP table and splits
    dp = [float('inf')] * (n + 1)
    dp[0] = 0  # No error at the start
    splits = [-1] * (n + 1)

    # Dynamic programming logic to calculate the minimum error
    for i in range(1, n + 1):
        total_length = 0
        for j in range(i, 0, -1):
            total_length += line_lengths[j - 1] + 1  # +1 to account for newline

            if total_length - 1 > target and j != i:  # Allow overshoot on the last line
                break

            error = abs(total_length - 1 - target)
            if dp[i] > dp[j - 1] + error:
                dp[i] = dp[j - 1] + error
                splits[i] = j - 1

    # Backtracking to get the chunks
    chunks = []
    i = len(splits) - 1
    while i > 0:
        j = splits[i]
        chunk = '\n'.join(lines[j:i])
        if chunk.strip():  # Avoid adding empty chunks
            chunks.append(chunk)
        i = j

    chunks.reverse()

    # Return the total error and the chunks
    total_error = dp[n]
    return total_error, chunks

# Testing the combined function
text = "Title\n1\n2\n3\n\nline1\nline2 is too long\n\nline3"
target = 15
total_error, result = dp_text_split(text, target)

# Printing the total error
print(f"Total Error: {total_error}")

# Printing the chunks
for idx, chunk in enumerate(result):
    print(f"Chunk {idx + 1}:\n{chunk}\n")


Total Error: 24
Chunk 1:
Title
1
2
3


Chunk 2:
line1

Chunk 3:
line2 is too long

Chunk 4:

line3



In [23]:
import pandas as pd
df_text=pd.read_parquet('./staging/text/transformed.parquet').dropna(subset=['len_text'])
df_text.info()
df_text.head()

<class 'pandas.core.frame.DataFrame'>
Index: 682 entries, 0 to 1370
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   wiki_id   682 non-null    object 
 1   url       682 non-null    object 
 2   page      682 non-null    object 
 3   len_text  682 non-null    float64
dtypes: float64(1), object(3)
memory usage: 26.6+ KB


Unnamed: 0,wiki_id,url,page,len_text
0,Q43436,https://en.wikipedia.org/wiki/Pearl,"Pearl\n\nA pearl is a hard, glistening object ...",38359.0
1,Q43088,https://en.wikipedia.org/wiki/Ruby,Ruby\n\nRuby is a pinkish red to blood-red col...,13666.0
2,Q5283,https://en.wikipedia.org/wiki/Diamond,Diamond\n\nDiamond is a solid form of the elem...,61340.0
3,Q573870,https://en.wikipedia.org/wiki/Bi_(jade),Bi (jade)\n\nThe bi (Chinese: 璧) is a type of ...,3694.0
5,Q138979,https://en.wikipedia.org/wiki/Nephrite,"NephriteNephrite is a variety of the calcium, ...",6175.0


In [24]:
d=df_text.to_dict(orient='records')
text=d[0]['page']
print(text[:2000])

Pearl

A pearl is a hard, glistening object produced within the soft tissue (specifically the mantle) of a living shelled mollusk or another animal, such as fossil  conulariids. Just like the shell of a mollusk, a pearl is composed of calcium carbonate (mainly aragonite or a mixture of aragonite and calcite)[3] in minute crystalline form, which has deposited in concentric layers. The ideal pearl is perfectly round and smooth, but many other shapes, known as baroque pearls, can occur. The finest quality of natural pearls have been highly valued as gemstones and objects of beauty for many centuries. Because of this, pearl has become a metaphor for something rare, fine, admirable and valuable.

The most valuable pearls occur spontaneously in the wild, but are extremely rare. These wild pearls are referred to as natural pearls. Cultured or farmed pearls from pearl oysters and freshwater mussels make up the majority of those currently sold. Imitation pearls are also widely sold in inexpensi

In [27]:
total_error, result = dp_text_split(text, 1000)
# Printing the total error
print(f"number of Chunks: {len(result)}")
print(f"Total Error: {total_error}")

# Printing the chunks
for idx, chunk in enumerate(result):
    print(f"Chunk {idx + 1}:\n{chunk}\n")

number of Chunks: 47
Total Error: 10499
Chunk 1:
Pearl

A pearl is a hard, glistening object produced within the soft tissue (specifically the mantle) of a living shelled mollusk or another animal, such as fossil  conulariids. Just like the shell of a mollusk, a pearl is composed of calcium carbonate (mainly aragonite or a mixture of aragonite and calcite)[3] in minute crystalline form, which has deposited in concentric layers. The ideal pearl is perfectly round and smooth, but many other shapes, known as baroque pearls, can occur. The finest quality of natural pearls have been highly valued as gemstones and objects of beauty for many centuries. Because of this, pearl has become a metaphor for something rare, fine, admirable and valuable.


Chunk 2:
The most valuable pearls occur spontaneously in the wild, but are extremely rare. These wild pearls are referred to as natural pearls. Cultured or farmed pearls from pearl oysters and freshwater mussels make up the majority of those current