In [5]:
# ! pip install -U langchain
# ! pip install langchain_text_splitters

In [15]:
data = """
Once you've loaded documents, you'll often want to transform them to better suit your application. The simplest example is you may want to split a long document into smaller chunks that can fit into your model's context window. LangChain has a number of built-in document transformers that make it easy to split, combine, filter, and otherwise manipulate documents.

When you want to deal with long pieces of text, it is necessary to split up that text into chunks. As simple as this sounds, there is a lot of potential complexity here. Ideally, you want to keep the semantically related pieces of text together. What "semantically related" means could depend on the type of text. This notebook showcases several ways to do that.

At a high level, text splitters work as following:

Split the text up into small, semantically meaningful chunks (often sentences).
Start combining these small chunks into a larger chunk until you reach a certain size (as measured by some function).
Once you reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks).
That means there are two different axes along which you can customize your text splitter:

How the text is split
How the chunk size is measured

"""

In [16]:
len("""Once you've loaded documents, you'll often want to transform them to better suit your application. The simplest example is you may want to split a long document into smaller chunks that can fit into your model's context window. LangChain has a number of built-in document transformers that make it easy to split, combine, filter, and otherwise manipulate documents.
""")

366

In [20]:
len("""When you want to deal with long pieces of text, it is necessary to split up that text into chunks. As simple as this sounds, there is a lot of potential complexity here. Ideally, you want to keep the semantically related pieces of text together. What "semantically related" means could depend on the type of text. This notebook showcases several ways to do that.
""")

363

### Understanding RecursiveCharacterTextSplitter

In [24]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# text = "This is a piece of text."
chunk_size = 800
chunk_overlap = 0

splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
chunks = splitter.split_text(data)

for i, chunk in enumerate(chunks):
# for chunk in chunks:
    print(i, len(chunk))
    print(chunk)

0 781
Once you've loaded documents, you'll often want to transform them to better suit your application. The simplest example is you may want to split a long document into smaller chunks that can fit into your model's context window. LangChain has a number of built-in document transformers that make it easy to split, combine, filter, and otherwise manipulate documents.

When you want to deal with long pieces of text, it is necessary to split up that text into chunks. As simple as this sounds, there is a lot of potential complexity here. Ideally, you want to keep the semantically related pieces of text together. What "semantically related" means could depend on the type of text. This notebook showcases several ways to do that.

At a high level, text splitters work as following:
1 501
Split the text up into small, semantically meaningful chunks (often sentences).
Start combining these small chunks into a larger chunk until you reach a certain size (as measured by some function).
Once you

### Understand chunk_overlap

In [3]:
data = """Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans."""

In [4]:
print(data)

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.


In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# text = "This is a piece of text."
chunk_size = 40
chunk_overlap = 10

splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

chunks = splitter.split_text(data)

for i, chunk in enumerate(chunks):
# for chunk in chunks:
    print(i, len(chunk))
    print(chunk)

0 40
Madam Speaker, Madam Vice President, our
1 36
our First Lady and Second Gentleman.
2 39
Members of Congress and of Congress and
3 32
and the Cabinet. Justices of the
4 31
of the Supreme Court. My fellow
5 20
My fellow Americans.


### CharacterTextSplitter

In [25]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    chunk_size=128,
    chunk_overlap=0,
    separator=" "
)

In [26]:
texts = text_splitter.split_text(data)

In [27]:
texts

["Once you've loaded documents, you'll often want to transform them to better suit your application. The simplest example is you",
 "may want to split a long document into smaller chunks that can fit into your model's context window. LangChain has a number of",
 'built-in document transformers that make it easy to split, combine, filter, and otherwise manipulate documents.\n\nWhen you want',
 'to deal with long pieces of text, it is necessary to split up that text into chunks. As simple as this sounds, there is a lot of',
 'potential complexity here. Ideally, you want to keep the semantically related pieces of text together. What "semantically',
 'related" means could depend on the type of text. This notebook showcases several ways to do that.\n\nAt a high level, text',
 'splitters work as following:\n\nSplit the text up into small, semantically meaningful chunks (often sentences).\nStart combining',
 'these small chunks into a larger chunk until you reach a certain size (as measured b

### RecursiveCharacterTextSplitter

In [34]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [21]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=20,
    length_function=len,
)

In [22]:
texts = text_splitter.split_text(data)

In [23]:
texts

['Once you\'ve loaded documents, you\'ll often want to transform them to better suit your application. The simplest example is you may want to split a long document into smaller chunks that can fit into your model\'s context window. LangChain has a number of built-in document transformers that make it easy to split, combine, filter, and otherwise manipulate documents.\n\nWhen you want to deal with long pieces of text, it is necessary to split up that text into chunks. As simple as this sounds, there is a lot of potential complexity here. Ideally, you want to keep the semantically related pieces of text together. What "semantically related" means could depend on the type of text. This notebook showcases several ways to do that.\n\nAt a high level, text splitters work as following:',
 'Split the text up into small, semantically meaningful chunks (often sentences).\nStart combining these small chunks into a larger chunk until you reach a certain size (as measured by some function).\nOnce 

### Token Chunking

In [46]:
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=128, chunk_overlap=20
)
texts = text_splitter.split_text(data)

In [47]:
texts[0]

"Once you've loaded documents, you'll often want to transform them to better suit your application. The simplest example is you may want to split a long document into smaller chunks that can fit into your model's context window. LangChain has a number of built-in document transformers that make it easy to split, combine, filter, and otherwise manipulate documents."

### Splitting Code

In [35]:
from langchain_text_splitters import (
    Language,
    RecursiveCharacterTextSplitter,
)

In [36]:
with open("app_updated.py") as fp:
    data = fp.read()

In [51]:
print(data)

In [52]:

python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=200, chunk_overlap=0
)

# python_docs = python_splitter.split_text(data)
python_docs = python_splitter.split_text(data)
python_docs

### Markdown splitter

In [29]:
data = """
# Fun in California

## Driving

Try driving on the 1 down to San Diego

### Food

Make sure to eat a burrito while you're there

## Hiking

Go to Yosemite
"""

In [30]:
from langchain_text_splitters import MarkdownHeaderTextSplitter

In [31]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)


In [32]:
texts = text_splitter.split_text(data)

In [34]:
for i, chunk in enumerate(texts):
    print(i, chunk)

0 # Fun in California

## Driving

Try driving on the 1 down to San Diego

### Food

Make sure to eat a burrito while you're
1 there

## Hiking

Go to Yosemite


### Semantic Chunking

In [64]:
!pip install --quiet langchain_experimental

In [12]:
# ! pip install langchain_openai

In [39]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

In [9]:
# ! pip install sentence-transformers

In [43]:
text_splitter = SemanticChunker(OpenAIEmbeddings(openai_api_key="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"), breakpoint_threshold_type="percentile", breakpoint_threshold_amount=50)
# text_splitter = SemanticChunker(OpenAIEmbeddings(openai_api_key="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"))

In [44]:
data = """
Paris, the capital city of France, is renowned for its rich history, iconic landmarks, and vibrant culture. From the magnificent Eiffel Tower standing tall against the skyline to the charming cobblestone streets of Montmartre, Paris exudes an undeniable allure. Visitors flock to marvel at masterpieces in the Louvre Museum, stroll along the Seine River, and indulge in exquisite French cuisine at quaint bistros. With its romantic ambiance and timeless elegance, Paris continues to captivate hearts and minds as one of the most enchanting cities in the world.

Computers have revolutionized the way we live, work, and interact with the world around us. From the early mechanical calculators to the sophisticated machines of today, computers have evolved exponentially, becoming integral to nearly every aspect of modern life. These powerful devices process vast amounts of data, enable seamless communication across the globe, and drive innovation in fields ranging from science and medicine to business and entertainment. With their unparalleled capabilities and ever-expanding potential, computers continue to shape the course of human progress in profound ways.

Adolf Hitler, the infamous dictator of Nazi Germany, rose to power in the 1930s, plunging Europe into the depths of World War II and perpetrating unspeakable atrocities during the Holocaust. Through his fiery rhetoric and ruthless policies, Hitler sought to impose his radical ideology of racial purity and expansionist ambitions upon the world. His regime systematically persecuted and murdered millions of innocent civilians, forever staining history with the horrors of genocide. Despite his ultimate defeat and the downfall of the Third Reich, Hitler's legacy serves as a haunting reminder of the dangers of unchecked tyranny and the enduring importance of safeguarding democracy and human rights."""

In [45]:
chunks = text_splitter.split_text(data)

for i, chunk in enumerate(chunks):
# for chunk in chunks:
    print(i, len(chunk))
    print(chunk)

0 561

Paris, the capital city of France, is renowned for its rich history, iconic landmarks, and vibrant culture. From the magnificent Eiffel Tower standing tall against the skyline to the charming cobblestone streets of Montmartre, Paris exudes an undeniable allure. Visitors flock to marvel at masterpieces in the Louvre Museum, stroll along the Seine River, and indulge in exquisite French cuisine at quaint bistros. With its romantic ambiance and timeless elegance, Paris continues to captivate hearts and minds as one of the most enchanting cities in the world.
1 91
Computers have revolutionized the way we live, work, and interact with the world around us.
2 369
From the early mechanical calculators to the sophisticated machines of today, computers have evolved exponentially, becoming integral to nearly every aspect of modern life. These powerful devices process vast amounts of data, enable seamless communication across the globe, and drive innovation in fields ranging from science and

### PDF parser

In [41]:
# ! pip install unstructured
# ! pip install pdf2image
# ! pip install -U unstructured pdf2image pdfminer

In [2]:
# ! pip install -U pdfminer.six


In [9]:
# ! pip install pillow_heif
# ! pip install opencv-python

In [12]:
# ! pip install pytesseract

In [15]:
# ! pip install unstructured_pytesseract

In [19]:
# ! pip install unstructured_inference

In [1]:
# ! pip install pikepdf

In [46]:
import os
from unstructured.partition.pdf import partition_pdf
from unstructured.staging.base import elements_to_json

In [47]:
filename = "SalesforceFinancial.pdf"

# Extracts the elements from the PDF
elements = partition_pdf(
    filename=filename,

    # Unstructured Helpers
    strategy="hi_res", 
    infer_table_structure=True, 
    model_name="yolox"
)

Following dependencies are missing: pikepdf. Please install them using `pip install pikepdf`.
PDF text extraction failed, skip text extraction...
Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model 

In [48]:
elements

[<unstructured.documents.elements.NarrativeText at 0x7f7333abaf50>,
 <unstructured.documents.elements.NarrativeText at 0x7f7328730410>,
 <unstructured.documents.elements.NarrativeText at 0x7f7328730510>,
 <unstructured.documents.elements.NarrativeText at 0x7f7328730610>,
 <unstructured.documents.elements.NarrativeText at 0x7f7328730710>,
 <unstructured.documents.elements.NarrativeText at 0x7f7328730890>,
 <unstructured.documents.elements.Title at 0x7f73287b24d0>,
 <unstructured.documents.elements.NarrativeText at 0x7f7328730990>,
 <unstructured.documents.elements.NarrativeText at 0x7f7328730a90>,
 <unstructured.documents.elements.Table at 0x7f7328730b90>,
 <unstructured.documents.elements.Title at 0x7f73285e83d0>,
 <unstructured.documents.elements.Text at 0x7f73285e8e10>,
 <unstructured.documents.elements.NarrativeText at 0x7f7328730c50>,
 <unstructured.documents.elements.NarrativeText at 0x7f7328730dd0>]

In [14]:
elements[-5].metadata.text_as_html

'<table><thead><th>Revenue")</th><th>Guidance $7.69 - $7.70 Billion</th><th>Guidance $31.7 - $31.8 Billion</th></thead><tr><td>Y/Y Growth</td><td>~21%</td><td>~20%</td></tr><tr><td>FX Impact?)</td><td>~($200M) y/y FX</td><td>~($600M) y/y FX)</td></tr><tr><td>GAAP operating margin</td><td></td><td>~3.8%</td></tr><tr><td>Non-GAAP operating margin”)</td><td></td><td>~20.4%</td></tr><tr><td>GAAP earnings (loss) per share</td><td>($0.03) - ($0.02)</td><td>$0.38 - $0.40</td></tr><tr><td>Non-GAAP earnings per share</td><td>$1.01 - $1.02</td><td>$4.74 - $4.76</td></tr><tr><td>Operating Cash Flow Growth (Y/Y)</td><td></td><td>~21% - 22%</td></tr><tr><td>Current Remaining Performance Obligation Growth (Y/Y)</td><td>~15%</td><td></td></tr></table>'

In [50]:
elements[1].text

'Operating Margin: First quarter GAAP operating margin was 0.3%. First quarter non-GAAP operating margin was 17.6%.'

In [21]:
elements[1].metadata.to_dict()

{'detection_class_prob': 0.9110447764396667,
 'coordinates': {'points': ((185.0060272216797, 247.2501678466797),
   (185.0060272216797, 310.4765930175781),
   (1545.081298828125, 310.4765930175781),
   (1545.081298828125, 247.2501678466797)),
  'system': 'PixelSpace',
  'layout_width': 1700,
  'layout_height': 2200},
 'last_modified': '2024-03-01T15:10:33',
 'filetype': 'application/pdf',
 'languages': ['eng'],
 'page_number': 1,
 'filename': 'SalesforceFinancial.pdf'}