# Document Transformers

After we load a Document object from a source, we end up having strings by page_content. In some situations, the length of the strings may be too large to feed into a model, both embedding and chat models.

- LangChain provides Document Transformers that split strings into chunks.
- These chunks are useful for embeddings
- 2 most common splitters: Split on a specific character OR split based on token counts.

In [5]:
from langchain.text_splitter import CharacterTextSplitter


In [1]:
with open("some_data/FDR_State_of_Union_1944.txt") as file:
    speech_text = file.read()

In [2]:
len(speech_text)

21927

In [3]:
len(speech_text.split())  # no. of words

3750

## Text Splitter
Even though the separator is `"\n\n"`, this will not be an hard enforcement. It will look to chunk size and try to keep the chunks the same size (~1000). Avoids too small chunks.

In [7]:
text_splitter = CharacterTextSplitter(separator="\n\n", chunk_size=1000)

In [8]:
texts = text_splitter.create_documents([speech_text])

In [9]:
len(texts)

28

In [10]:
texts[0]

Document(page_content="This Nation in the past two years has become an active partner in the world's greatest war against human slavery.\n\nWe have joined with like-minded people in order to defend ourselves in a world that has been gravely threatened with gangster rule.\n\nBut I do not think that any of us Americans can be content with mere survival. Sacrifices that we and our allies are making impose upon us all a sacred obligation to see to it that out of this war we and our children will gain something better than mere survival.\n\nWe are united in determination that this war shall not be followed by another interim which leads to new disaster- that we shall not repeat the tragic errors of ostrich isolationism—that we shall not repeat the excesses of the wild twenties when this Nation went for a joy ride on a roller coaster which ended in a tragic crash.", metadata={})

## TikToken
Tiktoken is by OpenAI but runs locally, no costs.

In [11]:
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=500)

In [12]:
texts = text_splitter.split_text(speech_text)  # or create_documents

In [13]:
len(texts)

15

In [14]:
texts[0]

'This Nation in the past two years has become an active partner in the world\'s greatest war against human slavery.\n\nWe have joined with like-minded people in order to defend ourselves in a world that has been gravely threatened with gangster rule.\n\nBut I do not think that any of us Americans can be content with mere survival. Sacrifices that we and our allies are making impose upon us all a sacred obligation to see to it that out of this war we and our children will gain something better than mere survival.\n\nWe are united in determination that this war shall not be followed by another interim which leads to new disaster- that we shall not repeat the tragic errors of ostrich isolationism—that we shall not repeat the excesses of the wild twenties when this Nation went for a joy ride on a roller coaster which ended in a tragic crash.\n\nWhen Mr. Hull went to Moscow in October, and when I went to Cairo and Teheran in November, we knew that we were in agreement with our allies in our